CN111831812A - Reading comprehension data set automatic generation method and device based on knowledge graph - Google Patents

Reading comprehension data set automatic generation method and device based on knowledge graph Download PDF

Info

Publication number
CN111831812A
CN111831812A CN202010991922.9A CN202010991922A CN111831812A CN 111831812 A CN111831812 A CN 111831812A CN 202010991922 A CN202010991922 A CN 202010991922A CN 111831812 A CN111831812 A CN 111831812A
Authority
CN
China
Prior art keywords
candidate
examples
instances
instance
relations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010991922.9A
Other languages
Chinese (zh)
Other versions
CN111831812B (en
Inventor
赵撼宇
袁莎
唐杰
谢年韬
马全跃
曹岗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyuan Artificial Intelligence Research Institute
Original Assignee
Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyuan Artificial Intelligence Research Institute filed Critical Beijing Zhiyuan Artificial Intelligence Research Institute
Priority to CN202010991922.9A priority Critical patent/CN111831812B/en
Publication of CN111831812A publication Critical patent/CN111831812A/en
Application granted granted Critical
Publication of CN111831812B publication Critical patent/CN111831812B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention discloses a reading comprehension data set automatic generation method and device based on a knowledge graph. The method comprises the following steps: extracting instances and/or relationships in a given problem; acquiring candidate examples and/or candidate relations corresponding to the examples from a pre-generated concept knowledge graph, and replacing the examples and/or relations in the given problem by using the candidate examples and/or candidate relations to generate a new problem; if the answer of the new question can be obtained from the concept knowledge graph, obtaining article segments corresponding to instances in the new question from other data sources; and generating a reading understanding data set by using the new question and the answer thereof and the article fragment. The invention can automatically generate a reading comprehension data set based on the concept knowledge graph, thereby saving the labor cost; the constructed reading comprehension data set is more complex and has higher requirements on reasoning capability.

Description

Reading comprehension data set automatic generation method and device based on knowledge graph
Technical Field
The invention relates to the technical field of natural language processing, in particular to a reading comprehension data set automatic generation method and device based on a knowledge graph.
Background
Natural language processing tasks include part-of-speech tagging, syntactic analysis, reading comprehension, and the like. Wherein, the tasks of part-of-speech tagging and syntactic analysis focus more on context information at a small range level (for example, within a sentence); the reading and understanding task focuses more on a wider range and deeper analysis and processing of context semantic information. And context semantic information with a larger range and a deeper level plays a very important role in the process of understanding the text, and is more helpful to finish the natural language processing target: a computer is able to read, process, and understand the inherent meaning of text. Therefore, the reading and understanding task is one of the most popular research directions in the natural language processing task. Since the research work related to reading comprehension needs to be established on a large-scale, high-quality data set, a large-scale, high-quality reading comprehension data set must first be established to conduct the research related to reading comprehension.
At present, most of the construction of reading and understanding data sets adopts a method of iteration by adopting a pure manual marking or a manual marking combined algorithm, but the methods have the following defects: firstly, the whole quantity of problems and article fragments need to be defined in advance, and the consumption of manpower and material resources is large; secondly, the algorithm is used for labeling the reading understanding data set, so that although the effect of saving manpower is achieved, the algorithm is limited in accuracy and still needs a large amount of manual examination, and therefore the manpower cost is saved; thirdly, a reading comprehension data set is constructed aiming at a specific field, and a annotator is required to have related field knowledge, so that the requirement on the annotator is higher, and the cost of data annotation is further increased; and fourthly, the manually labeled reading understanding data set is mostly relatively simple, does not relate to a more complex reasoning problem basically, cannot effectively promote related research, and the manually designed reasoning problem sometimes has certain irrationality due to the influence of subjective knowledge of a label maker.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention provides the following technical scheme.
The invention provides a reading comprehension data set automatic generation method based on a knowledge graph, which comprises the following steps:
extracting instances and/or relationships in a given problem;
acquiring candidate examples and/or candidate relations corresponding to the examples from a pre-generated concept knowledge graph, and replacing the examples and/or relations in the given problem by using the candidate examples and/or candidate relations to generate a new problem;
if the answer of the new question can be obtained from the concept knowledge graph, obtaining article segments corresponding to instances in the new question from other data sources;
and generating a reading understanding data set by using the new question and the answer thereof and the article fragment.
Preferably, the method further comprises the following steps:
and acquiring a candidate relation corresponding to the candidate instance from a pre-generated concept knowledge graph, and replacing the instance and the relation in the given problem by using the candidate instance and the candidate relation corresponding to the candidate instance to generate a new problem.
Preferably, the obtaining of the candidate instances corresponding to the instances from the pre-generated concept knowledge graph includes:
locating the instance in the concept knowledge-graph;
and acquiring other examples belonging to the same concept as the examples as candidate examples.
Preferably, the obtaining other instances belonging to the same concept as the instance as candidate instances includes:
in the concept knowledge graph, acquiring a previous layer concept corresponding to the instance according to an instanceOf relationship, and acquiring other instances under the previous layer concept as candidate instances;
and/or the presence of a gas in the gas,
and in the concept knowledge graph, acquiring a higher-level concept of a higher-level concept corresponding to the instance according to a sublassOf relation, and acquiring instances corresponding to other lower-level concepts of the higher-level concept as candidate instances.
Preferably, before replacing the instance in the given problem with the candidate instance, the method further comprises: and screening the candidate examples, and deleting the candidate examples which do not meet the requirements.
Preferably, the screening the candidate instances comprises:
screening the candidate examples according to the similarity of the examples and the candidate examples;
and/or
Screening the candidate examples according to the out-degree and in-degree of the candidate examples in the concept knowledge graph and/or the frequency of the candidate examples in the large-scale corpus;
and/or
And screening according to the relationship attributes of the concepts corresponding to the instances and the candidate instances.
Preferably, the obtaining a candidate instance and/or a candidate relationship corresponding to the instance from a pre-generated concept knowledge graph, and replacing the instance and/or relationship in the given problem with the candidate instance and/or candidate relationship includes:
if one problem comprises one example and one relation, recalling other first-degree relations corresponding to the example according to the concept knowledge graph, and replacing the relation in the corresponding problem by using the other first-degree relations;
or
If one problem comprises one example and a plurality of relations, recalling other last hop relations corresponding to the example according to the concept knowledge graph, and replacing the last hop relation in the corresponding problem by using the other last hop relations;
or
If a problem comprises a plurality of examples and a plurality of relations, recalling other first-degree relations of one example randomly, and replacing the relations in the corresponding problem by using the other first-degree relations of the example.
Preferably, the method further comprises the following steps:
and evaluating the problems in the reading comprehension data set by adopting a pre-trained language model, and deleting the problems and the answers thereof of which the evaluation indexes do not accord with the set conditions.
A second aspect of the invention provides a storage medium storing a computer program capable of performing the method as described above when the computer program is run.
A third aspect of the present invention provides an electronic device, comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions are loaded and executed by the processor, so as to enable the processor to execute the method.
The invention has the beneficial effects that: by adopting the scheme provided by the invention, the generation of the reading and understanding data set can be automatically completed only by means of the existing concept knowledge graph and a small number of simple seed problems without manual participation, so that the labor cost is greatly saved; moreover, by means of the concept knowledge graph comprising the upper and lower relations, the constructed reading comprehension data set is more complex and has higher requirements on reasoning capability, and has higher research value; in addition, the construction process of the reading understanding data set is not limited by a specific field, and automatic construction of reading understanding data sets in different fields can be realized.
Drawings
FIG. 1 is a flow chart of a method for automatically generating a reading comprehension data set based on a knowledge-graph according to the present invention;
FIG. 2 is a schematic diagram of a method for obtaining candidate examples in a conceptual knowledge base according to the present invention;
FIG. 3 is a schematic diagram of a method for obtaining candidate relationships in a conceptual knowledge graph according to the present invention;
fig. 4 is a schematic structural diagram of an apparatus for automatically generating a reading comprehension data set based on a knowledge-graph according to the present invention.
Detailed Description
For better understanding of the above technical solutions, the following detailed descriptions will be provided in conjunction with the drawings and the detailed description of the embodiments.
The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.
A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.
The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.
The display screen is used for displaying user interfaces of all the application programs.
In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.
Example one
As shown in fig. 1, an embodiment of the present invention provides a method for automatically generating a reading comprehension data set based on a knowledge graph, including:
s101, extracting instances and/or relations in a given problem;
s102, obtaining candidate examples and/or candidate relations corresponding to the examples from a pre-generated concept knowledge graph, and replacing the examples and/or relations in the given problem by the candidate examples and/or candidate relations to generate a new problem;
s103, if the answer of the new question can be obtained from the concept knowledge graph, obtaining article segments corresponding to examples in the new question from other data sources;
and S104, generating a reading understanding data set by using the new question, the answer thereof and the article fragment.
In one problem, instances and relationships are included. For example, the question "is the executive director of company a? "in, one instance" company A "and one relationship" executive boards "are included.
Step S101 is performed to determine instances and relationships in the problem. The instances and relationships determined in this step are initial instances and relationships, or original instances and original relationships, and new instances and/or new relationships may be included in the newly generated problem. It should be noted that, in the present invention, if an instance and/or a relationship in a problem is replaced to generate a new problem, the replaced problem is the original problem, and the included instance and relationship are the original instance and original relationship. Even if the newly generated problem is later replaced with an instance and/or relationship therein generating another new problem, the replaced newly generated problem is also the original problem in the replacement process, and the included instance and/or relationship is also the original instance and original relationship. That is, in the present invention, in the current replacement process, the replaced one is the original problem, the original relationship and the original instance, and the one generated after the replacement is the new problem, the new relationship and the new instance.
Step S102 is executed, candidate examples and/or candidate relations corresponding to the examples are obtained from the concept knowledge graph generated in advance, the candidate examples and/or candidate relations are used for replacing the examples and/or relations in the given problems, and new problems are generated.
The concept knowledge graph comprises a plurality of instances and relations, wherein the instances are associated, and each instance generally corresponds to one or more relations. And replacing the examples and/or the relations in the problem by using other examples belonging to the same concept as the examples in the problem as candidate examples and using other relations except the relations in the problem and corresponding to the examples in the problem as candidate relations, so as to generate a new problem.
In this step, new questions may be generated specifically by one or more of the following ways:
acquiring a candidate instance corresponding to the instance from a pre-generated concept knowledge graph, and replacing the instance in the given problem with the candidate instance to generate a first group of new problems;
acquiring a candidate relation corresponding to the instance from a pre-generated concept knowledge graph, and replacing the relation in the given problem by using the candidate relation to generate a second group of new problems;
and acquiring a candidate instance and a candidate relation corresponding to the instance from a pre-generated concept knowledge graph, replacing the instance in the given problem with the candidate instance, replacing the relation in the given problem with the candidate relation, and generating a third new group of problems.
Further, a candidate relationship corresponding to the candidate instance may be obtained from the concept knowledge-graph, the instance in the given question may be replaced by the candidate instance, and the relationship in the given question may be replaced by the candidate relationship corresponding to the candidate instance, so as to generate a fourth new set of questions. As an example, after generating the first new set of questions, the candidate relationships corresponding to the candidate instances may be obtained, and the relationships in the first new set of questions may be replaced by the candidate relationships of the candidate instances to obtain a fourth new set of questions. The candidate relations corresponding to the candidate examples refer to other relations except the relations in the given problem in the relations corresponding to the candidate examples in the concept knowledge graph.
The method for acquiring the candidate examples corresponding to the examples from the pre-generated concept knowledge graph specifically comprises the following steps:
locating the instance in the concept knowledge-graph;
and acquiring other examples belonging to the same concept as the examples as candidate examples.
Specifically, in the concept knowledge graph, a previous layer concept corresponding to the instance is obtained according to the instanceOf relationship, and other instances under the previous layer concept are obtained as candidate instances;
and/or acquiring a higher-level concept of a higher-level concept corresponding to the instance according to the sublassOf relation in the concept knowledge graph, and acquiring instances corresponding to other lower-level concepts of the higher-level concept as candidate instances.
For example, as shown in fig. 2, the original example in a given problem is "schooling", and in the concept knowledge graph, the previous concept "historical book" corresponding to the example "schooling" is obtained according to the instanceOf relationship, and other examples "chinese book", "tri-national records", "materials and literature" under the previous concept "historical book" are obtained as candidate examples;
and/or, in the concept knowledge graph, a higher-level concept book of a higher-level concept history book corresponding to the example book in the Stokes according to the SubclassOf relationship, and examples trisomy and wandering earth corresponding to other lower-level concepts science fiction books of the higher-level concept book are obtained as candidate examples.
Therefore, according to the method of the present invention, as shown in fig. 2, the obtained candidate examples of the original example "schooling" include: the book of Chinese characters, the records of three countries, the general references of the capital and the therapeutics, the three bodies, the earth of wandering and the like.
And after the candidate examples are obtained, replacing the examples in the given problems by the candidate examples to generate a first group of new problems. For example, in the above-described embodiment, the "stegian" is replaced with "han shu", "" three kingdoms "," material science "", "three kingdoms", and "earth of wandering" to create a new problem. For example, the original problems are: the author of the book of history is? After replacing the original instance with the candidate instance, ", a first new set of problems is generated: the author of the book of Chinese characters is? "," the authors of the three kingdoms are? The authors of the' data and reference? "," "the authors of the three-body"? The authors of the term, "the earth of wandering? "and the like.
In a preferred embodiment of the present invention, in order to avoid instance errors (e.g., incomplete instances, wrongly written words, etc.) in the construction process of the concept knowledge graph, before replacing the instance in the given problem with the candidate instance, the method further includes: and screening the candidate examples, and deleting the candidate examples which do not meet the requirements. Specifically, in the embodiment of the present invention, the ranking and screening of the candidate examples may be performed from at least one of the following angles, including:
from the aspect of semantic information of an instance, screening the candidate instances according to the similarity between the instance and the candidate instances, and deleting the candidate instances which have similarity with the instance not reaching a set threshold value and are unqualified candidate instances; wherein, the similarity between the example and the candidate example can be calculated according to the following method: acquiring Embedding of the original example and the candidate example by using algorithms such as CW2VEC, Node2VEC, GCN and the like, then respectively calculating the similarity of the candidate example and the original example, further comparing the calculated similarity with a set threshold, and if the calculated similarity does not reach the set threshold, deleting the candidate example which does not meet the requirement. The set threshold of the similarity can be manually set according to actual conditions.
And from the aspect of statistical characteristics of examples, screening the candidate examples according to the out-degree and in-degree of the candidate examples in the concept knowledge graph and/or the frequency of the candidate examples in the large-scale corpus, and deleting the candidate examples of which the out-degree and in-degree of the candidate examples in the concept knowledge graph and/or the frequency of the candidate examples in the large-scale corpus do not reach the set threshold value, wherein the candidate examples are unsatisfactory candidate examples. Wherein, the out degree, the in degree and/or the frequency setting threshold value in the large-scale corpus in the concept knowledge map can be manually set according to the actual situation.
And from the angle of the relation attribute of the concept corresponding to the instance, screening according to the relation attribute of the concept corresponding to the instance and the candidate instance, and deleting the candidate instance corresponding to the concept of which the similarity of the relation attribute of the concept corresponding to the instance does not reach the set threshold value, wherein the candidate instance is an unsatisfactory candidate instance. The set threshold of the similarity of the relationship attributes of the concept corresponding to the instance can be set manually according to actual conditions.
In the practical application process, the candidate examples can be screened from one angle, or the candidate examples can be screened from a plurality of angles at the same time, namely, when the plurality of angles meet the requirements, the candidate examples can be reserved as the candidate examples meeting the requirements, otherwise, the candidate examples are deleted.
In addition, it should be noted that in the actual application process, the screening of the candidate examples may be performed uniformly on all candidate examples after all candidate examples are determined, or may be performed each time a candidate example is determined. Moreover, if an unsatisfactory candidate instance is found after replacing the original instance with the candidate instance, such as during a spot check, the candidate instance, the newly generated question and corresponding answer with the candidate instance, and the given article snippet may be deleted simultaneously.
In the embodiment of the present invention, for different problems, different methods may be adopted to obtain a candidate relationship, and the candidate relationship is used to replace the relationship in a given problem, specifically, the method includes one or more of the following methods:
if one problem comprises one example or candidate example and one relation, recalling other first-degree relations corresponding to the example or candidate example according to the concept knowledge graph, and replacing the relation in the corresponding problem by using the other first-degree relations;
if one problem comprises one example or candidate examples and a plurality of relations, recalling other last hop relations corresponding to the example or the candidate examples according to the concept knowledge graph, and replacing the relations in the corresponding problem by using the other last hop relations;
if a plurality of instances or candidate instances and a plurality of relations are contained in a problem, other first-degree relations of one instance or candidate instance are recalled randomly, and the other first-degree relations of the instance are used for replacing the relations in the corresponding problem.
The above cases and methods are exemplified.
For example, one problem is: "is the president of company a? ". In this problem, an example "company a" and a relationship "president" are included, and according to the present invention, the method of obtaining candidate relationships is: and recalling other first-degree relations corresponding to the example 'company A' according to the concept knowledge graph. As shown in fig. 3, other first degree relationships include, for example: "address", "number of employees". These other first degree relationships are candidate relationships. After the candidate relationship is obtained, the candidate relationship is used for replacing the relationship in the corresponding problems, and a new problem is generated, namely the director is replaced by the address and the employee number, and the obtained new problem is as follows: "is the address of company a? "," the number of employees of company a is? ".
As another example, one problem is: "is a university of board of company a? ". The problem includes an example "company a", two relationships "president" and "graduate college", and the method for obtaining the candidate relationship according to the present invention is: and recalling other relations corresponding to the last hop relation in the plurality of relations of the example according to the concept knowledge graph. As shown in FIG. 3, other last hop relationships corresponding to the last hop relationship "graduate" of the relationships "board president", "graduate college" of the example "company A" include, for example: "sex", "native place", "school calendar". These other last hop relationships are candidate relationships. After the candidate relationship is obtained, the candidate relationship is used for replacing the relationship in the given question to generate a new question, namely, the gender, the native place and the academic calendar are used for replacing the graduation colleges, and the obtained new question is as follows: "native of president of company a? "," do the preschool calendar of the board of company a? "," the gender of the president of company a? ".
As another example, a question is "is a movie showing a director and a director? ". In this question, two examples are "chen-a" and "wang-a", two relations "lead actor" and "director". According to the invention, the method for obtaining the candidate relationship comprises the following steps: and randomly recalling other first-degree relations of one example, and replacing the relations in the corresponding problems by using the other first-degree relations of the example. For example, recalling another first degree relation "exhibition" of one of the instances "old" as a candidate relation, replacing the relation "lead actor" in the question with the candidate relation "exhibition", and getting a new question "show actor and director movie is? ".
Step S103 is executed, and if the answer to the new question can be obtained from the conceptual knowledge graph, the article segment corresponding to the instance in the new question is obtained from another data source.
In the present invention, if the new question generated according to the method in the above steps S101-102 can find an answer in the concept knowledge graph, it is reasonable to say that the new question is obtained, that is, the data corresponding to the instance included in the new question can be obtained by, for example, crawling the encyclopedia page corresponding to the instance, and the data is taken as the given article segment. And if the generated new question cannot find an answer in the concept knowledge graph, the new question is unreasonable and data for constructing the data set cannot be formed, and the new question is deleted.
And executing the step S104, and constructing a reading understanding data set by using the new question and the answer thereof and the article segment.
In a preferred embodiment of the present invention, the method further comprises:
and evaluating the problems in the reading comprehension data set by adopting a pre-trained language model, and deleting the problems and the answers thereof of which the evaluation indexes do not accord with the set conditions.
In order to avoid that the reading and understanding data set is not accurate enough and subsequent application of the reading and understanding data set is influenced due to errors such as grammar errors in the reading and understanding data set, the problems in the reading and understanding data set are evaluated.
In the implementation process, an evaluation result is returned after evaluation, the evaluation result can be presented in the form of an evaluation index, if the obtained evaluation index does not accord with the set condition, the corresponding problem is deleted, and if the obtained evaluation index accords with the set threshold value, the corresponding problem is reserved.
If the evaluation process is completed before step S103, after the problem whose evaluation index does not meet the set condition is deleted, step S103 is executed only for the remaining problems. If the evaluation process is performed after step S103 before the reading comprehension data set is constructed, by executing step S103, although the answer and the given article fragment corresponding to the question that does not meet the set condition have been found, the question and the corresponding answer and the given article fragment are deleted when the reading comprehension data set is constructed, and are not used for constructing the reading comprehension data set. If the evaluation process is performed after step S104, although the question whose evaluation index does not meet the set condition is already saved in the reading comprehension data set, it is also deleted, and the answer corresponding to the question is also deleted from the reading comprehension data set.
In the present invention, in particular, a pre-trained language model is used to evaluate the problem in the reading comprehension dataset.
Wherein the pre-trained language model can be implemented by means of the prior art.
Other evaluation methods may also be employed, as will be appreciated by those skilled in the art.
In addition, a query statement is a computer-expressed form of a question, and the computer expresses a given question in the form of a query statement when it queries the answer to the question and its article fragments. Thus, in particular implementations, the content of the question may include a natural language expressed question and a corresponding computer query statement, such as:
is the executive director of company a?
select? x where {? x < executive board > < company a > }
In step S102, the natural language expressed question and the corresponding instance and relationship in the computer query statement are synchronously replaced.
Example two
As shown in fig. 4, another aspect of the present invention further includes a functional module architecture corresponding to the method described in the first embodiment, that is, an embodiment of the present invention provides an apparatus for automatically generating a reading comprehension data set based on a knowledge graph, including:
an instance and/or relationship extraction module 201 for extracting instances and/or relationships in a given problem;
a candidate instance obtaining module 202, configured to obtain a candidate instance corresponding to the instance from a pre-generated concept knowledge graph;
a candidate relationship obtaining module 203, configured to obtain a candidate relationship corresponding to the instance from a pre-generated concept knowledge graph;
a new question generation module 204, configured to replace the instance and/or relationship in the given question with the candidate instance and/or candidate relationship, and generate a new question;
an article segment obtaining module 205, configured to obtain, if an answer to the new question can be obtained from the concept knowledge graph, an article segment corresponding to an instance in the new question from another data source;
a reading comprehension data set generation module 206, configured to generate a reading comprehension data set using the new question and the answer thereof and the article fragment.
Further, the candidate relationship obtaining module is further configured to obtain a candidate relationship corresponding to the candidate instance from a pre-generated concept knowledge graph, and replace the instance and the relationship in the given question with the candidate instance and the candidate relationship corresponding to the candidate instance to generate a new question.
Further, the candidate instance acquisition module is specifically configured to:
locating the instance in the concept knowledge-graph;
and acquiring other examples belonging to the same concept as the examples as candidate examples.
Further, the candidate instance obtaining module is configured to obtain, as candidate instances, other instances that belong to the same concept as the instances, and includes:
in the concept knowledge graph, acquiring a previous layer concept corresponding to the instance according to an instanceOf relationship, and acquiring other instances under the previous layer concept as candidate instances;
and/or acquiring a higher-level concept of a higher-level concept corresponding to the instance according to the sublassOf relation in the concept knowledge graph, and acquiring instances corresponding to other lower-level concepts of the higher-level concept as candidate instances.
Further, the candidate instance acquisition module is further configured to: and screening the candidate examples, and deleting the candidate examples which do not meet the requirements.
Further, the candidate instance obtaining module is configured to filter the candidate instances, and includes:
screening the candidate examples according to the similarity of the examples and the candidate examples;
and/or screening the candidate examples according to the out-degree and in-degree of the candidate examples in the concept knowledge graph and/or the frequency of the candidate examples in large-scale corpora;
and/or screening according to the relation attributes of the concepts corresponding to the examples and the candidate examples.
Further, the candidate relationship obtaining module is specifically configured to:
if one question comprises one instance or candidate instance and one relation, recalling other first-degree relations corresponding to the instance or candidate instance as candidate relations according to the concept knowledge graph;
or if one problem comprises one example or candidate examples and a plurality of relations, recalling other last hop relations corresponding to the example or the candidate examples as candidate relations according to the concept knowledge graph;
or if a plurality of instances or candidate instances and a plurality of relations are included in a problem, randomly recalling other first-degree relations of one instance or candidate instance as candidate relations.
Further, the automatic generation device for the reading comprehension data set further comprises an evaluation module, which is used for evaluating the problems in the reading comprehension data set by adopting a pre-trained language model and deleting the problems and answers thereof whose evaluation indexes do not meet the set conditions.
The device can be implemented by the automatic generation method of the reading and understanding data set based on the knowledge graph provided in the first embodiment, and specific implementation methods can be referred to the description in the first embodiment and are not described herein again.
The embodiment of the invention also provides a storage medium, wherein the storage medium stores a computer program, and the method of the embodiment can be executed when the computer program runs.
The embodiment of the present invention further provides an electronic device, which includes a processor and a memory connected to the processor, where the memory stores a plurality of instructions, and the instructions can be loaded and executed by the processor, so that the processor can execute the method according to the first embodiment.
Compared with the method for constructing the reading comprehension data set in the prior art, the technical scheme disclosed by the invention can realize the automatic construction of the reading comprehension data set by means of the concept knowledge map, and has the advantages that:
firstly, the scheme provided by the invention only needs to obtain the examples and the relations in the problems by means of the existing concept knowledge graph, a small number of seed problems and corresponding query sentences, further obtains the candidate examples and the candidate relations, replaces the corresponding problems by the candidate examples and the candidate relations, generates new problems, and finally can realize automatic construction of reading and understanding data sets by obtaining the answers of the new problems and the data corresponding to the examples, thereby greatly reducing the labor cost and having great economic value.
Secondly, the scheme provided by the invention is based on the concept knowledge graph, the concept knowledge graph summarizes and abstracts the existing human knowledge, and by means of the superior and inferior relations of the concept knowledge graph, a reading comprehension data set which is more complex and can be answered only by having reasoning capability can be constructed, so that the method has higher research value.
In addition, the scheme provided by the invention is not limited to a specific field, and in the application process, the reading understanding data sets in different fields can be automatically constructed only by replacing the specific content and a small number of seed problems of the concept knowledge graph.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A reading comprehension data set automatic generation method based on a knowledge graph is characterized by comprising the following steps:
extracting instances and/or relationships in a given problem;
acquiring candidate examples and/or candidate relations corresponding to the examples from a pre-generated concept knowledge graph, and replacing the examples and/or relations in the given problem by using the candidate examples and/or candidate relations to generate a new problem;
if the answer of the new question can be obtained from the concept knowledge graph, obtaining article segments corresponding to instances in the new question from other data sources;
and generating a reading understanding data set by using the new question and the answer thereof and the article fragment.
2. The method of automatically generating a knowledgegraph-based reading understanding dataset of claim 1, further comprising:
and acquiring a candidate relation corresponding to the candidate instance from a pre-generated concept knowledge graph, and replacing the instance and the relation in the given problem by using the candidate instance and the candidate relation corresponding to the candidate instance to generate a new problem.
3. The method of claim 1, wherein the obtaining of candidate instances corresponding to the instances in the pre-generated concept knowledge-graph comprises:
locating the instance in the concept knowledge-graph;
and acquiring other examples belonging to the same concept as the examples as candidate examples.
4. The method of automatically generating a knowledgegraph-based reading comprehension dataset of claim 3 wherein said obtaining other instances belonging to the same concept as said instance as candidate instances comprises:
in the concept knowledge graph, acquiring a previous layer concept corresponding to the instance according to an instanceOf relationship, and acquiring other instances under the previous layer concept as candidate instances;
and/or the presence of a gas in the gas,
and in the concept knowledge graph, acquiring a higher-level concept of a higher-level concept corresponding to the instance according to a sublassOf relation, and acquiring instances corresponding to other lower-level concepts of the higher-level concept as candidate instances.
5. The method for automatic knowledge-graph-based reading comprehension data set generation according to claim 1 wherein prior to replacing said instance in a given question with said candidate instance, further comprising: and screening the candidate examples, and deleting the candidate examples which do not meet the requirements.
6. The method of automatically generating a knowledgegraph-based reading understanding dataset of claim 5, wherein said screening said candidate instances comprises:
screening the candidate examples according to the similarity of the examples and the candidate examples;
and/or
Screening the candidate examples according to the out-degree and in-degree of the candidate examples in the concept knowledge graph and/or the frequency of the candidate examples in the large-scale corpus;
and/or
And screening according to the relationship attributes of the concepts corresponding to the instances and the candidate instances.
7. The method of claim 1, wherein the obtaining candidate instances and/or candidate relationships corresponding to the instances in a pre-generated concept knowledge-graph and replacing the instances and/or relationships in the given problem with the candidate instances and/or candidate relationships comprises:
if one problem comprises one example and one relation, recalling other first-degree relations corresponding to the example according to the concept knowledge graph, and replacing the relation in the corresponding problem by using the other first-degree relations;
or
If one problem comprises one example and a plurality of relations, recalling other last hop relations corresponding to the example according to the concept knowledge graph, and replacing the last hop relation in the corresponding problem by using the other last hop relations;
or
If a problem comprises a plurality of examples and a plurality of relations, recalling other first-degree relations of one example randomly, and replacing the relations in the corresponding problem by using the other first-degree relations of the example.
8. The method for automatic generation of a knowledge-graph based reading understanding dataset of claim 1, further comprising:
and evaluating the problems in the reading comprehension data set by adopting a pre-trained language model, and deleting the problems and the answers thereof of which the evaluation indexes do not accord with the set conditions.
9. A storage medium, characterized in that the storage medium stores a computer program which is capable of performing the method according to any one of claims 1 to 8 when the computer program runs.
10. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the method according to any of claims 1-8.
CN202010991922.9A 2020-09-21 2020-09-21 Reading comprehension data set automatic generation method and device based on knowledge graph Active CN111831812B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010991922.9A CN111831812B (en) 2020-09-21 2020-09-21 Reading comprehension data set automatic generation method and device based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010991922.9A CN111831812B (en) 2020-09-21 2020-09-21 Reading comprehension data set automatic generation method and device based on knowledge graph

Publications (2)

Publication Number Publication Date
CN111831812A true CN111831812A (en) 2020-10-27
CN111831812B CN111831812B (en) 2020-12-15

Family

ID=72918500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010991922.9A Active CN111831812B (en) 2020-09-21 2020-09-21 Reading comprehension data set automatic generation method and device based on knowledge graph

Country Status (1)

Country Link
CN (1) CN111831812B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170308531A1 (en) * 2015-01-14 2017-10-26 Baidu Online Network Technology (Beijing) Co., Ltd. Method, system and storage medium for implementing intelligent question answering
CN109829037A (en) * 2017-11-22 2019-05-31 上海智臻智能网络科技股份有限公司 Method, system, server and the storage medium of intelligent automatic question answering
CN109947912A (en) * 2019-01-25 2019-06-28 四川大学 A kind of model method based on paragraph internal reasoning and combined problem answer matches
CN110347803A (en) * 2019-07-18 2019-10-18 北京百度网讯科技有限公司 Obtain method and apparatus, the electronic equipment, readable medium read and understand material

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170308531A1 (en) * 2015-01-14 2017-10-26 Baidu Online Network Technology (Beijing) Co., Ltd. Method, system and storage medium for implementing intelligent question answering
CN109829037A (en) * 2017-11-22 2019-05-31 上海智臻智能网络科技股份有限公司 Method, system, server and the storage medium of intelligent automatic question answering
CN109947912A (en) * 2019-01-25 2019-06-28 四川大学 A kind of model method based on paragraph internal reasoning and combined problem answer matches
CN110347803A (en) * 2019-07-18 2019-10-18 北京百度网讯科技有限公司 Obtain method and apparatus, the electronic equipment, readable medium read and understand material

Also Published As

Publication number Publication date
CN111831812B (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN106874248B (en) Article generation method and device based on artificial intelligence
Onyancha Knowledge visualization and mapping of information literacy, 1975–2018
Kuška et al. Emotional creativity: A meta-analysis and integrative review
CN106610990B (en) Method and device for analyzing emotional tendency
Guo et al. Proposing an open-sourced tool for computational framing analysis of multilingual data
Friginal et al. Exploring mega-corpora: Google Ngram viewer and the corpus of historical American English
CN111831812B (en) Reading comprehension data set automatic generation method and device based on knowledge graph
Bantry White et al. The Journal Article Reporting Standards for Qualitative Primary, Qualitative Meta-Analytic and Mixed Methods Research: Applying the Standards to Social Work Research
Wood Computer assisted reading in German as a foreign language, developing and testing an NLP-based application
Jamoulle et al. Development, dissemination, and applications of a new terminological resource, the Q-Code taxonomy for professional aspects of general practice/family medicine
Mohsen et al. Thirty years of educational research in Saudi Arabia: a bibliometric study
Brunskill A Microsoft excel approach to reduce errors and increase efficiency in systematic searching
Dina et al. Research on application of information technology for library information service
Oberbichler et al. Tracing discourses in digital newspaper collections
CN112905744A (en) Qiaoqing question and answer method, device, equipment and storage device
Geißler The Kairntech Sherpa–An ML Platform and API for the Enrichment of (not only) Scientific Content
CN112800778B (en) Intent recognition method, system and storage medium based on word string length
Soleimanpour et al. No study is ever flawless: A scoping review of common errors in biomedical manuscripts
Arce et al. Improving a library FAQ: Assessment and reflection of the first year’s use
CN109582959B (en) Book catalog generation method and device, computer equipment and storage medium
Chia An investigation into student and teacher perceptions of, and attitudes towards, the use of information communications technologies to support digital forms of summative performance assessment in the applied information technology and engineering studies courses in Western Australia
Klein et al. Digging Into Data White Paper: Trading Consequences
Jackson et al. Data services and the performing arts
Abdul-Kader Application Of Speech-To-Text synthesizer by using Natural Language Processing (NLP).
Надурак Critical thinking: concept and practice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhao Hanyu

Inventor after: Yuan Sha

Inventor after: Xie Niantao

Inventor after: Ma Quanyue

Inventor after: Cao Gang

Inventor before: Zhao Hanyu

Inventor before: Yuan Sha

Inventor before: Tang Jie

Inventor before: Xie Niantao

Inventor before: Ma Quanyue

Inventor before: Cao Gang