CN113704499A

CN113704499A - Accurate and efficient intelligent education knowledge map construction method

Info

Publication number: CN113704499A
Application number: CN202111038104.8A
Authority: CN
Inventors: 徐强
Original assignee: Guangdong Zhaoyang Information Technology Co ltd
Current assignee: Guangdong Zhaoyang Information Technology Co ltd
Priority date: 2020-09-24
Filing date: 2021-09-06
Publication date: 2021-11-26

Abstract

The invention provides an accurate and efficient intelligent education knowledge graph construction method, which is characterized in that a knowledge graph body structure is constructed on the basis of teaching material and auxiliary data of a certain discipline authority, a semi-automatic body construction is adopted, body knowledge is obtained by using a statistical method and an unsupervised method, the body knowledge of other knowledge graphs is combined, the body is constructed under the guidance of experts, and the crowdsourcing semi-automatic semantic annotation process is completed; according to the ontology structure of the discipline knowledge graph, after correspondingly processing a structured external data source, RDF external source data are obtained, then data in the marked data are used as training data, and according to the ontology structure of the discipline knowledge graph, entities and relations are extracted from an internet text by adopting a supervision, semi-supervision and unsupervised method to obtain expanded data; the method can carry out targeted text conversion and error correction standardization on the educational knowledge point data, has high accuracy and reliability, and most efficiently constructs the most accurate knowledge map.

Description

Accurate and efficient intelligent education knowledge map construction method

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of intelligent teaching, in particular to an accurate and efficient intelligent education knowledge map construction method.

[ background of the invention ]

The artificial intelligence technology is widely applied to the teaching field and runs through five major links of teaching, learning, practicing, evaluating and measuring. Such as teaching: an intelligent lesson preparation system and a teacher capability map; learning: personalized learning content and intelligent learning path; refining: personalized exercise, voice exercise; and (3) measurement: third party evaluation; evaluation: learning report feedback and classroom behavior monitoring, etc. The artificial intelligence technology can basically meet the teaching process links, and help teachers or students to achieve targeted and accurate education knowledge data processing and efficient knowledge teaching and learning. However, in both teaching and learning, the data volume of related knowledge education is huge and the data structure is complex; though teachers and students can search knowledge points by using a search engine, the time is consumed, the efficiency is low, and the quality of search results is uneven. In order to improve comprehensiveness of education knowledge learning, education knowledge data needs to be accurately mined and associated, and accordingly a corresponding education knowledge map is constructed.

The construction of the knowledge graph is often costly. Because the current natural language processing method is not perfect enough, a fully automatic construction mode is difficult to obtain a relatively accurate knowledge graph, for example, DBPedia, YAGO and the like have more errors; the method of completely manual construction guarantees accuracy, but needs huge manpower and time cost, and the complete manual construction of a larger-scale knowledge graph is almost impossible. Therefore, how to coordinate accuracy and efficiency, balance an automation method and manual participation and construct the most accurate knowledge graph in the most efficient mode is a big problem to be solved for constructing the knowledge graph at present.

[ summary of the invention ]

Aiming at the defects in the prior art, the invention provides an accurate and efficient intelligent education knowledge map construction method, which can be used for efficiently constructing an actually available domain knowledge map with higher accuracy by fully utilizing high-quality professional data and massive internet data in the domain.

In order to solve the technical problems, the invention provides an accurate and efficient educational knowledge graph construction method, which comprises the following steps:

s1, constructing a field body: the ontology structure of the knowledge graph is constructed based on teaching materials and auxiliary materials of discipline authorities, ontology knowledge is obtained by adopting a semi-automatic ontology construction method and using a statistical method and an unsupervised method, the ontology is constructed under the guidance of experts by combining ontology knowledge of other knowledge graphs, and the ontology is improved in the process of crowdsourcing semi-automatic semantic annotation;

s2, crowdsourcing semi-automatic semantic annotation: crowdsourcing the text page to a plurality of annotators, and utilizing a semantic annotation tool to label according to the constructed body to obtain high-quality labeled data;

s3, supplementing external source data: processing the data with better structuralization degree of other sources according to the ontology structure (namely correspondingly processing the structured external data source according to the ontology structure of the discipline knowledge graph), and then integrating the data with the labeled data;

s4, information extraction: and (3) taking data in the labeled data as training data, and extracting entities and relations from the Internet text by adopting a supervision, semi-supervision and unsupervised method according to the ontology structure of the discipline knowledge graph to obtain the expanded data.

Further, in step S1, an ontology structure of the knowledge graph is constructed based on the teaching materials and auxiliary materials of the discipline authority, and the specific steps include:

s101, summarizing a domain core concept: firstly, obtaining a domain term by using a relevant statistical method, obtaining a domain core concept from the domain term, then referring to a knowledge graph or a data source with higher quality, and performing perfect supplement in a crowdsourcing semi-automatic semantic annotation step; the induction and arrangement of the domain core concept obtained by the method needs to refer to two basic principles of ontology construction, namely: the design of the classes in the ontology should take on the principle of independence and sharing;

s102, defining a domain relation and a constraint thereof: the relation is a core basic element of the ontology and is used for describing the interaction between concepts and examples in the field, and the relation directly determines the knowledge richness of the ontology knowledge graph and the functional range of other application systems constructed based on the knowledge graph;

step S103, body checking: the participation and cooperation of domain experts are needed in the process of constructing the domain ontology; and modifying and perfecting according to the guidance suggestions of experts to obtain a final subject field ontology.

Further, the number of classes contained in the ontology of step S101 should be minimized as much as possible, removing redundant classes as much as possible.

Further, step S102 further includes the following steps: (1) performing unsupervised open relationship extraction on the text in the field of the geographic subject by using an OpenIE method, and finding out a meaningful relationship from the extracted text; (2) referring to a knowledge graph or data source with higher quality; (3) determining a relationship according to the domain core concept and the encyclopedia information box; many instances are provided under each field core concept, most instances are provided with corresponding information frames in encyclopedias, and important relationships under the concept can be obtained by integrating the information frame relationships of multiple instances under the same concept; (4) supplementing a new relation in the crowdsourcing semi-automatic semantic annotation process; in the crowdsourcing semi-automatic semantic annotation process, if a new relationship is found and cannot be expressed by the existing relationship, the new relationship needs to be supplemented.

Further, in the step S2, "labeling with a semantic labeling tool to obtain high-quality labeling data" is specifically, based on a crowdsourcing semi-automatic semantic labeling tool developed by Pundit, an HTML text obtained by teaching and assisting electronization of a subject textbook is used as a labeling object, a subject field body is used as a labeling basis, a semantic labeling system is used for semi-automatic semantic labeling to form labeling data, and the subject field body is perfected in the process; the semantic annotation based on the domain ontology refers to a process of extracting structured knowledge from a document under the guidance of the domain ontology, namely, the pure text knowledge in the document is described by using RDF (resource description framework) language; among them, the process of semantic annotation generally includes two steps: (1) type marking: marking out words corresponding to concepts in the ontology in the document, and taking the words as examples corresponding to the concepts; (2) and (3) relation labeling: finding out the relation corresponding to the relation in the ontology existing between the instances, wherein the relation label can enrich the internal information of the instances; also, when semantically annotated, instances and relationships between instances are represented in the form of triples (E1, R, E2), where R is the relationship between instances E1 and E2.

Further, when the examples and the relations between the examples are expressed as the triples, the high-quality triples are obtained through the following steps:

s301, acquiring an instance and a relation name set; representing an instance set of each concept c by using E ═ { E1, … and eN }, crawling an Baidu encyclopedia information box corresponding to each instance ei to obtain a relation name set R ═ R1, … and rM } in all the information boxes, wherein the set size is M;

s302, connecting edges; if the information frame of the instance ei contains rj, setting the weight of the edge between the ei and the rj to be 1; if not, set to 0; in order to avoid the graph sparseness phenomenon, edges between the examples and the relationship names and between the examples and the relationship names are added; for the operation of example and example connecting edges, firstly setting a relation name vector V for each example, wherein the dimension of the vector is equal to the size M of a relation name set; if the relationship name rk exists in the information box of the instance, it is set to 1; if not, setting the position to 0; further, cosine similarity between the example and the example relation name vector can be obtained and used as the weight of the edge between the example and the example; similarly, an instance vector can be set for each relationship name, and further cosine similarity between the relationship name and the relationship name is obtained and used as the weight of the edge between the relationship name and the relationship name;

step S303, iterative computation; performing iterative computation by adopting a graph strengthening algorithm to obtain example and relation name typical degree sequence under each concept;

and S304, adding the relationship names with high typical degree and the value information thereof into the knowledge graph.

Further, the semantic annotation system, as a key system for knowledge graph construction, mainly includes the following requirements:

(1) marking basis: the semantic annotation system is based on the semantic annotation function of the ontology, and one or more ontology description files must be imported or files containing ontology information are adopted for configuration and used as the basic annotation basis of the semantic annotation system;

(2) labeling the object: most teaching materials are based on the fact that teaching and auxiliary book data are stored in static webpage files, so that a semantic annotation system needs to support the annotation function of the static webpage files;

(3) the labeling mode is as follows: the semantic annotation system must be able to provide basic annotation functions, including type annotation and relationship annotation; meanwhile, considering that a large number of pictures in the teaching material auxiliary book data need to be labeled, the semantic labeling system also needs to support the function of labeling the pictures;

(4) ontology language: the semantic annotation system at least supports one or more of RDF (S), DAML + OIL, XML and OWL ontology languages.

Further, in combination with the constructed geo-discipline knowledge graph target, the semantic annotation system further includes the following requirements:

(1) collaborative annotation: the semantic annotation system is a B/S mode-based semantic annotation system;

(2) and (4) marking and auditing: the semantic annotation system should have a certain user right control; the user mainly comprises a marking person and an auditor, wherein the marking person can only edit and delete the marking records of the user, and the auditor can edit and delete the marking records of all the marking persons on the current page;

(3) marking and tracing: for any piece of knowledge generated by page labeling, generating corresponding knowledge and simultaneously storing metadata information which can be traced back to a specific labeling source in the future; usually, the label tracing is realized by using an XPointer technology, which is a language for positioning data according to the characteristics of the data, such as the position, character content or attribute value, in the XML file;

(4) and (3) storage of the annotation data: the storage of the labeled data selects an RDF database, preferably a Sesame database, so as to realize a universal RDF data management frame, provide a corresponding programming interface and facilitate the integration of different storage systems, reasoning and query engines and the like;

(5) performing coreference resolution: through the instance query capability of the semantic annotation tool, existing instances can be selected for annotation when the same instance is encountered, so that multiple repeated and redundant instances are avoided, and the problem of instance co-reference caused by regeneration of a new instance is avoided;

aiming at the requirements, a corresponding semantic annotation architecture is provided by combining a geographical discipline knowledge graph target to be constructed currently, a semantic annotation system is utilized on the basis of a geographical discipline body and a resource management system, an annotation database is generated by annotation of annotation personnel, and finally the annotation database is cleaned and exported to annotation data.

Further, the information extraction at step S4 specifically includes the following steps:

s401, entity set expansion; the expansion data refers to RDF triple data extracted from a text by using the previously obtained labeling data and exogenous data and using methods such as machine learning and the like, and the expansion is carried out according to the entity set of each concept in the knowledge graph, wherein the method used for the expansion is word vector; the word vector has the main function that each word is mapped to a vector with fixed dimensionality through training of a large number of word corpora, so that the semantic relevance of the two words can be described according to the cosine distance between the vectors of the two words;

s402, extracting a relation; adopting unsupervised, supervised and semi-supervised methods to extract the relation; wherein, in the unsupervised method, a rule-based method and an LDA model are used; in the rule-based method, a regular template is defined for a relation to be extracted, and then text description of the corresponding relation is extracted from a text; the LDA model is an unsupervised machine learning technology, can be used for identifying relationship class information hidden in a text, and represents the characteristics of each type of relationship in a word bag mode; in the supervision method, existing relation data in a knowledge graph is used as training data, then corresponding triples are extracted from a text, and meanwhile, a simpler multilayer perceptron is used to prevent overfitting when the existing data are still insufficient; in the semi-supervision method, a remote supervision method based on a multi-language attention mechanism is adopted, and the information with consistency among multiple languages is utilized, so that the better extraction effect than that of a single language is realized.

Further, after the entity set is expanded in step S401, an entity disambiguation operation is also included.

The invention mainly has the following beneficial effects:

according to the technical scheme, high-quality professional data and massive internet data in the field can be fully utilized, targeted text conversion and error correction standardization can be performed on the education knowledge point data, accuracy and reliability are improved, accuracy and efficiency are coordinated, an automatic method and manual participation are balanced, and the most accurate and actually available domain knowledge map is constructed in the most efficient mode.

[ description attached drawings ]

FIG. 1 is a schematic flow chart of an accurate and efficient educational knowledge graph construction method of the present invention;

FIG. 2 is a diagram of a discipline knowledge graph construction route using the accurate and efficient educational knowledge graph construction method of the present invention;

FIG. 3 is an architecture diagram of semantic annotation in the method for constructing an accurate and efficient educational knowledge graph according to the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in FIG. 2, the content of the invention is further explained by taking the construction of the geography discipline knowledge graph facing basic education as an example, the construction method of the intelligent education knowledge graph of the invention provides a systematic solution- "four-step method", which is respectively: step 1, domain ontology construction, step 2, crowdsourcing semi-automatic semantic annotation, step 3, exogenous data completion, and step 4, information extraction.

The step 1, domain ontology construction refers to the construction of an ontology structure of the knowledge graph, and the ontology structure can be understood as a framework of the knowledge graph. And 2, crowdsourcing semi-automatic semantic annotation refers to crowdsourcing a text page to a plurality of annotators, constructing the body according to the step 1, and performing annotation by using a semantic annotation tool to obtain high-quality annotation data. And 3, supplementing the external source data, namely processing the data with better structuralization degree of other sources according to the body structure and then integrating the processed data with the labeled data. And 4, extracting information in a large scale from the text aiming at sparse entities or relations in the knowledge graph. It can be seen that: the step 1 and the step 2 are skeleton parts of the knowledge graph, are bases and cores, the two steps are mutually iterated, the body is constructed and guided to be labeled, the body structure can be reversely improved under new conditions encountered in the labeling, the step 1 and the step 2 ensure the accuracy of the knowledge graph, and targeted and controllable expansion and completion are carried out on the basis of high-quality labeled data obtained in the step 1 and the step 2, so that the coverage rate of the knowledge graph and the construction efficiency are ensured; the step 3 and the step 4 are blood and meat parts of the knowledge graph, the step 3 and the step 4 are also mutual iterative relations, the step 4 can extract information from a text by using the relations and entities obtained in the step 3, and the step 3 can also supplement related knowledge in structured data of other sources into the knowledge graph by using new entities and relations extracted in the step 4; detailed description see FIG. 1 for a process for building a domain knowledge graph using a "four-step" approach.

Through the four steps, high-quality professional data and massive internet data in the field can be fully utilized, and the actually available field knowledge map with high accuracy can be efficiently constructed.

Therefore, as shown in fig. 1, the method for constructing an accurate and efficient educational knowledge graph according to the present invention comprises the following steps:

and S1, constructing a field body. The domain ontology construction is based on teaching material and auxiliary material of discipline authority, an ontology structure of the knowledge graph is constructed (which can be understood as a framework for constructing the knowledge graph), a semi-automatic ontology construction method is adopted, ontology knowledge is obtained by using a statistical method and an unsupervised method, and the ontology is constructed under the guidance of experts by combining ontology knowledge of other knowledge graphs and is completed in the process of crowdsourcing semi-automatic semantic annotation. (the method of manually building ontology is usually completed by a large number of domain experts in cooperation with each other, the automatically built ontology is usually also called ontology learning, and the objective is to automatically obtain ontology knowledge from data resources by using knowledge acquisition technology, machine learning technology, statistical technology, and the like, so that the cost of ontology building is reduced, semi-automatically building ontology is between the manually built ontology and the automatically built ontology, and for most of the fields, it is difficult to implement the fully automatically built ontology, so in the process of automatically building ontology, it is usually required to be performed under the guidance of the user.);

and S2, crowdsourcing semi-automatic semantic annotation. The crowdsourcing semi-automatic semantic annotation refers to crowdsourcing a text page to a plurality of annotators, and utilizing a semantic annotation tool to label to obtain high-quality annotation data according to a constructed body;

and S3, completing external source data. The external source data completion means that data with better structuralization degree from other sources are integrated with the labeled data after being processed according to the body structure;

and S4, information extraction. And the information extraction utilizes data in the label data as training data, and entities and relations are extracted from Internet texts such as Baidu encyclopedia and the like by adopting a supervision, semi-supervision and unsupervised method according to the ontology structure of the discipline knowledge graph to obtain the expansion data.

The step S1 and the step S2 are skeleton parts of the knowledge graph, are bases and cores, and are mutually iterated, the body construction guides the labeling, the body structure can be reversely improved under the new condition encountered in the labeling, the accuracy of the knowledge graph is ensured, targeted and controllable expansion and completion are carried out on the basis of the high-quality labeling data obtained in the step S1 and the step S2, and the coverage rate of the knowledge graph and the construction efficiency are ensured; the steps S3 and S4 are blood and meat parts of the knowledge graph, the steps S3 and S4 are also iterative relationships, the step S4 may extract information from the text by using the relationships and entities obtained in the step S3, and the step S3 may also supplement the related knowledge in the structured data from other sources to the knowledge graph by using the new entities and relationships extracted in the step S4.

In step S1 of the method of the present invention, an ontology structure of a knowledge graph is constructed based on teaching materials and auxiliary materials of subject authorities (which can be understood as a framework for constructing the knowledge graph), and the method specifically includes the following steps:

and step S101, summarizing the domain core concept. The method comprises the steps of firstly obtaining domain terms by using a relevant statistical method, then obtaining domain core concepts from the domain terms, then referring to a knowledge graph or a data source with higher quality, and performing perfect supplement in a crowdsourcing semi-automatic semantic annotation step. The induction and arrangement of the core concept obtained by the method needs to refer to two basic principles of ontology construction, namely: the design of the class in the ontology should inherit the principle of independence and sharing, wherein the former means that the class can exist independently without depending on a specific field, and the latter means that the class can be shared, namely, the class can be multiplexed possibly and necessarily; furthermore, the number of classes contained in the ontology should be minimized as much as possible, removing redundant classes as much as possible.

And S102, defining the domain relation and the constraint thereof. Relationships are the core fundamental elements of an ontology, which is a description of the interaction between concepts, instances, in the domain. The relationship directly determines the richness of knowledge of the ontology knowledge graph and the functional scope of other application systems constructed based on the knowledge graph. Further, step S102 further includes the following steps: (1) performing unsupervised open relationship extraction on the text in the field of the geographic subject by using an OpenIE method, and finding out a meaningful relationship from the extracted text; (2) reference to higher quality knowledge maps or data sources, such as Baidu encyclopedia; (3) relationships are determined from the core concepts and the encyclopedia information box. Many instances are provided under each core concept, most instances are provided with corresponding information frames in encyclopedias, and important relationships under the concept can be obtained by integrating the information frame relationships of multiple instances under the same concept; (4) and supplementing a new relation in the crowdsourcing semi-automatic semantic annotation process. In the crowdsourcing semi-automatic semantic annotation process, if a new relationship is found and cannot be expressed by the existing relationship, the new relationship needs to be supplemented.

Step S103, checking the body. The participation and cooperation of domain experts is required in the process of building the domain ontology. Therefore, the final subject field ontology is obtained by modifying and perfecting according to the guidance suggestions of experts.

In step S2, the crowd-sourced semi-automatic semantic annotation is a crowd-sourced semi-automatic semantic annotation tool developed based on Pundit, and takes HTML text obtained by teaching and assisting electronization of subject textbooks as an annotation object, and uses a semantic annotation system to perform semi-automatic semantic annotation based on the subject field body to form annotation data, and the subject field body is perfected in the process. By adopting the crowdsourcing semi-automatic semantic annotation tool developed based on Pundit, the requirements of annotation examination, annotation traceability, coreference resolution, data storage and the like in the crowdsourcing annotation process can be met, and the crowdsourcing annotation efficiency is greatly improved. The semantic annotation refers to marking original data to enable the original data to contain certain semantic information, so that not only can people understand the semantic information, but also a machine can understand the semantic information; the semantic annotation based on the domain ontology in the invention refers to a process of extracting structured knowledge from a document under the guidance of the domain ontology, namely: the knowledge of plain text in a document is described by using RDF language, wherein the process of semantic annotation generally comprises two steps: (1) marking out words corresponding to the concepts in the ontology in the document, and taking the words as examples corresponding to the concepts; (2) relation labeling, namely finding out the relation between the instances corresponding to the relation in the ontology, wherein the relation labeling can enrich the internal information of the instances; also noted, instances and relationships between instances are often represented in the form of triples (E1, R, E2), where R is the relationship between instances E1 and E2.

From the comparison result, we can conclude that the semantic annotation system is used as a key system for knowledge graph construction, and the main requirements of the semantic annotation system include the following points:

(1) marking basis: the semantic annotation system provides a semantic annotation function based on an ontology, so that one or more ontology description files can be imported, or files containing ontology information are adopted for configuration, so that the semantic annotation system has a basic annotation basis;

(2) labeling the object: the semantic annotation system generally supports annotation of text files or static webpage files, and most teaching materials, such as teaching assistance books, are stored in the static webpage files at present, so the semantic annotation system needs to support an annotation function of the static webpage files;

(4) ontology language: most semantic annotation tools currently only support one or a few ontology languages such as rdf(s), DAML + OIL, and XML, while the latest ontology description language OWL recommended for W3C supports less, so in order to better use different ontology languages, the semantic annotation system should be able to support currently mainstream ontology languages, such as one or more of rdf(s), DAML + OIL, XML, and OWL.

In addition to the 4 basic requirements above, in conjunction with the geosciences knowledge-graph targets we build, we consider the following requirements to be equally important for semantic annotation systems. (1) Collaborative annotation: semantic annotation systems with earlier occurrence time are generally in a C/S mode, so that annotation personnel are required to install clients, and the software configuration and the semantic annotation process are inconvenient; however, with the development of the internet, a semantic annotation system based on a B/S mode gradually appears, and the semantic annotation system based on the B/S mode can conveniently support collaborative annotation of a large number of annotators, thereby remarkably improving the annotation speed; (2) and (4) marking and auditing: the labeling system should have certain user authority control, and under a simple condition, the user mainly comprises two types, namely a labeling person and an auditor, wherein the labeling person can only edit and delete the labeling records of the labeling person, and the auditor can edit and delete the labeling records of all the labeling persons on the current page; (3) marking and tracing: for any piece of knowledge generated by page marking, metadata information which can be traced back to a specific marking source in the future needs to be stored while generating corresponding knowledge, and usually the marking tracing is realized by adopting an XPointer technology, wherein the XPointer is a language for positioning data according to the characteristics of the position, character content, attribute value and the like of the data in an XML file; 4) and (3) storage of the annotation data: the storage of the labeled data is also a problem to be considered in a key way, and currently, a plurality of excellent RDF databases can be selected, wherein the Sesame database is an open-source project, has a simple structure, is easy to deploy, has complete functions and is easy to operate, realizes a universal RDF data management framework, and provides corresponding programming interfaces so as to integrate different storage systems, reasoning and query engines and the like; (5) performing coreference resolution: the co-reference problem of the instances is a problem frequently occurring in the process of labeling the webpage data, which means that the same instances occur in different webpage documents, and in order to avoid generating a plurality of repeated redundant instances, the labeling tool should have the capability of querying the instances, so that when the same instances are encountered, the existing instances can be selected for labeling, and the problem of co-reference of the instances caused by regenerating new instances is avoided.

Aiming at the requirements of the points, a corresponding semantic annotation architecture is provided by combining a geographical discipline knowledge graph target to be constructed currently, as shown in fig. 3. On the basis of the geographic subject ontology and the resource management system, a semantic annotation system is utilized, an annotation database is generated through annotation of annotation personnel, and finally the annotation database is cleaned and exported to annotation data.

In step S3, after performing corresponding processing on the structured external data source according to the ontology structure of the discipline knowledge graph, obtaining RDF external data that has a certain structure with the annotation data, where the RDF external data source is generally a knowledge graph disclosed on the internet or other websites with a better structured degree, and is characterized by a large amount of data and a better structure; the Baidu encyclopedia information frame is a good source of a domain knowledge graph expansion triple fact, and on the basis of an example obtained in the crowdsourcing semantic annotation and entity set expansion steps, a high-quality triple can be obtained through the following steps.

And S301, acquiring an instance and a relation name set. For each concept c, an instance set of the concept c is represented by E ═ { E1, … and eN }, and for each instance ei, the corresponding Baidu encyclopedic information box of the instance is crawled to obtain a relation name set R ═ R1, … and rM } in all the information boxes, wherein the set size is M;

and S302, connecting edges. If the information frame of the instance ei contains rj, setting the weight of the edge between the ei and the rj to be 1; if not, 0 is set. To avoid graph sparseness, we add edges between instances and instances, relationship names and relationship names. For the operation of example and example connecting edges, firstly setting a relation name vector V for each example, wherein the dimension of the vector is equal to the size M of a relation name set; if the relationship name rk exists in the information box of the instance, it is set to 1; if not, setting the position as 0, and further obtaining cosine similarity between the instance and the instance relation name vector as the weight of the edge between the instance and the instance; similarly, an instance vector can be set for each relationship name, and further cosine similarity between the relationship name and the relationship name is obtained and used as the weight of the edge between the relationship name and the relationship name;

and S303, iterative calculation. Iteration is carried out by adopting a graph strengthening algorithm, and after iterative computation, example and relation name typical degree sequence under each concept can be obtained;

It is worth mentioning that: the above steps also have the function of checking classification errors of the instances in the knowledge graph, and if the degree of representativeness of the instances under a certain concept obtained in step S303 is low, the instances are likely to be the instances with the classification errors.

In the step S4, the information extraction is to extract entities and relationships from internet texts such as encyclopedia and the like by using data in the label data as training data and adopting supervised, semi-supervised and unsupervised methods according to the ontology structure of the discipline knowledge graph to obtain the extended data. The method comprises the following steps:

and S401, entity set expansion. The expansion data refers to RDF triple data extracted from the text by methods such as machine learning and the like by using the obtained marking data and external source data. We want to extend based on the entity set of each concept in the knowledge-graph. The method used is a word vector. The word vector has the main function of mapping each word to a vector with fixed dimensionality through training of a large number of word corpora, so that the semantic relevance of the two words can be described according to the cosine distance between the vectors of the two words.

In fact, after the entity set is expanded, there should be a step of entity disambiguation, but the generic knowledge graph has more ambiguity and the domain knowledge graph has less ambiguity. For example: "apple" is both fruit and technology company, but there is almost no domain knowledge map that includes both technology company and fruit.

And S402, extracting the relation. We adopt 3 methods of unsupervised, supervised and semi-supervised to perform relationship extraction. In the unsupervised approach, we use a rule-based approach and an LDA model. In the rule-based method, a regular template is defined for a relation to be extracted, and then text description of the corresponding relation is extracted from a text; the LDA model is an unsupervised machine learning technology, which can be used for identifying hidden relation category information in texts and representing the characteristics of each type of relation in a bag-of-words manner; in the supervised method, the existing relation data in the knowledge graph is used as training data, and corresponding triples are extracted from the text. Because the existing data is still insufficient, a simpler multilayer perceptron is used for preventing overfitting; in the semi-supervised method, a remote supervision method based on a multi-language attention mechanism is adopted, and the information with consistency among multiple languages is utilized, so that the better extraction effect than that of a single language is realized.

Example (b):

the specific contents of the 4 steps of the accurate and efficient educational knowledge map construction method are as follows:

and step S1, constructing a domain ontology. The method comprises the steps that teaching materials and auxiliary materials based on geographical subject authorities are used, an unsupervised OpenIE method and a related statistical method are utilized, body structures of other knowledge maps are referred, and the construction of geographical subject bodies facing the basic education field is completed by combining guidance opinions of experts in the geographical subject field and first-line teachers;

among these, in step S1, the coverage and accuracy are very important evaluation indexes for the subject ontology in the basic education field. Under the condition that the current Chinese ontology automatic construction technology is not mature, ontology knowledge obtained by methods of ontology learning, statistical learning and the like is combined with characteristics of the basic education field, and an ontology in the geographical subject field is constructed under the guidance of experts by combining ontology knowledge of other knowledge maps.

And step S2, crowdsourcing semi-automatic semantic annotation. Taking a text obtained by teaching and assisting electronization of a geographical subject teaching material as a labeling object, taking a geographical subject field body as a labeling basis, and carrying out semi-automatic semantic labeling by using a semantic labeling system to form labeling data, wherein the geographical subject field body is perfected in the process;

in step S2, the annotation data is the basis and emphasis of the geographical discipline knowledge graph, and we adopt a crowd-sourced semi-automatic semantic annotation mode to ensure quality and efficiency. The labeled data source is the teaching material auxiliary text in the HTML format. The semantic annotation based on the domain ontology refers to a process of extracting structured knowledge from a document under the guidance of the domain ontology, namely, the pure text knowledge in the document is described by using an RDF language. The semantic annotation process comprises two steps: (1) type marking: marking out words corresponding to concepts in the ontology in the document, and taking the words as examples corresponding to the concepts; (2) and (3) relation labeling: and finding out the relation existing between the instances and corresponding to the relation in the ontology, wherein the relation label can enrich the intrinsic information of the instances. When labeled, instances and relationships between instances are typically represented in the form of triples (E1, R, E2), where R is the relationship between instances E1 and E2.

And step S3, completing the external source data. According to the ontology structure of the geographical discipline knowledge graph, after correspondingly processing a structured external data source, obtaining external source data which is used as an important part of the geographical discipline knowledge graph;

in step S3, the external source data refers to RDF data that is obtained by processing the external data source according to the ontology structure of the geographic subject field and is consistent with the annotation data structure. The external data source is generally a knowledge graph or other websites with good structuralization degree disclosed on the internet, and is characterized by large data volume and good structure. The following introduces 3 external data sources in the geo-discipline knowledge graph. (1) Geonames is a relatively authoritative knowledge map in the field of geographic information, contains over 1000 thousands of geographic place name information, and has high data accuracy. Mainly, English data, more important place names have names (label) of other languages, for example, more than 61 ten thousand place names having Chinese names. There are 19 attribute information per place name information (some attributes may be null). Part of the attribute information may be directly used as triple facts in the knowledge-graph, such as longitude (longitude); part of attribute information needs to be processed according to an ontology structure, for example, feature code (feature code) attribute information is processed to be used as a relationship between an instance and a concept; and processing attribute information such as a first-level administrative division code (admin1 code) and a second-level administrative division code (admin2 code) to obtain the upper and lower relations between the place names. (2) The Baidu encyclopedia information frame is a better source for extending triple facts by the domain knowledge graph. (3) Chinese administrative division information. China administrative regions are important in the geographic discipline, and therefore information that China administrative regions are accurate to the township level is obtained from a national statistical bureau website (http:// www.stats.gov.cn/tjsj/tjbz/tjyqhdmchxcfdm/2016 /), mainly the upper and lower relations among the administrative regions. Because the presentation mode is completely structured, the data is directly added into the knowledge graph after being processed according to the ontology structure.

In step S4, information is extracted. And (3) taking data in the labeled data as training data, and extracting entities and relations from Internet texts such as Baidu encyclopedia and the like by adopting a supervision, semi-supervision and unsupervised method according to the ontology structure of the knowledge graph of the geographic discipline to obtain expanded data.

In step S4, the extension data refers to RDF triple data extracted from the text by a method such as machine learning using the label data and the external source data obtained previously. Augmentation data is an important component of the geo-discipline knowledge graph. The text corpora used by us are 'world geography' book, 'Chinese geography' book, geographical book (hereinafter referred to as 'Chinese great encyclopedia text') and 'Baidu Baike Weiji Baike text' (hereinafter referred to as 'Baidu Weiji text') in 'Chinese great encyclopedia'. The two parts of linguistic data have the characteristics respectively, although the quantity of Chinese large encyclopedia texts is small, the quality is high, and the quantity of Baidu wiki texts is general, but the quantity is large. We adopt 3 methods of unsupervised, supervised and semi-supervised to perform relationship extraction. In the unsupervised approach, we use a rule-based approach and an LDA model. In the rule-based method, a regular template is defined for a relation to be extracted, and then text description of the corresponding relation is extracted from a text; the LDA model is an unsupervised machine learning technology, which can be used for identifying hidden relation category information in texts and representing the characteristics of each type of relation in a bag-of-words manner; in the supervised method, the existing relation data in the knowledge graph is used as training data, and corresponding triples are extracted from the text. Because the existing data is still insufficient, a simpler multilayer perceptron is used for preventing overfitting; in the semi-supervised method, a remote supervision method based on a multi-language attention mechanism is adopted, and the information with consistency among multiple languages is utilized, so that the better extraction effect than that of a single language is realized.

The above-mentioned embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited by these embodiments, except for the cases listed in the specific embodiments; all equivalent variations of the methods and principles of the present invention are intended to be within the scope of the present invention.

Claims

1. An accurate and efficient educational knowledge graph construction method is characterized by comprising the following steps:

s3, supplementing external source data: processing the data with better structuralization degree from other sources according to the body structure, and integrating the processed data with the labeled data;

2. The method according to claim 1, wherein in step S1, based on the teaching materials and auxiliary materials of discipline authority, the ontology structure of the knowledge graph is constructed, and the specific steps include:

3. The method according to claim 2, characterized in that the number of classes contained in the ontology of step S101 should be minimized as much as possible, removing redundant classes as much as possible.

4. The method according to claim 2, wherein step S102 further comprises the steps of: (1) performing unsupervised open relationship extraction on the text in the field of the geographic subject by using an OpenIE method, and finding out a meaningful relationship from the extracted text; (2) referring to a knowledge graph or data source with higher quality; (3) determining a relationship according to the domain core concept and the encyclopedia information box; many instances are provided under each field core concept, most instances are provided with corresponding information frames in encyclopedias, and important relationships under the concept can be obtained by integrating the information frame relationships of multiple instances under the same concept; (4) supplementing a new relation in the crowdsourcing semi-automatic semantic annotation process; in the crowdsourcing semi-automatic semantic annotation process, if a new relationship is found and cannot be expressed by the existing relationship, the new relationship needs to be supplemented.

5. The method according to claim 1, 2, 3 or 4, wherein in the step S2, "labeling with semantic labeling tools results in high-quality labeled data", specifically, based on a crowdsourcing semi-automatic semantic labeling tool developed by Pundit, HTML text obtained by teaching and assisting electronization of subject textbooks is used as a labeling object, and a subject field body is used as a labeling basis, and a semantic labeling system is used for performing semi-automatic semantic labeling to form labeled data, and the subject field body is perfected in the process; the semantic annotation based on the domain ontology refers to a process of extracting structured knowledge from a document under the guidance of the domain ontology, namely, the pure text knowledge in the document is described by using RDF (resource description framework) language; among them, the process of semantic annotation generally includes two steps: (1) type marking: marking out words corresponding to concepts in the ontology in the document, and taking the words as examples corresponding to the concepts; (2) and (3) relation labeling: finding out the relation corresponding to the relation in the ontology existing between the instances, wherein the relation label can enrich the internal information of the instances; also, when semantically annotated, instances and relationships between instances are represented in the form of triples (E1, R, E2), where R is the relationship between instances E1 and E2.

6. The method of claim 5, wherein a high quality triplet is obtained by the following steps when representing instances and relationships between instances as triples:

7. The method according to claim 5, wherein the semantic annotation system is used as a key system for knowledge graph construction, and mainly comprises the following requirements:

8. The method of claim 7, wherein in conjunction with the constructed geo-discipline knowledge-graph target, the semantic annotation system further comprises the requirements of:

9. The method according to any one of claims 1 to 8, wherein the information extraction of step S4 specifically comprises the following steps:

10. The method of claim 9, further comprising the operation of entity disambiguation after entity set expansion in step S401.