CN116127090B - Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction - Google Patents

Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction Download PDF

Info

Publication number
CN116127090B
CN116127090B CN202211699386.0A CN202211699386A CN116127090B CN 116127090 B CN116127090 B CN 116127090B CN 202211699386 A CN202211699386 A CN 202211699386A CN 116127090 B CN116127090 B CN 116127090B
Authority
CN
China
Prior art keywords
aviation system
entity
aviation
attribute
knowledge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211699386.0A
Other languages
Chinese (zh)
Other versions
CN116127090A (en
Inventor
何柳
安然
陶剑
卓雨东
刘姝妍
贺薇
裴育
王孝天
高龙
董洪飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Aero Polytechnology Establishment
Original Assignee
China Aero Polytechnology Establishment
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Aero Polytechnology Establishment filed Critical China Aero Polytechnology Establishment
Priority to CN202211699386.0A priority Critical patent/CN116127090B/en
Publication of CN116127090A publication Critical patent/CN116127090A/en
Application granted granted Critical
Publication of CN116127090B publication Critical patent/CN116127090B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an aviation system knowledge graph construction method based on fusion and semi-supervision information extraction, which comprises the following steps: constructing an aviation system knowledge system base, constructing an aviation system tag and entity category word list, constructing an aviation system domain knowledge system, constructing an aviation system entity fusion model based on attribute and neighbor characteristics, constructing an aviation system semi-supervised information extraction model based on reading and understanding, extracting aviation system information, and generating an aviation system knowledge map. Aiming at the knowledge system construction problem, the invention provides a system fusion algorithm based on label extraction and semantic features, thereby reducing construction difficulty and improving system richness; aiming at the entity fusion problem, an entity fusion algorithm is provided based on the attribute and the neighbor characteristics, so that the entity fusion effect is improved; aiming at the problem that high-quality large-scale data are difficult to acquire, a semi-supervised information extraction framework is provided based on reading and understanding, so that the model gradually has the characteristics of the field of aviation systems, and the practicality of the knowledge graph is improved.

Description

Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
Technical Field
The invention belongs to the technical field of knowledge graph construction, and particularly relates to an aviation system knowledge graph construction method based on fusion and semi-supervision information extraction.
Background
With the development of internet technology and the massive increase of network data, a knowledge graph is firstly proposed by google to improve the search result of a search engine, and the knowledge graph converts concepts and associations in the real world into structural triples, stores the structured triples in a graph form, so that a network-shaped structured knowledge base is formed, and the network-shaped knowledge structure has strong relational expression capability and modeling capability on the real world, can organize and integrate complex information on the internet, converts the information into valuable knowledge, and enables information resources to be better understood and used; meanwhile, the network-shaped knowledge structure also has strong semantic processing capability and open intercommunication capability, and the effect of a plurality of downstream tasks is improved. At present, the knowledge graph is widely used for a plurality of scenes such as question and answer retrieval, intelligent customer service, risk control, user recommendation and the like, and simultaneously achieves good effects in a plurality of industries such as finance, law, medical treatment, industry, government and the like.
Because of the data difference between the vertical fields and the specificity of scene application, knowledge maps in different fields cannot be directly migrated and reused, and are required to be reconstructed according to knowledge characteristics and scene requirements. However, the existing knowledge graph construction process lacks a general tool auxiliary support with knowledge system fusion, entity fusion and information extraction functions, and the construction process faces the following problems: firstly, due to the lack of unified management of the reusable knowledge system, corresponding references cannot be found during construction, the construction difficulty is increased, the utilization rate of the knowledge system is reduced, and different expressions of the same attribute exist in different systems, so that the fusion performance of the knowledge system affects the construction effect of the map; secondly, the entity fusion does not consider the characteristics of long value attributes and the condition that redundancy exists in relation nodes, so that the fusion accuracy is reduced; thirdly, the effect of the deep learning model is reduced due to the fact that high-quality large-scale training data are difficult to acquire, and the extraction models trained in different fields are difficult to migrate and multiplex due to the fact that differences exist between knowledge and service demand scenes. Therefore, for the aviation system, an aviation system knowledge graph construction method based on fusion and semi-supervised information extraction is sought to solve the problems that the knowledge system construction and carding are complicated, the system cannot be reused, high-quality training data are difficult to obtain, and entity fusion does not fully consider entity attributes and neighbor features.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides an aviation system knowledge graph construction method based on fusion and semi-supervision information extraction. The method comprises the steps of constructing an aviation system knowledge system base, constructing aviation system labels and entity category word lists, constructing an aviation system domain knowledge system, constructing an aviation system entity fusion model based on attribute and neighbor characteristics, constructing an aviation system semi-supervision information extraction model based on reading understanding, extracting aviation system information and generating an aviation system knowledge map. Aiming at the knowledge system construction problem, the invention provides a system fusion algorithm based on label extraction and semantic features, thereby reducing construction difficulty and improving system richness; aiming at the entity fusion problem, an entity fusion algorithm is provided based on the attribute and the neighbor characteristics, so that the entity fusion effect is improved; aiming at the problem that high-quality large-scale data are difficult to acquire, a semi-supervised information extraction framework is provided based on reading and understanding, so that the model gradually has the characteristics of the field of aviation systems, and the practicality of the knowledge graph is improved.
The invention provides an aviation system knowledge graph construction method based on fusion and semi-supervision information extraction, which comprises the following steps:
S1, constructing an aviation system knowledge system base;
s11, acquiring original data of an aviation system, respectively acquiring form class data, JSON class data and Neo4j graph database data, and comprehensively serving as the original data of the aviation system;
s12, preprocessing the aviation system data to obtain the attributes of aviation system entities and the relationship among the entities;
s13, storing an aviation system unified knowledge system; considering that only entities and attributes or relationship information between the entities exist in part of the aviation system knowledge graph, storing the attributes of the aviation system entities obtained in the step S12 by an entity attribute table of the aviation system, and storing the relationship between the aviation system entities by an entity relationship table of the aviation system;
s14, constructing an aviation system knowledge system base: integrating the entity attribute table and the entity relation table of the aviation system as an aviation system knowledge system library;
s2, constructing an aviation system tag and an entity category vocabulary;
s21, extracting keywords of an aviation system: extracting keywords of the aviation system entity from the name and descriptive attribute values of the aviation system entity by adopting a tf-idf keyword extraction algorithm;
s22, acquiring an aviation system tag and an entity category vocabulary: counting and sequencing keywords of similar entities of the aviation system to obtain high-frequency keywords of the aviation system, and constructing aviation system labels and entity category word lists by using the high-frequency keywords of the aviation system;
S3, constructing an aviation system domain knowledge system; extracting entity attributes of an aviation system, extracting entity labels of the aviation system, finding structural features of the aviation system, aligning entity types and fusing knowledge systems of the aviation system aiming at html webpage files to obtain a knowledge system in the field of the aviation system;
s4, constructing an aviation system entity fusion model based on the attribute and the neighbor characteristics; aiming at the fusion of the aviation system knowledge system in the aviation system domain knowledge system constructed in the step S3, the improvement is carried out based on the attribute and the neighbor feature, and the aviation system domain knowledge system after the entity fusion is obtained;
s5, constructing an aviation system semi-supervised information extraction model based on reading and understanding, and generating an aviation system knowledge graph;
s51, acquiring input data of an aviation system semi-supervision information extraction model: acquiring triples based on an aviation system domain knowledge system, acquiring texts based on unstructured texts in the aviation system domain, and combining the triples and the texts into input data;
s52, performing aviation system data preprocessing on input data based on a question generation template to generate a question corpus pair set;
s53, generating aviation system pre-labeling data by adopting a reading understanding model aiming at the question corpus pair set;
S54, acquiring aviation system annotation data based on threshold setting aiming at the question corpus pair set;
s55, extracting aviation system information based on aviation system annotation data, and generating an aviation system knowledge graph;
s551, aiming at an aviation system entity identification task, if the training data is enough, adopting a CRF model to extract information, and if the training data is insufficient, adopting a Bert-BiLSTM-CRF model to extract aviation system information;
s552, aiming at an aviation system relation extraction task, extracting aviation system information by adopting a Bert-BiGRU-ATT model;
s553, comprehensively extracting aviation system information, returning to the step S43 for training iteration, and finally generating an aviation system knowledge graph.
Further, the step S3 specifically includes the following steps:
s31, inputting an html webpage file and extracting entity attributes of an aviation system: analyzing the webpage structure of the encyclopedia website aviation system, judging a specific website through the webpage, and setting different rules according to the InfoBox analysis of different websites to obtain entity attributes of the aviation system;
s32, extracting an aviation system entity tag: extracting an aviation system entity tag from aviation system descriptive information of an encyclopedia website by using an unsupervised implicit dirichlet allocation LDA algorithm;
S33, the structural feature discovery of the aviation system is aligned with the entity category: traversing the aviation system label and the entity category word list based on the extracted aviation system entity label, and returning to the corresponding aviation system entity category if the label is hit; otherwise, calculating the semantic similarity with the aviation system entity category, and returning the aviation system entity category with the highest semantic similarity as an aviation system candidate entity category;
s34, fusing an aviation system knowledge system: performing attribute fusion and structure fusion on aviation system entities aligned to aviation system entity categories;
s35, acquiring an aviation system domain knowledge system: and forming an aviation system entity obtained by fusing the aviation system knowledge system into an aviation system field knowledge system, and supplementing the aviation system knowledge system library, the aviation system labels and the entity category vocabulary respectively.
Preferably, the step S4 specifically includes the following steps:
s41, normalizing entity attributes of an aviation system: normalizing the entity attribute of the aviation system, wherein the normalization process comprises normalization of attribute names and unification of attribute value expression;
s42, aviation system entity blocking: the aviation system entity is respectively subjected to entity name-based blocking processing and entity category-based blocking processing;
S421, converting the entity name of the aviation system into Bi-gram series, and establishing an inverted index table of the entity of the aviation system;
s422, inserting the aviation system entity into an aviation system entity inverted index table;
s423, the aviation system entity corresponding to the key value with the length larger than 1 in the inverted index table of the aviation system entity is drawn into the same block;
s424, dividing the aviation system entities of the same category in the block into unified sub-blocks, and taking the unified sub-blocks as a final aviation system entity block division result;
s43, entity alignment of the aviation system: extracting entity attribute features and neighbor features of the aviation system, respectively calculating the matching degree and taking an average value as a matching score of entity pairs of the aviation system;
s44, fusing aviation system entities: fusing the aviation system entity pairs with the matching scores exceeding a set threshold value, wherein the fusion comprises attribute fusion and relationship fusion; when the entity attribute of the aviation system is a single-value attribute, comparing the semantic similarity of the attribute value, if the semantic similarity is higher than a certain value, reserving one value, otherwise, reserving both values; when the entity attribute of the aviation system is a multi-value attribute, all values are directly reserved.
Preferably, the question generation template in the step S52 includes a first question generation template and a second question generation template, and the preprocessing of the aviation system entity recognition task data is performed based on the first question generation template to generate a first question corpus pair set; preprocessing the aviation system relation extraction task data based on a second question generation template to generate a second question corpus pair set; the question corpus pair sets comprise the first question corpus pair set and the second question corpus pair set;
The reading understanding model in the step S53 obtains the conditional probability of the aviation system output text by accessing the full connection layer with the activation function of softmax, determines an aviation system text segment by the starting position and the ending position, calculates the score of the aviation system text segment, namely the answer confidence, and finally outputs the aviation system text segment with the highest confidence;
in the step S54, a BIO-labeling mode is adopted for the first question corpus pair set to obtain labeling data, a second labeling mode is adopted for the second question corpus pair set to obtain labeling data, the BIO-labeling mode is that the position of an entity word is found out from a sentence by generating an 'O' list with a sentence length, the entity category is converted into pinyin, the first word of the entity word is labeled as 'B-category pinyin', the rest positions are labeled as 'I-category pinyin', and the rest positions are written into a file according to rows to obtain aviation system labeling data; the second labeling mode is to acquire the relation category in the knowledge system, generate a category number-category corresponding table, splice category numbers, entity pairs and sentences, write the category numbers, entity pairs and sentences into a file according to rows, and acquire aviation system labeling data;
The Bert-BiLSTM-CRF model in the step S551 comprises a Bert pre-training model, a bidirectional LSTM model and a CRF layer, wherein the Bert pre-training model is used for capturing hidden information in a corpus context, a result is input into the bidirectional LSTM model as a vector, the bidirectional LSTM model automatically extracts sentence characteristics through complete context information, label output with a maximum probability value is selected in each step, reasonable limitation is carried out on the output label through the CRF layer, the CRF layer screens unreasonable results through transition probability, and an original linear characteristic function in a linear chain CRF is combined with nonlinear output of the bidirectional LSTM;
in step S552, the Bert-biglu-ATT model includes a Bert vectorization input layer, an implicit layer and an output layer, where the Bert vectorization input layer vectorizes text, inputs the implicit layer, the implicit layer includes a biglu layer, an attribute layer and a Dense full-connection layer, the implicit layer is used to calculate probability weights that each word vector should be allocated, and the biglu layer performs deep feature extraction on context information.
Preferably, the unsupervised implicit dirichlet allocation LDA algorithm in step S32 specifically includes the following steps:
S321, setting the number k of the labels, traversing the aviation system document, and randomly associating the words with one label;
s322, for each aeronautical system document d, each word w is scanned and the proportion p (topic) of the words belonging to the tag t in the aeronautical system document d is calculated t |document d ) And the proportion p (word) of tags t derived from words w in all aviation system documents w |topic t );
S323, update probability p (word) of word w belonging to tag t w with topic t ):
p(word w with topic t )=p(topic t |document d )×p(word w |topic t ) (1)。
Preferably, the attribute fusion in the step S34 is traversing the entity attribute of the aviation system, converting the entity attribute name of the aviation system into a Bert word vector, calculating the cosine similarity of the Bert word vector, and reserving only one entity attribute of the aviation system with the cosine similarity higher than a certain value; the cosine Similarity is calculated as follows:
Similarity(attr1,attr2)=cos(Bert(attr1),Bert(attr2)) (2)
wherein attr1 and attr2 represent a first attribute of the aeronautical system entity and a second attribute of the aeronautical system entity, respectively; bert (attr 1), bert (attr 2) represents a first Bert word vector and a second Bert word vector, respectively.
Preferably, the step S12 specifically includes the following steps:
s121, table header analysis: the table type data analyzes the table header of the excel table to obtain the attribute of the aviation system entity and the relationship between the entities;
S122, JSON format analysis: the JSON class data obtains the attributes of the entities of the aviation system and the relationship among the entities by analyzing the JSON structure,
s123, entity attribute and relation statistics of the aviation system: and inquiring all data of the aviation system database by using the Neo4j graph database data through a Cypher statement, counting the attribute contained in each type of entity of the aviation system, and counting the relation among the entities of the aviation system.
Preferably, the aviation system semi-supervised information extraction model in step S5 includes an aviation system data preprocessing module, an aviation system pre-labeling data generating module, an aviation system labeling data selecting and generating module, and an aviation system information extraction module.
Compared with the prior art, the invention has the technical effects that:
1. according to the aviation system knowledge map construction method based on fusion and semi-supervised information extraction, a system fusion algorithm is provided based on label extraction and semantic features aiming at the problem of knowledge system construction, an aviation system knowledge system base is constructed to perform unified management on the aviation system knowledge system, and system fusion is performed based on the aviation system knowledge system base, so that construction difficulty is reduced, and system richness is improved; aiming at the problem of entity fusion, an aviation system entity fusion algorithm based on attributes and neighbor features is provided, and the attributes and neighbor features of aviation system entities are comprehensively considered on the basis of the name semantics of the aviation system entities, so that the effect of entity fusion is improved.
2. The application discloses an aviation system knowledge graph construction method based on fusion and semi-supervised information extraction, and provides an aviation system semi-supervised information extraction framework based on reading and understanding, wherein pre-annotation data is generated based on a constructed aviation system knowledge system base through reading and understanding models, the pre-annotation data is sent to an information extraction model for training, and a model prediction result is sent back to the reading and understanding model for iteration, so that the model gradually has aviation system field characteristics in an iterative training mode, and the performance on an information extraction task is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings.
FIG. 1 is a flow chart of an aviation system knowledge graph construction method based on fusion and semi-supervised information extraction;
FIG. 2 is a flow chart of a semantic feature based system fusion algorithm of the present application;
FIG. 3 is a flow chart of an aerospace system entity fusion model based on attribute and neighbor features of the present application;
FIG. 4 is an exemplary diagram of an entity partition of the present application;
FIG. 5 is a flow chart of entity fusion of the present application;
FIG. 6 is a flow chart of an aircraft system semi-supervised information extraction model based on reading understanding of the present application;
FIG. 7 is a schematic diagram of the Roberta_ wwm _ext_large model of the application;
FIG. 8 is a schematic illustration of an aircraft system annotation data annotation of the present application;
FIG. 9 is a schematic representation of the Bert-BiLSTM-CRF model of the present application;
FIG. 10 is a schematic diagram of the Bert-BiGRU-ATT model of the present application.
Detailed Description
The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
Fig. 1 shows an aviation system knowledge graph construction method based on fusion and semi-supervised information extraction, which comprises the following steps:
s1, constructing an aviation system knowledge system base, as shown in FIG. 2.
S11, acquiring original data of the aviation system, respectively acquiring form class data, JSON class data and Neo4j graph database data, and comprehensively serving as the original data of the aviation system.
S12, preprocessing the aviation system data to obtain the attributes of the aviation system entities and the relationship among the entities.
S121, table header analysis: and the table type data analyzes the table header of the excel table to obtain the attributes of the aviation system entity and the relationship among the entities.
S122, JSON format analysis: and analyzing the JSON structure to obtain the attributes of the entities of the aviation system and the relationship among the entities.
S123, entity attribute and relation statistics of the aviation system: the Neo4j graph database data queries all data of the aviation system database through a Cypher statement, the attribute contained in each type of entity of the aviation system is counted, and the relationship among the entities of the aviation system is counted.
S13, storing an aviation system unified knowledge system; considering that only entities and attributes or relationship information between entities may exist in part of the aviation system knowledge graph, the attributes of the aviation system entities obtained in step S12 are stored by the entity attribute table of the aviation system, and the relationship between the aviation system entities is stored by the entity relationship table of the aviation system.
S14, constructing an aviation system knowledge system base: and integrating the entity attribute table and the entity relation table of the aviation system as an aviation system knowledge system base.
S2, constructing an aviation system label and entity category word list, as shown in FIG. 2.
S21, extracting keywords of an aviation system: extracting keywords of the aviation system entity from the name and descriptive attribute values of the aviation system entity by adopting a tf-idf keyword extraction algorithm.
S22, acquiring an aviation system tag and an entity category vocabulary: and counting and sequencing keywords of similar entities of the aviation system to obtain high-frequency keywords of the aviation system, and constructing aviation system labels and entity category word lists by using the high-frequency keywords of the aviation system.
S3, constructing an aviation system domain knowledge system; and carrying out aviation system entity attribute extraction, aviation system entity label extraction, aviation system structural feature discovery and entity category alignment and aviation system knowledge system fusion on the html webpage file to obtain an aviation system field knowledge system, as shown in fig. 2.
S31, inputting an html webpage file and extracting entity attributes of an aviation system: the InfoBox of the encyclopedia website has a structured attribute name, can assist a user to construct a knowledge system, analyzes the webpage structure of the encyclopedia website and the wikipedia website, judges a specific website through a webpage, and obtains entity attributes of the aviation system according to different rules of InfoBox analysis setting of different websites.
Because the encyclopedia website lacks entity tags and cannot be aligned with entity categories in a knowledge system, the invention uses a tag extraction algorithm to extract the entity tags of the aviation system for descriptive information of the encyclopedia website, and then finds candidate entity categories of the aviation system according to the constructed entity table of the aviation system tags.
S32, extracting an aviation system entity tag: and extracting the aviation system entity tag from the aviation system descriptive information of the encyclopedia website by using an unsupervised implicit dirichlet distribution LDA algorithm.
The unsupervised implicit dirichlet distribution LDA algorithm specifically comprises the following steps:
s321, setting the number k of the labels, traversing the aviation system document, and randomly associating the words with one label.
S322, for each aeronautical system document d, each word w is scanned and the proportion p (topic) of the words belonging to the tag t in the aeronautical system document d is calculated t |document d ) And the proportion p (word) of tags t derived from words w in all aviation system documents w |topic t )。
S323, update probability p (word) of word w belonging to tag t w with topic t ):
p(word w with topic t )=p(topic t |document d )×p(word w |topic t ) (1)。
S33, the structural feature discovery of the aviation system is aligned with the entity category: traversing the aviation system label and the entity category word list based on the extracted aviation system entity label, and returning to the corresponding aviation system entity category if the label is hit; otherwise, calculating the semantic similarity with the aviation system entity category, and returning the aviation system entity category with the highest semantic similarity as the aviation system candidate entity category.
S34, fusing an aviation system knowledge system: and carrying out attribute fusion and structure fusion on the aviation system entities aligned to the aviation system entity types.
The attribute fusion is to traverse the entity attribute of the aviation system, convert the entity attribute name of the aviation system into a Bert word vector, calculate the cosine similarity of the Bert word vector, and only reserve one entity attribute of the aviation system with the cosine similarity higher than a certain value; the cosine Similarity is calculated as:
Similarity(attr1,attr2)=cos(Bert(attr1),Bert(attr2)) (2)
wherein attr1 and attr2 represent a first attribute of the aeronautical system entity and a second attribute of the aeronautical system entity, respectively; bert (attr 1), bert (attr 2) represents a first Bert word vector and a second Bert word vector, respectively.
S35, acquiring an aviation system domain knowledge system: and forming an aviation system entity obtained by fusing the aviation system knowledge system into an aviation system field knowledge system, and supplementing an aviation system knowledge system base, aviation system labels and entity category word list respectively.
At present, although knowledge graph websites of crowdsourcing type exist, the data forms are different and cannot be directly fused by using a knowledge system, so that the knowledge system is managed and stored in a unified format by means of collecting and arranging and constructing a knowledge system base, and a solid foundation is laid for system fusion. The user lacks reference when constructing the knowledge system, the InfoBox of the encyclopedia website can provide structural information, but lacks entity tag information, the categories in the knowledge system library cannot be directly matched for fusion, the data of the knowledge map is analyzed, keywords in entity names and descriptive attributes are counted, the categories are corresponding to the high-frequency keywords, a tag entity category table is constructed, descriptive information is obtained when the webpage is analyzed, then entity keywords are extracted through a tag extraction model, mapping is carried out through the tag-entity category table, the categories in the knowledge system library are found, and multiplexing of the knowledge system is achieved. The invention uses a system fusion algorithm based on semantic features to calculate the similarity of attribute names, combines the attributes with high similarity, improves the quality of a knowledge system, and reserves the richness of the knowledge system, which is an important invention point of the invention.
S4, constructing an aviation system entity fusion model based on the attribute and the neighbor characteristics; aiming at the fusion of the aviation system knowledge system in the aviation system domain knowledge system constructed in the step S3, the improvement is carried out based on the attribute and the neighbor feature, and the aviation system domain knowledge system after the entity fusion is obtained, as shown in figure 3.
S41, normalizing entity attributes of an aviation system: and normalizing the entity attribute of the aviation system, wherein the normalization process comprises attribute name normalization and attribute value expression unification.
The diversity of language expressions can cause different descriptions of the same attribute, and attribute names can be unified according to a defined knowledge system. Attribute values can generally be divided into three categories: the numerical type attribute, the date type attribute and the character string type attribute need unified units, for example, the numerical type attribute is in different forms with the same value of 125 cm and 1.25 m, and whether the attribute values are matched or not is compared under the unified units; the date type attribute also has various expressions, the date type attribute needs to be converted into a year-month-day format, and then the date type attribute is compared, and the unification of expression forms can improve the alignment effect of subsequent entities. In a specific embodiment, standard units are set for several types of common numerical attributes, as shown in table 1, and then the numerical values are extracted by using a regular expression and converted into standard units; and for the date type attribute, extracting the date and the year digitization through designing a date expression template, and then combining with a regular expression to perform conversion.
TABLE 1
S42, aviation system entity blocking: and respectively performing the blocking processing based on the entity name and the blocking processing based on the entity category on the aviation system entity.
Considering that the knowledge graph scale in the vertical field is over thousands, the pairwise matching of the entities can influence the efficiency of entity fusion, the entity blocking technology is introduced, the pairs possibly pointing to the same entity are put into the same block, and the entity alignment is carried out in the entity blocking, so that the number of pairwise matching times is reduced.
Considering the characteristics of redundant entities, the overlapping sequences exist in the same entity names, but the descriptions are diversified, in a specific embodiment, "Galaxy-1" and "Galaxy-1" point to the same entity, but the names are greatly different, and compared text similarity cannot be used as an entity pair to be aligned, so that the accuracy of blocking is reduced, as shown in fig. 4. The invention considers the Bi-gram sequence of the entity name to carry out the blocking, and the entity types are introduced to correct the blocking results because the entities with different types cannot point to the same direction, so that the blocking accuracy is improved.
S421, converting the entity name of the aviation system into Bi-gram series, and establishing an inverted index table of the entity of the aviation system.
S422, inserting the aviation system entity into the aviation system entity inverted index table.
S423, the aviation system entity corresponding to the key value with the length larger than 1 in the inverted index table of the aviation system entity is drawn into the same block;
s424, the aviation system entities of the same category in the block are divided into unified sub-blocks again to serve as a final aviation system entity block division result.
Entity fusion is to find out and combine entity pairs with consistent directions to improve the quality of the knowledge graph. Because the same name points to different and different names point to the same situation exists, information of as many entities as possible needs to be utilized, and accuracy of entity pair matching is improved. The method for calculating the matching degree of the entity pairs mainly can be divided into two types, one type is a supervised mode, and a feature coding model is obtained through training, so that the entity pairs to be aligned code features, calculate the similarity between the codes, and unify entity pairs with the similarity exceeding a threshold value; one type is an unsupervised mode, and the matching degree of entity pairs is calculated through entity information, so that entity pairs with high matching degree are fused. In consideration of the universality of the construction tool, the invention selects an unsupervised method for entity fusion, designs an entity fusion algorithm based on entity attributes and neighbor characteristics, fully utilizes the characteristics of the entity, improves the accuracy of entity fusion, and has an algorithm framework shown in figure 5. The algorithm input is the entity pair to be matched, and the output is the entity pair matching score.
S43, entity alignment of the aviation system: the aviation system entity is provided with two kinds of information of attribute and neighbor structure, in order to improve the accuracy of entity fusion, the two kinds of information are fully utilized, the attribute characteristics and the neighbor characteristics of the aviation system entity are extracted, the matching degree is calculated respectively, and the matching degree of the two kinds of characteristics is averaged finally to be used as the matching score of the aviation system entity pair in consideration of the same importance of the attribute information and the structure information.
S44, fusing aviation system entities: and fusing the aviation system entity pairs with the matching scores exceeding the set threshold value, wherein the fusion comprises attribute fusion and relationship fusion.
When the data are fused, the attribute values can be in conflict, and the invention designs a conflict resolution strategy to ensure the accuracy and the diversity of the data after the entity fusion. When the entity attribute of the aviation system is a single-value attribute, comparing the semantic similarity of the attribute value, if the semantic similarity is higher than a certain value, reserving one value, otherwise, reserving both values; when the entity attribute of the aviation system is a multi-value attribute, all values are directly reserved.
The data is acquired from a plurality of sources, so that the knowledge diversity of the knowledge graph can be improved, but the problem of data redundancy can occur, and the main purpose of entity fusion is to combine nodes of the same entity in different data sources, so that the error correction and completion of the knowledge graph are facilitated, the knowledge diversity in the knowledge graph is increased, and the quality of the knowledge graph is improved. There are two cases for an entity: the same name but referring to different entities, different names but referring to the same entity, so that entity names cannot be simply used for fusion. Aiming at the problem of entity fusion, the invention provides an entity fusion algorithm based on attribute and neighbor characteristics, and on the basis of entity name semantics, the attribute and neighbor characteristics of the entity are comprehensively considered, so that the quality of entity fusion is improved, and the invention is another important invention point.
S5, constructing an aviation system semi-supervised information extraction model based on reading and understanding, and generating an aviation system knowledge graph; the aviation system semi-supervised information extraction model comprises an aviation system data preprocessing module, an aviation system pre-labeling data generation module, an aviation system labeling data selection and generation module and an aviation system information extraction module, and is shown in fig. 6.
S51, acquiring input data of an aviation system semi-supervision information extraction model: and obtaining a triplet based on the knowledge system in the field of the aviation system, obtaining a text based on unstructured text in the field of the aviation system, and combining the triplet and the text into input data.
S52, performing aviation system data preprocessing on the input data based on the question generation template to generate a question corpus pair set.
The question generation template comprises a first question generation template and a second question generation template, and the preprocessing of the entity identification task data of the aviation system is performed based on the first question generation template to generate a first question corpus pair set; preprocessing the aviation system relation extraction task data based on a second question generation template to generate a second question corpus pair set; the question corpus pair sets comprise a first question corpus pair set and a second question corpus pair set.
S53, generating aviation system pre-labeling data by adopting a reading understanding model aiming at the question corpus pair set; the reading understanding model obtains the conditional probability of the output text of the aviation system by accessing the full-connection layer with the activation function of softmax, determines an aviation system text segment by the starting position and the ending position, calculates the score of the aviation system text segment, namely the answer confidence, and finally outputs the aviation system text segment with the highest confidence by the reading understanding model.
In a specific embodiment, as shown in fig. 7, considering the commonality of the framework, the reading understanding model uses a roberta_ wwm _ext_large model (wwm, whole word masking) trained on massive chinese corpora, which is an upgraded version of the Bert (Bidirectional Encoder Representation from Transformers, bi-directional coded representation from conveyers) model, and uses a harbour LTP as a word segmentation tool, so that the model has better performance on a larger data set or more steps of training in a full word mask and dynamic occlusion manner, the RoBERTa also cancels NSP tasks with poor effects, and selects longer time, larger batch size and more data for training during training, thereby improving the effect of the model on downstream tasks.
S54, acquiring aviation system annotation data based on threshold setting aiming at a question corpus pair set, acquiring the annotation data by adopting a BIO annotation mode aiming at a first question corpus pair set, and acquiring the annotation data by adopting a second annotation mode aiming at a second question corpus pair set; the BIO labeling mode is that the position of an entity word is found from sentences by generating an O list with the length of sentences, the entity category is converted into pinyin, the first word of the entity word is labeled as B-category pinyin, the rest positions are labeled as I-category pinyin, and the files are written according to rows to obtain aviation system labeling data; the second labeling mode is to obtain the relation category in the knowledge system, generate a category number-category corresponding table, splice category numbers, entity pairs and sentences, write the category numbers, entity pairs and sentences into a file according to rows, and obtain aviation system labeling data. In a specific embodiment, as shown in fig. 8, the left side is marked with the result of the BIO marking method, and the right side is marked with the result of the second marking method.
S55, extracting aviation system information based on the aviation system labeling data, and generating an aviation system knowledge graph.
S551, aiming at the aviation system entity identification task, if the training data is enough, adopting a conditional random field (conditional random field) CRF model to extract information, and if the training data is insufficient, adopting a Bert-BiLSTM-CRF model to extract aviation system information.
The main task of entity identification is to extract the entity in the field according to a defined knowledge system, different models have different characteristics and can adapt to different scenes, for example, the CRF model has the characteristics of high training speed, ideal extraction effect under the condition of sufficient labeling data, and suitability for scenes with sufficient labeling data and rapid extraction of structured data; the Bert-BiLSTM-CRF model requires a certain training time, but can reduce the requirement on the quantity of marking data, and is suitable for the scene with insufficient marking data. Multiple models are preset, so that different extraction scenes can be adapted, and the universality of extraction is improved.
As shown in fig. 9, the Bert-BiLSTM-CRF model includes a Bert pre-training model, a bi-directional LSTM (long short-term) model and a CRF layer, where the Bert pre-training model is used to capture hidden information in a corpus context, and inputs the result as a vector into the bi-directional LSTM model, the bi-directional LSTM model automatically extracts sentence features through complete context information, selects a tag output with a maximum probability value at each step, reasonably limits the output tag through the CRF layer, and screens unreasonable results through transition probability, so that the original linear feature function in the linear chain CRF is combined with the nonlinear output of the bi-directional LSTM, thereby improving the model effect of entity recognition.
S552, aiming at an aviation system relation extraction task, extracting aviation system information by adopting a Bert-BiGRU-ATT model;
as shown in fig. 10, the Bert-biglu-ATT model includes a Bert vectorization input layer, an implicit layer and an output layer, the Bert vectorization input layer vectorizes text, the implicit layer includes a biglu layer (Bi-gated recurrent unit), an Attention layer and a Dense Dense full-connection layer, the implicit layer is used for calculating probability weights to be allocated to each word vector, and the biglu layer performs deep feature extraction on context information.
The calculation of the probability weight is realized through an Attention mechanism, and key information in the text is given higher weight by further extracting the characteristics of the text, so that different text contents are distinguished. In sentences, different word pair relationship classifications play different roles, and the importance degree of the word pair relationship classifications of certain classes is very small; some terms describing relationships are of great importance to the classification of relationships. The term is distinguished by introducing an Attention layer, the vector h_ ijt output by BiGRU network activation processing is input into the Attention layer, the Attention mechanism matrix is formed by accumulating the products of different probability weights distributed by the Attention mechanism and the states of all hidden layers, and a Softmax function is used for normalization operation to obtain a prediction label.
S553, comprehensively extracting aviation system information, returning to the step S43 for training iteration, and finally generating an aviation system knowledge graph.
Knowledge-graph stores structured triples data, however, the acquired data is mostly stored in the form of unstructured text, and the extraction of structured triples from unstructured text has the following problems: (1) The extraction effect of the deep learning model depends on the quality and quantity of training data, and the high-quality large-scale training data is difficult to acquire; (2) The knowledge and business demand scenes are differentiated, and extraction models trained in different fields are difficult to migrate and multiplex.
Aiming at the two problems, the invention provides an aviation system semi-supervised information extraction model based on reading and understanding. The provided model automatically generates pre-labeling data through a reading understanding model based on a defined knowledge system, selects a high-quality data input information extraction model for training, inputs a result predicted by the extraction model into the reading understanding model for iterative training, and improves the effect of the model on extraction tasks in the field of aviation systems; the method for automatically generating the pre-labeling data and combining the model iterative training solves the problems that the labeling data is difficult to acquire and the model cannot be migrated and used commonly, improves the universality of the model, and is another important invention point of the invention.
According to the aviation system knowledge map construction method based on fusion and semi-supervised information extraction, a system fusion algorithm is provided based on label extraction and semantic features aiming at the problem of knowledge system construction, an aviation system knowledge system base is constructed to perform unified management on the aviation system knowledge system, and system fusion is performed based on the aviation system knowledge system base, so that construction difficulty is reduced, and system richness is improved; aiming at the problem of entity fusion, an aviation system entity fusion algorithm based on attributes and neighbor features is provided, and the attributes and neighbor features of aviation system entities are comprehensively considered on the basis of the name semantics of the aviation system entities, so that the effect of entity fusion is improved; aiming at the problem that high-quality large-scale data are difficult to acquire, an aviation system semi-supervision information extraction framework based on reading and understanding is provided, pre-labeling data are generated through reading and understanding models based on a constructed aviation system knowledge system base, the pre-labeling data are sent to an information extraction model for training, the result of model prediction is sent back to the reading and understanding model for iteration, and the model gradually has aviation system field characteristics in an iterative training mode, so that the performance on an information extraction task is improved.
Finally, what should be said is: the above embodiments are merely for illustrating the technical aspects of the present invention, and it should be understood by those skilled in the art that although the present invention has been described in detail with reference to the above embodiments: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention, which is intended to be encompassed by the claims.

Claims (7)

1. The aviation system knowledge graph construction method based on fusion and semi-supervision information extraction is characterized by comprising the following steps of:
s1, constructing an aviation system knowledge system base;
s11, acquiring original data of an aviation system, respectively acquiring form class data, JSON class data and Neo4j graph database data, and comprehensively serving as the original data of the aviation system;
s12, preprocessing the aviation system data to obtain the attributes of aviation system entities and the relationship among the entities;
s13, storing an aviation system unified knowledge system; considering that only entities and attributes or relationship information between the entities exist in part of the aviation system knowledge graph, storing the attributes of the aviation system entities obtained in the step S12 by an entity attribute table of the aviation system, and storing the relationship between the aviation system entities by an entity relationship table of the aviation system;
S14, constructing an aviation system knowledge system base: integrating the entity attribute table and the entity relation table of the aviation system as an aviation system knowledge system library;
s2, constructing an aviation system tag and an entity category vocabulary;
s21, extracting keywords of an aviation system: extracting keywords of the aviation system entity from the name and descriptive attribute values of the aviation system entity by adopting a tf-idf keyword extraction algorithm;
s22, acquiring an aviation system tag and an entity category vocabulary: counting and sequencing keywords of similar entities of the aviation system to obtain high-frequency keywords of the aviation system, and constructing aviation system labels and entity category word lists by using the high-frequency keywords of the aviation system;
s3, constructing an aviation system domain knowledge system; extracting entity attributes of an aviation system, extracting entity labels of the aviation system, finding structural features of the aviation system, aligning entity types and fusing knowledge systems of the aviation system aiming at html webpage files to obtain a knowledge system in the field of the aviation system;
s4, constructing an aviation system entity fusion model based on the attribute and the neighbor characteristics; aiming at the fusion of the aviation system knowledge system in the aviation system domain knowledge system constructed in the step S3, the improvement is carried out based on the attribute and the neighbor feature, and the aviation system domain knowledge system after the entity fusion is obtained;
S41, normalizing entity attributes of an aviation system: normalizing the entity attribute of the aviation system, wherein the normalization process comprises normalization of attribute names and unification of attribute value expression;
s42, aviation system entity blocking: the aviation system entity is respectively subjected to entity name-based blocking processing and entity category-based blocking processing;
s421, converting the entity name of the aviation system into Bi-gram series, and establishing an inverted index table of the entity of the aviation system;
s422, inserting the aviation system entity into an aviation system entity inverted index table;
s423, the aviation system entity corresponding to the key value with the length larger than 1 in the inverted index table of the aviation system entity is drawn into the same block;
s424, dividing the aviation system entities of the same category in the block into unified sub-blocks, and taking the unified sub-blocks as a final aviation system entity block division result;
s43, entity alignment of the aviation system: extracting entity attribute features and neighbor features of the aviation system, respectively calculating the matching degree and taking an average value as a matching score of entity pairs of the aviation system;
s44, fusing aviation system entities: fusing the aviation system entity pairs with the matching scores exceeding a set threshold value, wherein the fusion comprises attribute fusion and relationship fusion; when the entity attribute of the aviation system is a single-value attribute, comparing the semantic similarity of the attribute value, if the semantic similarity is higher than a certain value, reserving one value, otherwise, reserving both values; when the entity attribute of the aviation system is a multi-value attribute, directly reserving all values;
S5, constructing an aviation system semi-supervised information extraction model based on reading and understanding, and generating an aviation system knowledge graph;
s51, acquiring input data of an aviation system semi-supervision information extraction model: acquiring triples based on an aviation system domain knowledge system, acquiring texts based on unstructured texts in the aviation system domain, and combining the triples and the texts into input data;
s52, performing aviation system data preprocessing on input data based on a question generation template to generate a question corpus pair set;
s53, generating aviation system pre-labeling data by adopting a reading understanding model aiming at the question corpus pair set;
s54, acquiring aviation system annotation data based on threshold setting aiming at the question corpus pair set;
s55, extracting aviation system information based on aviation system annotation data, and generating an aviation system knowledge graph;
s551, aiming at an aviation system entity identification task, if the training data is enough, adopting a CRF model to extract information, and if the training data is insufficient, adopting a Bert-BiLSTM-CRF model to extract aviation system information;
the Bert-BiLSTM-CRF model comprises a Bert pre-training model, a bidirectional LSTM model and a CRF layer, wherein the Bert pre-training model is used for capturing hidden information in a corpus context, a result is input into the bidirectional LSTM model as a vector, sentence characteristics are automatically extracted by the bidirectional LSTM model through complete context information, a label with a maximum probability value is selected for output in each step, reasonable limitation is carried out on the output label through the CRF layer, unreasonable results are screened through transition probability by the CRF layer, and an original linear characteristic function in a linear chain CRF is combined with nonlinear output of the bidirectional LSTM;
S552, aiming at an aviation system relation extraction task, extracting aviation system information by adopting a Bert-BiGRU-ATT model;
the Bert-BiGRU-ATT model comprises a Bert vectorization input layer, an implicit layer and an output layer, wherein the Bert vectorization input layer vectorizes texts, the implicit layer is input and comprises a BiGRU layer, an Attention layer and a Dense full-connection layer, the implicit layer is used for calculating probability weights which each word vector should be allocated, and the BiGRU layer is used for extracting deep features of context information;
s553, comprehensively extracting aviation system information, returning to the step S43 for training iteration, and finally generating an aviation system knowledge graph.
2. The method for constructing an aviation system knowledge graph based on fusion and semi-supervised information extraction according to claim 1, wherein the step S3 specifically comprises the following steps:
s31, inputting an html webpage file and extracting entity attributes of an aviation system: analyzing the webpage structure of the encyclopedia website aviation system, judging a specific website through the webpage, and setting different rules according to the InfoBox analysis of different websites to obtain entity attributes of the aviation system;
s32, extracting an aviation system entity tag: extracting an aviation system entity tag from aviation system descriptive information of an encyclopedia website by using an unsupervised implicit dirichlet allocation LDA algorithm;
S33, the structural feature discovery of the aviation system is aligned with the entity category: traversing the aviation system label and the entity category word list based on the extracted aviation system entity label, and returning to the corresponding aviation system entity category if the label is hit; otherwise, calculating the semantic similarity with the aviation system entity category, and returning the aviation system entity category with the highest semantic similarity as an aviation system candidate entity category;
s34, fusing an aviation system knowledge system: performing attribute fusion and structure fusion on aviation system entities aligned to aviation system entity categories;
s35, acquiring an aviation system domain knowledge system: and forming an aviation system entity obtained by fusing the aviation system knowledge system into an aviation system field knowledge system, and supplementing the aviation system knowledge system library, the aviation system labels and the entity category vocabulary respectively.
3. The method for constructing an aviation system knowledge graph based on fusion and semi-supervised information extraction according to claim 1, wherein the question generation templates in the step S52 include a first question generation template and a second question generation template, and the aviation system entity recognition task data preprocessing is performed based on the first question generation template to generate a first question corpus pair set; preprocessing the aviation system relation extraction task data based on a second question generation template to generate a second question corpus pair set; the question corpus pair sets comprise the first question corpus pair set and the second question corpus pair set;
The reading understanding model in the step S53 obtains the conditional probability of the aviation system output text by accessing the full connection layer with the activation function of softmax, determines an aviation system text segment by the starting position and the ending position, calculates the score of the aviation system text segment, namely the answer confidence, and finally outputs the aviation system text segment with the highest confidence;
in the step S54, a BIO-labeling mode is adopted for the first question corpus pair set to obtain labeling data, a second labeling mode is adopted for the second question corpus pair set to obtain labeling data, the BIO-labeling mode is that the position of an entity word is found out from a sentence by generating an 'O' list with a sentence length, the entity category is converted into pinyin, the first word of the entity word is labeled as 'B-category pinyin', the rest positions are labeled as 'I-category pinyin', and the rest positions are written into a file according to rows to obtain aviation system labeling data; the second labeling mode is to obtain the relation category in the knowledge system, generate a category number-category corresponding table, splice category numbers, entity pairs and sentences, write the category numbers, the entity pairs and sentences into a file according to rows, and obtain aviation system labeling data.
4. The method for constructing an aviation system knowledge graph based on fusion and semi-supervised information extraction according to claim 2, wherein the unsupervised implicit dirichlet distribution LDA algorithm in step S32 specifically comprises the following steps:
s321, setting the number k of the labels, traversing the aviation system document, and randomly associating the words with one label;
s322, for each aeronautical system document d, each word w is scanned and the proportion p (topic) of the words belonging to the tag t in the aeronautical system document d is calculated t |document d ) And the proportion p (word) of tags t derived from words w in all aviation system documents w |topic t );
S323, update probability p (word) of word w belonging to tag t w withtopic t ):
p(word w with topic t )=p(topic t |document d )×p(word w |topic t ) (1)。
5. The method for constructing an aviation system knowledge graph based on fusion and semi-supervised information extraction according to claim 2, wherein the attribute fusion in the step S34 is traversing the aviation system entity attribute, converting the aviation system entity attribute name into a Bert word vector, calculating the cosine similarity of the Bert word vector, and reserving only one aviation system entity attribute with the cosine similarity higher than a certain value; the cosine Similarity is calculated as follows:
Similarity(attr1,attr2)=cos(Bert(attr1),Bert(attr2)) (2)
wherein attr1 and attr2 represent a first attribute of the aeronautical system entity and a second attribute of the aeronautical system entity, respectively; bert (attr 1), bert (attr 2) represents a first Bert word vector and a second Bert word vector, respectively.
6. The method for constructing an aviation system knowledge graph based on fusion and semi-supervised information extraction according to claim 1, wherein the step S12 specifically comprises the following steps:
s121, table header analysis: the table type data analyzes the table header of the excel table to obtain the attribute of the aviation system entity and the relationship between the entities;
s122, JSON format analysis: the JSON class data obtains the attributes of the entities of the aviation system and the relationship among the entities by analyzing the JSON structure,
s123, entity attribute and relation statistics of the aviation system: and inquiring all data of the aviation system database by using the Neo4j graph database data through a Cypher statement, counting the attribute contained in each type of entity of the aviation system, and counting the relation among the entities of the aviation system.
7. The method for constructing an aviation system knowledge graph based on fusion and semi-supervised information extraction according to claim 1, wherein the aviation system semi-supervised information extraction model in step S5 comprises an aviation system data preprocessing module, an aviation system pre-annotation data generating module, an aviation system annotation data selecting and generating module and an aviation system information extraction module.
CN202211699386.0A 2022-12-28 2022-12-28 Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction Active CN116127090B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211699386.0A CN116127090B (en) 2022-12-28 2022-12-28 Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211699386.0A CN116127090B (en) 2022-12-28 2022-12-28 Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction

Publications (2)

Publication Number Publication Date
CN116127090A CN116127090A (en) 2023-05-16
CN116127090B true CN116127090B (en) 2023-11-21

Family

ID=86302108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211699386.0A Active CN116127090B (en) 2022-12-28 2022-12-28 Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction

Country Status (1)

Country Link
CN (1) CN116127090B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701665A (en) * 2023-08-08 2023-09-05 滨州医学院 Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method
CN116821376B (en) * 2023-08-30 2024-03-08 北京华琦远航国际咨询有限公司 Knowledge graph construction method and system in coal mine safety production field

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019843A (en) * 2018-09-30 2019-07-16 北京国双科技有限公司 The processing method and processing device of knowledge mapping
CN111444351A (en) * 2020-03-24 2020-07-24 清华苏州环境创新研究院 Method and device for constructing knowledge graph in industrial process field
CN112182241A (en) * 2020-09-24 2021-01-05 四川大学 Automatic construction method of knowledge graph in field of air traffic control
CN112542223A (en) * 2020-12-21 2021-03-23 西南科技大学 Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record
CN112800247A (en) * 2021-04-09 2021-05-14 华中科技大学 Semantic encoding/decoding method, equipment and communication system based on knowledge graph sharing
CN114896417A (en) * 2022-05-20 2022-08-12 郑州轻工业大学 Method for constructing computer education knowledge graph based on knowledge graph
CN115329101A (en) * 2022-09-06 2022-11-11 南京邮电大学 Electric power Internet of things standard knowledge graph construction method and device
CN115422370A (en) * 2022-08-31 2022-12-02 苏州空天信息研究院 Demand influence domain analysis method based on knowledge graph

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019843A (en) * 2018-09-30 2019-07-16 北京国双科技有限公司 The processing method and processing device of knowledge mapping
CN111444351A (en) * 2020-03-24 2020-07-24 清华苏州环境创新研究院 Method and device for constructing knowledge graph in industrial process field
CN112182241A (en) * 2020-09-24 2021-01-05 四川大学 Automatic construction method of knowledge graph in field of air traffic control
CN112542223A (en) * 2020-12-21 2021-03-23 西南科技大学 Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record
CN112800247A (en) * 2021-04-09 2021-05-14 华中科技大学 Semantic encoding/decoding method, equipment and communication system based on knowledge graph sharing
CN114896417A (en) * 2022-05-20 2022-08-12 郑州轻工业大学 Method for constructing computer education knowledge graph based on knowledge graph
CN115422370A (en) * 2022-08-31 2022-12-02 苏州空天信息研究院 Demand influence domain analysis method based on knowledge graph
CN115329101A (en) * 2022-09-06 2022-11-11 南京邮电大学 Electric power Internet of things standard knowledge graph construction method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Link Prediction with Supervised Learning on an Industry 4.0 related Knowledge Graph;Irlan Grangel-González 等;《2021 26th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA )》;1-8 *
基于BERT-Bi-LSTM-CRF模型的自主式交通系统参与主体识别方法;唐进君 等;《交通信息与安全》;第40卷(第240期);80-90 *
航空制造知识图谱构建研究综述;邱凌 等;《计算机应用研究》;第39卷(第4期);968-977 *
航空语义知识库构建方法研究;董洪飞 等;《航空标准化与质量》(第5期);52-56 *

Also Published As

Publication number Publication date
CN116127090A (en) 2023-05-16

Similar Documents

Publication Publication Date Title
CN112199511B (en) Cross-language multi-source vertical domain knowledge graph construction method
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN112115238B (en) Question-answering method and system based on BERT and knowledge base
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN112487190B (en) Method for extracting relationships between entities from text based on self-supervision and clustering technology
CN113157859B (en) Event detection method based on upper concept information
CN113515632B (en) Text classification method based on graph path knowledge extraction
CN113761208A (en) Scientific and technological innovation information classification method and storage device based on knowledge graph
CN114936277A (en) Similarity problem matching method and user similarity problem matching system
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN115390806A (en) Software design mode recommendation method based on bimodal joint modeling
CN114881043A (en) Deep learning model-based legal document semantic similarity evaluation method and system
Aghaei et al. Question answering over knowledge graphs: A case study in tourism
CN111428502A (en) Named entity labeling method for military corpus
CN114117000A (en) Response method, device, equipment and storage medium
Tarride et al. A comparative study of information extraction strategies using an attention-based neural network
CN111930892B (en) Scientific and technological text classification method based on improved mutual information function
CN112613318B (en) Entity name normalization system, method thereof and computer readable medium
CN115827871A (en) Internet enterprise classification method, device and system
Zhang et al. An overview on supervised semi-structured data classification
CN115344668A (en) Multi-field and multi-disciplinary science and technology policy resource retrieval method and device
CN114942981A (en) Question-answer query method and device, electronic equipment and computer readable storage medium
CN111061939A (en) Scientific research academic news keyword matching recommendation method based on deep learning
Hao Naive Bayesian Prediction of Japanese Annotated Corpus for Textual Semantic Word Formation Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant