CN116050408A - Knowledge graph construction method and knowledge graph construction system in civil engineering standardization field - Google Patents

Knowledge graph construction method and knowledge graph construction system in civil engineering standardization field Download PDF

Info

Publication number
CN116050408A
CN116050408A CN202310092861.6A CN202310092861A CN116050408A CN 116050408 A CN116050408 A CN 116050408A CN 202310092861 A CN202310092861 A CN 202310092861A CN 116050408 A CN116050408 A CN 116050408A
Authority
CN
China
Prior art keywords
model
training
knowledge graph
civil engineering
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310092861.6A
Other languages
Chinese (zh)
Inventor
白久林
陈耀坤
刘毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202310092861.6A priority Critical patent/CN116050408A/en
Publication of CN116050408A publication Critical patent/CN116050408A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a knowledge graph construction method and a knowledge graph construction system in the field of civil standardization, which comprises the steps of screening sentences to be extracted from a standard text, modifying the sentences, and selecting sentences conforming to the definition of a Schema label so as to be suitable for extracting a neural network model; model training is carried out on the screened part of texts, and data marking is carried out on the training set according to the designed triplet Schema label, so that a triplet training set in the field of civil standard texts is obtained; constructing a CE-CasRel model, training by using the marked data set, and verifying by using a verification set; inputting the obtained standard text to be extracted into a trained CE-CasRel model, extracting triples in the field of soil and wood standard, and storing the triples into json files; reading json file, analyzing the triplet data, using the API interface of Neo4j, establishing a knowledge graph in the civil engineering standardization field and realizing visualization; meets the requirements of intelligent aesthetic diagrams on knowledge maps.

Description

Knowledge graph construction method and knowledge graph construction system in civil engineering standardization field
Technical Field
The invention belongs to the technical field of intelligent architectural drawings, relates to a knowledge graph construction method and a knowledge graph construction system in the field of civil engineering standardization, and particularly relates to a knowledge graph construction method and a knowledge graph construction system in the field of civil engineering standardization based on a deep neural network.
Background
The whole life cycle of the building engineering is constrained by various specifications and standards. The aesthetic drawing is still highly dependent on manual auditing as a key link in which design and construction quality are guaranteed. Manual inspection currently has a number of problems: the level of the staff is different, and during the process of drawing, the staff in different places and institutions have different understanding of the specifications, which leads to different judgment on whether the drawing is qualified; the drawings have a plurality of repeated examination contents, but auditors need to check and confirm compliance one by one, so that a plurality of repeated labor are caused, and the working efficiency is reduced; the building specification has a plurality of and wide regulations and higher understanding difficulty, and different specifications can have different requirements for the same building component, so that the comprehensiveness and the error-free property of the aesthetic drawing are difficult to ensure in practice; the intelligent degree is low, and the national requirements for the development of future intelligent construction cannot be met.
For the above reasons, intelligent esthetics are getting more and more attention and research, and their importance is also becoming increasingly prominent. In the intelligent examination process, the most complex is how the computer understands the rule knowledge and makes reasoning. Current research can be divided into two parts, one for the representation of canonical knowledge and one for the interpretation of rules.
Knowledge representation is often abstract into triples of entities, attributes and relationships for representation, and is traditionally stored in RDF, XML and other formats; in recent years, based on ontology, knowledge representation in a knowledge graph, namely a graph relation network mode is developed more intuitively. The domain ontology is required to be established according to specific application scenes or actual demands, the ontologies in different professional domains are often different, and a large amount of manual processing is often required, so that the intelligent domain ontology is not intelligent enough. For the civil field, the relation network defined by the ontology cannot highly refine the specification treaty, and how to construct a unified and efficient ontology framework suitable for knowledge extraction aiming at the civil field is a problem which is still to be studied deeply.
Rule interpretation is to process the specification into a specific format which can be recognized by a computer, and the knowledge graph spectrum can be regarded as rule interpretation. The traditional interpretation method is realized manually based on experts. In recent years, with research and development in other fields, students have begun to study interpretation by means of natural language processing, ontologies, and the like. Existing interpretation methods can be divided into two categories, shallow structuring and pattern matching. However, both methods have limitations in that for shallow structuring, canonical text can only be processed at coarser granularity levels, analysis cannot be performed at word level, and more human maintenance is required. In terms of pattern matching, the method is highly dependent on regular expressions, has poor flexibility, and has higher maintenance cost when updating is needed later.
Disclosure of Invention
In view of the above, the invention provides a knowledge graph construction method and a construction system in the civil engineering standardization field, which can meet the requirements of intelligent examination graphs on the knowledge graph in order to solve the problems of high later maintenance cost caused by the use limitation of the two knowledge graph interpretation methods, namely shallow structuring and pattern matching, which are commonly used at present.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a knowledge graph construction method in the civil engineering standardization field comprises the following steps:
s1, screening sentences to be extracted from a standard text, modifying the screened text format and text sentence pattern, and selecting sentences conforming to the Schema label definition to enable the sentences to be suitable for extraction of a neural network model;
s2, performing model training on part of the text screened in the step S1, performing sentence segmentation on the text to obtain a plurality of sentences, and performing data annotation on the training set according to the designed triplet Schema label to obtain a triplet training set in the field of civil standard texts;
s3, constructing a CE-CasRel model, training by using the marked data set, and verifying by using a verification set; the method comprises the steps that when a CE-CasRel model is built, a pre-training model is used as a BERT model optimized on a data set in the civil field;
s4, repeating the steps S1 and S2, inputting the obtained standard text to be extracted into a trained CE-CasRel model, extracting triples in the field of soil and wood standard, and storing the triples in a json file;
s5, reading the json file, analyzing the triplet data, and using an API interface of Neo4j to establish a knowledge graph in the civil engineering standardization field and realize visualization.
Further, the specific process of step S2 is:
s21, sorting the divided sentences under the same folder by using a regular expression to obtain a data set to be marked; wherein the regular rule is defined as a mark for dividing sentences according to periods, semicolons and colon;
s22, dividing the data set into a data set and a test set according to a preset proportion, and marking the training set by using marking software to obtain a triplet training set for training knowledge extraction;
s23, in order to extract triples and construct a knowledge graph, the extracted triples training set is divided into an entity tag and a relation tag according to a new Schema tag definition rule, the entity tag MSub, sub, prop represents an object to be examined in an aesthetic drawing, and the relation tags sub_Prop, prop_Bx and Bx_RProp are three relations based on a standard rule, and represent that a certain attribute of a certain subject should meet a certain standard requirement.
Further, the entity tag MSub, sub, prop in step S23 represents an object to be inspected in the aesthetic drawing. Wherein Sub represents a Subject (Subject) in the object under examination, such as a component, structure; prop represents the properties of Sub (Property), constrained by RProp or LRProp, such as thickness, cross-sectional area, etc.; MSub represents the parent set of Sub, in containment relationship with Sub elements.
The entity tag RProp indicates a Requirement condition for connection to the Prop, and indicates a Requirement (Requirement) for the Prop. Only if the constraint conditions provided by RProp are satisfied, the specification treaty is met.
The entity tag Bx, which represents a certain "Behavior (Behavior)" between the Prop and RProp, has a comparison or dependency on the common Behavior. For example: "greater than", "having", "satisfying", etc.
The parent element of the entity tag RSub, RProp, is used when RProp is an attribute of a certain object or concept. For example: the "effective height" is not preferably less than 1/3 of the beam height.
The entity tag LRProp, a precondition for connection with Bx (considered as a limitation), will only meet the constraint of the precondition provided by LRProp to continue to determine whether RProp meets the rule review.
Entity tag MProp, when multiple attribute nesting occurs, MProp acts as the upper layer attribute of Prop. For example: "concrete/MProp" and "strength grade/Prop" in "concrete strength grade of prefabricated part".
The entity tag CSub, CProp, CBx, CRProp is a Condition set (Condition) that is introduced when the canonical sentence is too complex, there will be a complete clause defining the fact set, at which time LRProp has not been able to fully generalize the semantics. The condition sets are used for satisfying the four most basic elements of a sentence in a rule.
Wherein the relationship label is:
the relationship labels sub_Prop, prop_Bx and Bx_RProp are three relationships based on the specification treaty, and represent that a certain attribute of a certain subject should meet a certain specification requirement.
The relationship tag rsub_rprop indicates that "a specification requirement" is "an attribute of a certain subject".
The relationship label mprop_prop represents an "attribute of attribute" relationship, i.e., prop is an attribute of MProp.
The relationship tag lrprop_bx represents the precondition of an element to review a main sentence, and this patent regards the element as a modification definition to the Bx tag element.
The relationship tag msub_sub indicates that MSub is the parent set of Sub, i.e., that MSub contains Sub.
The relationship labels CSub_CProp, CProp_CBx, CBx_CRPRop and basic three relationships are similar, the only difference being for the condition set and the fact set.
Further, the specific process of step S3 is:
s31, constructing a CE-CasRel model, and replacing an original encoder by using a BERT model retrained by using a data set in the civil field at the encoder end, wherein the model is called CEBERT; in order to solve the entity nesting problem at the decoding end, a stacked pointer labeling method is used; and performing Conditional Layer Normalization (CLN) on the feature vector of the head entity as a condition before predicting the tail entity of the triplet;
s32, taking the marked training set text as input, and entering a CEBERT encoder; the encoder CEBERT performs multi-head attention, layer normalization processing and residual error connection operation on each standard treaty, extracts text characteristics and obtains sentence level vector representation;
s33, the vector representation obtained after the pretrained model CEBERT is transmitted into a head entity identification layer of a decoder, and whether each token is the start or end of the head entity is judged by utilizing a linear layer and a sigmoid activation function; then, the identified start and end pairing is utilized to obtain a candidate head entity set;
s34, randomly extracting a head entity, and performing CLN on the sentence-level vector representation once by taking the entity as a condition; the sentence vector representation after CLN processing is used as input, and the sentence vector representation is also used for predicting tail entity positions under different relations through a linear layer and a sigmoid activation function;
and S35, after training of the CE-CasRel model is finished, verifying the trained model by using test set data, and retaining training to obtain model weight parameters after the required accurate pair and recall rate are achieved.
Further, the CEBERT encoder described in step S32 specifically includes: by using a Python crawler method, a large amount of texts are crawled from related encyclopedia vocabulary entries and standard texts of civil engineering, and data cleaning such as long text splitting, irrelevant content screening and the like is performed to construct a data set in the field of civil engineering. Selecting a RoBERTa pre-training model of a Chinese version, taking a data set as an input corpus, and transmitting the input corpus into the model for fine adjustment; the training method and the evaluation index refer to the original design of RoBERTa; finally, the model CEBERT which is finely tuned on the data set in the civil engineering field is obtained.
Further, the CLN expression in step S31 is:
Figure BDA0004070831540000041
wherein y is the characteristic information of the input CLN structure, c β And c β Respectively inputting two pieces of condition information to be fused;
the sigmoid activation function expression in step S33 is:
Figure BDA0004070831540000042
further, the specific process of step S4 is: repeating the steps S1 and S2, inputting the obtained standard text to be extracted into a trained CE-CasRel model, and extracting a triplet set conforming to the Schema label definition in the step S23; storing the extracted triples in lines according to the strip text through Python data processing, wherein one line acts as a strip rule Fan Tiaowen, and one strip text corresponds to a plurality of triples; all the treatises are stored in json files.
Further, the specific steps of step S5 are:
s51, defining nodes in a Neo4j graph relation library according to the entity labels in the Schema in the step S23; defining edges in a Neo4j graph relational library according to the relational labels;
s52, reading json files by utilizing json and pands libraries of Python, and reading a triplet set of each rule Fan Tiaowen according to rows; extracting from the triplet set and adding the triplet to a Python list according to the defined Schema label;
s53, connecting Python to Neo4j by using a py2Neo library of Python; correspondingly adding the entities in the list in the step 52 to Neo4j nodes; correspondingly adding the relation in the list of the step 52 to the Neo4j side to complete the construction of the knowledge graph;
s54, starting a Neo4j program through a console, and watching and saving the map in browsing.
A knowledge graph construction system in the field of civil engineering standardization comprises a preprocessing module, a knowledge extraction module and a construction module; wherein,,
the preprocessing module is used for carrying out sentence segmentation on the standard treatises and preprocessing the segmented sentences to obtain convertible standard sentences;
the knowledge extraction module is used for extracting the triples in the specification strip by the deep neural network model CE-CasRel to obtain the structured specification strip stored in the form of the triples;
and the construction module is used for constructing a knowledge graph, storing the structured specification treaty and storing the triples into a graph database based on a py2neo library.
The invention has the beneficial effects that:
1. the invention discloses a knowledge graph construction method in the civil engineering standardization field, which combines knowledge representation and rule interpretation, and provides an abstract representation for the standard regulations to be inspected from the inspection point of view: "a certain property of a certain subject is to meet a certain specification requirement". This abstracts the summary to 4 underlying entities and 3 underlying relationships, adding 8 entities and 7 relationships as modifiers or qualifiers. The method abstracts and refines the core information of the specification treaty, facilitates the extraction of the deep learning model, and can also serve the rule interpretation and the construction of the knowledge graph; the requirements of intelligent aesthetic diagrams on knowledge maps can be met; the method solves the problems of high later maintenance cost caused by the use limitation of the prior two knowledge map interpretation methods of shallow structuring and pattern matching in knowledge representation and rule interpretation.
2. In the method for constructing the knowledge graph in the civil engineering standardization field, disclosed by the invention, the intelligent examination graph is not separated from the existing standardization treaty, the basic problem of storing the standardization structure into a database must be solved, the knowledge graph is taken as a graph database, and the knowledge graph is one of the current research directions, but the standardization itself has the specialization, and the existing method cannot be directly applied to the civil engineering field. The method is suitable for knowledge graph construction in the civil field, improves the existing information extraction model aiming at the training data set in the civil field, and is beneficial to follow-up research.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a knowledge graph construction method in the civil engineering standardization field according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a network structure of a CE-CasRel model in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a structured document according to an embodiment of the present invention;
fig. 4 is a diagram showing knowledge graphs in the civil engineering standardization domain in an embodiment of the invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
The construction method and construction system of the knowledge graph in the civil engineering standardization field shown in fig. 1 comprise the following steps:
s1, screening sentences to be extracted from a standard text, and modifying the screened text sentence pattern to a certain extent so that the text sentence pattern is suitable for extraction of a neural network model;
in this embodiment, the standard files are collected by a method such as web crawling, and most of the collected files are in PDF or Word format, and are converted into TXT format files by using text editing software such as ABBYY, acrobat DC, and the like. Deleting the contents of pictures and tables in the specifications, deleting irrelevant or irremovable sentences, such as catalogues, blank lines, page numbers and the like, by using regular expressions, and selecting sentences conforming to the definition of the Schema tags. The sentences to be screened should contain keywords such as comparison relations, subordinate relations and the like.
S2, performing model training on part of the text screened in the step S1, performing sentence segmentation on the text to obtain a plurality of sentences, and performing data annotation on the training set according to the designed triplet Schema label to obtain a triplet training set in the field of civil standard texts; the method specifically comprises the following steps:
s21, saving the divided sentences into files in txt format by using a regular expression; arranging txt format files under the same folder to obtain a data set to be marked; wherein the regular rule is defined as a mark for dividing sentences according to periods, semicolons and colon;
s22, dividing the data set into a data set and a test set according to the proportion of 8:2, and marking the training set by using marking software to obtain a triplet training set for training knowledge extraction;
s23, in order to perform triplet extraction and knowledge graph construction, a novel Schema label is defined, and the extracted triplet training set is divided into entity labels and relation labels according to a new Schema label definition rule, wherein the entity labels and the relation labels are shown in the following tables 1 and 2 respectively. From the perspective of rule examination, the patent proposes an abstract representation for the specification required to be examined: "a certain property of a certain subject is to meet a certain specification requirement". The tag Sub indicates "a certain body", the Prop table "a certain attribute", the RProp indicates "a certain specification requirement", the tag Bx indicates "meet", essentially a certain "Behavior (Behavir)" between Prop and RProp. Sometimes, a single tag RProp cannot fully summarize the semantics, and a tag RSub needs to be introduced to expand the semantics, so that the RProp can be regarded as an attribute of RSub. Similarly, a single Sub cannot fully generalize semantics, where the reference tag MSub is needed as the parent set of Sub. Tag LRprop is similar to tag Rprop, but prefix L indicates a precondition that rule checking is performed only when the result of the comparison between tag prop and tag LRprop is true. Sometimes certain regulations exist with phenomena of modifying properties, such as "concrete/MProp" and "strength grade/Prop" in "concrete strength grade of prefabricated part", when it is necessary to introduce MProp as the last layer property of Prop. Most of the time, the tag can fully summarize the rule semantics, but sometimes conditional clauses are used as conditions, limiting the censoring requirements of the main sentence, so a condition set CSub, CProp, CBx, CRProp is introduced for describing the conditional clauses. Other specific definitions refer to tables 1 and 2.
Table 1: entity tag
Figure BDA0004070831540000071
Figure BDA0004070831540000081
Table 2: relationship label
Figure BDA0004070831540000082
The traditional triplet extraction method has the defects that the defined Schema framework is simple, the rule semantics are complex, and the simple Schema definition is difficult. The standard treaty has the characteristics that: the normative treaty has rich semantics, and each sentence has a plurality of entities from the perspective of content; to ensure the integrity of the treaty semantics, multiple triples are nested to represent the exact semantics. From the perspective of examination, this patent therefore proposes an abstract representation of the specification that needs to be examined: "a certain property of a certain subject is to meet a certain specification requirement". This abstracts the summary to 4 underlying entities and 3 underlying relationships, adding 8 entities and 7 relationships as modifiers or qualifiers. The method abstracts and refines the core information of the specification treaty, facilitates the extraction of the deep learning model, and can also serve the rule interpretation and the construction of the knowledge graph.
S3, constructing a CE-CasRel model, training by using the marked data set, and verifying by using a verification set; the method comprises the steps that when a CE-CasRel model is built, a pre-training model is used as a BERT model optimized on a data set in the civil field; the method specifically comprises the following steps:
s31, as shown in FIG. 2, the overall thought of the CE-CasRel model is as follows: identifying a header entity, namely a sub; all tail entities (objects) corresponding to all given relationships (relationships) are identified on an entity basis, i.e., feature vectors of head entities are conditioned on one time Conditional Layer Normalization (CLN) of feature vectors.
Conditional Layer Normalization (CLN) expression is:
Figure BDA0004070831540000091
wherein y is the characteristic information of the input CLN structure, c β And c β Respectively inputting two pieces of condition information to be fused.
Conventional information fusion methods are addition, multiplication, etc., which ignore directionality between information, but in triplets, this directionality is important. From the CLN expression, it can be seen that the weight W is trained by training 1 And W is 2 The condition information is mapped to different spaces, so that the direction information is embodied.
The original encoder is replaced by a BERT model retrained by using a data set in the civil field at the encoder end, and the patent is called CEBERT; in order to solve the entity nesting problem at the decoding end, a stacked pointer labeling method is used; in this embodiment, the CE-CasRel model loss function is:
Figure BDA0004070831540000092
wherein: sentence x j Belonging to training set D, representing the jth input sample;
T j = { (s, r, o) } is text x j All triples contained in the table;
s∈T j representation appears at T j A header entity of (a); θ= { W start ,b start ,W end ,b end -weight W and bias b that can be learned when predicting the head entity; lovp θ (s|x j ) The prior probability after taking the logarithm is represented, and the prior probability is equivalent to the probability of the prediction head entity;
r∈T j s represents that it appears at T j A set of relationships that are in and the head entity is s;
Figure BDA0004070831540000093
the weight W and the bias b which can be learned when predicting tail entities and relations are represented; />
Figure BDA0004070831540000094
Representing the probability of the log-taking predicted tail entity;
r is the set of all relationships; r.epsilon.R\T j S is R and r.epsilon.T j The difference set of s indicates that there is no occurrence of T j Other relationships in (a);
Figure BDA0004070831540000095
representation for sentence x j At head entity s and not present at T j For the relation r in (c), the tail entity identification should be null; />
Figure BDA0004070831540000096
The predicted probabilities for tail entities that are not in the sentence and tail entities that do not have a corresponding head entity after taking the logarithm are represented.
S32, taking the marked training set text as input, and entering a CEBERT encoder; and (3) performing multi-head attention, layer normalization processing and residual error connection operation on the standard treatises of each piece by using an encoder CEBERT, extracting text features, and obtaining sentence-level vector representation.
In this example, the design for the encoder CEBERT is: by using a Python crawler method, a large amount of texts are crawled from related encyclopedia vocabulary entries and standard texts of civil engineering, and data cleaning such as long text splitting, irrelevant content screening and the like is performed to construct a data set in the field of civil engineering. Selecting a RoBERTa pre-training model of a Chinese version, taking a data set as an input corpus, and transmitting the input corpus into the model for fine adjustment; the training method and the evaluation index refer to the original design of RoBERTa; finally, the model CEBERT which is finely tuned on the data set in the civil engineering field is obtained.
The pre-training of the model is essentially a kind of transfer learning, a neural network model is trained on a large-scale data set, and then the model is applied to a target task for further training, and the model is a pre-training model. At present, because of the lack of related researches in the civil engineering field, fine adjustment of a data set in the civil engineering field is not performed. This patent finely tunes to civil engineering field data, can promote 5.4% with the rate of accuracy of the relevant natural language processing task in the art.
S33, the vector representation obtained after the pretrained model CEBERT is transmitted into a head entity identification layer of a decoder, and whether each token is the start or end of the head entity is judged by utilizing a linear layer and a sigmoid activation function; then, the identified start and end pairing is utilized to obtain a candidate head entity set, wherein a sigmoid activation function expression is as follows:
Figure BDA0004070831540000101
the conventional knowledge extraction model is finally used for classifying functions, and a Softmax activation function is used, but the Softmax outputs mutually exclusive data, so that the method is more suitable for multiple classification problems. The data set involved in the patent knowledge extraction has a large number of entity overlapping problems, so that the sigmoid function is more suitable for processing the multi-label classification problem.
In this embodiment, the linear layer and sigmoid activation function expression for the pre-measurement head entity is:
Figure BDA0004070831540000102
Figure BDA0004070831540000103
wherein: x is x i Representing the ith token in the sentence, which represents a single Chinese character in the patent;
{W start ,b start ,W end ,b end -weight W and bias b that can be learned in the linear layer;
sigma represents a sigmoid activation function;
Figure BDA0004070831540000104
and->
Figure BDA0004070831540000105
A probability that the i-th token is a tail position of the head entity;
as shown in fig. 2, in this embodiment, the nearest matching principle means that: after all start_s and end_s are obtained, the next nearest start and end are considered as a complete entity.
S34, entering a tail entity identification layer with specific relation: randomly extracting a head entity, and performing CLN on the sentence-level vector representation once by taking the entity as a condition; the sentence vector representation after CLN processing is used as input, and is also used for predicting tail entity positions under different relations through a linear layer and a sigmoid activation function.
In this embodiment, the linear layer and sigmoid activation function expression for predicting tail entities and relationships is:
Figure BDA0004070831540000111
Figure BDA0004070831540000112
wherein:
Figure BDA0004070831540000113
the CLN is used to fuse the sequence code and the header entity code, and since there is only one entity in the input condition, the two conditions are set to be the same entity.
And S35, after training of the CE-CasRel model is finished, verifying the trained model by using test set data, and retaining training to obtain model weight parameters after the required accurate pair and recall rate are achieved.
The features of the BERT pre-training model are used as the encoder, so that not only can the manual feature extraction be omitted, but also the feature representation is more reasonable, the model accuracy can be improved, and meanwhile, the parameter adjusting time of training is saved. The stacked pointer labeling method for the decoder can effectively solve the problem of overlapping of a large number of entity relations in the standard text; the CLN is used for the head entity, so that the head entity and the feature vector of the sentence can be fused better than the traditional feature addition.
S4, inputting the standard text to be extracted into a trained CE-CasRel model, extracting triples in the field of soil and wood standard, and storing the triples into json files;
in this embodiment, the specific steps are as follows: repeating the steps S1 and S2 to obtain more processed canonical texts, inputting the canonical texts to be extracted into a trained CE-CasRel model, and extracting a triplet set conforming to the Schema label definition; storing the extracted triples in lines according to the strip text through Python data processing, wherein one line acts as a strip rule Fan Tiaowen, and one strip text corresponds to a plurality of triples; all the treatises are stored in json files. As shown in fig. 3, is a schematic diagram of a stored structured specification.
S5, reading the json file, analyzing the triplet data, and using an API interface of Neo4j to establish a knowledge graph in the civil engineering standardization field and realize visualization.
As shown in fig. 4, the knowledge graph results in the civil engineering standardization domain are shown. In this embodiment, the specific steps are as follows:
s51, defining nodes in a Neo4j graph relational library according to the entity labels in the Schema; according to the relationship labels, edges in the Neo4j graph relationship library are defined.
S52, reading json files by utilizing json and pands libraries of Python, and reading a triplet set of each rule Fan Tiaowen according to rows; according to the defined Schema tags, a list of Python is extracted from the triplet set and added to.
S53, connecting Python to Neo4j by using a py2Neo library of Python; correspondingly adding the entities in the list in the step 52 to Neo4j nodes; and correspondingly adding the relation in the list in the step 52 to the Neo4j side to complete the construction of the knowledge graph.
S54, starting a Neo4j program through a console, and inputting 'http:// localhost' in browsing: 7474/", the atlas can be viewed and saved.
The Neo4j database can be operated on using Python language with py2Neo, thereby avoiding using the Cypher query grammar of Neo4j; the Schema label system defined by the patent converts the relation between triples from two dimensions to one dimension, so that the Schema label system can be directly used for constructing a knowledge graph without redundant conversion operation.
A knowledge graph construction system in the civil engineering standardization field based on a deep neural network comprises a preprocessing module, a knowledge extraction module and a construction module; wherein,,
the preprocessing module is used for carrying out sentence segmentation on the standard treatises and preprocessing the segmented sentences to obtain convertible standard sentences;
the knowledge extraction module is used for extracting the triples in the specification strip by the deep neural network model CE-CasRel to obtain the structured specification strip stored in the form of the triples;
and the construction module is used for constructing a knowledge graph, storing the structured specification treaty and storing the triples into a graph database based on a py2neo library.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (8)

1. The construction method of the knowledge graph in the civil engineering standardization field is characterized by comprising the following steps:
s1, screening sentences to be extracted from a standard text, modifying the screened text format and text sentence pattern, and selecting sentences conforming to the Schema label definition to enable the sentences to be suitable for extraction of a neural network model;
s2, performing model training on part of the text screened in the step S1, performing sentence segmentation on the text to obtain a plurality of sentences, and performing data annotation on the training set according to the designed triplet Schema label to obtain a triplet training set in the field of civil standard texts;
s3, constructing a CE-CasRel model, training by using the data set marked in the step S2, and verifying by using a verification set; wherein the method comprises the steps of
When a CE-CasRel model is built, the pre-training model is a BERT model optimized on a data set in the civil field;
s4, repeating the steps S1 and S2, inputting the obtained standard text to be extracted into a trained CE-CasRel model, extracting triples in the field of soil and wood standard, and storing the triples in a json file;
s5, reading the json file, analyzing the triplet data, and using an API interface of Neo4j to establish a knowledge graph in the civil engineering standardization field and realize visualization.
2. The method for constructing a knowledge graph in the civil engineering-specification domain as claimed in claim 1, wherein the specific process of step S2 is as follows:
s21, sorting the divided sentences under the same folder by using a regular expression to obtain a data set to be marked; wherein the regular rule is defined as a mark for dividing sentences according to periods, semicolons and colon;
s22, dividing the data set into a data set and a test set according to a preset proportion, and marking the training set by using marking software to obtain a triplet training set for training knowledge extraction;
s23, in order to extract triples and construct a knowledge graph, the extracted triples training set is divided into an entity tag and a relation tag according to a new Schema tag definition rule, the entity tag MSub, sub, prop represents an object to be examined in an aesthetic drawing, and the relation tags sub_Prop, prop_Bx and Bx_RProp are three relations based on a standard rule, and represent that a certain attribute of a certain subject should meet a certain standard requirement.
3. The method for constructing a knowledge graph in the civil engineering-specification domain as claimed in claim 2, wherein the specific process of step S3 is as follows:
s31, constructing a CE-CasRel model, and replacing an original encoder by using a BERT model retrained by using a data set in the civil field at the encoder end, wherein the model is called CEBERT; in order to solve the entity nesting problem at the decoding end, a stacked pointer labeling method is used; and performing Conditional Layer Normalization (CLN) on the feature vector of the head entity as a condition before predicting the tail entity of the triplet;
s32, taking the marked training set text as input, and entering a CEBERT encoder; the encoder CEBERT performs multi-head attention, layer normalization processing and residual error connection operation on each standard treaty, extracts text characteristics and obtains sentence level vector representation;
s33, the vector representation obtained after the pretrained model CEBERT is transmitted into a head entity identification layer of a decoder, and whether each token is the start or end of the head entity is judged by utilizing a linear layer and a sigmoid activation function; then, the identified start and end pairing is utilized to obtain a candidate head entity set;
s34, randomly extracting a head entity, and performing CLN on the sentence-level vector representation once by taking the entity as a condition; the sentence vector representation after CLN processing is used as input, and the sentence vector representation is also used for predicting tail entity positions under different relations through a linear layer and a sigmoid activation function;
and S35, after training of the CE-CasRel model is finished, verifying the trained model by using test set data, and retaining training to obtain model weight parameters after the required accurate pair and recall rate are achieved.
4. The method for constructing a knowledge graph in the civil engineering specification field according to claim 3, wherein the CEBERT encoder described in step S32 specifically includes: by using a Python crawler method, a large amount of texts are crawled from related encyclopedia vocabulary entries and standard texts of civil engineering, and data cleaning such as long text splitting, irrelevant content screening and the like is performed to construct a data set in the field of civil engineering. Selecting a RoBERTa pre-training model of a Chinese version, taking a data set as an input corpus, and transmitting the input corpus into the model for fine adjustment; the training method and the evaluation index refer to the original design of RoBERTa; finally, the model CEBERT which is finely tuned on the data set in the civil engineering field is obtained.
5. The method for constructing a knowledge graph in the civil engineering-specification domain as claimed in claim 4, wherein the CLN expression in step S31 is:
Figure FDA0004070831530000021
wherein y is the characteristic information of the input CLN structure, c β And c β Respectively inputting two pieces of condition information to be fused;
the sigmoid activation function expression in step S33 is:
Figure FDA0004070831530000022
6. the method for constructing a knowledge graph in the civil engineering-specification domain as claimed in claim 5, wherein the specific process of step S4 is as follows: repeating the steps S1 and S2, inputting the obtained standard text to be extracted into a trained CE-CasRel model, and extracting a triplet set conforming to the Schema label definition in the step S23; storing the extracted triples in lines according to the strip text through Python data processing, wherein one line acts as a strip rule Fan Tiaowen, and one strip text corresponds to a plurality of triples; all the treatises are stored in json files.
7. The method for constructing a knowledge graph in the civil engineering-specification domain as claimed in claim 6, wherein the specific steps of step S5 are as follows:
s51, defining nodes in a Neo4j graph relation library according to the entity labels in the Schema in the step S23; defining edges in a Neo4j graph relational library according to the relational labels;
s52, reading json files by utilizing json and pands libraries of Python, and reading a triplet set of each rule Fan Tiaowen according to rows; extracting from the triplet set and adding the triplet to a Python list according to the defined Schema label;
s53, connecting Python to Neo4j by using a py2Neo library of Python; correspondingly adding the entities in the list in the step 52 to Neo4j nodes; correspondingly adding the relation in the list of the step 52 to the Neo4j side to complete the construction of the knowledge graph;
s54, starting a Neo4j program through a console, and watching and saving the map in browsing.
8. The map construction system based on the knowledge map construction method in the civil engineering standardization domain according to any one of claims 1 to 7, characterized by comprising a preprocessing module, a knowledge extraction module and a construction module; wherein,,
the preprocessing module is used for carrying out sentence segmentation on the standard treatises and preprocessing the segmented sentences to obtain convertible standard sentences;
the knowledge extraction module is used for extracting the triples in the specification strip by the deep neural network model CE-CasRel to obtain the structured specification strip stored in the form of the triples;
and the construction module is used for constructing a knowledge graph, storing the structured specification treaty and storing the triples into a graph database based on a py2neo library.
CN202310092861.6A 2023-02-07 2023-02-07 Knowledge graph construction method and knowledge graph construction system in civil engineering standardization field Pending CN116050408A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310092861.6A CN116050408A (en) 2023-02-07 2023-02-07 Knowledge graph construction method and knowledge graph construction system in civil engineering standardization field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310092861.6A CN116050408A (en) 2023-02-07 2023-02-07 Knowledge graph construction method and knowledge graph construction system in civil engineering standardization field

Publications (1)

Publication Number Publication Date
CN116050408A true CN116050408A (en) 2023-05-02

Family

ID=86125430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310092861.6A Pending CN116050408A (en) 2023-02-07 2023-02-07 Knowledge graph construction method and knowledge graph construction system in civil engineering standardization field

Country Status (1)

Country Link
CN (1) CN116050408A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725223A (en) * 2023-11-20 2024-03-19 中国科学院成都文献情报中心 Knowledge discovery-oriented scientific experiment knowledge graph construction method and system
CN118036733A (en) * 2024-04-11 2024-05-14 浙江建木智能系统有限公司 Knowledge graph construction method, system and medium for ship test training

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725223A (en) * 2023-11-20 2024-03-19 中国科学院成都文献情报中心 Knowledge discovery-oriented scientific experiment knowledge graph construction method and system
CN118036733A (en) * 2024-04-11 2024-05-14 浙江建木智能系统有限公司 Knowledge graph construction method, system and medium for ship test training

Similar Documents

Publication Publication Date Title
CN112199511B (en) Cross-language multi-source vertical domain knowledge graph construction method
CN108182295B (en) Enterprise knowledge graph attribute extraction method and system
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN112732934B (en) Power grid equipment word segmentation dictionary and fault case library construction method
CN111552821B (en) Legal intention searching method, legal intention searching device and electronic equipment
CN110990590A (en) Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
CN116050408A (en) Knowledge graph construction method and knowledge graph construction system in civil engineering standardization field
CN113157859B (en) Event detection method based on upper concept information
Hu et al. Considering optimization of English grammar error correction based on neural network
CN113254507B (en) Intelligent construction and inventory method for data asset directory
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN113377916B (en) Extraction method of main relations in multiple relations facing legal text
CN114443855A (en) Knowledge graph cross-language alignment method based on graph representation learning
CN114817454A (en) NLP knowledge graph construction method combining information content and BERT-BilSTM-CRF
CN114238653A (en) Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education
CN117574858A (en) Automatic generation method of class case retrieval report based on large language model
CN111104492B (en) Civil aviation field automatic question and answer method based on layering Attention mechanism
CN117094390A (en) Knowledge graph construction and intelligent search method oriented to ocean engineering field
CN116258204A (en) Industrial safety production violation punishment management method and system based on knowledge graph
Hirayama et al. Development of template-free form recognition system
CN115934966A (en) Automatic labeling method based on remote sensing image recommendation information
CN112749278B (en) Classification method for building engineering change instructions
Zheng Individualized Recommendation Method of Multimedia Network Teaching Resources Based on Classification Algorithm in a Smart University
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination