CN111694966B

CN111694966B - Chemical industry field oriented multi-level knowledge graph construction method and system

Info

Publication number: CN111694966B
Application number: CN202010523776.7A
Authority: CN
Inventors: 孙涛; 王�琦; 翟娇娇
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2020-06-10
Filing date: 2020-06-10
Publication date: 2023-07-21
Anticipated expiration: 2040-06-10
Also published as: CN111694966A

Abstract

The invention discloses a method and a system for constructing a multi-level knowledge graph oriented to the chemical industry field, comprising the following steps: acquiring data of different layers of production states influenced by chemical processes; performing relation extraction on the acquired data to obtain triplet data; constructing a single-level knowledge graph from the extracted triplet data; integrating the single-level knowledge patterns to obtain a multi-level knowledge pattern; performing complementation operation on the multi-level knowledge graph; performing quality assessment on the multi-level knowledge graph, and if the quality assessment is qualified, determining that the current multi-level knowledge graph is a qualified knowledge graph; otherwise, returning to the step of acquiring the data of different layers in the chemical process.

Description

Chemical industry field oriented multi-level knowledge graph construction method and system

Technical Field

The disclosure relates to the technical field of knowledge graph construction, in particular to a method and a system for constructing a multi-level knowledge graph oriented to the chemical field.

Background

The statements in this section merely mention background art related to the present disclosure and do not necessarily constitute prior art.

Technological development has led to tremendous advances in industrial production and mass life, and a representative product of technological development, a complex equipment system, has been developed. As industry technology evolves, complex equipment systems are applied to industry, i.e., complex industrial processes today. Complex industrial processes include a number of industrial fields, one of which is the chemical industry. It has the following characteristics: large scale, complex structure, complex business logic, strong coupling between production units, numerous factors affecting the production process, and the like.

Due to the complexity of the chemical process itself, the existing fault properties are characterized as follows:

(1) Complexity: the reasons and symptoms of the faults are no longer in one-to-one correspondence due to the extremely strong coupling between the production units of the chemical process. One-to-many, many-to-one, or many-to-many situations now occur.

(2) Transmissibility: a failure of a tiny component may occur, possibly together with a failure of the relevant component on the same path of the component, in a situation where the failure propagates laterally. The fault causes are also poorly defined because of the wide spread range.

(3) Multiple failure concurrence: due to its complexity and transmissibility, multiple failure concurrency is unavoidable.

(4) Ductility at time: some tiny components may fail, and other components may fail due to propagation. However, when the first component fails, the chemical system may not have shown an abnormality, and over time, the component failure must be changed from variable to cause the system to fail.

(5) Layering: the chemical process has different layers of influencing factors to influence the production state. Such as production process data, process flows. In chemical systems, the tail gas is recycled, and the heat exchange requirements between cold and hot devices often occur. As process complexity increases, material quality and supply problems can also affect production conditions.

However, the traditional fault diagnosis technology only analyzes the chemical process from one level of production process data, and the traditional method does not consider the intricate and complex association relation among the influence factors of the chemical process, and does not consider the change of fault properties. Thus, the situation that the analysis is incomplete and the diagnosis is inaccurate is avoided.

Disclosure of Invention

In order to solve the defects of the prior art, the present disclosure provides a method and a system for constructing a multi-level knowledge graph oriented to the chemical industry field; aiming at the defects that the traditional fault diagnosis technology only analyzes the chemical process from one level of production process data, the traditional method does not consider the intricate and complex association relation among chemical process influence factors and the change of fault properties, the method for automatically constructing the multi-level knowledge graph of the chemical process is provided. In the follow-up work, the multi-level knowledge graph is used as a knowledge base which is comprehensive in information coverage and expresses the intricate and complex relationship to provide powerful data support for fault reasoning, so that the accuracy of fault diagnosis can be improved.

In a first aspect, the present disclosure provides a method for constructing a multi-level knowledge graph for a chemical industry field;

the method for constructing the multi-level knowledge graph oriented to the chemical industry field comprises the following steps:

acquiring data of different layers of production states influenced by chemical processes;

performing relation extraction on the acquired data to obtain triplet data;

constructing a single-level knowledge graph from the extracted triplet data;

and integrating the single-level knowledge patterns to obtain a multi-level knowledge pattern.

In a second aspect, the present disclosure provides a chemical industry field oriented multi-level knowledge graph construction system;

chemical industry field oriented multi-level knowledge graph construction system comprises:

an acquisition module configured to: acquiring data of different layers of production states influenced by chemical processes;

an extraction module configured to: performing relation extraction on the acquired data to obtain triplet data;

a build module configured to: constructing a single-level knowledge graph from the extracted triplet data;

an integration module configured to: and integrating the single-level knowledge patterns to obtain a multi-level knowledge pattern.

In a third aspect, the present disclosure also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first aspect.

In a fourth aspect, the present disclosure also provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the first aspect.

In a fifth aspect, the present disclosure also provides a computer program (product) comprising a computer program for implementing the method of any one of the preceding aspects when run on one or more processors.

Compared with the prior art, the beneficial effects of the present disclosure are:

the multi-level knowledge graph considers different levels of the chemical process, the coupling between the influence factors of the chemical process is expressed in the form of the triples, the form can express the complexity, the transmissibility and the multi-fault concurrency of faults, and meanwhile, abnormal parts of the system can be timely found through the change of the state of the multi-level knowledge graph, and the abnormality can be found in advance before the faults do not change quality. The multi-level knowledge graph is richer in content than the conventional knowledge graph, the covered knowledge is more comprehensive, and powerful data support can be provided for fault diagnosis.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of a chemical process multi-level knowledge graph according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of an automatic acquisition data program framework according to a first embodiment of the present disclosure;

FIG. 3 is a flowchart of a multi-level knowledge graph automation construction for a chemical process according to an embodiment of the present disclosure;

fig. 4 is a schematic view of an encaroje model structure according to the first embodiment of the disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

Example 1

The embodiment provides a multi-level knowledge graph construction method oriented to the chemical field;

s101: acquiring data of different layers of production states influenced by chemical processes;

s102: performing relation extraction on the acquired data to obtain triplet data;

s103: constructing a single-level knowledge graph from the extracted triplet data;

s104: and integrating the single-level knowledge patterns to obtain a multi-level knowledge pattern.

As one or more embodiments, the method further comprises:

s105: and carrying out complementation operation on the multi-level knowledge graph.

As one or more embodiments, the method further comprises:

s106: performing quality assessment on the multi-level knowledge graph, and if the quality assessment is qualified, determining that the current multi-level knowledge graph is a qualified knowledge graph; otherwise, returning to the step of acquiring the data of different layers in the chemical process.

As one or more embodiments, in S101, data of different layers in the chemical process is obtained; the method comprises the following specific steps:

acquiring data of a production process data layer, a process flow layer, a material layer and an equipment parameter layer;

further, the data of the data layer of the production process comprises: data collected in the production process of chemical equipment comprises: reactor pressure measurements, separator temperature measurements, stripper liquid level measurements, and the like.

Further, the data of the technological process layer comprises: the process flow layer data reflects the association relation of production variables caused by the sequence of equipment investment in the production process, and the association relation comprises the following steps: in the chemical process, materials are fully reacted through a reactor after being fed, and then the reacted products are subjected to gas-liquid separation through a separator, so that the progressive relationship on the equipment enables production parameters to have a certain progressive relationship, and progressive association relationships exist between the materials and the reactor, wherein the association relationships are data of a process flow layer.

Further, the material layer data includes: raw material parameters involved in the chemical process include: what raw material is used, the quality of the raw material, the amount of the raw material, and the like.

Further, the data of the device parameter layer includes: the chemical process produces equipment parameters such as equipment materials, service life, model and the like.

It should be understood that in S101, data of different layers of the chemical process, including deterministic data and uncertainty data, is obtained; deterministic data refers to raw data that affects the production state known and correct; uncertainty information refers to incompletely correct information obtained from multiple data sources.

As one or more embodiments, in S102, relationship extraction is performed on the obtained data to obtain triplet data; the method comprises the following specific steps:

extracting the acquired data according to the relationship of the data to obtain triple data; extracting triplet data using dependency syntax analysis for the acquired text-type data; and for the acquired numerical data or tabular data, using a pearson correlation coefficient method to find the correlation between variables and extracting the triplet data.

Further, the nature of the data itself includes: text (extracting relations among literal information obtained at a process flow level, a material level and an equipment parameter level, for example, a section of low-pressure saturated steam humidified by soft water for a stripper is obtained in a separation styrene chemical process in the chemical field and enters from the bottom of a tower), the described relations (the stripper, the humidification and the low-pressure saturated steam) are required to be extracted, numerical values or tables (information with numerical characteristics at a production process data level, the material level and the equipment parameter level, for example, whether a reactor pressure measured value and a separator pressure measured value have a correlation in the production process or not, and the numerical relations are required to be extracted);

exemplary, text relationship extraction: utilizing the known determined knowledge to establish a dependency syntax dictionary marking the syntax collocation relation for each entity, wherein the syntax collocation relation mainly comprises the following steps: main-guest relation, dynamic-complement structure, state-dynamic structure, mesoguest structure, object preposition, dynamic-guest relation, state-middle structure, etc.

In practice, the dependency syntax dictionary splits sentences, and describes collocation relations and dependency relations among words. Then, a HanLP word segmentation tool (natural language analysis tool) is used for word segmentation, syntactic analysis is carried out on each word after word segmentation, the dictionary is traversed, and triples are extracted according to syntactic collocation relation marked by the dictionary.

Such as: in the chemical process of separating styrene, the words "flow regulator controls steam flow entering the stripping tower", the word segmentation results are [ flow regulator/nr, control/v, stripping tower steam flow/ns ], and the results of the syntactic analysis are: flow regulator: { }, control: { mainly called relationship= [ flow regulator ], dynamic guest relationship= [ stripper steam flow ] }, stripper steam flow: {}. Traversing the dictionary for each word after word segmentation. The words of the flow regulator and the steam flow of the stripping tower are not contained in the relation dictionary, but the control has a main-predicate relation and a moving guest relation in the relation list, and the control is a verb, so that the main-predicate guest can be judged. The flow regulator can be taken out as a head entity in the main relation, the steam flow of the stripping tower is taken out as a tail entity in the motor-guest relation, and the control is taken as the relation, so that the (flow regulator, control and steam flow of the stripping tower) triplets are extracted. And after the triples are extracted, the triples are integrated through entity alignment, so that a knowledge graph of the text data can be constructed.

It should be appreciated that the triplet data is in the form of: < h, r, t > where h is the head entity, t is the tail entity, and r is the relationship between the two entities.

As one or more embodiments, in S103, a single-level knowledge graph is constructed from the extracted triplet data; the method comprises the following specific steps:

and (3) aligning the triplet entities according to the extracted triplet data, and associating all triples to construct a single-level deterministic knowledge graph.

Further, constructing a single-level deterministic knowledge graph according to the extracted triplet data; the method comprises the following specific steps: an entity in the chemical industry comprising: the chemical equipment condenser, the separator, the flow regulator, the chemical equipment parameter mole content, the pressure and the like, and the entity in each triplet is directly aligned without entity identification.

If the triplet < stripper, influence, steam flow >, < stripper, influence, separator > has the entity stripper in both triplets, then the entity stripper in both triplets can be aligned to link the two triplets. And the other triples are subjected to entity alignment by referring to the steps, so that all triples are connected, and a single-layer deterministic knowledge graph is constructed.

Specifically, the single-layer deterministic knowledge graph refers to: and extracting independent triples from the acquired original information which is known and correctly influences the production state, and enabling the independent triples to be dispersed originally to generate a network structure knowledge graph formed by the relation in a physical alignment mode.

For example, for a split styrene chemical process, at the production data level < reactor pressure value, effect, separator pressure value >, < reactor pressure value, effect, compressor power value >, there are two independent triplets, but there is a common entity "reactor pressure value" in the two triplets, and the two triplets are linked after this common entity "reactor pressure value" is aligned, and the separator pressure value is linked by the reactor pressure value and the compressor power value. And (3) aligning the triples on a single layer by the entities to construct a single-layer knowledge graph.

As one or more embodiments, the method further comprises:

s103-4: multi-source data fusion: and fusing the acquired uncertainty knowledge by utilizing a multi-source data fusion algorithm, selecting the knowledge with the reliability higher than the set threshold value to be fused into a single-level deterministic knowledge graph, and discarding the knowledge with the reliability lower than the set threshold value to obtain a single-level knowledge graph after supplementation.

Further, the specific steps of fusion by using the multi-source data fusion algorithm comprise:

s103-41: carrying out block aggregation on data from different sources by taking entity keywords of each layer as a basis, and taking the data as candidate matching knowledge;

s103-42: and matching the candidate matching knowledge in the same block with the knowledge of the original knowledge graph by utilizing the multi-source data fusion coefficient W, and if the W is larger than a set threshold value, considering the candidate matching knowledge as correct knowledge, and adding the knowledge to the knowledge graph.

The multisource data fusion coefficients W are defined as follows:

w is made up of two parts, where confidence is the confidence score and the latter part is the average of entity similarity and relationship similarity. Where confidence consists of two parts, Q and cf. Q is a confidence level of data sources, such as relatively authoritative websites or knowledge bases like Baicaled encyclopedia, known networks, and the Q value is relatively high. cf is a confidence calculated for each two entity combinations based on the entity-to-entity, entity-to-relationship expression distance.

The confidence formula performs dependency syntax analysis based on inter-dependency and dependent phenomena among sentence components. After the sentence is segmented, the entity and the relation are identified, and the word, the relation and the position of the entity are marked from right to left at a time, which are respectively 0,1 and 2 … …. In the formula, L represents an entity position, and R represents a relation position. L (L) _i -L _j Representing the distance of entity 1 and entity 2; l (L) _i R represents the distance of entity 1 from the relationship. The greater the distance, the less likely there is a semantic relationship between the entity and the relationship, and the lower the confidence. The latter part of the formula is the calculation of the similarity of the candidate matching entity pairs to the knowledge in the knowledge base.

The relationship_sim is calculated as the relation similarity, the average of the Relationship similarity and the relation similarity is taken as the similarity of the knowledge, and if the corresponding similarity is larger than a set threshold value of 0.5, the knowledge is considered to be more reliable.

The entity_sim calculation method comprises the following steps:

firstly, word segmentation is carried out on a text, word vectors obtained by word2vec are used for modeling the text, and cosine similarity is used for calculating cosine values of included angles of two text vectors to measure similarity.

The relation_sim calculation method comprises the following steps:

and traversing the knowledge base of the same block by taking the entity as the center according to the relation in the candidate matching entity pair, and checking whether the relation with higher similarity with the relation in the candidate matching entity pair exists in the knowledge base.

If not, traversing the whole knowledge base to see whether the knowledge base exists or not, and if not, setting the relationship_sim to be 0;

if the relation exists, calculating the distance L from the entity to the matching relation in the knowledge base by adding 1 to the distance of every other triplet, wherein the relation_sim is 1/L.

After the multi-source data fusion model is adopted, knowledge with the reliability higher than the set threshold value is fused into the knowledge graph, and knowledge with the reliability lower than the set threshold value is abandoned.

It should be appreciated that S103-41 performs knowledge extraction in the relationship extraction stage, centered on the entity key. Therefore, when data are fused, the data can be fused with the data of the respective layers, the whole knowledge base is prevented from being traversed, and the calculation complexity is reduced.

It should be appreciated that multi-source data fusion needs to be performed because: the accuracy of the acquired uncertainty data cannot be guaranteed. And once incorrect data is added to the knowledge graph constructed by the user, fault diagnosis errors can be caused. In addition, the method and the system hope that the knowledge graph can realize self-adaptive learning, once incorrect data is added into the knowledge graph, the knowledge graph becomes an incorrect knowledge base along with the self-adaptive learning, and the accuracy of fault diagnosis cannot be guaranteed. In order to use the uncertainty information, the collected multi-source data fusion model W needs to be fused, and only the data can be added into the knowledge graph if the data is judged to be truly and credible.

As one or more embodiments, in S104, integrating the single-level knowledge graph to obtain a multi-level knowledge graph; the method comprises the following specific steps:

and integrating the single-level knowledge graph by means of entity alignment to obtain a multi-level knowledge graph. Such as: for the separation styrene chemical process, there is a triplet 1< reactor pressure value, influence, compressor power value > at the production process data level, a triplet 2< reactor pressure value, influence, reactor level value > at the process flow level, a triplet 3< reactor at the equipment parameter level, influence, reactor pressure value > and the entity "reactor pressure value" is present in all three triples, then the three independent triples are linked by the entity reactor pressure. The specific illustration is shown in fig. 1, where two nodes connected by a dashed line are the same entity although at different levels, as is the reactor pressure value, although at three levels. The knowledge maps of each level are integrated into a multi-level knowledge map by establishing connection with the same entity in the triplet. The specific schematic diagram is shown in fig. 1.

As one or more embodiments, in S105, a completion operation is performed on the multi-level knowledge graph; the method comprises the following specific steps: knowledge-graph completion is performed by the projE model taking into account semantic information.

Scoring function of the projE model taking into account semantic information:

where h (e, r) represents a scoring function and i represents the i-th entity in the set of entities to be scored. h (e, r) _i Representing the score of the i-th entity in the set of entities to be scored. W represents an sxk matrix formed by the entities to be scored, s represents the number of the entities to be scored, b _p Representing the bias vector.Representing the similarity of the two entities themselves, +.>Vector representing the ith entity of the entity to be scored, e representing the original triplet<h,r,？>Entity h in (a). />The larger the inner product of the two terms, the smaller the distance between the two terms, meaning that the two vectors are semantically smaller, that is, the more similar the two entities are. />Representing the similarity of semantic information of two entity neighbor nodes.

It will be appreciated that the projE model, parameter scale is n only _e k+n _r k+5k, the calculation speed and the prediction capability are better, but the ProjE model has own defects. In fact, knowledge-graph can be applied so widely not only because of the ability to express content shapesThe information of diverse formulas also because it contains rich semantic information. The projE model also focuses attention on the relationship of candidate entities to eXso r, without exploiting the rich semantic information in the knowledge-graph. The advantage of the multi-level knowledge graph is rich semantic information, so that the ProjE model considering the semantic information is provided, and the rich semantic information is integrated into the ProjE model, so that the link prediction task can be well completed.

The task of the ProjE model considering semantic information is to predict the triplet < h, r,? Is the missing entity? The possible entities form a set-the entity set to be scored. And calculating a scoring function of each entity to be scored, wherein the entity with the highest score is the correct entity.

As can be seen from the formula (i),the other two items are semantic information which is similar to the entity to be scored and is merged by the former entity h, and semantic information which is similar to the neighbor node of the former entity h and the neighbor node of the entity to be scored.

Meanwhile, an aggregator function is designed to learn entity neighborhood context information by aggregating relevant embedded vectors.

Where N (e) is a vector representation of the aggregated context information of entity e. n (e) is a set of components in the context information of entity e, mean is an aggregator function, of course the aggregator function here may take many forms such as: mean, max, pooling, et al, in the modified ProjE model, have empirically chosen the Mean function as the aggregator function.

The method for obtaining n (e) is as follows: given a triplet < h, r, t >, the neighborhood context of entity h is a node located near h other than t. That is, the nodes around h in the knowledge graph actually participate in the acquisition, and the local driving h affects the triples < h, r, t >. Since there may be a large number of neighbor nodes, a random walk method is used to collect a neighborhood set for each entity as a preprocessing step.

Specifically, given a node h, run k rounds of random walks of length l, and create n (e) by adding all the duplicate nodes accessed in those walks. N (e) is found by means of the aggregator function Mean after N (e) has been created.

N (e) in the formula represents the original triplet<h,r,？>In h,entity neighborhood context information for the i-th entity in the set of entities to be scored is represented. />The larger the inner product of the two terms, the smaller the distance between the two terms, meaning that the two vectors are semantically smaller, i.e. the greater the likelihood that the two entities are related. Fig. 4 shows the structure of the ProjE model taking into account semantic information.

Illustratively, in the step S105, a complementing operation is performed on the multi-level knowledge graph; the method comprises the following specific steps: in the known correct multi-level knowledge graph, the three-component of the knowledge graph of each level is N parts, N is a positive integer, N-1 parts of three-component of each level are classified into a data set as a training set, 1 part of three-component of each level is classified into a data set as a test set, the scoring function of the ProjE model considering semantic information is trained based on training data, implicit knowledge is mined, the accuracy of the implicit knowledge is verified by using the test data set, a trained ProjE model considering semantic information is obtained, and the multi-level knowledge graph is complemented based on the trained ProjE model considering semantic information.

Some implicit knowledge exists in the knowledge graph, and the implicit knowledge is not obviously expressed, so that the condition of inaccurate diagnosis is easily ignored in fault diagnosis. For example, in the chemical industry, it is known that equipment knots can be caused by the conditions of improper raw material proportion and unequal flow control, and the production process can be failed due to the equipment knots. In the knowledge graph, the raw material proportioning parameters and the flow control parameters have relations with the corresponding equipment, and the fault reasons are inferred according to the relations, which is known as explicit knowledge. However, in the actual chemical process, the quality grade of the raw materials, the material quality of the equipment and other information can also have a certain influence on the equipment nodulation, and in the knowledge graph of explicit knowledge, the quality grade of the raw materials and other information and certain equipment have no link relation. This situation may lead to errors in decision making, so we need to mine out these knowledge that is not shown in the knowledge graph by considering the projE model of semantic information for the completion operation.

It should be understood that knowledge in a knowledge graph is divided into display knowledge and implicit knowledge:

explicit knowledge refers to: knowledge of the correct determination is already known. The knowledge is generally derived from production process data and collected and mastered data in the chemical industry, and the explicit knowledge is characterized by sparsity, namely sparse relation between entities.

Implicit knowledge refers to: unknown correct knowledge. That is, there is a relationship between entities in the knowledge graph, but the two entities are not linked. This correct knowledge is not embodied in the knowledge-graph, but is implicit in the knowledge-graph. For the fault diagnosis of the chemical process, each piece of knowledge is important. Only if the implicit knowledge is mined, the information covered by the knowledge graph is more complete. The present disclosure uses the projE model that considers semantic information for knowledge graph completion.

As one or more embodiments, in S106, performing quality assessment on the multi-level knowledge graph; the method comprises the following specific steps:

removing one of the entities in the known triplet < h, r, t > to make the triplet < h, r,? Form >; using the ProjE model that considers semantic information for triples < h, r? And (3) predicting the missing entity t, wherein if the predicted entity is consistent with the original entity, the quality of the knowledge graph is high, otherwise, the quality of the knowledge graph is low, and the quality of the knowledge in the knowledge graph is judged.

For example, for a multi-level knowledge graph of the separation styrene process, we know that < reactor pressure value, influence, separator pressure value > is the correct knowledge, we remove one of the entities of the triplet to make the triplet < reactor pressure value, influence,? By considering the projE model of semantic information, it is predicted whether the missing entity in the triplet is a separator pressure value. One entity of the multiple groups of triples is removed, and if the ProjE model taking semantic information into consideration can accurately predict that the missing entities are the entities in the original triples, the constructed multi-level knowledge graph is higher in quality.

Fig. 1 is a schematic diagram of a multi-level knowledge graph proposed in the present disclosure. The multi-level knowledge graph which is to be constructed in the disclosure is shown in fig. 1, the knowledge graph is divided into different levels, and meanwhile, the levels are associated with each other.

Fig. 2 is a schematic diagram of multi-level knowledge graph data acquisition in the chemical field according to the present disclosure.

The working steps are as follows: the data acquisition procedure is divided into four modules: the system comprises a scheduling module (a link request needing to be crawled next), a crawler module (extracting needed data and a link needing to be crawled next), a downloading module (linking with the Internet and acquiring a webpage response), and a data processing module (processing the crawled data).

The specific workflow of the framework is as follows: when the engine needs to request, the scheduling module receives the request of the crawler module and transmits the request to the downloading module. The downloading module sends a request to the appointed website, receives the response and then transmits the response to the crawler module. The crawler module analyzes the webpage response acquired by the downloading module, extracts required data and a link request to be crawled, transmits the data to the data processing module, and transmits the link request to be crawled to the scheduling module. The data processing module purifies and formats the crawled data to form a usable form.

Fig. 3 is a flow chart for automatically constructing a multi-level knowledge graph in the chemical field according to the disclosure.

Example two

The embodiment provides a multi-level knowledge graph construction system oriented to the chemical industry field;

It should be noted that the above-mentioned obtaining module, extracting module, constructing module and integrating module correspond to steps S101 to S104 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of embodiment one.

The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. The method for constructing the multi-level knowledge graph oriented to the chemical industry field comprises the following steps:

performing relation extraction on the acquired data to obtain triplet data;

constructing a single-level knowledge graph from the extracted triplet data;

integrating the single-level knowledge patterns to obtain a multi-level knowledge pattern;

after the step of constructing the single-level knowledge graph from the extracted triplet data, the step of integrating the single-level knowledge graph to obtain the multi-level knowledge graph further comprises multi-source data fusion, wherein the acquired uncertainty knowledge is fused by utilizing a multi-source data fusion algorithm, the knowledge with the reliability higher than a set threshold value is selected to be fused into the single-level deterministic knowledge graph, and the knowledge with the reliability lower than the set threshold value is discarded to obtain the single-level knowledge graph after supplementation;

the specific steps of fusion by using the multi-source data fusion algorithm comprise:

(1) Carrying out block aggregation on data from different sources by taking entity keywords of each layer as a basis, and taking the data as candidate matching knowledge;

(2) Matching the candidate matching knowledge in the same block with the knowledge of the original knowledge graph by utilizing a multi-source data fusion coefficient W, and if W is larger than a set threshold value, considering the candidate matching knowledge as correct knowledge, and adding the correct knowledge into the knowledge graph;

the multisource data fusion coefficients W are defined as follows:

w is composed of two parts, whereinconfidenceFor the purpose of the confidence score,is the average value of entity similarity and relationship similarity, whereinconfidenceIs composed of two partsQAndcf，Qconfidence of data source, whichQThe value of the value is high and,cfbased on the distance between the entities and the relation, calculating a confidence coefficient for each two entity combinations;

confidence formulaPerforming dependency syntax analysis according to interdependence and depended phenomenon among sentence components, recognizing entity and relationship after sentence word segmentation, and marking the relationship and the position of entity from right to left in sequence to be 0,1,2 and … … respectively, whereinLThe location of the entity is indicated and,Rrepresenting the relation position->Representing the distance of entity 1 and entity 2; L _i -Ra distance representing the entity 1 and the relationship;

Entity_simfor the computation of the text similarity between entities,Relationship_simfor calculating the similarity of the relationship, taking the average of the two as the similarity of the knowledge, and if the corresponding similarity is greater than a set threshold value of 0.5, the knowledge is more credible;

Entity_simthe calculation method comprises the following steps:

text is subjected toWord segmentation is carried out by adoptingword2vecModeling the text by the obtained word vector, and calculating cosine values of included angles of the two text vectors by using cosine similarity to measure the similarity;

Relationship_simthe calculation method comprises the following steps:

traversing the knowledge base of the same block by taking the entity as the center according to the relationship in the candidate matching entity pair, and checking whether the relationship with higher similarity with the relationship in the candidate matching entity pair exists in the knowledge base;

if not, traversing the entire knowledge base to see if there is any, if not,Relationship_simis 0;

if so, calculating the distance from the entity in the knowledge base to the matching relationLBy adding 1 to every other triplet distance,Relationship_simis that1/L ；

After the multi-source data fusion model is adopted, knowledge with the selected credibility higher than a set threshold value is fused into a knowledge graph, and knowledge with the credibility lower than the set threshold value is abandoned;

the multi-level knowledge graph construction method further comprises the step of carrying out completion operation on the multi-level knowledge graph, and the method specifically comprises the following steps: in the known correct multi-level knowledge graph, the three-component of the knowledge graph of each level is N parts, N is a positive integer, N-1 parts of three-component of each level are classified into a data set as a training set, 1 part of three-component of each level is classified into a data set as a test set, the scoring function of the ProjE model considering semantic information is trained based on training data, implicit knowledge is mined, the accuracy of the implicit knowledge is verified by using the test data set, a trained ProjE model considering semantic information is obtained, and the multi-level knowledge graph is complemented based on the trained ProjE model considering semantic information;

the multi-level knowledge graph construction method further comprises the steps of carrying out quality assessment on the multi-level knowledge graph, if the quality assessment is qualified, the current multi-level knowledge graph is the qualified knowledge graph, otherwise, returning to the step of acquiring data of different layers in the chemical process, wherein the specific steps comprise: removing one of the entities of the known triplet; and predicting the missing entity of the triplet by using a ProjE model considering semantic information, if the predicted entity is consistent with the original entity, indicating that the quality of the knowledge graph is high, otherwise, indicating that the quality of the knowledge graph is low, and judging the quality of the knowledge in the knowledge graph.

2. The method of claim 1, wherein the relationship extraction is performed on the acquired data to obtain triplet data; the method comprises the following specific steps:

3. The method of claim 1, wherein the extracted triplet data is used to construct a single-level knowledge-graph; the method comprises the following specific steps:

according to the extracted triplet data, performing triplet entity alignment, and associating all triples to construct a single-level deterministic knowledge graph;

or alternatively, the process may be performed,

integrating the single-level knowledge patterns to obtain a multi-level knowledge pattern; the method comprises the following specific steps:

and integrating the single-level knowledge graph by means of entity alignment to obtain a multi-level knowledge graph.

4. Multilayer knowledge graph construction system towards chemical industry field, characterized by includes:

an integration module configured to: integrating the single-level knowledge patterns to obtain a multi-level knowledge pattern;

the multisource data fusion coefficients W are defined as follows:

confidence formulaPerforming dependency syntax analysis according to interdependence and depended phenomenon among sentence components, identifying entity and relationship after sentence word segmentation, and marking word, relationship and entity position from right to left at a time to be 0,1,2 and … … respectively, whereinLThe location of the entity is indicated and,Rrepresenting the relation position->Representing the distance of entity 1 and entity 2; L _i -Ra distance representing the entity 1 and the relationship;

Entity_simthe calculation method comprises the following steps:

word segmentation is carried out on the text by adoptingword2vecModeling the text by the obtained word vector, and calculating cosine values of included angles of the two text vectors by using cosine similarity to measure the similarity;

Relationship_simthe calculation method comprises the following steps:

the multi-level knowledge graph construction system further comprises the step of carrying out completion operation on the multi-level knowledge graph, and the specific steps comprise: in the known correct multi-level knowledge graph, the three-component of the knowledge graph of each level is N parts, N is a positive integer, N-1 parts of three-component of each level are classified into a data set as a training set, 1 part of three-component of each level is classified into a data set as a test set, the scoring function of the ProjE model considering semantic information is trained based on training data, implicit knowledge is mined, the accuracy of the implicit knowledge is verified by using the test data set, a trained ProjE model considering semantic information is obtained, and the multi-level knowledge graph is complemented based on the trained ProjE model considering semantic information;

the multi-level knowledge graph construction system further comprises the steps of carrying out quality assessment on the multi-level knowledge graph, if the quality assessment is qualified, the current multi-level knowledge graph is the qualified knowledge graph, otherwise, returning to the step of acquiring data of different layers in the chemical process, wherein the specific steps comprise: removing one of the entities of the known triplet; and predicting the missing entity of the triplet by using a ProjE model considering semantic information, if the predicted entity is consistent with the original entity, indicating that the quality of the knowledge graph is high, otherwise, indicating that the quality of the knowledge graph is low, and judging the quality of the knowledge in the knowledge graph.

5. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of claims 1-3.

6. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-3.