CN117667890A

CN117667890A - Knowledge base construction method and system for standard digitization

Info

Publication number: CN117667890A
Application number: CN202311635195.2A
Authority: CN
Inventors: 岳高峰; 高亮; 王志强; 温娜
Original assignee: China National Institute of Standardization
Current assignee: China National Institute of Standardization
Priority date: 2023-12-01
Filing date: 2023-12-01
Publication date: 2024-03-08
Anticipated expiration: 2043-12-01
Also published as: CN117667890B

Abstract

The invention discloses a knowledge base construction method and a knowledge base construction system for standard digitization, comprising the steps of obtaining standard digitization knowledge data, preprocessing the knowledge data, obtaining knowledge association level according to the preprocessed knowledge data, extracting named entities and entity relations, screening the named entities by adopting a screening model to obtain first entities, screening the named entities by adopting the screening model to obtain second entities according to the knowledge association level, merging the first entities and the second entities to obtain comprehensive entities, classifying the knowledge data by adopting a classification model according to the comprehensive entities to obtain classification data, constructing a knowledge base model according to the entity relations and the classification data, and outputting a knowledge base. The method not only can improve the precision of the carbon emission method of the low-carbon park, but also has better interpretability, and can be directly applied to a carbon emission system of the low-carbon park.

Description

Knowledge base construction method and system for standard digitization

Technical Field

The invention relates to the field of standard digitization, in particular to a knowledge base construction method and system for standard digitization.

Background

The knowledge base construction technology is widely applied in the field of standard digitization, and can help constructors of the standard digitization knowledge base to timely and efficiently construct the standard digitization knowledge base so as to realize optimization processing of data of the standard digitization knowledge base. At present, the knowledge base has the characteristics of huge user information quantity, various data types, high information density and the like, and the knowledge base construction method has more uncertain factors, so that the standard digital knowledge base construction method has larger uncertainty. Although some knowledge base construction methods and systems for standard digitization have been invented, the problem of uncertainty of the knowledge base construction method for standard digitization is not yet solved effectively.

Disclosure of Invention

The invention aims to provide a knowledge base construction method and system for standard digitization.

In order to achieve the above purpose, the invention is implemented according to the following technical scheme:

the invention comprises the following steps:

acquiring standard digitized knowledge data, and preprocessing the knowledge data;

acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data; comprising the following steps:

calculating the similarity between knowledge data:

the similarity between the knowledge data R and the knowledge data C is eta (R, C), the combined set of the knowledge data R and the knowledge data C is R U C, the intersection of the knowledge data R and the knowledge data C is R U C, the similarity is 0.83 to 1 knowledge correlation grade and is one grade, the similarity is 0.51 to 0.82 knowledge correlation grade and is two grade, the similarity is 0.31 to 0.5 knowledge correlation grade and is three grade, and the similarity is 0 to 0.3 knowledge correlation grade and is four grade;

the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;

combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data; comprising the following steps:

fusing the first entity and the second entity, deleting the repeated named entity, obtaining the local density and the relative distance of the named entity, and calculating a decision value:

wherein the local density of named entity e isRelative distance μ of named entity e _e The decision value of the named entity e is theta _e Sorting the decision values in a descending order, selecting the first n named entities as cluster centers, and taking the first m named entities smaller than n as micro cluster centers;

increasing the number of micro cluster centers, and acquiring the micro cluster center number with the least influence of the result;

distributing the remaining named entities to class clusters where named entities with higher density and closer distance are located according to a distribution strategy of a density peak clustering algorithm;

calculating the similarity between named entities:

wherein named entity e and named entity jIs of the similarity of F _ej Named entity j is named entity e with K nearest neighbor j e K (e), named entity g and named entity e with Euclidean distance c _eg Constructing a similarity matrix between named entities;

calculating the similarity among the micro clusters:

wherein the micro clusters v _e And micro cluster v _j Is of the similarity ofThe number of the jth micro cluster centers is v _j Micro cluster v _e To micro cluster v _j Is of the similarity of F _ej Micro cluster v _j To micro cluster v _e Is of the similarity of F _je Constructing a similarity matrix among the micro clusters according to the similarity among the micro clusters, combining the micro clusters with highest similarity and containing the center of the similar cluster with the micro clusters without the center of the final similar cluster, and outputting a comprehensive entity;

and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.

Further, the method of preprocessing in step a includes removing duplicate data, de-word segmentation, de-stop word, extracting knowledge-related level, smoothing noise data, normalizing and digitizing.

Further, the method for extracting the named entity and the entity relation of the knowledge data comprises the following steps:

extracting keywords of knowledge data, carrying out gridding treatment on the keywords, mapping the keywords into a rectangular coordinate system, and obtaining a histogram by taking the frequency of occurrence of the keywords in cells as a horizontal axis and the number of cells containing the same keyword quantity as a vertical axis;

setting a threshold value of the number of the cells in the points, temporarily storing the cells if the number of the cells is greater than or equal to the threshold value, sorting the number of the keywords in the cells, and storing the cells of the first three of the sorted keywords;

temporarily reserving the cells, centering on the cells, if the number of the temporarily reserved cells exceeds the threshold, thoroughly reserving the temporarily reserved cells, otherwise, removing the cells;

forming the reserved unit cells into a characteristic cell matrix, wherein the expression is as follows:

wherein the reserved 1 row and 1 column unit cell is u ₁₁ The feature lattice matrix isThe number of rows of the feature grid matrix is m, the number of columns of the feature grid matrix is n, and the elements of the feature grid matrix are output as named entities;

weights are assigned to different phrases in sentences to obtain embedded representation of sentence-level features, and the relationship types in the sentences are judged through the full connection layer, wherein the expression is as follows:

z _r ＝β(g _r B _c +v _r )

wherein the embedding of sentence c is denoted as B _c The s-type function is beta, and the relation classification result r of the sentence c is z _r The error parameter of the relation classification result r is v _r The relation coefficient of the relation classification result r is g _r Judging the relation of the knowledge data and outputting a relation result.

Further, the method for screening the named entity by adopting the screening model to obtain the first entity comprises the following steps:

screening an optimal model containing a plurality of named entities from the named entities, and setting the optimal model as an initial value;

removing initial values from all named entities, combining to form a subset, and combining with the initial values;

screening the subset by using a full subset model, selecting a model with obvious statistics, stopping calculation if no named entity is added, otherwise, continuing screening;

judging that the statistics of the subset are all obvious and the regression error is minimum, if so, reforming the subset, otherwise, screening out a named entity first set with obvious statistics;

regression is carried out on all subsets, independent variables with statistic absolute values smaller than a critical value and the smallest are deleted, and the operation is repeated until all subsets are traversed;

judging whether the statistics of all independent variables are obvious, if not, deleting again, otherwise screening out a named entity second set with obvious statistics;

and merging the named entity first set and the named entity second set, deleting repeated named entities to obtain a named entity set, and screening the named entity set according to regression errors to obtain the first entity.

Further, the method for screening the named entity to obtain the second entity according to the knowledge association level comprises the following steps:

obtaining an association matrix of the named entities according to the association of knowledge association levels among the named entities, classifying the named entities by adopting the association, and calculating a first screening index according to the maximum association degree in the named entity class and the minimum association degree among the classes, wherein the expression of the first screening index is as follows:

wherein the first screening index of named entity a isThe relevance within a named entity class is x _n The relevance between named entity classes is x _o ；

Obtaining a second screening index according to the information entropy of the named entity and the sample variance of the named entity, wherein the calculation formula of the second screening index is as follows:

wherein the number of named entities is w, and the specific gravity of the j index of the b named entity is u _bj The mean value of named entities isThe b-th named entity is s _b Obtaining a decisive screening index according to the first screening index and the second screening index:

the decisive screening index of the a-th named entity isAnd sorting the deterministic screening indexes in a descending order, and taking the largest deterministic screening index as a second entity.

Further, the method for classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data comprises the following steps:

calculating the density of the t-th iteration generation comprehensive entity:

wherein the density of the t-th iteration complex entity i isThe ith comprehensive entity is s _i The j-th comprehensive entity is s _j T-th iteration synthesis entity s _j The kth neighbor of (2) is +.>T-th iterative synthesis entity s _i The inverse k neighbor set in the whole comprehensive entity is +.>Judging the stripping mark in the t iteration generation comprehensive entity, wherein the expression is as follows:

wherein the stripping mark of the t-th iteration comprehensive entity i is as followsThe initial density of the t-th iteration complex entity i isThe density threshold of the t-th iteration complex i is +.>Determining the stripped boundary set, wherein the expression is:

wherein the t-th iteration is based on the boundary set of the stripping mark χ stripped asThe rest comprehensive entity set is stripped for the t time iteration to be S ^(t) Determining a residual comprehensive entity set after the residual comprehensive entity set is stripped;

after the stripping is finished, if not, continuing to strip, obtaining a connection threshold value of the comprehensive entity, and finishing initial clustering;

and carrying out fuzzy division on the boundary comprehensive entity, wherein the expression is as follows:

wherein the a-th cluster is D _a The jth initial cluster isThe a-th original cluster is +.>Boundary synthesis entity s _i To the initial cluster->Distance of +.>Boundary synthesis entity s _i And cluster D _a Is z(s) _i ,D _a ) Boundary synthesis entity s _i To the initial cluster->Distance of +.>Classification data is obtained from the fuzzy partition.

Further, the method for constructing the knowledge base model according to the entity relation and the classification data comprises the following steps:

the knowledge base model is constructed based on a graph neural network and a converter model, the graph neural network is adopted to model entity relations and classification data, the classification data are expressed as nodes, the entity relations are expressed as edges, and the graph neural network is utilized to learn the expressions and the relations among the nodes;

inputting the classified data into a converter model, learning the representation of the text by using a self-attention mechanism and context information, and fusing the entity representation learned by the graph neural network and the text representation learned by the converter model by using weighted average;

storing the fused entity and text representation into a knowledge base, deducing a new knowledge relationship by adopting a graph neural network, dividing classification data into a training set and a testing set by adopting a random sampling method, training the knowledge base by adopting the training set, testing the trained knowledge base by adopting the testing set, stopping training when AUC values of the testing set are all greater than or equal to 0.64, otherwise adding a crossover operator into the testing set data, and continuing training.

In a second aspect, a knowledge base construction system for standard digitization, comprises:

and a pretreatment module: the method comprises the steps of acquiring standard digitized knowledge data, and preprocessing the knowledge data;

and an extraction module: acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data;

and an entity screening module: the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;

and a classification module: the method comprises the steps of combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data;

the construction module comprises: and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.

The beneficial effects of the invention are as follows:

compared with the prior art, the invention has the following technical effects:

the method can improve the accuracy of the knowledge base through preprocessing, named entity and entity relation extraction, named entity screening, data classification and knowledge base construction, thereby improving the accuracy of standard digital knowledge base construction, realizing the knowledge base construction of standard digital knowledge base, greatly saving resources and labor cost, improving working efficiency, realizing the knowledge base construction of standard digital knowledge data, carrying out knowledge base construction on the standard digital knowledge data in real time, having important significance on the standard digital knowledge base construction, adapting to the construction systems of different knowledge bases and the standard digital knowledge base construction requirements of different users, and having certain universality.

Drawings

FIG. 1 is a flowchart illustrating steps of a knowledge base construction method and system for standard digitization according to the present invention.

Detailed Description

The invention is further described by the following specific examples, which are presented to illustrate, but not to limit, the invention.

The invention discloses a knowledge base construction method and a knowledge base construction system for standard digitization, wherein the knowledge base construction method and the knowledge base construction system for standard digitization comprise the following steps:

as shown in fig. 1, in this embodiment, the steps include:

calculating the similarity between knowledge data:

in the actual evaluation, two standard digitized texts are provided:

text 1 "with the popularization of smart phones, the duty ratio of mobile shopping in electronic commerce is improved year by year. The data shows that the mobile end shopping ratio in XXXX years has reached XX%, and it is expected that XX years will be promoted to XX%. Therefore, the electronic commerce enterprise needs to optimize the user experience of the mobile terminal and improve the conversion rate of the mobile terminal;

the text 2' medical image recognition technology utilizes a deep learning algorithm to automatically analyze and recognize medical images and assist doctors in diagnosing diseases. The technique can greatly improve diagnosis efficiency and accuracy and lighten the workload of doctors. At present, medical image recognition technology is widely applied in a plurality of fields, such as lung nodule detection, fundus lesion screening and the like;

the similarity of the text 1 and the text 2 is 0.13, the knowledge association level of the text 1 and the text 2 is four, the named entity of the text 1 is extracted to be a smart phone, electronic commerce, mobile end shopping, XXXX year, XX%, user experience and conversion rate, the entity relationship of the text 1 is that the smart phone is popular, the proportion of the mobile end shopping in the electronic commerce is improved year by year, the shopping proportion of the mobile end in XXXX year is improved, the shopping proportion of the mobile end in XX year is estimated to be improved, the shopping proportion of the mobile end in XX year is improved, the electronic commerce enterprise is required to optimize the user experience of the mobile end, and the electronic commerce enterprise is required to improve the conversion rate of the mobile end;

extracting named entities of the text 2 as medical image recognition technology, deep learning algorithm, medical image, doctor, disease diagnosis, lung nodule detection and fundus lesion screening, wherein the entity relationship of the text 2 is that the medical image recognition technology utilizes the deep learning algorithm, the medical image recognition technology automatically analyzes and recognizes the medical image, the medical image recognition technology assists the doctor in disease diagnosis, the medical image recognition technology improves the diagnosis efficiency and accuracy, the medical image recognition technology lightens the workload of the doctor, the application field of the medical image recognition technology is lung nodule detection and fundus lesion screening;

in actual evaluation, the first entity of the text 1 is a smart phone, electronic commerce, mobile shopping, user experience and conversion rate, and the first entity of the text 2 is a medical image recognition technology, a deep learning algorithm, disease diagnosis, lung nodule detection and fundus lesion screening; the second entity of the text 1 is a smart phone, electronic commerce, mobile shopping and conversion rate, and the second entity of the text 2 is a medical image recognition technology, disease diagnosis, lung nodule detection and fundus lesion screening;

calculating the similarity between named entities:

wherein the similarity between the named entity e and the named entity j is F _ej Named entity j is named entity e with K nearest neighbor j e K (e), named entity g and named entity e with Euclidean distance c _eg Constructing a similarity matrix between named entities;

calculating the similarity among the micro clusters:

in actual evaluation, the comprehensive entity of the text 1 is a smart phone, electronic commerce, mobile shopping, user experience and conversion rate, and the comprehensive entity of the text 2 is a medical image recognition technology, a deep learning algorithm, disease diagnosis, lung nodule detection and fundus lesion screening;

the text 1 classification data are class 1 smart phones, electronic commerce, mobile shopping, class 2 user experience and conversion rate; the text 2 classification data is a class 1 deep learning algorithm, a class 2 medical image recognition technology, disease diagnosis, lung nodule detection and fundus lesion screening;

In this embodiment, the method of preprocessing in step a includes removing duplicate data, de-segmenting words, de-disabling words, extracting knowledge-related levels, smoothing noise data, normalizing and digitizing.

In this embodiment, the method for extracting named entities and entity relationships from the knowledge data includes:

z _r ＝β(g _r B _c +v _r )

In this embodiment, the method for screening the named entity to obtain the first entity by using a screening model includes:

In this embodiment, the method for screening the named entity to obtain the second entity according to the knowledge association level includes:

In this embodiment, the method for classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data includes:

calculating the density of the t-th iteration generation comprehensive entity:

wherein the density of the t-th iteration complex entity i isThe ith comprehensive entity is s _i The j-th comprehensive entity is s _j T-th iteration synthesis entity s _j The kth neighbor of (2) is +.>T-th iterative synthesis entity s _i Throughout the wholeThe inverse k neighbor set in the integrated entity is +.>Judging the stripping mark in the t iteration generation comprehensive entity, wherein the expression is as follows:

In this embodiment, the method for constructing a knowledge base model according to the entity relationship and the classification data includes:

In this embodiment, the method for optimizing the knowledge base model includes:

initializing a population, namely dividing the population into probes and followers, determining whether the followers follow the probes according to the change of fitness values, and searching for the position updating rule expression of the probes as follows:

wherein the iteration number is t, and the maximum iteration number is t _max The relation parameter is delta, the position information of the ith population individual in j dimension of the t iteration isThe random number of normal distribution is H, the whole 1 matrix is Y, and the safety threshold is V _T The early warning value of the individual position of the population is E _S The exponential function with the natural constant e as the base is exp (·), and the position information of the ith population individual in the j-dimension in the t+1st iteration is +.>Updating the position of the follower, and the expression is:

wherein the position of the global worst of the t-th iteration probe isThe optimal position of the t+1st iteration probe isPopulation number m, element absolute valueAnd (3) randomly selecting 13% of population individuals from the population as observers by using a matrix which is less than or equal to 1 as B, and updating the positions of the population, wherein the expression is as follows:

wherein the optimal population individual position of the t-th iteration isThe random parameter is p, and the fitness value of the sparrow individual i is g _i The global optimum fitness value is r _g The global worst fitness function value is r _w The constant is ρ, and the iteration is stopped when the fitness value reaches a minimum.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A knowledge base construction method for standard digitization, comprising the steps of:

calculating the similarity between knowledge data:

calculating the similarity between named entities:

calculating the similarity among the micro clusters:

2. A knowledge base construction method for standard digitization according to claim 1, wherein the pre-processing method in step a comprises removing duplicate data, de-word, de-stop word, extracting knowledge-related level, smoothing noise data, normalization and digitization.

3. The knowledge base construction method for standard digitization according to claim 1, wherein the method for extracting named entities and entity relationships from the knowledge data comprises:

wherein the reserved 1 row and 1 column unit cell is u ₁₁ The feature lattice matrix isThe number of rows of the feature grid matrix is m, the number of columns of the feature grid matrix is n, and the elements of the feature grid matrix are inputThe name is named entity;

z _r ＝β(g _r B _c +v _r )

4. A method for building a knowledge base for standard digitization according to claim 1, wherein the method for screening the named entities to obtain a first entity using a screening model comprises:

5. The method for building a standardized digitized knowledge base of claim 1 wherein the method for screening the named entities for a second entity based on the knowledge correlation rating comprises:

6. The method for constructing a standardized digitized knowledge base of claim 1 wherein the method for classifying the knowledge data to obtain classification data using a classification model based on the integrated entity comprises:

calculating the density of the t-th iteration generation comprehensive entity:

wherein the t th iterationThe stripping mark of the integrated entity i is as followsThe initial density of the t-th iterative synthesis entity i is +.>The density threshold of the t-th iteration complex i is +.>Determining the stripped boundary set, wherein the expression is:

wherein the a-th cluster is D _a The jth initial cluster isThe a-th original cluster is +.>Boundary synthesis entitys _i To the initial cluster->Distance of +.>Boundary synthesis entity s _i And cluster D _a Is z(s) _i ,D _a ) Boundary synthesis entity s _i To the initial cluster->Distance of +.>Classification data is obtained from the fuzzy partition.

7. A knowledge base construction system for standard digitization, comprising: