CN117667890A - Knowledge base construction method and system for standard digitization - Google Patents
Knowledge base construction method and system for standard digitization Download PDFInfo
- Publication number
- CN117667890A CN117667890A CN202311635195.2A CN202311635195A CN117667890A CN 117667890 A CN117667890 A CN 117667890A CN 202311635195 A CN202311635195 A CN 202311635195A CN 117667890 A CN117667890 A CN 117667890A
- Authority
- CN
- China
- Prior art keywords
- entity
- named
- knowledge
- screening
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000009411 base construction Methods 0.000 title claims abstract description 29
- 238000012216 screening Methods 0.000 claims abstract description 93
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 238000013145 classification model Methods 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 26
- 230000014509 gene expression Effects 0.000 claims description 23
- 230000015572 biosynthetic process Effects 0.000 claims description 12
- 238000003786 synthesis reaction Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 238000010501 iterative synthesis reaction Methods 0.000 claims description 4
- 230000005484 gravity Effects 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 3
- 238000002407 reforming Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims 1
- 229910052799 carbon Inorganic materials 0.000 abstract 4
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 abstract 2
- 238000005516 engineering process Methods 0.000 description 14
- 238000013528 artificial neural network Methods 0.000 description 10
- 238000012360 testing method Methods 0.000 description 10
- 238000012549 training Methods 0.000 description 10
- 238000003745 diagnosis Methods 0.000 description 8
- 206010056342 Pulmonary mass Diseases 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 7
- 238000001514 detection method Methods 0.000 description 7
- 201000010099 disease Diseases 0.000 description 7
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 7
- 230000003902 lesion Effects 0.000 description 7
- 239000000523 sample Substances 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 2
- 241000287127 Passeridae Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a knowledge base construction method and a knowledge base construction system for standard digitization, comprising the steps of obtaining standard digitization knowledge data, preprocessing the knowledge data, obtaining knowledge association level according to the preprocessed knowledge data, extracting named entities and entity relations, screening the named entities by adopting a screening model to obtain first entities, screening the named entities by adopting the screening model to obtain second entities according to the knowledge association level, merging the first entities and the second entities to obtain comprehensive entities, classifying the knowledge data by adopting a classification model according to the comprehensive entities to obtain classification data, constructing a knowledge base model according to the entity relations and the classification data, and outputting a knowledge base. The method not only can improve the precision of the carbon emission method of the low-carbon park, but also has better interpretability, and can be directly applied to a carbon emission system of the low-carbon park.
Description
Technical Field
The invention relates to the field of standard digitization, in particular to a knowledge base construction method and system for standard digitization.
Background
The knowledge base construction technology is widely applied in the field of standard digitization, and can help constructors of the standard digitization knowledge base to timely and efficiently construct the standard digitization knowledge base so as to realize optimization processing of data of the standard digitization knowledge base. At present, the knowledge base has the characteristics of huge user information quantity, various data types, high information density and the like, and the knowledge base construction method has more uncertain factors, so that the standard digital knowledge base construction method has larger uncertainty. Although some knowledge base construction methods and systems for standard digitization have been invented, the problem of uncertainty of the knowledge base construction method for standard digitization is not yet solved effectively.
Disclosure of Invention
The invention aims to provide a knowledge base construction method and system for standard digitization.
In order to achieve the above purpose, the invention is implemented according to the following technical scheme:
the invention comprises the following steps:
acquiring standard digitized knowledge data, and preprocessing the knowledge data;
acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data; comprising the following steps:
calculating the similarity between knowledge data:
the similarity between the knowledge data R and the knowledge data C is eta (R, C), the combined set of the knowledge data R and the knowledge data C is R U C, the intersection of the knowledge data R and the knowledge data C is R U C, the similarity is 0.83 to 1 knowledge correlation grade and is one grade, the similarity is 0.51 to 0.82 knowledge correlation grade and is two grade, the similarity is 0.31 to 0.5 knowledge correlation grade and is three grade, and the similarity is 0 to 0.3 knowledge correlation grade and is four grade;
the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data; comprising the following steps:
fusing the first entity and the second entity, deleting the repeated named entity, obtaining the local density and the relative distance of the named entity, and calculating a decision value:
wherein the local density of named entity e isRelative distance μ of named entity e e The decision value of the named entity e is theta e Sorting the decision values in a descending order, selecting the first n named entities as cluster centers, and taking the first m named entities smaller than n as micro cluster centers;
increasing the number of micro cluster centers, and acquiring the micro cluster center number with the least influence of the result;
distributing the remaining named entities to class clusters where named entities with higher density and closer distance are located according to a distribution strategy of a density peak clustering algorithm;
calculating the similarity between named entities:
wherein named entity e and named entity jIs of the similarity of F ej Named entity j is named entity e with K nearest neighbor j e K (e), named entity g and named entity e with Euclidean distance c eg Constructing a similarity matrix between named entities;
calculating the similarity among the micro clusters:
wherein the micro clusters v e And micro cluster v j Is of the similarity ofThe number of the jth micro cluster centers is v j Micro cluster v e To micro cluster v j Is of the similarity of F ej Micro cluster v j To micro cluster v e Is of the similarity of F je Constructing a similarity matrix among the micro clusters according to the similarity among the micro clusters, combining the micro clusters with highest similarity and containing the center of the similar cluster with the micro clusters without the center of the final similar cluster, and outputting a comprehensive entity;
and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
Further, the method of preprocessing in step a includes removing duplicate data, de-word segmentation, de-stop word, extracting knowledge-related level, smoothing noise data, normalizing and digitizing.
Further, the method for extracting the named entity and the entity relation of the knowledge data comprises the following steps:
extracting keywords of knowledge data, carrying out gridding treatment on the keywords, mapping the keywords into a rectangular coordinate system, and obtaining a histogram by taking the frequency of occurrence of the keywords in cells as a horizontal axis and the number of cells containing the same keyword quantity as a vertical axis;
setting a threshold value of the number of the cells in the points, temporarily storing the cells if the number of the cells is greater than or equal to the threshold value, sorting the number of the keywords in the cells, and storing the cells of the first three of the sorted keywords;
temporarily reserving the cells, centering on the cells, if the number of the temporarily reserved cells exceeds the threshold, thoroughly reserving the temporarily reserved cells, otherwise, removing the cells;
forming the reserved unit cells into a characteristic cell matrix, wherein the expression is as follows:
wherein the reserved 1 row and 1 column unit cell is u 11 The feature lattice matrix isThe number of rows of the feature grid matrix is m, the number of columns of the feature grid matrix is n, and the elements of the feature grid matrix are output as named entities;
weights are assigned to different phrases in sentences to obtain embedded representation of sentence-level features, and the relationship types in the sentences are judged through the full connection layer, wherein the expression is as follows:
z r =β(g r B c +v r )
wherein the embedding of sentence c is denoted as B c The s-type function is beta, and the relation classification result r of the sentence c is z r The error parameter of the relation classification result r is v r The relation coefficient of the relation classification result r is g r Judging the relation of the knowledge data and outputting a relation result.
Further, the method for screening the named entity by adopting the screening model to obtain the first entity comprises the following steps:
screening an optimal model containing a plurality of named entities from the named entities, and setting the optimal model as an initial value;
removing initial values from all named entities, combining to form a subset, and combining with the initial values;
screening the subset by using a full subset model, selecting a model with obvious statistics, stopping calculation if no named entity is added, otherwise, continuing screening;
judging that the statistics of the subset are all obvious and the regression error is minimum, if so, reforming the subset, otherwise, screening out a named entity first set with obvious statistics;
regression is carried out on all subsets, independent variables with statistic absolute values smaller than a critical value and the smallest are deleted, and the operation is repeated until all subsets are traversed;
judging whether the statistics of all independent variables are obvious, if not, deleting again, otherwise screening out a named entity second set with obvious statistics;
and merging the named entity first set and the named entity second set, deleting repeated named entities to obtain a named entity set, and screening the named entity set according to regression errors to obtain the first entity.
Further, the method for screening the named entity to obtain the second entity according to the knowledge association level comprises the following steps:
obtaining an association matrix of the named entities according to the association of knowledge association levels among the named entities, classifying the named entities by adopting the association, and calculating a first screening index according to the maximum association degree in the named entity class and the minimum association degree among the classes, wherein the expression of the first screening index is as follows:
wherein the first screening index of named entity a isThe relevance within a named entity class is x n The relevance between named entity classes is x o ;
Obtaining a second screening index according to the information entropy of the named entity and the sample variance of the named entity, wherein the calculation formula of the second screening index is as follows:
wherein the number of named entities is w, and the specific gravity of the j index of the b named entity is u bj The mean value of named entities isThe b-th named entity is s b Obtaining a decisive screening index according to the first screening index and the second screening index:
the decisive screening index of the a-th named entity isAnd sorting the deterministic screening indexes in a descending order, and taking the largest deterministic screening index as a second entity.
Further, the method for classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data comprises the following steps:
calculating the density of the t-th iteration generation comprehensive entity:
wherein the density of the t-th iteration complex entity i isThe ith comprehensive entity is s i The j-th comprehensive entity is s j T-th iteration synthesis entity s j The kth neighbor of (2) is +.>T-th iterative synthesis entity s i The inverse k neighbor set in the whole comprehensive entity is +.>Judging the stripping mark in the t iteration generation comprehensive entity, wherein the expression is as follows:
wherein the stripping mark of the t-th iteration comprehensive entity i is as followsThe initial density of the t-th iteration complex entity i isThe density threshold of the t-th iteration complex i is +.>Determining the stripped boundary set, wherein the expression is:
wherein the t-th iteration is based on the boundary set of the stripping mark χ stripped asThe rest comprehensive entity set is stripped for the t time iteration to be S (t) Determining a residual comprehensive entity set after the residual comprehensive entity set is stripped;
after the stripping is finished, if not, continuing to strip, obtaining a connection threshold value of the comprehensive entity, and finishing initial clustering;
and carrying out fuzzy division on the boundary comprehensive entity, wherein the expression is as follows:
wherein the a-th cluster is D a The jth initial cluster isThe a-th original cluster is +.>Boundary synthesis entity s i To the initial cluster->Distance of +.>Boundary synthesis entity s i And cluster D a Is z(s) i ,D a ) Boundary synthesis entity s i To the initial cluster->Distance of +.>Classification data is obtained from the fuzzy partition.
Further, the method for constructing the knowledge base model according to the entity relation and the classification data comprises the following steps:
the knowledge base model is constructed based on a graph neural network and a converter model, the graph neural network is adopted to model entity relations and classification data, the classification data are expressed as nodes, the entity relations are expressed as edges, and the graph neural network is utilized to learn the expressions and the relations among the nodes;
inputting the classified data into a converter model, learning the representation of the text by using a self-attention mechanism and context information, and fusing the entity representation learned by the graph neural network and the text representation learned by the converter model by using weighted average;
storing the fused entity and text representation into a knowledge base, deducing a new knowledge relationship by adopting a graph neural network, dividing classification data into a training set and a testing set by adopting a random sampling method, training the knowledge base by adopting the training set, testing the trained knowledge base by adopting the testing set, stopping training when AUC values of the testing set are all greater than or equal to 0.64, otherwise adding a crossover operator into the testing set data, and continuing training.
In a second aspect, a knowledge base construction system for standard digitization, comprises:
and a pretreatment module: the method comprises the steps of acquiring standard digitized knowledge data, and preprocessing the knowledge data;
and an extraction module: acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data;
and an entity screening module: the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
and a classification module: the method comprises the steps of combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data;
the construction module comprises: and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
The beneficial effects of the invention are as follows:
compared with the prior art, the invention has the following technical effects:
the method can improve the accuracy of the knowledge base through preprocessing, named entity and entity relation extraction, named entity screening, data classification and knowledge base construction, thereby improving the accuracy of standard digital knowledge base construction, realizing the knowledge base construction of standard digital knowledge base, greatly saving resources and labor cost, improving working efficiency, realizing the knowledge base construction of standard digital knowledge data, carrying out knowledge base construction on the standard digital knowledge data in real time, having important significance on the standard digital knowledge base construction, adapting to the construction systems of different knowledge bases and the standard digital knowledge base construction requirements of different users, and having certain universality.
Drawings
FIG. 1 is a flowchart illustrating steps of a knowledge base construction method and system for standard digitization according to the present invention.
Detailed Description
The invention is further described by the following specific examples, which are presented to illustrate, but not to limit, the invention.
The invention discloses a knowledge base construction method and a knowledge base construction system for standard digitization, wherein the knowledge base construction method and the knowledge base construction system for standard digitization comprise the following steps:
as shown in fig. 1, in this embodiment, the steps include:
acquiring standard digitized knowledge data, and preprocessing the knowledge data;
acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data; comprising the following steps:
calculating the similarity between knowledge data:
the similarity between the knowledge data R and the knowledge data C is eta (R, C), the combined set of the knowledge data R and the knowledge data C is R U C, the intersection of the knowledge data R and the knowledge data C is R U C, the similarity is 0.83 to 1 knowledge correlation grade and is one grade, the similarity is 0.51 to 0.82 knowledge correlation grade and is two grade, the similarity is 0.31 to 0.5 knowledge correlation grade and is three grade, and the similarity is 0 to 0.3 knowledge correlation grade and is four grade;
in the actual evaluation, two standard digitized texts are provided:
text 1 "with the popularization of smart phones, the duty ratio of mobile shopping in electronic commerce is improved year by year. The data shows that the mobile end shopping ratio in XXXX years has reached XX%, and it is expected that XX years will be promoted to XX%. Therefore, the electronic commerce enterprise needs to optimize the user experience of the mobile terminal and improve the conversion rate of the mobile terminal;
the text 2' medical image recognition technology utilizes a deep learning algorithm to automatically analyze and recognize medical images and assist doctors in diagnosing diseases. The technique can greatly improve diagnosis efficiency and accuracy and lighten the workload of doctors. At present, medical image recognition technology is widely applied in a plurality of fields, such as lung nodule detection, fundus lesion screening and the like;
the similarity of the text 1 and the text 2 is 0.13, the knowledge association level of the text 1 and the text 2 is four, the named entity of the text 1 is extracted to be a smart phone, electronic commerce, mobile end shopping, XXXX year, XX%, user experience and conversion rate, the entity relationship of the text 1 is that the smart phone is popular, the proportion of the mobile end shopping in the electronic commerce is improved year by year, the shopping proportion of the mobile end in XXXX year is improved, the shopping proportion of the mobile end in XX year is estimated to be improved, the shopping proportion of the mobile end in XX year is improved, the electronic commerce enterprise is required to optimize the user experience of the mobile end, and the electronic commerce enterprise is required to improve the conversion rate of the mobile end;
extracting named entities of the text 2 as medical image recognition technology, deep learning algorithm, medical image, doctor, disease diagnosis, lung nodule detection and fundus lesion screening, wherein the entity relationship of the text 2 is that the medical image recognition technology utilizes the deep learning algorithm, the medical image recognition technology automatically analyzes and recognizes the medical image, the medical image recognition technology assists the doctor in disease diagnosis, the medical image recognition technology improves the diagnosis efficiency and accuracy, the medical image recognition technology lightens the workload of the doctor, the application field of the medical image recognition technology is lung nodule detection and fundus lesion screening;
the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
in actual evaluation, the first entity of the text 1 is a smart phone, electronic commerce, mobile shopping, user experience and conversion rate, and the first entity of the text 2 is a medical image recognition technology, a deep learning algorithm, disease diagnosis, lung nodule detection and fundus lesion screening; the second entity of the text 1 is a smart phone, electronic commerce, mobile shopping and conversion rate, and the second entity of the text 2 is a medical image recognition technology, disease diagnosis, lung nodule detection and fundus lesion screening;
combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data; comprising the following steps:
fusing the first entity and the second entity, deleting the repeated named entity, obtaining the local density and the relative distance of the named entity, and calculating a decision value:
wherein the local density of named entity e isRelative distance μ of named entity e e The decision value of the named entity e is theta e Sorting the decision values in a descending order, selecting the first n named entities as cluster centers, and taking the first m named entities smaller than n as micro cluster centers;
increasing the number of micro cluster centers, and acquiring the micro cluster center number with the least influence of the result;
distributing the remaining named entities to class clusters where named entities with higher density and closer distance are located according to a distribution strategy of a density peak clustering algorithm;
calculating the similarity between named entities:
wherein the similarity between the named entity e and the named entity j is F ej Named entity j is named entity e with K nearest neighbor j e K (e), named entity g and named entity e with Euclidean distance c eg Constructing a similarity matrix between named entities;
calculating the similarity among the micro clusters:
wherein the micro clusters v e And micro cluster v j Is of the similarity ofThe number of the jth micro cluster centers is v j Micro cluster v e To micro cluster v j Is of the similarity of F ej Micro cluster v j To micro cluster v e Is of the similarity of F je Constructing a similarity matrix among the micro clusters according to the similarity among the micro clusters, combining the micro clusters with highest similarity and containing the center of the similar cluster with the micro clusters without the center of the final similar cluster, and outputting a comprehensive entity;
in actual evaluation, the comprehensive entity of the text 1 is a smart phone, electronic commerce, mobile shopping, user experience and conversion rate, and the comprehensive entity of the text 2 is a medical image recognition technology, a deep learning algorithm, disease diagnosis, lung nodule detection and fundus lesion screening;
the text 1 classification data are class 1 smart phones, electronic commerce, mobile shopping, class 2 user experience and conversion rate; the text 2 classification data is a class 1 deep learning algorithm, a class 2 medical image recognition technology, disease diagnosis, lung nodule detection and fundus lesion screening;
and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
In this embodiment, the method of preprocessing in step a includes removing duplicate data, de-segmenting words, de-disabling words, extracting knowledge-related levels, smoothing noise data, normalizing and digitizing.
In this embodiment, the method for extracting named entities and entity relationships from the knowledge data includes:
extracting keywords of knowledge data, carrying out gridding treatment on the keywords, mapping the keywords into a rectangular coordinate system, and obtaining a histogram by taking the frequency of occurrence of the keywords in cells as a horizontal axis and the number of cells containing the same keyword quantity as a vertical axis;
setting a threshold value of the number of the cells in the points, temporarily storing the cells if the number of the cells is greater than or equal to the threshold value, sorting the number of the keywords in the cells, and storing the cells of the first three of the sorted keywords;
temporarily reserving the cells, centering on the cells, if the number of the temporarily reserved cells exceeds the threshold, thoroughly reserving the temporarily reserved cells, otherwise, removing the cells;
forming the reserved unit cells into a characteristic cell matrix, wherein the expression is as follows:
wherein the reserved 1 row and 1 column unit cell is u 11 The feature lattice matrix isThe number of rows of the feature grid matrix is m, the number of columns of the feature grid matrix is n, and the elements of the feature grid matrix are output as named entities;
weights are assigned to different phrases in sentences to obtain embedded representation of sentence-level features, and the relationship types in the sentences are judged through the full connection layer, wherein the expression is as follows:
z r =β(g r B c +v r )
wherein the embedding of sentence c is denoted as B c The s-type function is beta, and the relation classification result r of the sentence c is z r The error parameter of the relation classification result r is v r The relation coefficient of the relation classification result r is g r Judging the relation of the knowledge data and outputting a relation result.
In this embodiment, the method for screening the named entity to obtain the first entity by using a screening model includes:
screening an optimal model containing a plurality of named entities from the named entities, and setting the optimal model as an initial value;
removing initial values from all named entities, combining to form a subset, and combining with the initial values;
screening the subset by using a full subset model, selecting a model with obvious statistics, stopping calculation if no named entity is added, otherwise, continuing screening;
judging that the statistics of the subset are all obvious and the regression error is minimum, if so, reforming the subset, otherwise, screening out a named entity first set with obvious statistics;
regression is carried out on all subsets, independent variables with statistic absolute values smaller than a critical value and the smallest are deleted, and the operation is repeated until all subsets are traversed;
judging whether the statistics of all independent variables are obvious, if not, deleting again, otherwise screening out a named entity second set with obvious statistics;
and merging the named entity first set and the named entity second set, deleting repeated named entities to obtain a named entity set, and screening the named entity set according to regression errors to obtain the first entity.
In this embodiment, the method for screening the named entity to obtain the second entity according to the knowledge association level includes:
obtaining an association matrix of the named entities according to the association of knowledge association levels among the named entities, classifying the named entities by adopting the association, and calculating a first screening index according to the maximum association degree in the named entity class and the minimum association degree among the classes, wherein the expression of the first screening index is as follows:
wherein the first screening index of named entity a isThe relevance within a named entity class is x n The relevance between named entity classes is x o ;
Obtaining a second screening index according to the information entropy of the named entity and the sample variance of the named entity, wherein the calculation formula of the second screening index is as follows:
wherein the number of named entities is w, and the specific gravity of the j index of the b named entity is u bj The mean value of named entities isThe b-th named entity is s b Obtaining a decisive screening index according to the first screening index and the second screening index:
the decisive screening index of the a-th named entity isAnd sorting the deterministic screening indexes in a descending order, and taking the largest deterministic screening index as a second entity.
In this embodiment, the method for classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data includes:
calculating the density of the t-th iteration generation comprehensive entity:
wherein the density of the t-th iteration complex entity i isThe ith comprehensive entity is s i The j-th comprehensive entity is s j T-th iteration synthesis entity s j The kth neighbor of (2) is +.>T-th iterative synthesis entity s i Throughout the wholeThe inverse k neighbor set in the integrated entity is +.>Judging the stripping mark in the t iteration generation comprehensive entity, wherein the expression is as follows:
wherein the stripping mark of the t-th iteration comprehensive entity i is as followsThe initial density of the t-th iteration complex entity i isThe density threshold of the t-th iteration complex i is +.>Determining the stripped boundary set, wherein the expression is:
wherein the t-th iteration is based on the boundary set of the stripping mark χ stripped asThe rest comprehensive entity set is stripped for the t time iteration to be S (t) Determining a residual comprehensive entity set after the residual comprehensive entity set is stripped;
after the stripping is finished, if not, continuing to strip, obtaining a connection threshold value of the comprehensive entity, and finishing initial clustering;
and carrying out fuzzy division on the boundary comprehensive entity, wherein the expression is as follows:
wherein the a-th cluster is D a The jth initial cluster isThe a-th original cluster is +.>Boundary synthesis entity s i To the initial cluster->Distance of +.>Boundary synthesis entity s i And cluster D a Is z(s) i ,D a ) Boundary synthesis entity s i To the initial cluster->Distance of +.>Classification data is obtained from the fuzzy partition.
In this embodiment, the method for constructing a knowledge base model according to the entity relationship and the classification data includes:
the knowledge base model is constructed based on a graph neural network and a converter model, the graph neural network is adopted to model entity relations and classification data, the classification data are expressed as nodes, the entity relations are expressed as edges, and the graph neural network is utilized to learn the expressions and the relations among the nodes;
inputting the classified data into a converter model, learning the representation of the text by using a self-attention mechanism and context information, and fusing the entity representation learned by the graph neural network and the text representation learned by the converter model by using weighted average;
storing the fused entity and text representation into a knowledge base, deducing a new knowledge relationship by adopting a graph neural network, dividing classification data into a training set and a testing set by adopting a random sampling method, training the knowledge base by adopting the training set, testing the trained knowledge base by adopting the testing set, stopping training when AUC values of the testing set are all greater than or equal to 0.64, otherwise adding a crossover operator into the testing set data, and continuing training.
In this embodiment, the method for optimizing the knowledge base model includes:
initializing a population, namely dividing the population into probes and followers, determining whether the followers follow the probes according to the change of fitness values, and searching for the position updating rule expression of the probes as follows:
wherein the iteration number is t, and the maximum iteration number is t max The relation parameter is delta, the position information of the ith population individual in j dimension of the t iteration isThe random number of normal distribution is H, the whole 1 matrix is Y, and the safety threshold is V T The early warning value of the individual position of the population is E S The exponential function with the natural constant e as the base is exp (·), and the position information of the ith population individual in the j-dimension in the t+1st iteration is +.>Updating the position of the follower, and the expression is:
wherein the position of the global worst of the t-th iteration probe isThe optimal position of the t+1st iteration probe isPopulation number m, element absolute valueAnd (3) randomly selecting 13% of population individuals from the population as observers by using a matrix which is less than or equal to 1 as B, and updating the positions of the population, wherein the expression is as follows:
wherein the optimal population individual position of the t-th iteration isThe random parameter is p, and the fitness value of the sparrow individual i is g i The global optimum fitness value is r g The global worst fitness function value is r w The constant is ρ, and the iteration is stopped when the fitness value reaches a minimum.
In a second aspect, a knowledge base construction system for standard digitization, comprises:
and a pretreatment module: the method comprises the steps of acquiring standard digitized knowledge data, and preprocessing the knowledge data;
and an extraction module: acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data;
and an entity screening module: the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
and a classification module: the method comprises the steps of combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data;
the construction module comprises: and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (7)
1. A knowledge base construction method for standard digitization, comprising the steps of:
acquiring standard digitized knowledge data, and preprocessing the knowledge data;
acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data; comprising the following steps:
calculating the similarity between knowledge data:
the similarity between the knowledge data R and the knowledge data C is eta (R, C), the combined set of the knowledge data R and the knowledge data C is R U C, the intersection of the knowledge data R and the knowledge data C is R U C, the similarity is 0.83 to 1 knowledge correlation grade and is one grade, the similarity is 0.51 to 0.82 knowledge correlation grade and is two grade, the similarity is 0.31 to 0.5 knowledge correlation grade and is three grade, and the similarity is 0 to 0.3 knowledge correlation grade and is four grade;
the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data; comprising the following steps:
fusing the first entity and the second entity, deleting the repeated named entity, obtaining the local density and the relative distance of the named entity, and calculating a decision value:
wherein the local density of named entity e isRelative distance μ of named entity e e The decision value of the named entity e is theta e Sorting the decision values in a descending order, selecting the first n named entities as cluster centers, and taking the first m named entities smaller than n as micro cluster centers;
increasing the number of micro cluster centers, and acquiring the micro cluster center number with the least influence of the result;
distributing the remaining named entities to class clusters where named entities with higher density and closer distance are located according to a distribution strategy of a density peak clustering algorithm;
calculating the similarity between named entities:
wherein the similarity between the named entity e and the named entity j is F ej Named entity j is named entity e with K nearest neighbor j e K (e), named entity g and named entity e with Euclidean distance c eg Constructing a similarity matrix between named entities;
calculating the similarity among the micro clusters:
wherein the micro clusters v e And micro cluster v j Is of the similarity ofThe number of the jth micro cluster centers is v j Micro cluster v e To micro cluster v j Is of the similarity of F ej Micro cluster v j To micro cluster v e Is of the similarity of F je Constructing a similarity matrix among the micro clusters according to the similarity among the micro clusters, combining the micro clusters with highest similarity and containing the center of the similar cluster with the micro clusters without the center of the final similar cluster, and outputting a comprehensive entity;
and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
2. A knowledge base construction method for standard digitization according to claim 1, wherein the pre-processing method in step a comprises removing duplicate data, de-word, de-stop word, extracting knowledge-related level, smoothing noise data, normalization and digitization.
3. The knowledge base construction method for standard digitization according to claim 1, wherein the method for extracting named entities and entity relationships from the knowledge data comprises:
extracting keywords of knowledge data, carrying out gridding treatment on the keywords, mapping the keywords into a rectangular coordinate system, and obtaining a histogram by taking the frequency of occurrence of the keywords in cells as a horizontal axis and the number of cells containing the same keyword quantity as a vertical axis;
setting a threshold value of the number of the cells in the points, temporarily storing the cells if the number of the cells is greater than or equal to the threshold value, sorting the number of the keywords in the cells, and storing the cells of the first three of the sorted keywords;
temporarily reserving the cells, centering on the cells, if the number of the temporarily reserved cells exceeds the threshold, thoroughly reserving the temporarily reserved cells, otherwise, removing the cells;
forming the reserved unit cells into a characteristic cell matrix, wherein the expression is as follows:
wherein the reserved 1 row and 1 column unit cell is u 11 The feature lattice matrix isThe number of rows of the feature grid matrix is m, the number of columns of the feature grid matrix is n, and the elements of the feature grid matrix are inputThe name is named entity;
weights are assigned to different phrases in sentences to obtain embedded representation of sentence-level features, and the relationship types in the sentences are judged through the full connection layer, wherein the expression is as follows:
z r =β(g r B c +v r )
wherein the embedding of sentence c is denoted as B c The s-type function is beta, and the relation classification result r of the sentence c is z r The error parameter of the relation classification result r is v r The relation coefficient of the relation classification result r is g r Judging the relation of the knowledge data and outputting a relation result.
4. A method for building a knowledge base for standard digitization according to claim 1, wherein the method for screening the named entities to obtain a first entity using a screening model comprises:
screening an optimal model containing a plurality of named entities from the named entities, and setting the optimal model as an initial value;
removing initial values from all named entities, combining to form a subset, and combining with the initial values;
screening the subset by using a full subset model, selecting a model with obvious statistics, stopping calculation if no named entity is added, otherwise, continuing screening;
judging that the statistics of the subset are all obvious and the regression error is minimum, if so, reforming the subset, otherwise, screening out a named entity first set with obvious statistics;
regression is carried out on all subsets, independent variables with statistic absolute values smaller than a critical value and the smallest are deleted, and the operation is repeated until all subsets are traversed;
judging whether the statistics of all independent variables are obvious, if not, deleting again, otherwise screening out a named entity second set with obvious statistics;
and merging the named entity first set and the named entity second set, deleting repeated named entities to obtain a named entity set, and screening the named entity set according to regression errors to obtain the first entity.
5. The method for building a standardized digitized knowledge base of claim 1 wherein the method for screening the named entities for a second entity based on the knowledge correlation rating comprises:
obtaining an association matrix of the named entities according to the association of knowledge association levels among the named entities, classifying the named entities by adopting the association, and calculating a first screening index according to the maximum association degree in the named entity class and the minimum association degree among the classes, wherein the expression of the first screening index is as follows:
wherein the first screening index of named entity a isThe relevance within a named entity class is x n The relevance between named entity classes is x o ;
Obtaining a second screening index according to the information entropy of the named entity and the sample variance of the named entity, wherein the calculation formula of the second screening index is as follows:
wherein the number of named entities is w, and the specific gravity of the j index of the b named entity is u bj The mean value of named entities isThe b-th named entity is s b Obtaining a decisive screening index according to the first screening index and the second screening index:
the decisive screening index of the a-th named entity isAnd sorting the deterministic screening indexes in a descending order, and taking the largest deterministic screening index as a second entity.
6. The method for constructing a standardized digitized knowledge base of claim 1 wherein the method for classifying the knowledge data to obtain classification data using a classification model based on the integrated entity comprises:
calculating the density of the t-th iteration generation comprehensive entity:
wherein the density of the t-th iteration complex entity i isThe ith comprehensive entity is s i The j-th comprehensive entity is s j T-th iteration synthesis entity s j The kth neighbor of (2) is +.>T-th iterative synthesis entity s i The inverse k neighbor set in the whole comprehensive entity is +.>Judging the stripping mark in the t iteration generation comprehensive entity, wherein the expression is as follows:
wherein the t th iterationThe stripping mark of the integrated entity i is as followsThe initial density of the t-th iterative synthesis entity i is +.>The density threshold of the t-th iteration complex i is +.>Determining the stripped boundary set, wherein the expression is:
wherein the t-th iteration is based on the boundary set of the stripping mark χ stripped asThe rest comprehensive entity set is stripped for the t time iteration to be S (t) Determining a residual comprehensive entity set after the residual comprehensive entity set is stripped;
after the stripping is finished, if not, continuing to strip, obtaining a connection threshold value of the comprehensive entity, and finishing initial clustering;
and carrying out fuzzy division on the boundary comprehensive entity, wherein the expression is as follows:
wherein the a-th cluster is D a The jth initial cluster isThe a-th original cluster is +.>Boundary synthesis entitys i To the initial cluster->Distance of +.>Boundary synthesis entity s i And cluster D a Is z(s) i ,D a ) Boundary synthesis entity s i To the initial cluster->Distance of +.>Classification data is obtained from the fuzzy partition.
7. A knowledge base construction system for standard digitization, comprising:
and a pretreatment module: the method comprises the steps of acquiring standard digitized knowledge data, and preprocessing the knowledge data;
and an extraction module: acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data;
and an entity screening module: the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
and a classification module: the method comprises the steps of combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data;
the construction module comprises: and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311635195.2A CN117667890B (en) | 2023-12-01 | 2023-12-01 | Knowledge base construction method and system for standard digitization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311635195.2A CN117667890B (en) | 2023-12-01 | 2023-12-01 | Knowledge base construction method and system for standard digitization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117667890A true CN117667890A (en) | 2024-03-08 |
CN117667890B CN117667890B (en) | 2024-08-02 |
Family
ID=90083858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311635195.2A Active CN117667890B (en) | 2023-12-01 | 2023-12-01 | Knowledge base construction method and system for standard digitization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117667890B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN111737496A (en) * | 2020-06-29 | 2020-10-02 | 东北电力大学 | Power equipment fault knowledge map construction method |
US20210081376A1 (en) * | 2018-05-25 | 2021-03-18 | ZFusion Technology Co., Ltd. Xiamen | Construction method, device, computing device, and storage medium for constructing patent knowledge database |
CN114792145A (en) * | 2022-05-27 | 2022-07-26 | 中国标准化研究院 | Standard digital management maintenance system and method based on knowledge graph |
CN114925212A (en) * | 2022-05-06 | 2022-08-19 | 神州医疗科技股份有限公司 | Relation extraction method and system for automatically judging and fusing knowledge graph |
CN115329101A (en) * | 2022-09-06 | 2022-11-11 | 南京邮电大学 | Electric power Internet of things standard knowledge graph construction method and device |
-
2023
- 2023-12-01 CN CN202311635195.2A patent/CN117667890B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210081376A1 (en) * | 2018-05-25 | 2021-03-18 | ZFusion Technology Co., Ltd. Xiamen | Construction method, device, computing device, and storage medium for constructing patent knowledge database |
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN111737496A (en) * | 2020-06-29 | 2020-10-02 | 东北电力大学 | Power equipment fault knowledge map construction method |
CN114925212A (en) * | 2022-05-06 | 2022-08-19 | 神州医疗科技股份有限公司 | Relation extraction method and system for automatically judging and fusing knowledge graph |
CN114792145A (en) * | 2022-05-27 | 2022-07-26 | 中国标准化研究院 | Standard digital management maintenance system and method based on knowledge graph |
CN115329101A (en) * | 2022-09-06 | 2022-11-11 | 南京邮电大学 | Electric power Internet of things standard knowledge graph construction method and device |
Non-Patent Citations (7)
Title |
---|
SANGHYUK ROY CHOI 等: "Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review", 《BIOLOGY》, vol. 12, no. 7, 22 July 2023 (2023-07-22), pages 1 - 29 * |
孙嘉睿 等: "模糊边界剥离聚类", 《山东大学学报(理学版)》, vol. 59, no. 3, 28 November 2023 (2023-11-28), pages 2 - 3 * |
岳喜超 等: "结合主成分与熵权的关键变量筛选算法", 《中国电子科学研究院学报》, vol. 18, no. 7, 20 July 2023 (2023-07-20), pages 2 * |
彭维湘: "基于VBA编程的全子集模型筛选算法", 《统计与决策》, vol. 39, no. 11, 7 June 2023 (2023-06-07), pages 1 - 3 * |
李智冈 等: "基于加权核密度估计与微簇合并的密度峰值聚类算法", 《信息与控制》, 2 November 2023 (2023-11-02), pages 2 * |
谢雪莲: "基于知识图谱嵌入的急性心肌梗死辅助诊断模型", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》, no. 02, 15 February 2023 (2023-02-15), pages 062 - 233 * |
赵小瑞 等: "基于空间格的连续时间空域特征提取算法", 《舰船电子对抗》, vol. 46, no. 4, 25 August 2023 (2023-08-25), pages 1 - 2 * |
Also Published As
Publication number | Publication date |
---|---|
CN117667890B (en) | 2024-08-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105677873B (en) | Text Intelligence association cluster based on model of the domain knowledge collects processing method | |
CN109189767B (en) | Data processing method and device, electronic equipment and storage medium | |
CN108038492A (en) | A kind of perceptual term vector and sensibility classification method based on deep learning | |
CN108959305A (en) | A kind of event extraction method and system based on internet big data | |
CN110633365A (en) | Word vector-based hierarchical multi-label text classification method and system | |
CN110569982A (en) | Active sampling method based on meta-learning | |
CN110097096B (en) | Text classification method based on TF-IDF matrix and capsule network | |
CN112214335B (en) | Web service discovery method based on knowledge graph and similarity network | |
CN112529638B (en) | Service demand dynamic prediction method and system based on user classification and deep learning | |
CN109033087B (en) | Method for calculating text semantic distance, deduplication method, clustering method and device | |
CN116756347B (en) | Semantic information retrieval method based on big data | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN114611491A (en) | Intelligent government affair public opinion analysis research method based on text mining technology | |
CN114093445B (en) | Patient screening marking method based on partial multi-marking learning | |
CN117351484B (en) | Tumor stem cell characteristic extraction and classification system based on AI | |
CN113722494A (en) | Equipment fault positioning method based on natural language understanding | |
CN117667890B (en) | Knowledge base construction method and system for standard digitization | |
CN116629716A (en) | Intelligent interaction system work efficiency analysis method | |
CN116401368A (en) | Intention recognition method and system based on topic event analysis | |
CN113222018B (en) | Image classification method | |
CN111767402B (en) | Limited domain event detection method based on counterstudy | |
CN112487816B (en) | Named entity identification method based on network classification | |
CN117437976B (en) | Disease risk screening method and system based on gene detection | |
CN113010668B (en) | Text clustering method, text clustering device, electronic equipment and computer readable storage medium | |
CN114036946B (en) | Text feature extraction and auxiliary retrieval system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |