CN117667890A - Knowledge base construction method and system for standard digitization - Google Patents

Knowledge base construction method and system for standard digitization Download PDF

Info

Publication number
CN117667890A
CN117667890A CN202311635195.2A CN202311635195A CN117667890A CN 117667890 A CN117667890 A CN 117667890A CN 202311635195 A CN202311635195 A CN 202311635195A CN 117667890 A CN117667890 A CN 117667890A
Authority
CN
China
Prior art keywords
entity
named
knowledge
screening
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311635195.2A
Other languages
Chinese (zh)
Other versions
CN117667890B (en
Inventor
岳高峰
高亮
王志强
温娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China National Institute of Standardization
Original Assignee
China National Institute of Standardization
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China National Institute of Standardization filed Critical China National Institute of Standardization
Priority to CN202311635195.2A priority Critical patent/CN117667890B/en
Publication of CN117667890A publication Critical patent/CN117667890A/en
Application granted granted Critical
Publication of CN117667890B publication Critical patent/CN117667890B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a knowledge base construction method and a knowledge base construction system for standard digitization, comprising the steps of obtaining standard digitization knowledge data, preprocessing the knowledge data, obtaining knowledge association level according to the preprocessed knowledge data, extracting named entities and entity relations, screening the named entities by adopting a screening model to obtain first entities, screening the named entities by adopting the screening model to obtain second entities according to the knowledge association level, merging the first entities and the second entities to obtain comprehensive entities, classifying the knowledge data by adopting a classification model according to the comprehensive entities to obtain classification data, constructing a knowledge base model according to the entity relations and the classification data, and outputting a knowledge base. The method not only can improve the precision of the carbon emission method of the low-carbon park, but also has better interpretability, and can be directly applied to a carbon emission system of the low-carbon park.

Description

Knowledge base construction method and system for standard digitization
Technical Field
The invention relates to the field of standard digitization, in particular to a knowledge base construction method and system for standard digitization.
Background
The knowledge base construction technology is widely applied in the field of standard digitization, and can help constructors of the standard digitization knowledge base to timely and efficiently construct the standard digitization knowledge base so as to realize optimization processing of data of the standard digitization knowledge base. At present, the knowledge base has the characteristics of huge user information quantity, various data types, high information density and the like, and the knowledge base construction method has more uncertain factors, so that the standard digital knowledge base construction method has larger uncertainty. Although some knowledge base construction methods and systems for standard digitization have been invented, the problem of uncertainty of the knowledge base construction method for standard digitization is not yet solved effectively.
Disclosure of Invention
The invention aims to provide a knowledge base construction method and system for standard digitization.
In order to achieve the above purpose, the invention is implemented according to the following technical scheme:
the invention comprises the following steps:
acquiring standard digitized knowledge data, and preprocessing the knowledge data;
acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data; comprising the following steps:
calculating the similarity between knowledge data:
the similarity between the knowledge data R and the knowledge data C is eta (R, C), the combined set of the knowledge data R and the knowledge data C is R U C, the intersection of the knowledge data R and the knowledge data C is R U C, the similarity is 0.83 to 1 knowledge correlation grade and is one grade, the similarity is 0.51 to 0.82 knowledge correlation grade and is two grade, the similarity is 0.31 to 0.5 knowledge correlation grade and is three grade, and the similarity is 0 to 0.3 knowledge correlation grade and is four grade;
the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data; comprising the following steps:
fusing the first entity and the second entity, deleting the repeated named entity, obtaining the local density and the relative distance of the named entity, and calculating a decision value:
wherein the local density of named entity e isRelative distance μ of named entity e e The decision value of the named entity e is theta e Sorting the decision values in a descending order, selecting the first n named entities as cluster centers, and taking the first m named entities smaller than n as micro cluster centers;
increasing the number of micro cluster centers, and acquiring the micro cluster center number with the least influence of the result;
distributing the remaining named entities to class clusters where named entities with higher density and closer distance are located according to a distribution strategy of a density peak clustering algorithm;
calculating the similarity between named entities:
wherein named entity e and named entity jIs of the similarity of F ej Named entity j is named entity e with K nearest neighbor j e K (e), named entity g and named entity e with Euclidean distance c eg Constructing a similarity matrix between named entities;
calculating the similarity among the micro clusters:
wherein the micro clusters v e And micro cluster v j Is of the similarity ofThe number of the jth micro cluster centers is v j Micro cluster v e To micro cluster v j Is of the similarity of F ej Micro cluster v j To micro cluster v e Is of the similarity of F je Constructing a similarity matrix among the micro clusters according to the similarity among the micro clusters, combining the micro clusters with highest similarity and containing the center of the similar cluster with the micro clusters without the center of the final similar cluster, and outputting a comprehensive entity;
and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
Further, the method of preprocessing in step a includes removing duplicate data, de-word segmentation, de-stop word, extracting knowledge-related level, smoothing noise data, normalizing and digitizing.
Further, the method for extracting the named entity and the entity relation of the knowledge data comprises the following steps:
extracting keywords of knowledge data, carrying out gridding treatment on the keywords, mapping the keywords into a rectangular coordinate system, and obtaining a histogram by taking the frequency of occurrence of the keywords in cells as a horizontal axis and the number of cells containing the same keyword quantity as a vertical axis;
setting a threshold value of the number of the cells in the points, temporarily storing the cells if the number of the cells is greater than or equal to the threshold value, sorting the number of the keywords in the cells, and storing the cells of the first three of the sorted keywords;
temporarily reserving the cells, centering on the cells, if the number of the temporarily reserved cells exceeds the threshold, thoroughly reserving the temporarily reserved cells, otherwise, removing the cells;
forming the reserved unit cells into a characteristic cell matrix, wherein the expression is as follows:
wherein the reserved 1 row and 1 column unit cell is u 11 The feature lattice matrix isThe number of rows of the feature grid matrix is m, the number of columns of the feature grid matrix is n, and the elements of the feature grid matrix are output as named entities;
weights are assigned to different phrases in sentences to obtain embedded representation of sentence-level features, and the relationship types in the sentences are judged through the full connection layer, wherein the expression is as follows:
z r =β(g r B c +v r )
wherein the embedding of sentence c is denoted as B c The s-type function is beta, and the relation classification result r of the sentence c is z r The error parameter of the relation classification result r is v r The relation coefficient of the relation classification result r is g r Judging the relation of the knowledge data and outputting a relation result.
Further, the method for screening the named entity by adopting the screening model to obtain the first entity comprises the following steps:
screening an optimal model containing a plurality of named entities from the named entities, and setting the optimal model as an initial value;
removing initial values from all named entities, combining to form a subset, and combining with the initial values;
screening the subset by using a full subset model, selecting a model with obvious statistics, stopping calculation if no named entity is added, otherwise, continuing screening;
judging that the statistics of the subset are all obvious and the regression error is minimum, if so, reforming the subset, otherwise, screening out a named entity first set with obvious statistics;
regression is carried out on all subsets, independent variables with statistic absolute values smaller than a critical value and the smallest are deleted, and the operation is repeated until all subsets are traversed;
judging whether the statistics of all independent variables are obvious, if not, deleting again, otherwise screening out a named entity second set with obvious statistics;
and merging the named entity first set and the named entity second set, deleting repeated named entities to obtain a named entity set, and screening the named entity set according to regression errors to obtain the first entity.
Further, the method for screening the named entity to obtain the second entity according to the knowledge association level comprises the following steps:
obtaining an association matrix of the named entities according to the association of knowledge association levels among the named entities, classifying the named entities by adopting the association, and calculating a first screening index according to the maximum association degree in the named entity class and the minimum association degree among the classes, wherein the expression of the first screening index is as follows:
wherein the first screening index of named entity a isThe relevance within a named entity class is x n The relevance between named entity classes is x o
Obtaining a second screening index according to the information entropy of the named entity and the sample variance of the named entity, wherein the calculation formula of the second screening index is as follows:
wherein the number of named entities is w, and the specific gravity of the j index of the b named entity is u bj The mean value of named entities isThe b-th named entity is s b Obtaining a decisive screening index according to the first screening index and the second screening index:
the decisive screening index of the a-th named entity isAnd sorting the deterministic screening indexes in a descending order, and taking the largest deterministic screening index as a second entity.
Further, the method for classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data comprises the following steps:
calculating the density of the t-th iteration generation comprehensive entity:
wherein the density of the t-th iteration complex entity i isThe ith comprehensive entity is s i The j-th comprehensive entity is s j T-th iteration synthesis entity s j The kth neighbor of (2) is +.>T-th iterative synthesis entity s i The inverse k neighbor set in the whole comprehensive entity is +.>Judging the stripping mark in the t iteration generation comprehensive entity, wherein the expression is as follows:
wherein the stripping mark of the t-th iteration comprehensive entity i is as followsThe initial density of the t-th iteration complex entity i isThe density threshold of the t-th iteration complex i is +.>Determining the stripped boundary set, wherein the expression is:
wherein the t-th iteration is based on the boundary set of the stripping mark χ stripped asThe rest comprehensive entity set is stripped for the t time iteration to be S (t) Determining a residual comprehensive entity set after the residual comprehensive entity set is stripped;
after the stripping is finished, if not, continuing to strip, obtaining a connection threshold value of the comprehensive entity, and finishing initial clustering;
and carrying out fuzzy division on the boundary comprehensive entity, wherein the expression is as follows:
wherein the a-th cluster is D a The jth initial cluster isThe a-th original cluster is +.>Boundary synthesis entity s i To the initial cluster->Distance of +.>Boundary synthesis entity s i And cluster D a Is z(s) i ,D a ) Boundary synthesis entity s i To the initial cluster->Distance of +.>Classification data is obtained from the fuzzy partition.
Further, the method for constructing the knowledge base model according to the entity relation and the classification data comprises the following steps:
the knowledge base model is constructed based on a graph neural network and a converter model, the graph neural network is adopted to model entity relations and classification data, the classification data are expressed as nodes, the entity relations are expressed as edges, and the graph neural network is utilized to learn the expressions and the relations among the nodes;
inputting the classified data into a converter model, learning the representation of the text by using a self-attention mechanism and context information, and fusing the entity representation learned by the graph neural network and the text representation learned by the converter model by using weighted average;
storing the fused entity and text representation into a knowledge base, deducing a new knowledge relationship by adopting a graph neural network, dividing classification data into a training set and a testing set by adopting a random sampling method, training the knowledge base by adopting the training set, testing the trained knowledge base by adopting the testing set, stopping training when AUC values of the testing set are all greater than or equal to 0.64, otherwise adding a crossover operator into the testing set data, and continuing training.
In a second aspect, a knowledge base construction system for standard digitization, comprises:
and a pretreatment module: the method comprises the steps of acquiring standard digitized knowledge data, and preprocessing the knowledge data;
and an extraction module: acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data;
and an entity screening module: the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
and a classification module: the method comprises the steps of combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data;
the construction module comprises: and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
The beneficial effects of the invention are as follows:
compared with the prior art, the invention has the following technical effects:
the method can improve the accuracy of the knowledge base through preprocessing, named entity and entity relation extraction, named entity screening, data classification and knowledge base construction, thereby improving the accuracy of standard digital knowledge base construction, realizing the knowledge base construction of standard digital knowledge base, greatly saving resources and labor cost, improving working efficiency, realizing the knowledge base construction of standard digital knowledge data, carrying out knowledge base construction on the standard digital knowledge data in real time, having important significance on the standard digital knowledge base construction, adapting to the construction systems of different knowledge bases and the standard digital knowledge base construction requirements of different users, and having certain universality.
Drawings
FIG. 1 is a flowchart illustrating steps of a knowledge base construction method and system for standard digitization according to the present invention.
Detailed Description
The invention is further described by the following specific examples, which are presented to illustrate, but not to limit, the invention.
The invention discloses a knowledge base construction method and a knowledge base construction system for standard digitization, wherein the knowledge base construction method and the knowledge base construction system for standard digitization comprise the following steps:
as shown in fig. 1, in this embodiment, the steps include:
acquiring standard digitized knowledge data, and preprocessing the knowledge data;
acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data; comprising the following steps:
calculating the similarity between knowledge data:
the similarity between the knowledge data R and the knowledge data C is eta (R, C), the combined set of the knowledge data R and the knowledge data C is R U C, the intersection of the knowledge data R and the knowledge data C is R U C, the similarity is 0.83 to 1 knowledge correlation grade and is one grade, the similarity is 0.51 to 0.82 knowledge correlation grade and is two grade, the similarity is 0.31 to 0.5 knowledge correlation grade and is three grade, and the similarity is 0 to 0.3 knowledge correlation grade and is four grade;
in the actual evaluation, two standard digitized texts are provided:
text 1 "with the popularization of smart phones, the duty ratio of mobile shopping in electronic commerce is improved year by year. The data shows that the mobile end shopping ratio in XXXX years has reached XX%, and it is expected that XX years will be promoted to XX%. Therefore, the electronic commerce enterprise needs to optimize the user experience of the mobile terminal and improve the conversion rate of the mobile terminal;
the text 2' medical image recognition technology utilizes a deep learning algorithm to automatically analyze and recognize medical images and assist doctors in diagnosing diseases. The technique can greatly improve diagnosis efficiency and accuracy and lighten the workload of doctors. At present, medical image recognition technology is widely applied in a plurality of fields, such as lung nodule detection, fundus lesion screening and the like;
the similarity of the text 1 and the text 2 is 0.13, the knowledge association level of the text 1 and the text 2 is four, the named entity of the text 1 is extracted to be a smart phone, electronic commerce, mobile end shopping, XXXX year, XX%, user experience and conversion rate, the entity relationship of the text 1 is that the smart phone is popular, the proportion of the mobile end shopping in the electronic commerce is improved year by year, the shopping proportion of the mobile end in XXXX year is improved, the shopping proportion of the mobile end in XX year is estimated to be improved, the shopping proportion of the mobile end in XX year is improved, the electronic commerce enterprise is required to optimize the user experience of the mobile end, and the electronic commerce enterprise is required to improve the conversion rate of the mobile end;
extracting named entities of the text 2 as medical image recognition technology, deep learning algorithm, medical image, doctor, disease diagnosis, lung nodule detection and fundus lesion screening, wherein the entity relationship of the text 2 is that the medical image recognition technology utilizes the deep learning algorithm, the medical image recognition technology automatically analyzes and recognizes the medical image, the medical image recognition technology assists the doctor in disease diagnosis, the medical image recognition technology improves the diagnosis efficiency and accuracy, the medical image recognition technology lightens the workload of the doctor, the application field of the medical image recognition technology is lung nodule detection and fundus lesion screening;
the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
in actual evaluation, the first entity of the text 1 is a smart phone, electronic commerce, mobile shopping, user experience and conversion rate, and the first entity of the text 2 is a medical image recognition technology, a deep learning algorithm, disease diagnosis, lung nodule detection and fundus lesion screening; the second entity of the text 1 is a smart phone, electronic commerce, mobile shopping and conversion rate, and the second entity of the text 2 is a medical image recognition technology, disease diagnosis, lung nodule detection and fundus lesion screening;
combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data; comprising the following steps:
fusing the first entity and the second entity, deleting the repeated named entity, obtaining the local density and the relative distance of the named entity, and calculating a decision value:
wherein the local density of named entity e isRelative distance μ of named entity e e The decision value of the named entity e is theta e Sorting the decision values in a descending order, selecting the first n named entities as cluster centers, and taking the first m named entities smaller than n as micro cluster centers;
increasing the number of micro cluster centers, and acquiring the micro cluster center number with the least influence of the result;
distributing the remaining named entities to class clusters where named entities with higher density and closer distance are located according to a distribution strategy of a density peak clustering algorithm;
calculating the similarity between named entities:
wherein the similarity between the named entity e and the named entity j is F ej Named entity j is named entity e with K nearest neighbor j e K (e), named entity g and named entity e with Euclidean distance c eg Constructing a similarity matrix between named entities;
calculating the similarity among the micro clusters:
wherein the micro clusters v e And micro cluster v j Is of the similarity ofThe number of the jth micro cluster centers is v j Micro cluster v e To micro cluster v j Is of the similarity of F ej Micro cluster v j To micro cluster v e Is of the similarity of F je Constructing a similarity matrix among the micro clusters according to the similarity among the micro clusters, combining the micro clusters with highest similarity and containing the center of the similar cluster with the micro clusters without the center of the final similar cluster, and outputting a comprehensive entity;
in actual evaluation, the comprehensive entity of the text 1 is a smart phone, electronic commerce, mobile shopping, user experience and conversion rate, and the comprehensive entity of the text 2 is a medical image recognition technology, a deep learning algorithm, disease diagnosis, lung nodule detection and fundus lesion screening;
the text 1 classification data are class 1 smart phones, electronic commerce, mobile shopping, class 2 user experience and conversion rate; the text 2 classification data is a class 1 deep learning algorithm, a class 2 medical image recognition technology, disease diagnosis, lung nodule detection and fundus lesion screening;
and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
In this embodiment, the method of preprocessing in step a includes removing duplicate data, de-segmenting words, de-disabling words, extracting knowledge-related levels, smoothing noise data, normalizing and digitizing.
In this embodiment, the method for extracting named entities and entity relationships from the knowledge data includes:
extracting keywords of knowledge data, carrying out gridding treatment on the keywords, mapping the keywords into a rectangular coordinate system, and obtaining a histogram by taking the frequency of occurrence of the keywords in cells as a horizontal axis and the number of cells containing the same keyword quantity as a vertical axis;
setting a threshold value of the number of the cells in the points, temporarily storing the cells if the number of the cells is greater than or equal to the threshold value, sorting the number of the keywords in the cells, and storing the cells of the first three of the sorted keywords;
temporarily reserving the cells, centering on the cells, if the number of the temporarily reserved cells exceeds the threshold, thoroughly reserving the temporarily reserved cells, otherwise, removing the cells;
forming the reserved unit cells into a characteristic cell matrix, wherein the expression is as follows:
wherein the reserved 1 row and 1 column unit cell is u 11 The feature lattice matrix isThe number of rows of the feature grid matrix is m, the number of columns of the feature grid matrix is n, and the elements of the feature grid matrix are output as named entities;
weights are assigned to different phrases in sentences to obtain embedded representation of sentence-level features, and the relationship types in the sentences are judged through the full connection layer, wherein the expression is as follows:
z r =β(g r B c +v r )
wherein the embedding of sentence c is denoted as B c The s-type function is beta, and the relation classification result r of the sentence c is z r The error parameter of the relation classification result r is v r The relation coefficient of the relation classification result r is g r Judging the relation of the knowledge data and outputting a relation result.
In this embodiment, the method for screening the named entity to obtain the first entity by using a screening model includes:
screening an optimal model containing a plurality of named entities from the named entities, and setting the optimal model as an initial value;
removing initial values from all named entities, combining to form a subset, and combining with the initial values;
screening the subset by using a full subset model, selecting a model with obvious statistics, stopping calculation if no named entity is added, otherwise, continuing screening;
judging that the statistics of the subset are all obvious and the regression error is minimum, if so, reforming the subset, otherwise, screening out a named entity first set with obvious statistics;
regression is carried out on all subsets, independent variables with statistic absolute values smaller than a critical value and the smallest are deleted, and the operation is repeated until all subsets are traversed;
judging whether the statistics of all independent variables are obvious, if not, deleting again, otherwise screening out a named entity second set with obvious statistics;
and merging the named entity first set and the named entity second set, deleting repeated named entities to obtain a named entity set, and screening the named entity set according to regression errors to obtain the first entity.
In this embodiment, the method for screening the named entity to obtain the second entity according to the knowledge association level includes:
obtaining an association matrix of the named entities according to the association of knowledge association levels among the named entities, classifying the named entities by adopting the association, and calculating a first screening index according to the maximum association degree in the named entity class and the minimum association degree among the classes, wherein the expression of the first screening index is as follows:
wherein the first screening index of named entity a isThe relevance within a named entity class is x n The relevance between named entity classes is x o
Obtaining a second screening index according to the information entropy of the named entity and the sample variance of the named entity, wherein the calculation formula of the second screening index is as follows:
wherein the number of named entities is w, and the specific gravity of the j index of the b named entity is u bj The mean value of named entities isThe b-th named entity is s b Obtaining a decisive screening index according to the first screening index and the second screening index:
the decisive screening index of the a-th named entity isAnd sorting the deterministic screening indexes in a descending order, and taking the largest deterministic screening index as a second entity.
In this embodiment, the method for classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data includes:
calculating the density of the t-th iteration generation comprehensive entity:
wherein the density of the t-th iteration complex entity i isThe ith comprehensive entity is s i The j-th comprehensive entity is s j T-th iteration synthesis entity s j The kth neighbor of (2) is +.>T-th iterative synthesis entity s i Throughout the wholeThe inverse k neighbor set in the integrated entity is +.>Judging the stripping mark in the t iteration generation comprehensive entity, wherein the expression is as follows:
wherein the stripping mark of the t-th iteration comprehensive entity i is as followsThe initial density of the t-th iteration complex entity i isThe density threshold of the t-th iteration complex i is +.>Determining the stripped boundary set, wherein the expression is:
wherein the t-th iteration is based on the boundary set of the stripping mark χ stripped asThe rest comprehensive entity set is stripped for the t time iteration to be S (t) Determining a residual comprehensive entity set after the residual comprehensive entity set is stripped;
after the stripping is finished, if not, continuing to strip, obtaining a connection threshold value of the comprehensive entity, and finishing initial clustering;
and carrying out fuzzy division on the boundary comprehensive entity, wherein the expression is as follows:
wherein the a-th cluster is D a The jth initial cluster isThe a-th original cluster is +.>Boundary synthesis entity s i To the initial cluster->Distance of +.>Boundary synthesis entity s i And cluster D a Is z(s) i ,D a ) Boundary synthesis entity s i To the initial cluster->Distance of +.>Classification data is obtained from the fuzzy partition.
In this embodiment, the method for constructing a knowledge base model according to the entity relationship and the classification data includes:
the knowledge base model is constructed based on a graph neural network and a converter model, the graph neural network is adopted to model entity relations and classification data, the classification data are expressed as nodes, the entity relations are expressed as edges, and the graph neural network is utilized to learn the expressions and the relations among the nodes;
inputting the classified data into a converter model, learning the representation of the text by using a self-attention mechanism and context information, and fusing the entity representation learned by the graph neural network and the text representation learned by the converter model by using weighted average;
storing the fused entity and text representation into a knowledge base, deducing a new knowledge relationship by adopting a graph neural network, dividing classification data into a training set and a testing set by adopting a random sampling method, training the knowledge base by adopting the training set, testing the trained knowledge base by adopting the testing set, stopping training when AUC values of the testing set are all greater than or equal to 0.64, otherwise adding a crossover operator into the testing set data, and continuing training.
In this embodiment, the method for optimizing the knowledge base model includes:
initializing a population, namely dividing the population into probes and followers, determining whether the followers follow the probes according to the change of fitness values, and searching for the position updating rule expression of the probes as follows:
wherein the iteration number is t, and the maximum iteration number is t max The relation parameter is delta, the position information of the ith population individual in j dimension of the t iteration isThe random number of normal distribution is H, the whole 1 matrix is Y, and the safety threshold is V T The early warning value of the individual position of the population is E S The exponential function with the natural constant e as the base is exp (·), and the position information of the ith population individual in the j-dimension in the t+1st iteration is +.>Updating the position of the follower, and the expression is:
wherein the position of the global worst of the t-th iteration probe isThe optimal position of the t+1st iteration probe isPopulation number m, element absolute valueAnd (3) randomly selecting 13% of population individuals from the population as observers by using a matrix which is less than or equal to 1 as B, and updating the positions of the population, wherein the expression is as follows:
wherein the optimal population individual position of the t-th iteration isThe random parameter is p, and the fitness value of the sparrow individual i is g i The global optimum fitness value is r g The global worst fitness function value is r w The constant is ρ, and the iteration is stopped when the fitness value reaches a minimum.
In a second aspect, a knowledge base construction system for standard digitization, comprises:
and a pretreatment module: the method comprises the steps of acquiring standard digitized knowledge data, and preprocessing the knowledge data;
and an extraction module: acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data;
and an entity screening module: the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
and a classification module: the method comprises the steps of combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data;
the construction module comprises: and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (7)

1. A knowledge base construction method for standard digitization, comprising the steps of:
acquiring standard digitized knowledge data, and preprocessing the knowledge data;
acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data; comprising the following steps:
calculating the similarity between knowledge data:
the similarity between the knowledge data R and the knowledge data C is eta (R, C), the combined set of the knowledge data R and the knowledge data C is R U C, the intersection of the knowledge data R and the knowledge data C is R U C, the similarity is 0.83 to 1 knowledge correlation grade and is one grade, the similarity is 0.51 to 0.82 knowledge correlation grade and is two grade, the similarity is 0.31 to 0.5 knowledge correlation grade and is three grade, and the similarity is 0 to 0.3 knowledge correlation grade and is four grade;
the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data; comprising the following steps:
fusing the first entity and the second entity, deleting the repeated named entity, obtaining the local density and the relative distance of the named entity, and calculating a decision value:
wherein the local density of named entity e isRelative distance μ of named entity e e The decision value of the named entity e is theta e Sorting the decision values in a descending order, selecting the first n named entities as cluster centers, and taking the first m named entities smaller than n as micro cluster centers;
increasing the number of micro cluster centers, and acquiring the micro cluster center number with the least influence of the result;
distributing the remaining named entities to class clusters where named entities with higher density and closer distance are located according to a distribution strategy of a density peak clustering algorithm;
calculating the similarity between named entities:
wherein the similarity between the named entity e and the named entity j is F ej Named entity j is named entity e with K nearest neighbor j e K (e), named entity g and named entity e with Euclidean distance c eg Constructing a similarity matrix between named entities;
calculating the similarity among the micro clusters:
wherein the micro clusters v e And micro cluster v j Is of the similarity ofThe number of the jth micro cluster centers is v j Micro cluster v e To micro cluster v j Is of the similarity of F ej Micro cluster v j To micro cluster v e Is of the similarity of F je Constructing a similarity matrix among the micro clusters according to the similarity among the micro clusters, combining the micro clusters with highest similarity and containing the center of the similar cluster with the micro clusters without the center of the final similar cluster, and outputting a comprehensive entity;
and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
2. A knowledge base construction method for standard digitization according to claim 1, wherein the pre-processing method in step a comprises removing duplicate data, de-word, de-stop word, extracting knowledge-related level, smoothing noise data, normalization and digitization.
3. The knowledge base construction method for standard digitization according to claim 1, wherein the method for extracting named entities and entity relationships from the knowledge data comprises:
extracting keywords of knowledge data, carrying out gridding treatment on the keywords, mapping the keywords into a rectangular coordinate system, and obtaining a histogram by taking the frequency of occurrence of the keywords in cells as a horizontal axis and the number of cells containing the same keyword quantity as a vertical axis;
setting a threshold value of the number of the cells in the points, temporarily storing the cells if the number of the cells is greater than or equal to the threshold value, sorting the number of the keywords in the cells, and storing the cells of the first three of the sorted keywords;
temporarily reserving the cells, centering on the cells, if the number of the temporarily reserved cells exceeds the threshold, thoroughly reserving the temporarily reserved cells, otherwise, removing the cells;
forming the reserved unit cells into a characteristic cell matrix, wherein the expression is as follows:
wherein the reserved 1 row and 1 column unit cell is u 11 The feature lattice matrix isThe number of rows of the feature grid matrix is m, the number of columns of the feature grid matrix is n, and the elements of the feature grid matrix are inputThe name is named entity;
weights are assigned to different phrases in sentences to obtain embedded representation of sentence-level features, and the relationship types in the sentences are judged through the full connection layer, wherein the expression is as follows:
z r =β(g r B c +v r )
wherein the embedding of sentence c is denoted as B c The s-type function is beta, and the relation classification result r of the sentence c is z r The error parameter of the relation classification result r is v r The relation coefficient of the relation classification result r is g r Judging the relation of the knowledge data and outputting a relation result.
4. A method for building a knowledge base for standard digitization according to claim 1, wherein the method for screening the named entities to obtain a first entity using a screening model comprises:
screening an optimal model containing a plurality of named entities from the named entities, and setting the optimal model as an initial value;
removing initial values from all named entities, combining to form a subset, and combining with the initial values;
screening the subset by using a full subset model, selecting a model with obvious statistics, stopping calculation if no named entity is added, otherwise, continuing screening;
judging that the statistics of the subset are all obvious and the regression error is minimum, if so, reforming the subset, otherwise, screening out a named entity first set with obvious statistics;
regression is carried out on all subsets, independent variables with statistic absolute values smaller than a critical value and the smallest are deleted, and the operation is repeated until all subsets are traversed;
judging whether the statistics of all independent variables are obvious, if not, deleting again, otherwise screening out a named entity second set with obvious statistics;
and merging the named entity first set and the named entity second set, deleting repeated named entities to obtain a named entity set, and screening the named entity set according to regression errors to obtain the first entity.
5. The method for building a standardized digitized knowledge base of claim 1 wherein the method for screening the named entities for a second entity based on the knowledge correlation rating comprises:
obtaining an association matrix of the named entities according to the association of knowledge association levels among the named entities, classifying the named entities by adopting the association, and calculating a first screening index according to the maximum association degree in the named entity class and the minimum association degree among the classes, wherein the expression of the first screening index is as follows:
wherein the first screening index of named entity a isThe relevance within a named entity class is x n The relevance between named entity classes is x o
Obtaining a second screening index according to the information entropy of the named entity and the sample variance of the named entity, wherein the calculation formula of the second screening index is as follows:
wherein the number of named entities is w, and the specific gravity of the j index of the b named entity is u bj The mean value of named entities isThe b-th named entity is s b Obtaining a decisive screening index according to the first screening index and the second screening index:
the decisive screening index of the a-th named entity isAnd sorting the deterministic screening indexes in a descending order, and taking the largest deterministic screening index as a second entity.
6. The method for constructing a standardized digitized knowledge base of claim 1 wherein the method for classifying the knowledge data to obtain classification data using a classification model based on the integrated entity comprises:
calculating the density of the t-th iteration generation comprehensive entity:
wherein the density of the t-th iteration complex entity i isThe ith comprehensive entity is s i The j-th comprehensive entity is s j T-th iteration synthesis entity s j The kth neighbor of (2) is +.>T-th iterative synthesis entity s i The inverse k neighbor set in the whole comprehensive entity is +.>Judging the stripping mark in the t iteration generation comprehensive entity, wherein the expression is as follows:
wherein the t th iterationThe stripping mark of the integrated entity i is as followsThe initial density of the t-th iterative synthesis entity i is +.>The density threshold of the t-th iteration complex i is +.>Determining the stripped boundary set, wherein the expression is:
wherein the t-th iteration is based on the boundary set of the stripping mark χ stripped asThe rest comprehensive entity set is stripped for the t time iteration to be S (t) Determining a residual comprehensive entity set after the residual comprehensive entity set is stripped;
after the stripping is finished, if not, continuing to strip, obtaining a connection threshold value of the comprehensive entity, and finishing initial clustering;
and carrying out fuzzy division on the boundary comprehensive entity, wherein the expression is as follows:
wherein the a-th cluster is D a The jth initial cluster isThe a-th original cluster is +.>Boundary synthesis entitys i To the initial cluster->Distance of +.>Boundary synthesis entity s i And cluster D a Is z(s) i ,D a ) Boundary synthesis entity s i To the initial cluster->Distance of +.>Classification data is obtained from the fuzzy partition.
7. A knowledge base construction system for standard digitization, comprising:
and a pretreatment module: the method comprises the steps of acquiring standard digitized knowledge data, and preprocessing the knowledge data;
and an extraction module: acquiring knowledge association level according to the preprocessed knowledge data, and extracting named entities and entity relations from the knowledge data;
and an entity screening module: the method comprises the steps of screening the named entities by adopting a screening model to obtain a first entity, and screening the named entities according to the knowledge association level to obtain a second entity;
and a classification module: the method comprises the steps of combining the first entity and the second entity to obtain a comprehensive entity, and classifying the knowledge data by using a classification model according to the comprehensive entity to obtain classification data;
the construction module comprises: and constructing a knowledge base model according to the entity relation and the classification data, and outputting a knowledge base.
CN202311635195.2A 2023-12-01 2023-12-01 Knowledge base construction method and system for standard digitization Active CN117667890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311635195.2A CN117667890B (en) 2023-12-01 2023-12-01 Knowledge base construction method and system for standard digitization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311635195.2A CN117667890B (en) 2023-12-01 2023-12-01 Knowledge base construction method and system for standard digitization

Publications (2)

Publication Number Publication Date
CN117667890A true CN117667890A (en) 2024-03-08
CN117667890B CN117667890B (en) 2024-08-02

Family

ID=90083858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311635195.2A Active CN117667890B (en) 2023-12-01 2023-12-01 Knowledge base construction method and system for standard digitization

Country Status (1)

Country Link
CN (1) CN117667890B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN111737496A (en) * 2020-06-29 2020-10-02 东北电力大学 Power equipment fault knowledge map construction method
US20210081376A1 (en) * 2018-05-25 2021-03-18 ZFusion Technology Co., Ltd. Xiamen Construction method, device, computing device, and storage medium for constructing patent knowledge database
CN114792145A (en) * 2022-05-27 2022-07-26 中国标准化研究院 Standard digital management maintenance system and method based on knowledge graph
CN114925212A (en) * 2022-05-06 2022-08-19 神州医疗科技股份有限公司 Relation extraction method and system for automatically judging and fusing knowledge graph
CN115329101A (en) * 2022-09-06 2022-11-11 南京邮电大学 Electric power Internet of things standard knowledge graph construction method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210081376A1 (en) * 2018-05-25 2021-03-18 ZFusion Technology Co., Ltd. Xiamen Construction method, device, computing device, and storage medium for constructing patent knowledge database
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN111737496A (en) * 2020-06-29 2020-10-02 东北电力大学 Power equipment fault knowledge map construction method
CN114925212A (en) * 2022-05-06 2022-08-19 神州医疗科技股份有限公司 Relation extraction method and system for automatically judging and fusing knowledge graph
CN114792145A (en) * 2022-05-27 2022-07-26 中国标准化研究院 Standard digital management maintenance system and method based on knowledge graph
CN115329101A (en) * 2022-09-06 2022-11-11 南京邮电大学 Electric power Internet of things standard knowledge graph construction method and device

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
SANGHYUK ROY CHOI 等: "Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review", 《BIOLOGY》, vol. 12, no. 7, 22 July 2023 (2023-07-22), pages 1 - 29 *
孙嘉睿 等: "模糊边界剥离聚类", 《山东大学学报(理学版)》, vol. 59, no. 3, 28 November 2023 (2023-11-28), pages 2 - 3 *
岳喜超 等: "结合主成分与熵权的关键变量筛选算法", 《中国电子科学研究院学报》, vol. 18, no. 7, 20 July 2023 (2023-07-20), pages 2 *
彭维湘: "基于VBA编程的全子集模型筛选算法", 《统计与决策》, vol. 39, no. 11, 7 June 2023 (2023-06-07), pages 1 - 3 *
李智冈 等: "基于加权核密度估计与微簇合并的密度峰值聚类算法", 《信息与控制》, 2 November 2023 (2023-11-02), pages 2 *
谢雪莲: "基于知识图谱嵌入的急性心肌梗死辅助诊断模型", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》, no. 02, 15 February 2023 (2023-02-15), pages 062 - 233 *
赵小瑞 等: "基于空间格的连续时间空域特征提取算法", 《舰船电子对抗》, vol. 46, no. 4, 25 August 2023 (2023-08-25), pages 1 - 2 *

Also Published As

Publication number Publication date
CN117667890B (en) 2024-08-02

Similar Documents

Publication Publication Date Title
CN105677873B (en) Text Intelligence association cluster based on model of the domain knowledge collects processing method
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN108038492A (en) A kind of perceptual term vector and sensibility classification method based on deep learning
CN108959305A (en) A kind of event extraction method and system based on internet big data
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
CN110569982A (en) Active sampling method based on meta-learning
CN110097096B (en) Text classification method based on TF-IDF matrix and capsule network
CN112214335B (en) Web service discovery method based on knowledge graph and similarity network
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN109033087B (en) Method for calculating text semantic distance, deduplication method, clustering method and device
CN116756347B (en) Semantic information retrieval method based on big data
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
CN114093445B (en) Patient screening marking method based on partial multi-marking learning
CN117351484B (en) Tumor stem cell characteristic extraction and classification system based on AI
CN113722494A (en) Equipment fault positioning method based on natural language understanding
CN117667890B (en) Knowledge base construction method and system for standard digitization
CN116629716A (en) Intelligent interaction system work efficiency analysis method
CN116401368A (en) Intention recognition method and system based on topic event analysis
CN113222018B (en) Image classification method
CN111767402B (en) Limited domain event detection method based on counterstudy
CN112487816B (en) Named entity identification method based on network classification
CN117437976B (en) Disease risk screening method and system based on gene detection
CN113010668B (en) Text clustering method, text clustering device, electronic equipment and computer readable storage medium
CN114036946B (en) Text feature extraction and auxiliary retrieval system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant