CN117891959A - Document metadata storage method and system based on Bayesian network - Google Patents
Document metadata storage method and system based on Bayesian network Download PDFInfo
- Publication number
- CN117891959A CN117891959A CN202410298022.4A CN202410298022A CN117891959A CN 117891959 A CN117891959 A CN 117891959A CN 202410298022 A CN202410298022 A CN 202410298022A CN 117891959 A CN117891959 A CN 117891959A
- Authority
- CN
- China
- Prior art keywords
- document
- metadata
- literature
- data
- quality
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000003860 storage Methods 0.000 title claims abstract description 48
- 238000011156 evaluation Methods 0.000 claims description 76
- 238000004458 analytical method Methods 0.000 claims description 28
- 238000004364 calculation method Methods 0.000 claims description 27
- 238000013441 quality evaluation Methods 0.000 claims description 26
- 238000001303 quality assessment method Methods 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 18
- 238000004140 cleaning Methods 0.000 claims description 16
- 238000012216 screening Methods 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 230000002159 abnormal effect Effects 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 238000012937 correction Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 abstract description 7
- 230000000694 effects Effects 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 238000009826 distribution Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000013500 data storage Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000000546 chi-square test Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Library & Information Science (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a document metadata storage method and system based on a Bayesian network. The document metadata storage method based on the Bayesian network comprises the following steps: evaluating the quality of a document; storing documents in a classified manner; predicting the metadata of the missing document; and constructing a literature knowledge graph. According to the invention, the known literature metadata is obtained and stored in the known metadata set, the literature metadata predicted value is calculated, and the literature metadata predicted value is obtained by utilizing technologies such as Bayesian network and the like to infer and predict according to the existing data information, so that the effect of inferring and assigning the missing metadata when the given literature metadata is incomplete is achieved, and the problem that the inference on the missing metadata is not carried out when the literature metadata is missing in the prior art is solved.
Description
Technical Field
The invention relates to the technical field of document metadata storage, in particular to a document metadata storage method and system based on a Bayesian network.
Background
With the rapid development of information technology, big data and artificial intelligence have become hot spots in many fields, document metadata storage needs to pay attention to how to effectively organize, store and retrieve document information, a Bayesian network model is a probability graph model for reflecting probability dependency relations among input quantities, in the technical field of document metadata storage, the method can be used for analyzing association relations among document information, including cooperation relations among authors, document reference relations, association relations among topics and the like, and the introduction of the Bayesian network technology provides new ideas and methods for management and utilization of document information, so that research contents and application fields of the document metadata storage field are enriched.
The conventional literature data storage system extracts a plurality of pieces of data in a target patent literature, determines the category of each piece of extracted data, performs semantic similarity calculation based on deep learning on the data in the same category, merges the data in the same or similar manner, or performs literature identification on an initial literature, performs category storage on the initial literature, allocates literature operation authority of a user terminal, rationally generalizes and stores the literature into an online literature, and realizes that the merged data is imported into a storage table generated by a patent literature metadata template.
For example, publication No.: the patent document data storage method, device and storage medium based on metadata disclosed in the patent of CN116975068A comprises the following steps: extracting a plurality of pieces of data in a target patent document according to a patent document metadata template; determining a category of each piece of extracted data based on the document structure; traversing each piece of extracted data, carrying out semantic similarity calculation based on deep learning on the data of the same category, determining the relation between the data of the same category, and merging the data of the same category or similar data; the merged data is imported into a storage table generated according to the patent document metadata template.
For example, bulletin numbers: an on-line document induction and storage system based on document data analysis of the invention patent publication of CN113239207B, comprising: the system comprises a document identification module, a classification storage module, a heat calculation module and a permission distribution module, wherein the document identification module is used for carrying out document identification on an initial document, the document identification is used for repeating the document identification and the latest document, the classification storage module is used for carrying out classification storage on the initial document according to a document identification in initial document information, the heat calculation module is used for carrying out heat calculation on an online document in a server, the permission distribution module is used for distributing document operation permission of a user terminal, rationalizing the document and storing the online document, and carrying out differentiation setting on the document operation permission of a visitor.
However, in the process of implementing the technical scheme of the embodiment of the application, the application discovers that the above technology has at least the following technical problems:
In the prior art, document identification is performed on an initial document, a plurality of pieces of data in a target document are extracted, the category of each piece of extracted data is determined, each piece of extracted data is traversed, semantic similarity calculation based on deep learning is performed on the data in the same category, the relation between the data in the same category is determined, the same or similar data are combined and stored in a category mode, meanwhile, the document operation authority of a visitor is differently set, and the problem that missing metadata cannot be inferred when document metadata are missing is solved.
Disclosure of Invention
The embodiment of the application solves the problem that the prior art cannot infer the missing metadata when the literature metadata is missing by providing the literature metadata storage method and system based on the Bayesian network, and realizes the inference and assignment of the missing metadata when the given literature metadata is incomplete.
The embodiment of the application provides a document metadata storage method based on a Bayesian network, which comprises the following steps: acquiring literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature; acquiring all high-quality literature metadata, establishing a literature metadata relation evaluation index model, combining a Bayesian algorithm, determining the dependency relationship between the literature metadata based on the Bayesian network, analyzing the type of the literature, and classifying and storing the same type of literature; if the literature metadata is incomplete, constructing a literature metadata estimated value model, and deducing the value of the missing metadata; if the document metadata is complete, constructing a document knowledge graph according to the similarity of the stored high-quality document metadata.
Further, the specific analysis process for establishing the literature quality assessment index model is as follows: the document information comprises the number of times the document is cited, publisher information data and document originality data; the document metadata includes document title data, document author information data, and document keyword data; obtaining the times of document introduction, publisher information data and document originality data of each document from a document database; carrying out data processing on the times of document introduction and original data of each document, and eliminating repeated data in the times of document introduction and original data of the document to obtain first cleaning data; carrying out average value calculation on the first cleaning data to obtain a first average value; identifying abnormal data in the first cleaning data, and replacing the abnormal data by the first average value; obtaining second cleaning data, and taking the second cleaning data as the cited times of cleaned documents and original document data; comprehensively analyzing the times of literature citation, publisher information data and literature originality data of each literature, and establishing a literature quality assessment index model for obtaining a literature quality assessment index.
Further, the specific analysis method for obtaining the high-quality literature comprises the following steps: based on a literature quality evaluation index model, acquiring the literature quality evaluation index of each literature, analyzing the literature quality evaluation index and obtaining a preset screening threshold value of the literature quality evaluation index; if the document quality assessment index is not lower than a preset screening threshold value, indicating that the document is a high-quality document; if the document quality assessment index is lower than a preset screening threshold value, indicating that the document is a low-quality document; and the high-quality documents and the low-quality documents are stored in a partitioned manner.
Further, the specific analysis process for determining the dependency relationship between the literature metadata based on the Bayesian network comprises the steps of taking the literature header data as first acquired literature metadata, taking the literature author information data as second acquired literature metadata, and analyzing to obtain a literature metadata relationship evaluation index of the first acquired literature metadata and the second acquired literature metadata; taking the document author information data as first acquired document metadata, taking the document keyword data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata; analyzing probability dependency relations among document title data, document author information data and document keyword data by combining a Bayesian network model, and accordingly analyzing whether document metadata are interdependent; if the document metadata are interdependent, the document title data, the document author information data and the document keyword data come from the same document; if the document metadata are not interdependent, the document title data, the document author information data, and the document keyword data are described as coming from different documents.
Further, the specific analysis process of the document metadata relation evaluation index is as follows: acquiring document title data, document author information data and document keyword data of all high-quality documents; constructing a document metadata relation evaluation index model formula, and calculating a document metadata relation evaluation index according to the document metadata relation evaluation index model formula; the literature metadata relation evaluation index model formula is as follows: In the formula,/> is expressed as a first acquired document metadata/> , a document metadata relationship evaluation index of/> and a second acquired document metadata/> ,/> and/> are expressed as first acquired document metadata and second acquired document metadata, respectively,/> is a constant,/> is expressed as a number of a probability dependency relationship evaluation index between the document metadata, and/> ,/> is expressed as a total number of probability dependency relationship evaluation indexes between the document metadata.
Further, the specific analysis process of the type to which the analysis document belongs is as follows: acquiring document title data, document author information data and document keyword data of a document to be learned from a document database; taking the document title data, the document author information data and the document keyword data of the document to be learned as input data of a document classification judgment model, and training to obtain the document classification judgment model for judging the type of the document; the document title data, the document author information data, and the document keyword data are input into a document classification judgment model, thereby obtaining the document type, and classification storage of the same type of documents is performed.
Further, the specific analysis process for constructing the literature metadata predictive value model is as follows: if the obtained literature metadata is input into the literature metadata relation evaluation index model formula, the missing metadata is detected, and then a literature metadata estimated value model is constructed; acquiring known document metadata and storing the known document metadata into a known metadata data set; calculating a literature metadata pre-estimation value, wherein the literature metadata pre-estimation value model formula is as follows: In the formula,/> is expressed as a document metadata predicted value,/> and/> are expressed as first acquired document metadata and second acquired document metadata in a known metadata set, respectively,/> is expressed as a correction factor of the acquired document metadata,/> is expressed as a circumference rate, and/> is expressed as a natural constant.
Further, the specific analysis method for detecting the missing metadata comprises the following steps: inputting the acquired document metadata into a document metadata relationship evaluation index model formula; if the calculation result of the literature metadata relation evaluation index model formula is 0, the missing metadata is described; if the calculation result of the literature metadata relation evaluation index model formula is not 0, the explanation is that metadata is not missing.
Further, the specific analysis method for constructing the literature knowledge graph comprises the following steps: obtaining the document metadata of all high-quality documents in the high-quality document storage area from a document information database, and analyzing to obtain a sample document metadata average estimated value; and (5) judging the similarity of the document metadata, correlating the similar document metadata, and constructing a document knowledge graph.
The embodiment of the application provides a document metadata storage system based on a Bayesian network, which comprises the following components: document quality assessment module: the method comprises the steps of obtaining literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature; document classification storage module: the method comprises the steps of acquiring metadata of all high-quality documents, establishing a document metadata relation evaluation index model, determining the dependency relationship among nodes of a Bayesian network by combining a Bayesian algorithm, analyzing the type of the documents, and classifying and storing the documents of the same type; literature missing data prediction module: if the literature metadata is incomplete, constructing a literature metadata predicted value model, and deducing the value of the missing metadata; document knowledge graph construction module: and if the document metadata is complete, constructing a document knowledge graph according to the stored high-quality document metadata similarity.
One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:
1. the method comprises the steps of obtaining known literature metadata, storing the known literature metadata into a known metadata data set, calculating a literature metadata predicted value, deducing and predicting by using technologies such as a Bayesian network according to existing data information, so as to obtain the literature metadata predicted value, deducing and assigning the missing metadata when the given literature metadata is incomplete, and effectively solving the problem that the missing metadata is deduced when the literature metadata is missing in the prior art.
2. The obtained document metadata is input into a document metadata relation evaluation index model formula to judge the integrity of the document metadata, if the calculation result of the document metadata relation evaluation index model formula is 0, the missing metadata is indicated, if the calculation result of the document metadata relation evaluation index model formula is not 0, the missing metadata is indicated, and therefore whether the metadata is missing is detected, and further whether the given document metadata is complete is judged.
3. The method comprises the steps of obtaining document title data, document author information data and document keyword data of all high-quality documents, constructing a document metadata relation evaluation index model formula, taking the document title data as first obtained document metadata and the document author information data as second obtained document metadata, analyzing and obtaining document metadata relation evaluation indexes of the first obtained document metadata and the second obtained document metadata, and analyzing probability dependency relations among the document title data, the document author information data and the document keyword data, so that whether the document metadata are interdependent or not is further realized.
Drawings
FIG. 1 is a flowchart of a method for storing document metadata based on a Bayesian network according to an embodiment of the present application;
Fig. 2 is a schematic structural diagram of a bayesian network-based document metadata storage system according to an embodiment of the present application.
Detailed Description
The embodiment of the application solves the problem that the prior art cannot infer the missing metadata when the literature metadata is missing by providing the literature metadata storage method and the system based on the Bayesian network, calculates the estimated value of the literature metadata by acquiring the known literature metadata and storing the known literature metadata into the known metadata set, and obtains the estimated value of the literature metadata by utilizing the existing data information and utilizing the Bayesian network technology to infer and predict, thereby realizing the problem of inferring the missing metadata when the literature metadata is missing.
The technical scheme in the embodiment of the application aims to solve the problem that when the document metadata is missing, the missing metadata is inferred, and the overall thought is as follows:
The method comprises the steps of obtaining the times of citation of each document, publisher information data and document originality data, performing data processing on the times of citation of each document and the document originality data, comprehensively analyzing the times of citation of each document, the publisher information data and the document originality data, establishing a document quality assessment index model, obtaining document quality assessment indexes of each document, storing high-quality documents and low-quality documents in a partitioning manner, obtaining document title data, document author information data and document keyword data of all the high-quality documents, constructing a document metadata relation assessment index model formula, calculating a document metadata relation assessment index according to the document metadata relation assessment index formula, taking the document author information data as first obtained document metadata and the document keyword data as second obtained document metadata, analyzing to obtain the literature metadata relation evaluation index of the first acquired literature metadata and the second acquired literature metadata, combining a Bayesian network model, analyzing the probability dependency relation among the literature header data, the literature author information data and the literature keyword data, analyzing whether the literature metadata are mutually dependent or not according to the probability dependency relation, if the missing metadata is detected in the process, acquiring the known literature metadata, storing the known literature metadata into a known metadata set, constructing a literature metadata pre-evaluation model, calculating to obtain literature metadata pre-evaluation values, acquiring the literature metadata of all high-quality literatures in a high-quality literature storage area, analyzing to obtain sample literature metadata average evaluation values, judging the literature metadata similarity, constructing a literature knowledge graph, and achieving the aims of ensuring that when the given literature metadata is incomplete, and deducing and assigning effects of the missing metadata.
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
As shown in fig. 1, a flowchart of a bayesian network-based document metadata storage method according to an embodiment of the present application is applied to a bayesian network-based document metadata storage system, and the method includes the following steps: document quality assessment: acquiring literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature; document classification storage: acquiring all high-quality literature metadata, establishing a literature metadata relation evaluation index model, combining a Bayesian algorithm, determining the dependency relationship between the literature metadata based on the Bayesian network, analyzing the type of the literature, and classifying and storing the same type of literature; missing document metadata prediction: if the literature metadata is incomplete, constructing a literature metadata estimated value model, and deducing the value of the missing metadata; building a literature knowledge graph: if the document metadata is complete, constructing a document knowledge graph according to the similarity of the stored high-quality document metadata.
Further, the specific analysis process for establishing the document quality assessment index model is as follows: the document information includes the number of times the document is cited, publisher information data, and document originality data; the document metadata includes document title data, document author information data, and document keyword data; obtaining the times of document introduction, publisher information data and document originality data of each document from a document database; carrying out data processing on the times of the citation of each document and the original data of the document, and eliminating repeated data in the times of the citation of each document and the original data of the document to obtain first cleaning data; carrying out average value calculation on the first cleaning data to obtain a first average value; identifying abnormal data in the first cleaning data, and replacing the abnormal data by a first average value; obtaining second cleaning data, and taking the second cleaning data as the cited times of cleaned documents and original document data; comprehensively analyzing the times of the cited documents, publisher information data and document originality data of each document, and establishing a document quality assessment index model which is used for obtaining a document quality assessment index.
In this embodiment, the document information includes, but is not limited to, the number of times the document is cited, the information data of the publisher, and the original data of the document, the document data sources, the document structure, the language expression capability, and the like are considered, the document metadata includes, but is not limited to, the document title data, the document author information data, the document keyword data, the document abstract, the document publishing date, the institution to which the document author belongs, and the like are considered, and the document quality evaluation index can be obtained by a more accurate calculation method besides a machine learning algorithm, such as a random forest method, and a specific calculation method is: constructing a document quality evaluation index model formula, and calculating a document quality evaluation index, wherein the document quality evaluation index is used for helping personnel to quickly identify documents with higher quality in massive documents, so that time and energy are saved, and meanwhile, the influence and importance of the documents can be helped to be evaluated, the personnel can be helped to know the status and influence degree of a certain document in the academic field, and the document quality evaluation index model formula is as follows: In the formula,/> is expressed as a document quality assessment index, which can help assess the influence and importance of a document,/> 、/> and/> are respectively expressed as the number of times the document is cited, publisher information data and document originality data,/> 、/> and/> are respectively expressed as the allowable deviation values of the number of times the document is cited, publisher information data and document originality data, the allowable deviation values are extracted from a document information database, and because in actual calculation, very few completely accurate data exist, errors and uncertainties in the data are avoided, the allowable deviation values are introduced, so that the calculation result of the document quality assessment index is more reliable, and/> 、/> and/> are respectively expressed as the weight ratio of the number of times the document is cited, publisher information data and the document originality data, and the weight ratio of the index,/> is expressed as a natural constant.
Further, the specific analysis method for obtaining the high-quality literature comprises the following steps: based on a literature quality evaluation index model, acquiring the literature quality evaluation index of each literature, analyzing the literature quality evaluation index and obtaining a preset screening threshold value of the literature quality evaluation index; if the document quality assessment index is not lower than the preset screening threshold value, the document is indicated to be a high-quality document; if the document quality assessment index is lower than a preset screening threshold value, indicating that the document is a low-quality document; and the high-quality documents and the low-quality documents are stored in a partitioned manner.
In this embodiment, by summing up and averaging the document quality evaluation indexes to obtain a preset screening threshold of the document quality evaluation indexes, multiple evaluation indexes can be comprehensively considered to obtain a comprehensive document quality evaluation index, and thus a preset screening threshold is set, high-quality documents and low-quality documents can be distinguished according to the preset screening threshold, thereby helping personnel to screen documents more effectively so as to better manage and utilize document resources.
Further, the specific analysis process for determining the dependency relationship between the literature metadata based on the Bayesian network comprises the steps of taking the literature header data as first acquired literature metadata, taking the literature author information data as second acquired literature metadata, and analyzing to obtain a literature metadata relationship evaluation index of the first acquired literature metadata and the second acquired literature metadata; taking the document author information data as first acquired document metadata, taking the document keyword data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata; analyzing probability dependency relations among document title data, document author information data and document keyword data by combining a Bayesian network model, and accordingly analyzing whether document metadata are interdependent; if the document metadata are interdependent, the document title data, the document author information data and the document keyword data come from the same document; if the document metadata are not interdependent, the document title data, the document author information data, and the document keyword data are described as coming from different documents.
In this embodiment, the bayesian network is a probabilistic graph model for representing the dependency relationship between variables, which can be used to describe the dependency relationship between document metadata, including whether document title data, author information data and keyword data are derived from the same document, for example, probability inference is performed by using the bayesian network, and a document metadata relationship evaluation index is obtained according to known document author information and document keyword data, so as to determine the probability distribution of the title data of the document, and statistical methods, such as chi-square test, are used to compare the probability distributions of the title data of different documents, and if there is a significant difference in the probability distributions, it is indicated that there is a dependency relationship between them; conversely, if their probability distributions tend to agree, this indicates that they are not interdependent.
Further, the specific analysis process of the document metadata relation evaluation index is as follows: acquiring document title data, document author information data and document keyword data of all high-quality documents; constructing a document metadata relation evaluation index model formula, and calculating a document metadata relation evaluation index according to the document metadata relation evaluation index model formula; the literature metadata relation evaluation index model formula is as follows: In the formula,/> is expressed as a first acquired document metadata/> , a document metadata relationship evaluation index of/> and a second acquired document metadata/> ,/> and/> are expressed as first acquired document metadata and second acquired document metadata, respectively,/> is a constant,/> is expressed as a number of a probability dependency relationship evaluation index between the document metadata, and/> ,/> is expressed as a total number of probability dependency relationship evaluation indexes between the document metadata.
In this embodiment, the document metadata relationship evaluation index may be used to determine the relationship between document metadata, help personnel to better understand the relationship between document metadata, evaluate the relationship of documents, and support academic research and decision to avoid unstable problems caused when the denominator is 0, and analyze the dependency relationship between document metadata is developed on the basis of complete document metadata.
Further, the specific analysis process of the type to which the analysis document belongs is: acquiring document title data, document author information data and document keyword data of a document to be learned from a document database; taking document title data, document author information data and document keyword data of a document to be learned as input data of a document classification judgment model, and training to obtain the document classification judgment model for judging the type of the document; the document title data, the document author information data, and the document keyword data are input into a document classification judgment model, thereby obtaining the document type, and classification storage of the same type of documents is performed.
In this embodiment, the types of documents include journal articles, conference articles, monographs, and the like, and the documents of the same type are classified and stored, so that a user can more conveniently search and find the documents, and can directly locate the documents of the required type according to the needs, so that the search efficiency and accuracy are improved, meanwhile, the documents of different types can be managed in a targeted manner, so that the operations of version control, update and maintenance and the like are facilitated, the collected document metadata are preprocessed, including operations of text cleaning, word segmentation, stop word removal and the like, and meanwhile, the document types need to be marked or encoded, so that the models are subjected to supervised learning, a cyclic neural network model is selected for text classification tasks, the preprocessed document metadata is input into the cyclic neural network model for training, the marked document types are used as supervision signals, and the models are subjected to parameter adjustment and optimization according to the evaluation results, so that the automatic judgment of the document types is realized.
Further, the specific analysis process for constructing the literature metadata predictive value model is as follows: the literature metadata predicted value can be obtained through analysis of a literature information platform, and can be obtained through a more accurate calculation method, wherein the specific calculation method is as follows: if the obtained literature metadata is input into the literature metadata relation evaluation index model formula, the missing metadata is detected, and then a literature metadata estimated value model is constructed; acquiring known document metadata and storing the known document metadata into a known metadata data set; the literature metadata pre-evaluation value is calculated and used for supplementing missing literature metadata, so that the literature data is more complete and comprehensive, the complete literature can help personnel to perform deeper data analysis and mining work, potential association relations and rules are found, and a literature metadata pre-evaluation value model formula is as follows: In the formula,/> is expressed as a document metadata predicted value,/> and/> are expressed as first acquired document metadata and second acquired document metadata in a known metadata set, respectively,/> is expressed as a correction factor of the acquired document metadata,/> is expressed as a circumference rate, and/> is expressed as a natural constant.
In this embodiment, when metadata of a document is missing, the missing document metadata can be supplemented by the document metadata predictive value model formula prediction, so that the document data is more complete and comprehensive, and the document metadata may be inaccurate due to update, revision or change of the document itself, so that a correction factor is required to correct the document metadata.
Further, the specific analysis method for detecting the missing metadata comprises the following steps: inputting the acquired document metadata into a document metadata relationship evaluation index model formula; if the calculation result of the literature metadata relation evaluation index model formula is 0, the missing metadata is described; if the calculation result of the literature metadata relation evaluation index model formula is not 0, the explanation is that metadata is not missing.
In this embodiment, when some acquired document metadata is missing, its value is 0, and substituting the document metadata relationship evaluation index model formula will result in a result of 0, so that it can be determined that the document lacks metadata; on the contrary, if the result obtained by the index model formula of the document metadata relation evaluation is not 0, it is proved that the first acquired document metadata and the second acquired document metadata are both values which are not 0, so that the document does not lack metadata.
Further, the specific analysis method for constructing the literature knowledge graph comprises the following steps: obtaining the document metadata of all high-quality documents in the high-quality document storage area from a document information database, and analyzing to obtain a sample document metadata average estimated value; and (5) judging the similarity of the document metadata, correlating the similar document metadata, and constructing a document knowledge graph.
In this embodiment, a sample document metadata average estimation value model formula is constructed, according to which a sample document metadata average estimation value is calculated, the average estimation value can reflect the average value of document metadata, and can provide a benchmark and a reference for document metadata similarity evaluation, so as to help analyze and compare similarity between documents, and thus better understand and utilize the document data, and the sample document metadata average estimation value model formula is: In the formula,/> is expressed as a sample document metadata average estimated value,/> is expressed as the/> sample document metadata, and/> ,/> is expressed as the total number of sample document metadata; the document metadata similarity evaluation index can be obtained by a text similarity algorithm, such as a TF-IDF method, and can also be obtained by a more accurate calculation method, wherein the specific calculation method is as follows: the construction of the document metadata similarity evaluation index model formula can be used for measuring the similarity degree between documents and helping users understand the association degree between the documents, so that information retrieval and knowledge graph construction are carried out, the document metadata similarity evaluation index is calculated, and the document metadata similarity evaluation index model formula is as follows: in the formula,/> , where/> is represented as a document metadata similarity evaluation index, the average evaluation value of the sample document metadata is taken as the center data of the knowledge graph, if the document metadata similarity evaluation index is higher, the document metadata is closer to the center data of the knowledge graph, if the document metadata similarity evaluation index is lower, the document metadata is farther from the center data of the knowledge graph, and/> and/> are respectively represented as a comparison document and a sample document, and the sample document is all high-quality documents in a high-quality document storage area obtained from a document information database, thereby constructing a document knowledge graph,/> is represented as comparison document metadata, and/> is constant, so as to avoid instability problems caused when a denominator is 0.
As shown in fig. 2, a schematic structural diagram of a bayesian network-based document metadata storage system according to an embodiment of the present application includes: document quality assessment module: the method comprises the steps of obtaining literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature; document classification storage module: the method comprises the steps of acquiring metadata of all high-quality documents, establishing a document metadata relation evaluation index model, determining the dependency relationship among nodes of a Bayesian network by combining a Bayesian algorithm, analyzing the type of the documents, and classifying and storing the documents of the same type; literature missing data prediction module: if the literature metadata is incomplete, constructing a literature metadata predicted value model, and deducing the value of the missing metadata; document knowledge graph construction module: and if the document metadata is complete, constructing a document knowledge graph according to the stored high-quality document metadata similarity.
The technical scheme provided by the embodiment of the application at least has the following technical effects or advantages: relative to publication No.: according to the patent literature data storage method, device and storage medium based on metadata disclosed by CN116975068A, the integrity of the literature metadata is judged by inputting the acquired literature metadata into a literature metadata relation evaluation index model formula, if the calculation result of the literature metadata relation evaluation index model formula is 0, the missing metadata is indicated, if the calculation result of the literature metadata relation evaluation index model formula is not 0, the missing metadata is indicated, so that whether the metadata is missing is detected, and further whether the given literature metadata is complete is judged; relative to the bulletin number: in the embodiment of the application, known literature metadata is acquired and stored in a known metadata set, an estimated value of the literature metadata is calculated, existing data information is utilized, inference and prediction are carried out by using technologies such as a Bayesian network and the like, so that the estimated value of the literature metadata is obtained, and further, when the given literature metadata is incomplete, inference and assignment of the missing metadata are realized.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (10)
1. A method for storing document metadata based on a bayesian network, comprising the steps of:
acquiring literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature;
Acquiring all high-quality literature metadata, establishing a literature metadata relation evaluation index model, combining a Bayesian algorithm, determining the dependency relationship between the literature metadata based on the Bayesian network, analyzing the type of the literature, and classifying and storing the same type of literature;
if the literature metadata is incomplete, constructing a literature metadata estimated value model, and deducing the value of the missing metadata;
if the document metadata is complete, constructing a document knowledge graph according to the similarity of the stored high-quality document metadata.
2. The method for storing document metadata based on bayesian network according to claim 1, wherein the specific analysis process for establishing the document quality assessment index model is as follows:
The document information comprises the number of times the document is cited, publisher information data and document originality data;
The document metadata includes document title data, document author information data, and document keyword data;
obtaining the times of document introduction, publisher information data and document originality data of each document from a document database;
Carrying out data processing on the times of document introduction and original data of each document, and eliminating repeated data in the times of document introduction and original data of the document to obtain first cleaning data;
carrying out average value calculation on the first cleaning data to obtain a first average value;
identifying abnormal data in the first cleaning data, and replacing the abnormal data by the first average value;
Obtaining second cleaning data, and taking the second cleaning data as the cited times of cleaned documents and original document data;
Comprehensively analyzing the times of literature citation, publisher information data and literature originality data of each literature, and establishing a literature quality assessment index model for obtaining a literature quality assessment index.
3. The bayesian network-based document metadata storage method according to claim 2, wherein the specific analysis method for obtaining the high-quality document is as follows:
Based on a literature quality evaluation index model, acquiring the literature quality evaluation index of each literature, analyzing the literature quality evaluation index and obtaining a preset screening threshold value of the literature quality evaluation index;
if the document quality assessment index is not lower than a preset screening threshold value, indicating that the document is a high-quality document;
if the document quality assessment index is lower than a preset screening threshold value, indicating that the document is a low-quality document;
and the high-quality documents and the low-quality documents are stored in a partitioned manner.
4. A bayesian network-based document metadata storage method according to claim 3, wherein the specific analysis process for determining the dependency relationship between the bayesian network-based document metadata comprises:
Taking the document title data as first acquired document metadata, taking the document author information data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata;
Taking the document author information data as first acquired document metadata, taking the document keyword data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata;
Analyzing probability dependency relations among document title data, document author information data and document keyword data by combining a Bayesian network model, and accordingly analyzing whether document metadata are interdependent;
if the document metadata are interdependent, the document title data, the document author information data and the document keyword data come from the same document;
if the document metadata are not interdependent, the document title data, the document author information data, and the document keyword data are described as coming from different documents.
5. The bayesian network-based document metadata storage method according to claim 4, wherein the specific analysis process of the document metadata relation evaluation index is as follows:
acquiring document title data, document author information data and document keyword data of all high-quality documents;
constructing a document metadata relation evaluation index model formula, and calculating a document metadata relation evaluation index according to the document metadata relation evaluation index model formula;
The literature metadata relation evaluation index model formula is as follows:
,
Where is represented as the first acquired document metadata/> , the document metadata relationship evaluation index of/> and the second acquired document metadata/> ,/> and/> are represented as the first acquired document metadata and the second acquired document metadata, respectively, is a constant,/> is represented as the number of the probability dependency evaluation index between the document metadata, and/> , is represented as the total number of the probability dependency evaluation indexes between the document metadata.
6. The bayesian network-based document metadata storage method according to claim 2, wherein the specific analysis process of the type to which the analysis document belongs is:
Acquiring document title data, document author information data and document keyword data of a document to be learned from a document database;
Taking the document title data, the document author information data and the document keyword data of the document to be learned as input data of a document classification judgment model, and training to obtain the document classification judgment model for judging the type of the document;
The document title data, the document author information data, and the document keyword data are input into a document classification judgment model, thereby obtaining the document type, and classification storage of the same type of documents is performed.
7. The method for storing literature metadata based on bayesian network according to claim 1, wherein the specific analysis process for constructing the literature metadata predictive value model is as follows:
if the obtained literature metadata is input into the literature metadata relation evaluation index model formula, the missing metadata is detected, and then a literature metadata estimated value model is constructed;
acquiring known document metadata and storing the known document metadata into a known metadata data set;
Calculating a literature metadata pre-estimation value, wherein the literature metadata pre-estimation value model formula is as follows:
,
Where is denoted as literature metadata predicted value,/> and/> are denoted as first and second acquired literature metadata in a known metadata set, respectively,/> is denoted as correction factor of the acquired literature metadata,/> is denoted as circumference rate, and/> is denoted as natural constant.
8. The bayesian network-based literature metadata storage method of claim 7, wherein the specific analysis method for detecting missing metadata is as follows:
inputting the acquired document metadata into a document metadata relationship evaluation index model formula;
if the calculation result of the literature metadata relation evaluation index model formula is 0, the missing metadata is described;
if the calculation result of the literature metadata relation evaluation index model formula is not 0, the explanation is that metadata is not missing.
9. The method for storing document metadata based on bayesian network according to claim 1, wherein the specific analysis method for constructing the document knowledge graph is as follows:
acquiring document metadata of all high-quality documents in a high-quality document storage area, and analyzing to obtain an average estimated value of the sample document metadata;
and (5) judging the similarity of the document metadata, correlating the similar document metadata, and constructing a document knowledge graph.
10. A bayesian network-based document metadata storage system, the bayesian network-based document metadata storage system comprising:
document quality assessment module: the method comprises the steps of obtaining literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature;
Document classification storage module: the method comprises the steps of acquiring metadata of all high-quality documents, establishing a document metadata relation evaluation index model, determining the dependency relationship among nodes of a Bayesian network by combining a Bayesian algorithm, analyzing the type of the documents, and classifying and storing the documents of the same type;
literature missing data prediction module: if the literature metadata is incomplete, constructing a literature metadata predicted value model, and deducing the value of the missing metadata;
Document knowledge graph construction module: and if the document metadata is complete, constructing a document knowledge graph according to the stored high-quality document metadata similarity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410298022.4A CN117891959B (en) | 2024-03-15 | 2024-03-15 | Document metadata storage method and system based on Bayesian network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410298022.4A CN117891959B (en) | 2024-03-15 | 2024-03-15 | Document metadata storage method and system based on Bayesian network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117891959A true CN117891959A (en) | 2024-04-16 |
CN117891959B CN117891959B (en) | 2024-05-10 |
Family
ID=90641568
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410298022.4A Active CN117891959B (en) | 2024-03-15 | 2024-03-15 | Document metadata storage method and system based on Bayesian network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117891959B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636426A (en) * | 2014-12-22 | 2015-05-20 | 河海大学 | Multi-factor comprehensive quantitative analysis and sorting method for academic influences of scientific research institutions |
CN106021222A (en) * | 2016-05-09 | 2016-10-12 | 浙江农林大学 | Analysis method and device for scientific research literature theme evolution |
CN106570088A (en) * | 2016-10-20 | 2017-04-19 | 浙江大学 | Discovering and evolution tracking method for scientific research document topics |
CN107315738A (en) * | 2017-07-05 | 2017-11-03 | 山东大学 | A kind of innovation degree appraisal procedure of text message |
CN109801687A (en) * | 2019-01-15 | 2019-05-24 | 合肥工业大学 | A kind of construction method and system of the causality knowledge base towards medicine |
CN110569270A (en) * | 2019-08-15 | 2019-12-13 | 中国人民解放军国防科技大学 | bayesian-based LDA topic label calibration method, system and medium |
WO2020207431A1 (en) * | 2019-04-12 | 2020-10-15 | 智慧芽信息科技(苏州)有限公司 | Document classification method, apparatus and device, and storage medium |
CN111930962A (en) * | 2020-09-02 | 2020-11-13 | 平安国际智慧城市科技股份有限公司 | Document data value evaluation method and device, electronic equipment and storage medium |
WO2022116324A1 (en) * | 2020-12-04 | 2022-06-09 | 中国科学院深圳先进技术研究院 | Search model training method, apparatus, terminal device, and storage medium |
CN117520800A (en) * | 2023-11-25 | 2024-02-06 | 北京豆果信息技术有限公司 | Training method, system, electronic equipment and medium for nutrition literature model |
-
2024
- 2024-03-15 CN CN202410298022.4A patent/CN117891959B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636426A (en) * | 2014-12-22 | 2015-05-20 | 河海大学 | Multi-factor comprehensive quantitative analysis and sorting method for academic influences of scientific research institutions |
CN106021222A (en) * | 2016-05-09 | 2016-10-12 | 浙江农林大学 | Analysis method and device for scientific research literature theme evolution |
CN106570088A (en) * | 2016-10-20 | 2017-04-19 | 浙江大学 | Discovering and evolution tracking method for scientific research document topics |
CN107315738A (en) * | 2017-07-05 | 2017-11-03 | 山东大学 | A kind of innovation degree appraisal procedure of text message |
CN109801687A (en) * | 2019-01-15 | 2019-05-24 | 合肥工业大学 | A kind of construction method and system of the causality knowledge base towards medicine |
CN112233736A (en) * | 2019-01-15 | 2021-01-15 | 合肥工业大学 | Knowledge base construction method and system |
WO2020207431A1 (en) * | 2019-04-12 | 2020-10-15 | 智慧芽信息科技(苏州)有限公司 | Document classification method, apparatus and device, and storage medium |
CN110569270A (en) * | 2019-08-15 | 2019-12-13 | 中国人民解放军国防科技大学 | bayesian-based LDA topic label calibration method, system and medium |
CN111930962A (en) * | 2020-09-02 | 2020-11-13 | 平安国际智慧城市科技股份有限公司 | Document data value evaluation method and device, electronic equipment and storage medium |
WO2022116324A1 (en) * | 2020-12-04 | 2022-06-09 | 中国科学院深圳先进技术研究院 | Search model training method, apparatus, terminal device, and storage medium |
CN117520800A (en) * | 2023-11-25 | 2024-02-06 | 北京豆果信息技术有限公司 | Training method, system, electronic equipment and medium for nutrition literature model |
Also Published As
Publication number | Publication date |
---|---|
CN117891959B (en) | 2024-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111078994B (en) | Portrait-based medical science popularization article recommendation method and system | |
CN112070615B (en) | Financial product recommendation method and device based on knowledge graph | |
Votto et al. | Applying and assessing performance of earned duration management control charts for EPC project duration monitoring | |
Davila Delgado et al. | Big data analytics system for costing power transmission projects | |
CN115794803B (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN113297044B (en) | Operation and maintenance risk early warning method and device | |
CN111199493A (en) | Arrearage risk identification method based on customer payment information and credit investigation information | |
Florez‐Perez et al. | Using machine learning to analyze and predict construction task productivity | |
CN112445844A (en) | Financial data management control system of big data platform | |
Riesener et al. | Methodology for Automated Master Data Management using Artificial Intelligence | |
Gunawan et al. | C4. 5, K-Nearest Neighbor, Naïve Bayes, and Random Forest Algorithms Comparison to Predict Students' on TIME Graduation | |
CN117891234A (en) | Method and device for detecting running state of machine room, storage medium and electronic equipment | |
Fan | Data mining model for predicting the quality level and classification of construction projects | |
CN116562901B (en) | Automatic generation method of anti-fraud rule based on machine learning | |
CN117891959B (en) | Document metadata storage method and system based on Bayesian network | |
Chrisnanto et al. | The uses of educational data mining in academic performance analysis at higher education institutions (case study at UNJANI) | |
Banga et al. | Implementation of machine learning techniques in software reliability: A framework | |
Perez-Valiente et al. | Identification of reservoir analogues in the presence of uncertainty | |
CN113034316B (en) | Patent value conversion analysis method and system | |
CN112506930B (en) | Data insight system based on machine learning technology | |
CN111680572B (en) | Dynamic judgment method and system for power grid operation scene | |
Sharma et al. | Hybrid Software Reliability Model for Big Fault Data and Selection of Best Optimizer Using an Estimation Accuracy Function | |
CN114329966A (en) | Method and system for evaluating health degree of remote control system of natural gas pipeline | |
CN113657599A (en) | Accident cause and effect reasoning method and device, electronic equipment and readable storage medium | |
CN111724048A (en) | Characteristic extraction method for finished product library scheduling system performance data based on characteristic engineering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |