CN117891959B - Document metadata storage method and system based on Bayesian network - Google Patents

Document metadata storage method and system based on Bayesian network Download PDF

Info

Publication number
CN117891959B
CN117891959B CN202410298022.4A CN202410298022A CN117891959B CN 117891959 B CN117891959 B CN 117891959B CN 202410298022 A CN202410298022 A CN 202410298022A CN 117891959 B CN117891959 B CN 117891959B
Authority
CN
China
Prior art keywords
document
metadata
literature
data
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410298022.4A
Other languages
Chinese (zh)
Other versions
CN117891959A (en
Inventor
甘克勤
李景
张明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China National Institute of Standardization
Original Assignee
China National Institute of Standardization
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China National Institute of Standardization filed Critical China National Institute of Standardization
Priority to CN202410298022.4A priority Critical patent/CN117891959B/en
Publication of CN117891959A publication Critical patent/CN117891959A/en
Application granted granted Critical
Publication of CN117891959B publication Critical patent/CN117891959B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a document metadata storage method and system based on a Bayesian network. The document metadata storage method based on the Bayesian network comprises the following steps: evaluating the quality of a document; storing documents in a classified manner; predicting the metadata of the missing document; and constructing a literature knowledge graph. According to the invention, the known literature metadata is obtained and stored in the known metadata set, the literature metadata predicted value is calculated, and the literature metadata predicted value is obtained by utilizing technologies such as Bayesian network and the like to infer and predict according to the existing data information, so that the effect of inferring and assigning the missing metadata when the given literature metadata is incomplete is achieved, and the problem that the inference on the missing metadata is not carried out when the literature metadata is missing in the prior art is solved.

Description

Document metadata storage method and system based on Bayesian network
Technical Field
The invention relates to the technical field of document metadata storage, in particular to a document metadata storage method and system based on a Bayesian network.
Background
With the rapid development of information technology, big data and artificial intelligence have become hot spots in many fields, document metadata storage needs to pay attention to how to effectively organize, store and retrieve document information, a Bayesian network model is a probability graph model for reflecting probability dependency relations among input quantities, in the technical field of document metadata storage, the method can be used for analyzing association relations among document information, including cooperation relations among authors, document reference relations, association relations among topics and the like, and the introduction of the Bayesian network technology provides new ideas and methods for management and utilization of document information, so that research contents and application fields of the document metadata storage field are enriched.
The conventional literature data storage system extracts a plurality of pieces of data in a target patent literature, determines the category of each piece of extracted data, performs semantic similarity calculation based on deep learning on the data in the same category, merges the data in the same or similar manner, or performs literature identification on an initial literature, performs category storage on the initial literature, allocates literature operation authority of a user terminal, rationally generalizes and stores the literature into an online literature, and realizes that the merged data is imported into a storage table generated by a patent literature metadata template.
For example, publication No.: the patent document data storage method, device and storage medium based on metadata disclosed in the patent of CN116975068A comprises the following steps: extracting a plurality of pieces of data in a target patent document according to a patent document metadata template; determining a category of each piece of extracted data based on the document structure; traversing each piece of extracted data, carrying out semantic similarity calculation based on deep learning on the data of the same category, determining the relation between the data of the same category, and merging the data of the same category or similar data; the merged data is imported into a storage table generated according to the patent document metadata template.
For example, bulletin numbers: an on-line document induction and storage system based on document data analysis of the invention patent publication of CN113239207B, comprising: the system comprises a document identification module, a classification storage module, a heat calculation module and a permission distribution module, wherein the document identification module is used for carrying out document identification on an initial document, the document identification is used for repeating the document identification and the latest document, the classification storage module is used for carrying out classification storage on the initial document according to a document identification in initial document information, the heat calculation module is used for carrying out heat calculation on an online document in a server, the permission distribution module is used for distributing document operation permission of a user terminal, rationalizing the document and storing the online document, and carrying out differentiation setting on the document operation permission of a visitor.
However, in the process of implementing the technical scheme of the embodiment of the application, the application discovers that the above technology has at least the following technical problems:
In the prior art, document identification is performed on an initial document, a plurality of pieces of data in a target document are extracted, the category of each piece of extracted data is determined, each piece of extracted data is traversed, semantic similarity calculation based on deep learning is performed on the data in the same category, the relation between the data in the same category is determined, the same or similar data are combined and stored in a category mode, meanwhile, the document operation authority of a visitor is differently set, and the problem that missing metadata cannot be inferred when document metadata are missing is solved.
Disclosure of Invention
The embodiment of the application solves the problem that the prior art cannot infer the missing metadata when the literature metadata is missing by providing the literature metadata storage method and system based on the Bayesian network, and realizes the inference and assignment of the missing metadata when the given literature metadata is incomplete.
The embodiment of the application provides a document metadata storage method based on a Bayesian network, which comprises the following steps: acquiring literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature; acquiring all high-quality literature metadata, establishing a literature metadata relation evaluation index model, combining a Bayesian algorithm, determining the dependency relationship between the literature metadata based on the Bayesian network, analyzing the type of the literature, and classifying and storing the same type of literature; if the literature metadata is incomplete, constructing a literature metadata estimated value model, and deducing the value of the missing metadata; if the document metadata is complete, constructing a document knowledge graph according to the similarity of the stored high-quality document metadata.
Further, the specific analysis process for establishing the literature quality assessment index model is as follows: the document information comprises the number of times the document is cited, publisher information data and document originality data; the document metadata includes document title data, document author information data, and document keyword data; obtaining the times of document introduction, publisher information data and document originality data of each document from a document database; carrying out data processing on the times of document introduction and original data of each document, and eliminating repeated data in the times of document introduction and original data of the document to obtain first cleaning data; carrying out average value calculation on the first cleaning data to obtain a first average value; identifying abnormal data in the first cleaning data, and replacing the abnormal data by the first average value; obtaining second cleaning data, and taking the second cleaning data as the cited times of cleaned documents and original document data; comprehensively analyzing the times of literature citation, publisher information data and literature originality data of each literature, and establishing a literature quality assessment index model for obtaining a literature quality assessment index.
Further, the specific analysis method for obtaining the high-quality literature comprises the following steps: based on a literature quality evaluation index model, acquiring the literature quality evaluation index of each literature, analyzing the literature quality evaluation index and obtaining a preset screening threshold value of the literature quality evaluation index; if the document quality assessment index is not lower than a preset screening threshold value, indicating that the document is a high-quality document; if the document quality assessment index is lower than a preset screening threshold value, indicating that the document is a low-quality document; and the high-quality documents and the low-quality documents are stored in a partitioned manner.
Further, the specific analysis process for determining the dependency relationship between the literature metadata based on the Bayesian network comprises the steps of taking the literature header data as first acquired literature metadata, taking the literature author information data as second acquired literature metadata, and analyzing to obtain a literature metadata relationship evaluation index of the first acquired literature metadata and the second acquired literature metadata; taking the document author information data as first acquired document metadata, taking the document keyword data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata; analyzing probability dependency relations among document title data, document author information data and document keyword data by combining a Bayesian network model, and accordingly analyzing whether document metadata are interdependent; if the document metadata are interdependent, the document title data, the document author information data and the document keyword data come from the same document; if the document metadata are not interdependent, the document title data, the document author information data, and the document keyword data are described as coming from different documents.
Further, the specific analysis process of the document metadata relation evaluation index is as follows: acquiring document title data, document author information data and document keyword data of all high-quality documents; constructing a document metadata relation evaluation index model formula, and calculating a document metadata relation evaluation index according to the document metadata relation evaluation index model formula; the literature metadata relation evaluation index model formula is as follows: in the above, the ratio of/> Expressed as/>First acquisition literature metadata/>And second acquisition literature metadata/>Index is evaluated according to the literature metadata relationship of (1)/>/>Respectively expressed as first acquired document metadata and second acquired document metadata,/>Is constant,/>Number expressed as probability dependency evaluation index between document metadata,/>,/>Expressed as the total number of probability dependency evaluation indexes between document metadata.
Further, the specific analysis process of the type to which the analysis document belongs is as follows: acquiring document title data, document author information data and document keyword data of a document to be learned from a document database; taking the document title data, the document author information data and the document keyword data of the document to be learned as input data of a document classification judgment model, and training to obtain the document classification judgment model for judging the type of the document; the document title data, the document author information data, and the document keyword data are input into a document classification judgment model, thereby obtaining the document type, and classification storage of the same type of documents is performed.
Further, the specific analysis process for constructing the literature metadata predictive value model is as follows: if the obtained literature metadata is input into the literature metadata relation evaluation index model formula, the missing metadata is detected, and then a literature metadata estimated value model is constructed; acquiring known document metadata and storing the known document metadata into a known metadata data set; calculating a literature metadata pre-estimation value, wherein the literature metadata pre-estimation value model formula is as follows: In the above, the ratio of/> Expressed as literature metadata pre-evaluation value,/>/>First and second acquisition literature metadata respectively denoted as known metadata set,/>Correction factor expressed as acquired document metadata,/>Expressed as circumference ratio,/>Expressed as a natural constant.
Further, the specific analysis method for detecting the missing metadata comprises the following steps: inputting the acquired document metadata into a document metadata relationship evaluation index model formula; if the calculation result of the literature metadata relation evaluation index model formula is 0, the missing metadata is described; if the calculation result of the literature metadata relation evaluation index model formula is not 0, the explanation is that metadata is not missing.
Further, the specific analysis method for constructing the literature knowledge graph comprises the following steps: obtaining the document metadata of all high-quality documents in the high-quality document storage area from a document information database, and analyzing to obtain a sample document metadata average estimated value; and (5) judging the similarity of the document metadata, correlating the similar document metadata, and constructing a document knowledge graph.
The embodiment of the application provides a document metadata storage system based on a Bayesian network, which comprises the following components: document quality assessment module: the method comprises the steps of obtaining literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature; document classification storage module: the method comprises the steps of acquiring metadata of all high-quality documents, establishing a document metadata relation evaluation index model, determining the dependency relationship among nodes of a Bayesian network by combining a Bayesian algorithm, analyzing the type of the documents, and classifying and storing the documents of the same type; literature missing data prediction module: if the literature metadata is incomplete, constructing a literature metadata predicted value model, and deducing the value of the missing metadata; document knowledge graph construction module: and if the document metadata is complete, constructing a document knowledge graph according to the stored high-quality document metadata similarity.
One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:
1. The method comprises the steps of obtaining known literature metadata, storing the known literature metadata into a known metadata data set, calculating a literature metadata predicted value, deducing and predicting by using technologies such as a Bayesian network according to existing data information, so as to obtain the literature metadata predicted value, deducing and assigning the missing metadata when the given literature metadata is incomplete, and effectively solving the problem that the missing metadata is deduced when the literature metadata is missing in the prior art.
2. The obtained document metadata is input into a document metadata relation evaluation index model formula to judge the integrity of the document metadata, if the calculation result of the document metadata relation evaluation index model formula is 0, the missing metadata is indicated, if the calculation result of the document metadata relation evaluation index model formula is not 0, the missing metadata is indicated, and therefore whether the metadata is missing is detected, and further whether the given document metadata is complete is judged.
3. The method comprises the steps of obtaining document title data, document author information data and document keyword data of all high-quality documents, constructing a document metadata relation evaluation index model formula, taking the document title data as first obtained document metadata and the document author information data as second obtained document metadata, analyzing and obtaining document metadata relation evaluation indexes of the first obtained document metadata and the second obtained document metadata, and analyzing probability dependency relations among the document title data, the document author information data and the document keyword data, so that whether the document metadata are interdependent or not is further realized.
Drawings
FIG. 1 is a flowchart of a method for storing document metadata based on a Bayesian network according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a bayesian network-based document metadata storage system according to an embodiment of the present application.
Detailed Description
The embodiment of the application solves the problem that the prior art cannot infer the missing metadata when the literature metadata is missing by providing the literature metadata storage method and the system based on the Bayesian network, calculates the estimated value of the literature metadata by acquiring the known literature metadata and storing the known literature metadata into the known metadata set, and obtains the estimated value of the literature metadata by utilizing the existing data information and utilizing the Bayesian network technology to infer and predict, thereby realizing the problem of inferring the missing metadata when the literature metadata is missing.
The technical scheme in the embodiment of the application aims to solve the problem that when the document metadata is missing, the missing metadata is inferred, and the overall thought is as follows:
The method comprises the steps of obtaining the times of citation of each document, publisher information data and document originality data, performing data processing on the times of citation of each document and the document originality data, comprehensively analyzing the times of citation of each document, the publisher information data and the document originality data, establishing a document quality assessment index model, obtaining document quality assessment indexes of each document, storing high-quality documents and low-quality documents in a partitioning manner, obtaining document title data, document author information data and document keyword data of all the high-quality documents, constructing a document metadata relation assessment index model formula, calculating a document metadata relation assessment index according to the document metadata relation assessment index formula, taking the document author information data as first obtained document metadata and the document keyword data as second obtained document metadata, analyzing to obtain the literature metadata relation evaluation index of the first acquired literature metadata and the second acquired literature metadata, combining a Bayesian network model, analyzing the probability dependency relation among the literature header data, the literature author information data and the literature keyword data, analyzing whether the literature metadata are mutually dependent or not according to the probability dependency relation, if the missing metadata is detected in the process, acquiring the known literature metadata, storing the known literature metadata into a known metadata set, constructing a literature metadata pre-evaluation model, calculating to obtain literature metadata pre-evaluation values, acquiring the literature metadata of all high-quality literatures in a high-quality literature storage area, analyzing to obtain sample literature metadata average evaluation values, judging the literature metadata similarity, constructing a literature knowledge graph, and achieving the aims of ensuring that when the given literature metadata is incomplete, and deducing and assigning effects of the missing metadata.
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
As shown in fig. 1, a flowchart of a bayesian network-based document metadata storage method according to an embodiment of the present application is applied to a bayesian network-based document metadata storage system, and the method includes the following steps: document quality assessment: acquiring literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature; document classification storage: acquiring all high-quality literature metadata, establishing a literature metadata relation evaluation index model, combining a Bayesian algorithm, determining the dependency relationship between the literature metadata based on the Bayesian network, analyzing the type of the literature, and classifying and storing the same type of literature; missing document metadata prediction: if the literature metadata is incomplete, constructing a literature metadata estimated value model, and deducing the value of the missing metadata; building a literature knowledge graph: if the document metadata is complete, constructing a document knowledge graph according to the similarity of the stored high-quality document metadata.
Further, the specific analysis process for establishing the document quality assessment index model is as follows: the document information includes the number of times the document is cited, publisher information data, and document originality data; the document metadata includes document title data, document author information data, and document keyword data; obtaining the times of document introduction, publisher information data and document originality data of each document from a document database; carrying out data processing on the times of the citation of each document and the original data of the document, and eliminating repeated data in the times of the citation of each document and the original data of the document to obtain first cleaning data; carrying out average value calculation on the first cleaning data to obtain a first average value; identifying abnormal data in the first cleaning data, and replacing the abnormal data by a first average value; obtaining second cleaning data, and taking the second cleaning data as the cited times of cleaned documents and original document data; comprehensively analyzing the times of the cited documents, publisher information data and document originality data of each document, and establishing a document quality assessment index model which is used for obtaining a document quality assessment index.
In this embodiment, the document information includes, but is not limited to, the number of times the document is cited, the information data of the publisher, and the original data of the document, the document data sources, the document structure, the language expression capability, and the like are considered, the document metadata includes, but is not limited to, the document title data, the document author information data, the document keyword data, the document abstract, the document publishing date, the institution to which the document author belongs, and the like are considered, and the document quality evaluation index can be obtained by a more accurate calculation method besides a machine learning algorithm, such as a random forest method, and a specific calculation method is: constructing a document quality evaluation index model formula, and calculating a document quality evaluation index, wherein the document quality evaluation index is used for helping personnel to quickly identify documents with higher quality in massive documents, so that time and energy are saved, and meanwhile, the influence and importance of the documents can be helped to be evaluated, the personnel can be helped to know the status and influence degree of a certain document in the academic field, and the document quality evaluation index model formula is as follows: in the above, the ratio of/> Expressed as a document quality assessment index, can help assess the influence and importance of the document,/>、/>/>Expressed as the number of times a document is cited, publisher information data, and document originality data,/>, respectively、/>/>The allowable deviation values respectively expressed as the times of the cited documents, the publisher information data and the original data of the documents are extracted from the document information database, and the allowable deviation values are introduced because in the actual calculation, there is little completely accurate data, errors and uncertainties in the data are avoided, so that the calculation result of the document quality assessment index is more reliable,、/>/>Expressed as the weight ratio of the times of reference to the document, publisher information data, and document originality data, respectively,/>Expressed as a natural constant.
Further, the specific analysis method for obtaining the high-quality literature comprises the following steps: based on a literature quality evaluation index model, acquiring the literature quality evaluation index of each literature, analyzing the literature quality evaluation index and obtaining a preset screening threshold value of the literature quality evaluation index; if the document quality assessment index is not lower than the preset screening threshold value, the document is indicated to be a high-quality document; if the document quality assessment index is lower than a preset screening threshold value, indicating that the document is a low-quality document; and the high-quality documents and the low-quality documents are stored in a partitioned manner.
In this embodiment, by summing up and averaging the document quality evaluation indexes to obtain a preset screening threshold of the document quality evaluation indexes, multiple evaluation indexes can be comprehensively considered to obtain a comprehensive document quality evaluation index, and thus a preset screening threshold is set, high-quality documents and low-quality documents can be distinguished according to the preset screening threshold, thereby helping personnel to screen documents more effectively so as to better manage and utilize document resources.
Further, the specific analysis process for determining the dependency relationship between the literature metadata based on the Bayesian network comprises the steps of taking the literature header data as first acquired literature metadata, taking the literature author information data as second acquired literature metadata, and analyzing to obtain a literature metadata relationship evaluation index of the first acquired literature metadata and the second acquired literature metadata; taking the document author information data as first acquired document metadata, taking the document keyword data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata; analyzing probability dependency relations among document title data, document author information data and document keyword data by combining a Bayesian network model, and accordingly analyzing whether document metadata are interdependent; if the document metadata are interdependent, the document title data, the document author information data and the document keyword data come from the same document; if the document metadata are not interdependent, the document title data, the document author information data, and the document keyword data are described as coming from different documents.
In this embodiment, the bayesian network is a probabilistic graph model for representing the dependency relationship between variables, which can be used to describe the dependency relationship between document metadata, including whether document title data, author information data and keyword data are derived from the same document, for example, probability inference is performed by using the bayesian network, and a document metadata relationship evaluation index is obtained according to known document author information and document keyword data, so as to determine the probability distribution of the title data of the document, and statistical methods, such as chi-square test, are used to compare the probability distributions of the title data of different documents, and if there is a significant difference in the probability distributions, it is indicated that there is a dependency relationship between them; conversely, if their probability distributions tend to agree, this indicates that they are not interdependent.
Further, the specific analysis process of the document metadata relation evaluation index is as follows: acquiring document title data, document author information data and document keyword data of all high-quality documents; constructing a document metadata relation evaluation index model formula, and calculating a document metadata relation evaluation index according to the document metadata relation evaluation index model formula; the literature metadata relation evaluation index model formula is as follows: in the above, the ratio of/> Expressed as/>First acquisition literature metadata/>And second acquisition literature metadata/>Index is evaluated according to the literature metadata relationship of (1)/>/>Respectively expressed as first acquired document metadata and second acquired document metadata,/>Is constant,/>Number expressed as probability dependency evaluation index between document metadata,/>,/>Expressed as the total number of probability dependency evaluation indexes between document metadata.
In this embodiment, the document metadata relationship evaluation index may be used to determine the relationship between document metadata, help personnel understand better the relationship between document metadata, evaluate the relevance of documents, and support academic research and decision-making,In order to avoid the instability problem caused when the denominator is 0, the dependency relationship between the analysis document metadata is developed on the basis of the complete document metadata.
Further, the specific analysis process of the type to which the analysis document belongs is: acquiring document title data, document author information data and document keyword data of a document to be learned from a document database; taking document title data, document author information data and document keyword data of a document to be learned as input data of a document classification judgment model, and training to obtain the document classification judgment model for judging the type of the document; the document title data, the document author information data, and the document keyword data are input into a document classification judgment model, thereby obtaining the document type, and classification storage of the same type of documents is performed.
In this embodiment, the types of documents include journal articles, conference articles, monographs, and the like, and the documents of the same type are classified and stored, so that a user can more conveniently search and find the documents, and can directly locate the documents of the required type according to the needs, so that the search efficiency and accuracy are improved, meanwhile, the documents of different types can be managed in a targeted manner, so that the operations of version control, update and maintenance and the like are facilitated, the collected document metadata are preprocessed, including operations of text cleaning, word segmentation, stop word removal and the like, and meanwhile, the document types need to be marked or encoded, so that the models are subjected to supervised learning, a cyclic neural network model is selected for text classification tasks, the preprocessed document metadata is input into the cyclic neural network model for training, the marked document types are used as supervision signals, and the models are subjected to parameter adjustment and optimization according to the evaluation results, so that the automatic judgment of the document types is realized.
Further, the specific analysis process for constructing the literature metadata predictive value model is as follows: the literature metadata predicted value can be obtained through analysis of a literature information platform, and can be obtained through a more accurate calculation method, wherein the specific calculation method is as follows: if the obtained literature metadata is input into the literature metadata relation evaluation index model formula, the missing metadata is detected, and then a literature metadata estimated value model is constructed; acquiring known document metadata and storing the known document metadata into a known metadata data set; the literature metadata pre-evaluation value is calculated and used for supplementing missing literature metadata, so that the literature data is more complete and comprehensive, the complete literature can help personnel to perform deeper data analysis and mining work, potential association relations and rules are found, and a literature metadata pre-evaluation value model formula is as follows: In the above, the ratio of/> Expressed as literature metadata pre-evaluation value,/>/>First and second acquisition literature metadata respectively denoted as known metadata set,/>Correction factor expressed as acquired document metadata,/>Expressed as circumference ratio,/>Expressed as a natural constant.
In this embodiment, when metadata of a document is missing, the missing document metadata can be supplemented by the document metadata predictive value model formula prediction, so that the document data is more complete and comprehensive, and the document metadata may be inaccurate due to update, revision or change of the document itself, so that a correction factor is required to correct the document metadata.
Further, the specific analysis method for detecting the missing metadata comprises the following steps: inputting the acquired document metadata into a document metadata relationship evaluation index model formula; if the calculation result of the literature metadata relation evaluation index model formula is 0, the missing metadata is described; if the calculation result of the literature metadata relation evaluation index model formula is not 0, the explanation is that metadata is not missing.
In this embodiment, when some acquired document metadata is missing, its value is 0, and substituting the document metadata relationship evaluation index model formula will result in a result of 0, so that it can be determined that the document lacks metadata; on the contrary, if the result obtained by the index model formula of the document metadata relation evaluation is not 0, it is proved that the first acquired document metadata and the second acquired document metadata are both values which are not 0, so that the document does not lack metadata.
Further, the specific analysis method for constructing the literature knowledge graph comprises the following steps: obtaining the document metadata of all high-quality documents in the high-quality document storage area from a document information database, and analyzing to obtain a sample document metadata average estimated value; and (5) judging the similarity of the document metadata, correlating the similar document metadata, and constructing a document knowledge graph.
In this embodiment, a sample document metadata average estimation value model formula is constructed, according to which a sample document metadata average estimation value is calculated, the average estimation value can reflect the average value of document metadata, and can provide a benchmark and a reference for document metadata similarity evaluation, so as to help analyze and compare similarity between documents, and thus better understand and utilize the document data, and the sample document metadata average estimation value model formula is: in the above, the ratio of/> Expressed as sample literature metadata average estimate value,/>Expressed as/>Sample literature metadata,/>,/>Expressed as the total number of sample document metadata; the document metadata similarity evaluation index can be obtained by a text similarity algorithm, such as a TF-IDF method, and can also be obtained by a more accurate calculation method, wherein the specific calculation method is as follows: the construction of the document metadata similarity evaluation index model formula can be used for measuring the similarity degree between documents and helping users understand the association degree between the documents, so that information retrieval and knowledge graph construction are carried out, the document metadata similarity evaluation index is calculated, and the document metadata similarity evaluation index model formula is as follows: /(I)In the above, the ratio of/>Expressed as a literature metadata similarity evaluation index, taking a sample literature metadata average evaluation value as the center data of the knowledge graph, if the literature metadata similarity evaluation index is higher, the literature metadata is closer to the center data of the knowledge graph, and if the literature metadata similarity evaluation index is lower, the literature metadata is farther from the center data of the knowledge graph,/>AndRespectively expressed as a comparison document and a sample document, wherein the sample document is all high-quality documents in a high-quality document storage area obtained from a document information database, thereby constructing a document knowledge graph,/>Expressed as comparative literature metadata,/>The constant is to avoid instability problems caused when the denominator is 0.
As shown in fig. 2, a schematic structural diagram of a bayesian network-based document metadata storage system according to an embodiment of the present application includes: document quality assessment module: the method comprises the steps of obtaining literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature; document classification storage module: the method comprises the steps of acquiring metadata of all high-quality documents, establishing a document metadata relation evaluation index model, determining the dependency relationship among nodes of a Bayesian network by combining a Bayesian algorithm, analyzing the type of the documents, and classifying and storing the documents of the same type; literature missing data prediction module: if the literature metadata is incomplete, constructing a literature metadata predicted value model, and deducing the value of the missing metadata; document knowledge graph construction module: and if the document metadata is complete, constructing a document knowledge graph according to the stored high-quality document metadata similarity.
The technical scheme provided by the embodiment of the application at least has the following technical effects or advantages: relative to publication No.: according to the patent literature data storage method, device and storage medium based on metadata disclosed by CN116975068A, the integrity of the literature metadata is judged by inputting the acquired literature metadata into a literature metadata relation evaluation index model formula, if the calculation result of the literature metadata relation evaluation index model formula is 0, the missing metadata is indicated, if the calculation result of the literature metadata relation evaluation index model formula is not 0, the missing metadata is indicated, so that whether the metadata is missing is detected, and further whether the given literature metadata is complete is judged; relative to the bulletin number: in the embodiment of the application, known literature metadata is acquired and stored in a known metadata set, an estimated value of the literature metadata is calculated, existing data information is utilized, inference and prediction are carried out by using technologies such as a Bayesian network and the like, so that the estimated value of the literature metadata is obtained, and further, when the given literature metadata is incomplete, inference and assignment of the missing metadata are realized.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (7)

1. A method for storing document metadata based on a bayesian network, comprising the steps of:
Acquiring literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature;
Acquiring all high-quality literature metadata, establishing a literature metadata relation evaluation index model, combining a Bayesian algorithm, determining the dependency relationship between the literature metadata based on the Bayesian network, analyzing the type of the literature, and classifying and storing the same type of literature;
if the literature metadata is incomplete, constructing a literature metadata estimated value model, and deducing the value of the missing metadata;
If the document metadata is complete, constructing a document knowledge graph according to the similarity of the stored high-quality document metadata;
The specific analysis process for determining the dependency relationship between the literature metadata based on the Bayesian network comprises the following steps:
taking the document title data as first acquired document metadata, taking the document author information data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata;
Taking the document author information data as first acquired document metadata, taking the document keyword data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata;
Analyzing probability dependency relations among document title data, document author information data and document keyword data by combining a Bayesian network model, and accordingly analyzing whether document metadata are interdependent;
if the document metadata are interdependent, the document title data, the document author information data and the document keyword data come from the same document;
if the document metadata are not interdependent, the document title data, the document author information data and the document keyword data are described as coming from different documents;
The specific analysis process of the document metadata relation evaluation index comprises the following steps:
Acquiring document title data, document author information data and document keyword data of all high-quality documents;
constructing a document metadata relation evaluation index model formula, and calculating a document metadata relation evaluation index according to the document metadata relation evaluation index model formula;
The literature metadata relation evaluation index model formula is as follows:
In the method, in the process of the invention, Expressed as/>First acquisition literature metadata/>And second acquisition literature metadata/>Index is evaluated according to the literature metadata relationship of (1)/>/>Represented as first acquired document metadata and second acquired document metadata,Is constant,/>Number expressed as probability dependency evaluation index between document metadata,/>Expressed as a total number of probability dependency evaluation indexes between document metadata;
The specific analysis process for constructing the literature metadata predictive value model comprises the following steps:
If the obtained literature metadata is input into the literature metadata relation evaluation index model formula, the missing metadata is detected, and then a literature metadata estimated value model is constructed;
Acquiring known document metadata and storing the known document metadata into a known metadata data set;
calculating a literature metadata pre-estimation value, wherein the literature metadata pre-estimation value model formula is as follows:
,
In the method, in the process of the invention, Expressed as literature metadata pre-evaluation value,/>/>First and second acquisition literature metadata respectively denoted as known metadata set,/>Correction factor expressed as acquired document metadata,/>Expressed as circumference ratio,/>Expressed as a natural constant.
2. The method for storing document metadata based on bayesian network according to claim 1, wherein the specific analysis process for establishing the document quality assessment index model is as follows:
The document information comprises the number of times the document is cited, publisher information data and document originality data;
the document metadata includes document title data, document author information data, and document keyword data;
Obtaining the times of document introduction, publisher information data and document originality data of each document from a document database;
Carrying out data processing on the times of document introduction and original data of each document, and eliminating repeated data in the times of document introduction and original data of the document to obtain first cleaning data;
Carrying out average value calculation on the first cleaning data to obtain a first average value;
identifying abnormal data in the first cleaning data, and replacing the abnormal data by the first average value;
obtaining second cleaning data, and taking the second cleaning data as the cited times of cleaned documents and original document data;
Comprehensively analyzing the times of literature citation, publisher information data and literature originality data of each literature, and establishing a literature quality assessment index model for obtaining a literature quality assessment index.
3. The bayesian network-based document metadata storage method according to claim 2, wherein the specific analysis method for obtaining the high-quality document is as follows:
Based on a literature quality evaluation index model, acquiring the literature quality evaluation index of each literature, analyzing the literature quality evaluation index and obtaining a preset screening threshold value of the literature quality evaluation index;
If the document quality assessment index is not lower than a preset screening threshold value, indicating that the document is a high-quality document;
If the document quality assessment index is lower than a preset screening threshold value, indicating that the document is a low-quality document;
And the high-quality documents and the low-quality documents are stored in a partitioned manner.
4. The bayesian network-based document metadata storage method according to claim 2, wherein the specific analysis process of the type to which the analysis document belongs is:
acquiring document title data, document author information data and document keyword data of a document to be learned from a document database;
Taking the document title data, the document author information data and the document keyword data of the document to be learned as input data of a document classification judgment model, and training to obtain the document classification judgment model for judging the type of the document;
the document title data, the document author information data, and the document keyword data are input into a document classification judgment model, thereby obtaining the document type, and classification storage of the same type of documents is performed.
5. The bayesian network-based document metadata storage method according to claim 1, wherein the specific analysis method for detecting missing metadata is as follows:
Inputting the acquired document metadata into a document metadata relationship evaluation index model formula;
If the calculation result of the literature metadata relation evaluation index model formula is 0, the missing metadata is described;
if the calculation result of the literature metadata relation evaluation index model formula is not 0, the explanation is that metadata is not missing.
6. The method for storing document metadata based on bayesian network according to claim 1, wherein the specific analysis method for constructing the document knowledge graph is as follows:
acquiring document metadata of all high-quality documents in a high-quality document storage area, and analyzing to obtain an average estimated value of the sample document metadata;
And (5) judging the similarity of the document metadata, correlating the similar document metadata, and constructing a document knowledge graph.
7. A bayesian network-based document metadata storage system, the bayesian network-based document metadata storage system comprising:
document quality assessment module: the method comprises the steps of obtaining literature information, establishing an literature quality evaluation index model, evaluating the literature quality level and obtaining a high-quality literature;
document classification storage module: the method comprises the steps of acquiring metadata of all high-quality documents, establishing a document metadata relation evaluation index model, determining the dependency relationship among nodes of a Bayesian network by combining a Bayesian algorithm, analyzing the type of the documents, and classifying and storing the documents of the same type;
literature missing data prediction module: if the literature metadata is incomplete, constructing a literature metadata predicted value model, and deducing the value of the missing metadata;
Document knowledge graph construction module: if the document metadata is complete, constructing a document knowledge graph according to the similarity of the stored high-quality document metadata;
The specific analysis process for determining the dependency relationship between the literature metadata based on the Bayesian network comprises the following steps:
taking the document title data as first acquired document metadata, taking the document author information data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata;
Taking the document author information data as first acquired document metadata, taking the document keyword data as second acquired document metadata, and analyzing to obtain a document metadata relation evaluation index of the first acquired document metadata and the second acquired document metadata;
Analyzing probability dependency relations among document title data, document author information data and document keyword data by combining a Bayesian network model, and accordingly analyzing whether document metadata are interdependent;
if the document metadata are interdependent, the document title data, the document author information data and the document keyword data come from the same document;
if the document metadata are not interdependent, the document title data, the document author information data and the document keyword data are described as coming from different documents;
The specific analysis process of the document metadata relation evaluation index comprises the following steps:
Acquiring document title data, document author information data and document keyword data of all high-quality documents;
constructing a document metadata relation evaluation index model formula, and calculating a document metadata relation evaluation index according to the document metadata relation evaluation index model formula;
The literature metadata relation evaluation index model formula is as follows:
In the method, in the process of the invention, Expressed as/>First acquisition literature metadata/>And second acquisition literature metadata/>Index is evaluated according to the literature metadata relationship of (1)/>/>Represented as first acquired document metadata and second acquired document metadata,Is constant,/>Number expressed as probability dependency evaluation index between document metadata,/>Expressed as a total number of probability dependency evaluation indexes between document metadata;
The specific analysis process for constructing the literature metadata predictive value model comprises the following steps:
If the obtained literature metadata is input into the literature metadata relation evaluation index model formula, the missing metadata is detected, and then a literature metadata estimated value model is constructed;
Acquiring known document metadata and storing the known document metadata into a known metadata data set;
calculating a literature metadata pre-estimation value, wherein the literature metadata pre-estimation value model formula is as follows:
,
In the method, in the process of the invention, Expressed as literature metadata pre-evaluation value,/>/>First and second acquisition literature metadata respectively denoted as known metadata set,/>Correction factor expressed as acquired document metadata,/>Expressed as circumference ratio,/>Expressed as a natural constant.
CN202410298022.4A 2024-03-15 2024-03-15 Document metadata storage method and system based on Bayesian network Active CN117891959B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410298022.4A CN117891959B (en) 2024-03-15 2024-03-15 Document metadata storage method and system based on Bayesian network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410298022.4A CN117891959B (en) 2024-03-15 2024-03-15 Document metadata storage method and system based on Bayesian network

Publications (2)

Publication Number Publication Date
CN117891959A CN117891959A (en) 2024-04-16
CN117891959B true CN117891959B (en) 2024-05-10

Family

ID=90641568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410298022.4A Active CN117891959B (en) 2024-03-15 2024-03-15 Document metadata storage method and system based on Bayesian network

Country Status (1)

Country Link
CN (1) CN117891959B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636426A (en) * 2014-12-22 2015-05-20 河海大学 Multi-factor comprehensive quantitative analysis and sorting method for academic influences of scientific research institutions
CN106021222A (en) * 2016-05-09 2016-10-12 浙江农林大学 Analysis method and device for scientific research literature theme evolution
CN106570088A (en) * 2016-10-20 2017-04-19 浙江大学 Discovering and evolution tracking method for scientific research document topics
CN107315738A (en) * 2017-07-05 2017-11-03 山东大学 A kind of innovation degree appraisal procedure of text message
CN109801687A (en) * 2019-01-15 2019-05-24 合肥工业大学 A kind of construction method and system of the causality knowledge base towards medicine
CN110569270A (en) * 2019-08-15 2019-12-13 中国人民解放军国防科技大学 bayesian-based LDA topic label calibration method, system and medium
WO2020207431A1 (en) * 2019-04-12 2020-10-15 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus and device, and storage medium
CN111930962A (en) * 2020-09-02 2020-11-13 平安国际智慧城市科技股份有限公司 Document data value evaluation method and device, electronic equipment and storage medium
WO2022116324A1 (en) * 2020-12-04 2022-06-09 中国科学院深圳先进技术研究院 Search model training method, apparatus, terminal device, and storage medium
CN117520800A (en) * 2023-11-25 2024-02-06 北京豆果信息技术有限公司 Training method, system, electronic equipment and medium for nutrition literature model

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636426A (en) * 2014-12-22 2015-05-20 河海大学 Multi-factor comprehensive quantitative analysis and sorting method for academic influences of scientific research institutions
CN106021222A (en) * 2016-05-09 2016-10-12 浙江农林大学 Analysis method and device for scientific research literature theme evolution
CN106570088A (en) * 2016-10-20 2017-04-19 浙江大学 Discovering and evolution tracking method for scientific research document topics
CN107315738A (en) * 2017-07-05 2017-11-03 山东大学 A kind of innovation degree appraisal procedure of text message
CN109801687A (en) * 2019-01-15 2019-05-24 合肥工业大学 A kind of construction method and system of the causality knowledge base towards medicine
CN112233736A (en) * 2019-01-15 2021-01-15 合肥工业大学 Knowledge base construction method and system
WO2020207431A1 (en) * 2019-04-12 2020-10-15 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus and device, and storage medium
CN110569270A (en) * 2019-08-15 2019-12-13 中国人民解放军国防科技大学 bayesian-based LDA topic label calibration method, system and medium
CN111930962A (en) * 2020-09-02 2020-11-13 平安国际智慧城市科技股份有限公司 Document data value evaluation method and device, electronic equipment and storage medium
WO2022116324A1 (en) * 2020-12-04 2022-06-09 中国科学院深圳先进技术研究院 Search model training method, apparatus, terminal device, and storage medium
CN117520800A (en) * 2023-11-25 2024-02-06 北京豆果信息技术有限公司 Training method, system, electronic equipment and medium for nutrition literature model

Also Published As

Publication number Publication date
CN117891959A (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN115578015B (en) Sewage treatment whole process supervision method, system and storage medium based on Internet of things
JP2021504789A (en) ESG-based corporate evaluation execution device and its operation method
US9208209B1 (en) Techniques for monitoring transformation techniques using control charts
CN110717535A (en) Automatic modeling method and system based on data analysis processing system
Davila Delgado et al. Big data analytics system for costing power transmission projects
CN115794803B (en) Engineering audit problem monitoring method and system based on big data AI technology
CN112445844B (en) Financial data management control system of big data platform
CN113297044B (en) Operation and maintenance risk early warning method and device
Florez‐Perez et al. Using machine learning to analyze and predict construction task productivity
CN114091360B (en) Multi-model fused voltage transformer error state evaluation method
Vilaysouk et al. Semisupervised machine learning classification framework for material intensity parameters of residential buildings
CN117891959B (en) Document metadata storage method and system based on Bayesian network
Gunawan et al. C4. 5, K-Nearest Neighbor, Naïve Bayes, and Random Forest Algorithms Comparison to Predict Students' on TIME Graduation
Chrisnanto et al. The uses of educational data mining in academic performance analysis at higher education institutions (case study at UNJANI)
Banga et al. Implementation of machine learning techniques in software reliability: A framework
Fan Data mining model for predicting the quality level and classification of construction projects
Beig Zali et al. Semisupervised Clustering Approach for Pipe Failure Prediction with Imbalanced Data Set
CN115619539A (en) Pre-loan risk evaluation method and device
CN111680572B (en) Dynamic judgment method and system for power grid operation scene
Dyvak et al. An Ontological Approach to Detecting Irrelevant and Unreliable Information on Web-Resources and Social Networks
CN114329966A (en) Method and system for evaluating health degree of remote control system of natural gas pipeline
CN113657599A (en) Accident cause and effect reasoning method and device, electronic equipment and readable storage medium
CN111724048A (en) Characteristic extraction method for finished product library scheduling system performance data based on characteristic engineering
CN111126694A (en) Time series data prediction method, system, medium and device
Azzalini et al. Data Quality and Data Ethics: Towards a Trade-off Evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant