CN117785862B - Biological safety database extraction verification method and system - Google Patents

Biological safety database extraction verification method and system Download PDF

Info

Publication number
CN117785862B
CN117785862B CN202410216976.6A CN202410216976A CN117785862B CN 117785862 B CN117785862 B CN 117785862B CN 202410216976 A CN202410216976 A CN 202410216976A CN 117785862 B CN117785862 B CN 117785862B
Authority
CN
China
Prior art keywords
data
extractor
calibration
acquiring
rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410216976.6A
Other languages
Chinese (zh)
Other versions
CN117785862A (en
Inventor
肖娜
赵超
张兮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202410216976.6A priority Critical patent/CN117785862B/en
Publication of CN117785862A publication Critical patent/CN117785862A/en
Application granted granted Critical
Publication of CN117785862B publication Critical patent/CN117785862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to the field of data extraction and verification, in particular to a method and a system for extracting and verifying a biological safety database, which are used for acquiring data of the biological safety database, cleaning the data and acquiring the quoting rate and abnormal conditions of the data in history extraction; performing format conversion and data coding on the data, obtaining data classification and data form of the data, and obtaining the data in a set confidence threshold interval; acquiring known reference data, comparing the data with the reference data, then carrying out demonstration on the data, adding calibration words to the data according to the demonstration result, and generating a feedback mechanism; through verification on multiple aspects of data, the accuracy of the data is effectively improved.

Description

Biological safety database extraction verification method and system
Technical Field
The invention relates to the field of data extraction and verification, in particular to a method and a system for extracting and verifying a biological safety database.
Background
A biosafety database is a database system for storing, managing and analyzing biosafety related data. The method has wide range of data including but not limited to animal epidemic diseases, plant epidemic situations, human infectious diseases, plant pests, invasive species and the like, reflects the influence of organisms and related factors on the ecological system and human health, is an important basis for formulating related policies and taking countermeasures, and can provide scientific and accurate data support for related departments and researchers to better cope with biosafety risks and the like.
Currently, the construction and management of biosafety databases is generally responsible for research institutions and the like to ensure accuracy of data. However, the presence of data in a biosafety database may be due to inaccurate, incomplete, or inconsistent situations in the data collection, processing, or storage process, which may affect the reliability of the data and the accuracy of the analysis results. Then after the biosafety data is extracted, if the data is not sufficiently verified, the extracted data will be inaccurate, and may result in erroneous judgment, decision, etc. At present, data extraction of a biological safety database is carried out according to actual needs and related regulations, and the extracted data is often inaccurate because of some data errors or anomalies and the like possibly caused by the limitation of a verification method.
Disclosure of Invention
(1) Technical problem to be solved
The invention aims to provide a method and a system for extracting and verifying a biosafety database, which are used for solving the problem of low accuracy of extracting data from the biosafety database, and cleaning the data and acquiring the quotation rate and abnormal conditions of the data in the history extraction by acquiring the data of the biosafety database; performing format conversion and data coding on the data, obtaining data classification and data form of the data, and obtaining the data in a set confidence threshold interval; acquiring known reference data, comparing the data with the reference data, then carrying out demonstration on the data, adding calibration words to the data according to the demonstration result, and generating a feedback mechanism; through verification on multiple aspects of data, the accuracy of the data is effectively improved.
(2) Technical proposal
To achieve the above object, in one aspect, the present invention provides a biosafety database extraction verification method, the method comprising:
acquiring data extracted from a biological safety database, and cleaning the data to obtain first data; the data cleaning is to delete the data of the specific calibration words and the repeated data by using a data processing method; acquiring the quotation rate and abnormal conditions of the first data in the history extraction; the abnormal condition is that when the extracted data is demonstrated to be an abnormal value, the data is marked with an abnormal calibration word; acquiring first data, wherein the quotation rate in the first data is larger than a set threshold value, and marking the data with the abnormal calibration words as second data after the data with the abnormal calibration words are given a set weight;
Processing the second data to obtain third data; the data processing comprises data format conversion and data encoding; the data format conversion is to convert data in different formats into a preset unified format through a data format conversion method; the data coding is to code non-numerical data into numerical data by adopting a coding algorithm;
acquiring data classification and data form of third data; the data classification is to obtain fourth data from the third data through a data classification algorithm; the data form is obtained by descriptive statistical analysis of fourth data, and whether the fourth data obeys normal distribution is checked by a normal checking method; when normal distribution is obeyed, acquiring data in a set confidence threshold interval and recording the data as fifth data; when the normal distribution is not obeyed, the fifth data is equal to the fourth data;
Acquiring known reference data, comparing fifth data with the reference data, and deleting data with a difference value between the fifth data and the reference data larger than a set difference threshold value to obtain sixth data; carrying out demonstration on the sixth data to obtain a demonstration result, adding a calibration word to the sixth data according to the demonstration result, and feeding back to a database management center; when the database management center receives feedback, a data feedback mechanism is generated; if there is no known reference data, the sixth data is equal to the fifth data.
Further, the method for acquiring the quote rate comprises the following steps:
Acquiring the success rate of the extractor and a characteristic attribute model of the first data, and acquiring a first quoting rate of the first data through the characteristic attribute model; the characteristic attribute comprises a unit attribute, a time attribute and a region attribute of the first data in history extraction, and the characteristic attribute model establishes a mapping relation among the unit attribute, the time attribute, the region attribute and the data through a neural network algorithm; the unit attribute, the time attribute and the region attribute of the extractor are obtained and recorded as the characteristics of the extractor, and the characteristics of the extractor are input into a characteristic attribute model to obtain the first quotation rate; and obtaining the quotation rate by the extractor success rate and the first quotation rate through a linear weighting algorithm.
Further, the method for obtaining the extractor success rate comprises the following steps:
The success rate of the extractor is obtained through the prediction of a success model of the extractor, wherein the success model of the extractor is obtained through the establishment of the mapping relation among the understanding degree, the data retention rate, the retrieval frequency and the extractor through a neural network algorithm; obtaining understanding degree of the keywords through a professional level model, wherein the keywords are obtained by extracting retrieval information, and the professional level model establishes mapping relations among vocabulary difficulty, historical retrieval data conditions of extractors, professional backgrounds of extractors and the keywords through a neural network algorithm; acquiring historical data retention rate and historical search frequency of the historical extraction data of the extractor, and predicting the historical data retention rate and the historical search frequency through a time sequence prediction model; and inputting the understanding degree, the data retention rate and the search frequency into an extractor success model to obtain the extractor success rate.
Further, the method further comprises:
The calibration words added to the data according to the demonstration result comprise an abnormal calibration word, an error calibration word and a defect calibration word; when the database management center monitors that the data has the calibration word, a data feedback mechanism is generated; the data feedback mechanism is used for classifying the data with the calibration words according to the calibration words to generate corresponding information and feeding back the information to the data source provider, and the data source provider verifies the data with the calibration words according to the information and feeds back verification information to the database management center; when the database management center receives feedback verification information of the data source provider, whether the calibration words are removed or not is judged according to the verification information, management operation is carried out on the data, and notification information is generated and fed back to a history extractor of the data.
Further, the method further comprises:
When the database management center receives feedback verification information of the data source provider, a third party unit is introduced to verify the data and obtain a judging result; and according to the judging result, if the data accords with the condition of removing the calibration word, removing the data calibration word and performing management operation, wherein the management operation comprises deleting the data and correcting the data, and if the data does not accord with the condition of removing the calibration word, the management operation is to list the data into a deactivated database.
Based on the same inventive concept, the invention also provides a biosafety database extraction verification system, the system comprising:
The first data verification module is used for acquiring data extracted from the biological safety database, and cleaning the data to obtain first data; the data cleaning is to delete the data of the specific calibration words and the repeated data by using a data processing method; acquiring the quotation rate and abnormal conditions of the first data in the history extraction; the abnormal condition is that when the extracted data is demonstrated to be an abnormal value, the data is marked with an abnormal calibration word; acquiring first data, wherein the quotation rate in the first data is larger than a set threshold value, and marking the data with the abnormal calibration words as second data after the data with the abnormal calibration words are given a set weight;
The second data verification module is used for obtaining third data through data processing of the second data; the data processing comprises data format conversion and data encoding; the data format conversion is to convert data in different formats into a preset unified format through a data format conversion method; the data coding is to code non-numerical data into numerical data by adopting a coding algorithm;
The third data verification module is used for acquiring data classification and data form of third data; the data classification is to obtain fourth data from the third data through a data classification algorithm; the data form is obtained by descriptive statistical analysis of fourth data, and whether the fourth data obeys normal distribution is checked by a normal checking method; when normal distribution is obeyed, acquiring data in a set confidence threshold interval and recording the data as fifth data; when the normal distribution is not obeyed, the fifth data is equal to the fourth data;
The data demonstration feedback module is used for acquiring known reference data, comparing fifth data with the reference data, and deleting data with a difference value between the fifth data and the reference data larger than a set difference threshold value to obtain sixth data; carrying out demonstration on the sixth data to obtain a demonstration result, adding a calibration word to the sixth data according to the demonstration result, and feeding back to a database management center; when the database management center receives feedback, a data feedback mechanism is generated; if there is no known reference data, the sixth data is equal to the fifth data.
Further, the system further comprises:
The quotation rate acquisition module is used for acquiring the success rate of the extractor and a characteristic attribute model of the first data, and acquiring a first quotation rate of the first data through the characteristic attribute model; the characteristic attribute comprises a unit attribute, a time attribute and a region attribute of the first data in history extraction, and the characteristic attribute model establishes a mapping relation among the unit attribute, the time attribute, the region attribute and the data through a neural network algorithm; the unit attribute, the time attribute and the region attribute of the extractor are obtained and recorded as the characteristics of the extractor, and the characteristics of the extractor are input into a characteristic attribute model to obtain the first quotation rate; and obtaining the quotation rate by the extractor success rate and the first quotation rate through a linear weighting algorithm.
Further, the system further comprises:
The extractor success rate module is used for predicting and obtaining the extractor success rate through an extractor success model, wherein the extractor success model is used for establishing the mapping relation among the understanding degree, the data retention rate, the retrieval frequency and the extractor through a neural network algorithm; obtaining understanding degree of the keywords through a professional level model, wherein the keywords are obtained by extracting retrieval information, and the professional level model establishes mapping relations among vocabulary difficulty, historical retrieval data conditions of extractors, professional backgrounds of extractors and the keywords through a neural network algorithm; acquiring historical data retention rate and historical search frequency of the historical extraction data of the extractor, and predicting the historical data retention rate and the historical search frequency through a time sequence prediction model; and inputting the understanding degree, the data retention rate and the search frequency into an extractor success model to obtain the extractor success rate.
Further, the system further comprises:
the data feedback mechanism module is used for adding calibration words to the data according to the demonstration result, wherein the calibration words comprise abnormal calibration words, error calibration words and defect calibration words; when the database management center monitors that the data has the calibration word, a data feedback mechanism is generated; the data feedback mechanism is used for classifying the data with the calibration words according to the calibration words to generate corresponding information and feeding back the information to the data source provider, and the data source provider verifies the data with the calibration words according to the information and feeds back verification information to the database management center; when the database management center receives feedback verification information of the data source provider, whether the calibration words are removed or not is judged according to the verification information, management operation is carried out on the data, and notification information is generated and fed back to a history extractor of the data.
Further, the system further comprises:
The data management module is used for verifying the data introduced into the third party unit and obtaining a judging result when the database management center receives the feedback verification information of the data source provider; and according to the judging result, if the data accords with the condition of removing the calibration word, removing the data calibration word and performing management operation, wherein the management operation comprises deleting the data and correcting the data, and if the data does not accord with the condition of removing the calibration word, the management operation is to list the data into a deactivated database.
(3) Advantageous effects
Compared with the prior art, the invention has the beneficial effects that:
1. The accuracy of the data is improved by fully verifying the data, and meanwhile, the accuracy of the data is further verified according to the data quotation rate and the success rate of the data extractor.
2. And closed-loop management of data is formed through a data feedback mechanism, so that the accuracy of the data is effectively improved.
Drawings
FIG. 1 is a flow chart of a method for extracting and verifying a biosafety database according to embodiment 1 of the invention;
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Before the example, the application scenario of the present invention needs to be described, and the purpose of the biosafety database extraction verification is to improve the accuracy of extracting data from the biosafety database. Current biosafety databases are typically managed by related institutions or specialized organizations that are responsible for the storage, processing, and analysis of the databases, but are not effectively secured for data accuracy. For example, in the research and judgment of epidemic situation, the biological safety data is inaccurate, so that the false judgment of the epidemic situation is caused, if the number of cases is underestimated, the epidemic situation can be more serious than the actual situation, and more people are infected; conversely, if the number of cases is overestimated, unnecessary panic and resource waste may result. If the data acquired during the drug and vaccine development process is inaccurate, the development progress and results may be affected, for example, if the relevant data is erroneous, the development direction may be erroneous or the development progress may be delayed. Therefore, after the extractor extracts the data, the data needs to be sufficiently verified to ensure that the accuracy of the data is obtained, and the lower the accuracy of the data is, the related departments or research institutions and the like can possibly make wrong decisions or research conclusions based on inaccurate data; and the higher the data accuracy, the more accurate decisions and decisions can be made by the relevant departments and researchers. The invention is a method for fully verifying the accuracy of data by comprehensively considering the influence of various factors on the accuracy of the data, and simultaneously considering the historical extraction condition of the data and the influence of the factors of an extractor on the data extraction, thereby avoiding errors or anomalies of the extracted data caused by the limitation of the verification method as far as possible. For example, researchers in a research institution recently need to study the infection of a disease between animals, and the researchers extract relevant data about the disease in a biosafety database for research analysis according to authorized permissions. However, because the researcher obtains more and messy data through retrieval, and because of different data sources, various reasons such as data storage processes and the like, the data has various conditions such as duplication, deletion or inconsistent formats, and the like, and the data is directly extracted and used, so that the efficiency and the accuracy of the research process are influenced. Therefore, the data extraction method and the system have the advantages that the data extraction method and the system fully verify the extracted data, meanwhile, the influence of factors of the data extractor on the accuracy of the data is considered, a corresponding feedback mechanism is formed on the accuracy of the data, the feedback mechanism can feed back the demonstration condition of the extracted data to a data source provider to correspondingly verify the data, a certain closed-loop management mode is formed on the data, and the accuracy of the data in a database can be effectively improved.
Example 1: as shown in fig. 1, the present embodiment provides a biosafety database extraction verification method, the method including:
S1, acquiring data extracted from a biological safety database, and cleaning the data to obtain first data; the data cleaning is to delete the data of the specific calibration words and the repeated data by using a data processing method; acquiring the quotation rate and abnormal conditions of the first data in the history extraction; the abnormal condition is that when the extracted data is demonstrated to be an abnormal value, the data is marked with an abnormal calibration word; acquiring first data, wherein the quotation rate in the first data is larger than a set threshold value, and marking the data with the abnormal calibration words as second data after the data with the abnormal calibration words are given a set weight; the data in the biosafety database has different data sources, and may be inaccurate, incomplete or inconsistent due to errors or omission in the process of data acquisition, processing or storage, so that after the data is extracted, some common data problems need to be processed. Python is a powerful programming language, is widely applied to data processing, and can conveniently clean data through a relative library and tools of the Python, for example, repeated data, error data and defect data are common problems in data extraction, and then the repeated data and other data which do not meet the requirements in the data can be deleted through a Python data processing method. The data cleaning in this embodiment is to delete the data with the specific calibration words and the repeated data, and the specific calibration words include error calibration words and defect calibration words. The quote rate of the data refers to the ratio of the number of times the data is adopted after being extracted in actual extraction to the number of times the data is extracted, and if the ratio is larger, the data is higher in attention and more prone to being adopted, namely the data is reflected to a certain degree to have higher accuracy. The abnormal calibration word means that the result obtained by the data after the demonstration is uncertain and unclear, for example, the error of the data after the demonstration is larger than the set precision value but the theoretical error is in a reasonable range, the data at the moment cannot judge whether the error exists or not, and probably because the data precision required by the demonstration is high, the original data source does not require such high precision in the acquisition process, and the data is judged to be the abnormal data and the abnormal calibration word is added at the moment. Meanwhile, the reasons for the two are possibly caused by various factors such as demonstrated equipment difference, demonstrated environment difference, and demonstrated level difference. Thus when there is an outlier in the extracted data, the data is given a relatively small weight, and then the effect of the data on the results is reduced in the final decision or study.
S2, obtaining third data through data processing of the second data; the data processing comprises data format conversion and data encoding; the data format conversion is to convert data in different formats into a preset unified format through a data format conversion method; the data coding is to code non-numerical data into numerical data by adopting a coding algorithm; the data has inconsistent data formats due to different data sources or different storage modes, such as Excel tables, CSV files or JSON formats, and the like, and the different formats are converted into uniform formats through a data format conversion method. For example, format conversion is performed by writing a custom script program, including writing a script by adopting programming languages such as Python and Java, and the like, and the conversion of the data format is realized according to specific requirements. At this time, some non-numerical data is inconvenient for subsequent data verification, analysis and research, and common methods for encoding the non-numerical data into numerical data include single-hot encoding, digital encoding, continuous encoding, and the like. Different coding methods may have different effects on the results, so that in practice the appropriate method will be chosen for coding depending on the nature of the data and the purpose of analysis. For example, the extracted data is related to the incidence rate of a certain disease between different age groups of male and female, and the male and female are non-numerical data, which is unfavorable for data analysis, and the digital coding can be adopted at this time, and is a simple binary coding mode, and the male can be coded as 1 and the female can be coded as 0 through a coding algorithm at this time.
S3, acquiring data classification and data form of third data; the data classification is to obtain fourth data from the third data through a data classification algorithm; the data form is obtained by descriptive statistical analysis of fourth data, and whether the fourth data obeys normal distribution is checked by a normal checking method; when normal distribution is obeyed, acquiring data in a set confidence threshold interval and recording the data as fifth data; when the normal distribution is not obeyed, the fifth data is equal to the fourth data; because of different requirements, the extracted data can be used for research analysis and the like by extracting various types of data, and therefore, the data is classified by adopting a data classification algorithm. The data classification algorithm (Data Classification Algorithm) refers to a class of algorithms for classifying similar data together according to data characteristics, and is mainly divided into a supervised learning algorithm and an unsupervised learning algorithm. Common data classification algorithms include Bayesian classification, K nearest neighbor algorithm, decision tree classification, logistic regression, random forest, etc., and in this embodiment, a random forest classification algorithm is used. The extracted data can have various data forms, but according to statistical analysis, the data conforming to the normal distribution form has higher accuracy for acquiring the data within the set confidence threshold. Common normal inspection methods include P-P diagram, Q-Q diagram, KS inspection, shapiro-Wilk inspection, etc., and the method of combining P-P diagram and Shapiro-Wilk inspection is adopted in the embodiment. Since the shape-Wilk test is a statistical method for checking whether data conforms to a normal distribution, the basic idea of the test is to perform the test by comparing the actual distribution of data with a theoretical normal distribution, and the P-P diagram (Probability-Probability Plot) test method is a visual test method based on Probability distribution diagram for evaluating whether a data set conforms to a specified Probability distribution, the result of which may be affected by factors such as sample size and data fluctuation, and thus, when the test is performed using the P-P diagram, a more accurate conclusion can be obtained in combination with the shape-Wilk test method.
S4, acquiring known reference data, comparing fifth data with the reference data, and deleting data with a difference value between the fifth data and the reference data being larger than a set difference threshold value to obtain sixth data; carrying out demonstration on the sixth data to obtain a demonstration result, adding a calibration word to the sixth data according to the demonstration result, and feeding back to a database management center; when the database management center receives feedback, a data feedback mechanism is generated; if there is no known reference data, the sixth data is equal to the fifth data. In general, some reference data, including some reference value or external data, of related data may be stored as a data extraction unit, and the reference data may be used as comparison data of the extracted data, so as to establish a comparison relationship between the extracted data and the reference data. For example, a linear regression algorithm is used to calculate the relationship between the two, and linear regression is to determine the quantitative relationship of the interdependence of two or more variables by using regression analysis in mathematical statistics. And deleting the extracted data when the difference value between the extracted data and the reference data is larger than the set difference threshold value. For example, the data extracted this time is about the incidence of a certain disease between male and female ages, wherein the data shows that the incidence of female between 30-35 years is 53%, and the disease has been just analyzed by the extraction unit before the extraction unit, so that the incidence of female between the ages is 32% and the difference between the two is too large, and the data for this case can be deleted. And carrying out demonstration on the extracted data, adding a calibration word to the real situation of the data according to the demonstration result, wherein the added calibration word indicates that the data is possibly error, defect and abnormality after being demonstrated, and the extraction unit feeds back the data after adding the calibration word to the database management center. When the database management center monitors that the data has the calibration word, a feedback mechanism is started.
Further, the method for acquiring the quote rate comprises the following steps:
Acquiring the success rate of the extractor and a characteristic attribute model of the first data, and acquiring a first quoting rate of the first data through the characteristic attribute model; the characteristic attribute comprises a unit attribute, a time attribute and a region attribute of the first data in history extraction, and the characteristic attribute model establishes a mapping relation among the unit attribute, the time attribute, the region attribute and the data through a neural network algorithm; the unit attribute, the time attribute and the region attribute of the extractor are obtained and recorded as the characteristics of the extractor, and the characteristics of the extractor are input into a characteristic attribute model to obtain the first quotation rate; and obtaining the quotation rate by the extractor success rate and the first quotation rate through a linear weighting algorithm. The quote rate of the data may reflect the accuracy of the data to some extent, however, the data in the database may be affected by a number of factors during the extraction process, which may determine whether the data will be employed last. The factors comprise unit attributes, time attributes and region attributes, wherein the unit attributes refer to extracted data units, the extracted data units comprise research institutions or related departments and the like, and the emphasis, the requirements, the accuracy and the like of the extracted data of different units are different; the time attribute refers to the time required for extracting data, such as time can be divided all the year round, morning, evening, year round, etc., because the data is different from time to time; the region attribute refers to region characteristics of the extracted data, for example, regions can be divided according to country, longitude and latitude, drainage basin, etc., because different regions are affected by different environmental factors, and there is a certain difference in the data. For example, the extractors belong to research institutions, and data to be extracted are required to be used for researching the diffusivity of a certain disease in coastal areas in summer, so that the data extracted by the extractors are required to be more accurate and comprehensive. For example, the accuracy of the diffusivity may be adopted only if the accuracy reaches 2 decimal places for policy formulation based on relevant departments, and may be required to reach 4-5 decimal places for ensuring the accuracy of the research result as a research institution. Meanwhile, the temperature in summer is relatively high, and the survival rate of bacteria or viruses with temperature resistance can be higher, such as hand-foot-and-mouth disease which is high in summer and is caused by intestinal infectious diseases of children, bacillary dysentery which is high in summer and the like. Coastal areas are very windy and humid over the years, and these factors will contribute to the spread of the disease. Then the data is affected by the unit attribute, the time attribute and the region attribute in the process of extraction to determine whether the data is finally adopted. Then the accuracy of the data can be reflected if the data is employed under the corresponding attribute and the case of employing.
Further, the method for obtaining the extractor success rate comprises the following steps:
The success rate of the extractor is obtained through the prediction of a success model of the extractor, wherein the success model of the extractor is obtained through the establishment of the mapping relation among the understanding degree, the data retention rate, the retrieval frequency and the extractor through a neural network algorithm; obtaining understanding degree of the keywords through a professional level model, wherein the keywords are obtained by extracting retrieval information, and the professional level model establishes mapping relations among vocabulary difficulty, historical retrieval data conditions of extractors, professional backgrounds of extractors and the keywords through a neural network algorithm; acquiring historical data retention rate and historical search frequency of the historical extraction data of the extractor, and predicting the historical data retention rate and the historical search frequency through a time sequence prediction model; and inputting the understanding degree, the data retention rate and the search frequency into an extractor success model to obtain the extractor success rate. The data extractor needs to acquire relevant extraction qualification to have legal rights to extract the data in the biosafety database, so the extractor may need to receive relevant training to know the extraction method and requirements of the biosafety database, and ensure that the biosafety database has corresponding skills and capabilities. Then the extractor is taken as a responsibility which is specially responsible for data extraction, the success rate of the historical extracted data of the extractor also affects the accuracy of the data to a certain extent, the success rate of the extractor can be taken as one of factors for judging the accuracy of the data, and if the accuracy is low, the data can be extracted again for verification and comparison. The vocabulary difficulty refers to a difficulty coefficient calculating method based on a knowledge graph, and the method calculates the difficulty coefficient of the vocabulary by utilizing semantic relations and a knowledge network in the knowledge graph. Semantic relationships in the knowledge graph include word sense relationships, semantic similarity, concept relationships and the like, and the relationships can be used for measuring the cognitive difficulty of the vocabulary. The knowledge graph can be divided into a general knowledge graph and a specific field knowledge graph (industry knowledge graph, vertical knowledge graph) according to the field, wherein the general knowledge graph is a structured encyclopedia knowledge base, and the specific field knowledge graph is a specific field and can be regarded as an industry knowledge base based on semantic technology. The embodiment is based on a knowledge graph of a specific field. The data retention rate refers to the percentage of the final retained data volume to the total retrieved data volume in the process of successfully extracting the data by an extractor, and the higher the percentage is, the higher the accuracy of the data is described to a certain extent. The search frequency refers to the number of times that the extractor extracts in the process of successfully extracting the data, and the more the number of times of extraction, the lower the extraction efficiency of the extractor, the lower the accuracy of the extracted data. The historical retrieval data condition of the extractor refers to which data are retrieved and which data are extracted by the extractor in the historical operation, and the condition that the extractor has a certain degree of knowledge on the information of the type can be reflected as long as the extractor extracts the data. The professional background of the extractor refers to the expertise and working experience of the extractor, which can be clearer when the extractor extracts the related knowledge of the professional field of the extractor, and is also helpful for improving the accuracy of the extracted data.
Further, the method further comprises:
The calibration words added to the data according to the demonstration result comprise an abnormal calibration word, an error calibration word and a defect calibration word; when the database management center monitors that the data has the calibration word, a data feedback mechanism is generated; the data feedback mechanism is used for classifying the data with the calibration words according to the calibration words to generate corresponding information and feeding back the information to the data source provider, and the data source provider verifies the data with the calibration words according to the information and feeds back verification information to the database management center; when the database management center receives feedback verification information of the data source provider, whether the calibration words are removed or not is judged according to the verification information, management operation is carried out on the data, and notification information is generated and fed back to a history extractor of the data. And according to the demonstration result, the added calibration words of the data comprise abnormal calibration words, error calibration words and missing value calibration words, and when the database system monitors that the data have the calibration words, the data with the calibration words are classified according to the calibration words to generate corresponding information and are fed back to a data source provider. When the data source provider receives the message, such as the error correction word, the data source provider, with the appropriate experimentation, re-experiments the data to demonstrate the data. During the demonstration of the data source provider, the data always keeps the related calibration words, so that the data with the calibration words, which are obtained at the moment, are subjected to related processing in the data cleaning stage when the extractor extracts the data with the calibration words. After the data source provider feeds back the obtained demonstration result to the database management center, the database management center decides whether to remove the calibration word of the data according to the feedback result and after judging, and meanwhile generates notification information to inform the history extractor, because the history extractor may already use the data to make decisions or research, and the accuracy of the data is affected. The history extractor can correct or re-make decisions or researches in time according to the notification information of the biosafety database management center. Through the feedback mechanism, the data is subjected to certain closed-loop management, the data is provided for the database management center from the data source provider, the database management center is responsible for storage, processing and the like, the data extractor extracts the data and feeds back to the database management center when the data is proved to have problems, the database management center feeds back the generated information to the data source provider according to the feedback, and the data source provider feeds back to the database management center after the data is proved, so that the data accuracy of the database is effectively improved.
Further, the method further comprises:
When the database management center receives feedback verification information of the data source provider, a third party unit is introduced to verify the data and obtain a judging result; and according to the judging result, if the data accords with the condition of removing the calibration word, removing the data calibration word and performing management operation, wherein the management operation comprises deleting the data and correcting the data, and if the data does not accord with the condition of removing the calibration word, the management operation is to list the data into a deactivated database. When the database management center receives the feedback verification information of the data source provider, in order to ensure the accuracy of the feedback verification information, a third party unit is introduced according to specific situations to fully demonstrate the feedback verification information of the data source provider to obtain a judging result. Determining whether the calibration words of the data are removed according to the determination result, if the calibration words are removed, the data determination result is consistent with the feedback verification information of the data source provider, and the data are deleted or corrected by the database management center; if the data is not subjected to the removal of the calibration words, the data judgment result is inconsistent with the feedback of the data source provider, the data does not have the necessity of referential and adoption at the moment, the management operation can list the data into a deactivated database, and the data in the deactivated database is suspended from being extracted to the outside until the data is finally demonstrated to have the necessity of referential and adoption.
Example 2: based on the same inventive concept, the present embodiment further provides a biosafety database extraction verification system, including:
The first data verification module is used for acquiring data extracted from the biological safety database, and cleaning the data to obtain first data; the data cleaning is to delete the data of the specific calibration words and the repeated data by using a data processing method; acquiring the quotation rate and abnormal conditions of the first data in the history extraction; the abnormal condition is that when the extracted data is demonstrated to be an abnormal value, the data is marked with an abnormal calibration word; acquiring first data, wherein the quotation rate in the first data is larger than a set threshold value, and marking the data with the abnormal calibration words as second data after the data with the abnormal calibration words are given a set weight;
The second data verification module is used for obtaining third data through data processing of the second data; the data processing comprises data format conversion and data encoding; the data format conversion is to convert data in different formats into a preset unified format through a data format conversion method; the data coding is to code non-numerical data into numerical data by adopting a coding algorithm;
The third data verification module is used for acquiring data classification and data form of third data; the data classification is to obtain fourth data from the third data through a data classification algorithm; the data form is obtained by descriptive statistical analysis of fourth data, and whether the fourth data obeys normal distribution is checked by a normal checking method; when normal distribution is obeyed, acquiring data in a set confidence threshold interval and recording the data as fifth data; when the normal distribution is not obeyed, the fifth data is equal to the fourth data;
The data demonstration feedback module is used for acquiring known reference data, comparing fifth data with the reference data, and deleting data with a difference value between the fifth data and the reference data larger than a set difference threshold value to obtain sixth data; carrying out demonstration on the sixth data to obtain a demonstration result, adding a calibration word to the sixth data according to the demonstration result, and feeding back to a database management center; when the database management center receives feedback, a data feedback mechanism is generated; if there is no known reference data, the sixth data is equal to the fifth data.
Further, the system further comprises:
The quotation rate acquisition module is used for acquiring the success rate of the extractor and a characteristic attribute model of the first data, and acquiring a first quotation rate of the first data through the characteristic attribute model; the characteristic attribute comprises a unit attribute, a time attribute and a region attribute of the first data in history extraction, and the characteristic attribute model establishes a mapping relation among the unit attribute, the time attribute, the region attribute and the data through a neural network algorithm; the unit attribute, the time attribute and the region attribute of the extractor are obtained and recorded as the characteristics of the extractor, and the characteristics of the extractor are input into a characteristic attribute model to obtain the first quotation rate; and obtaining the quotation rate by the extractor success rate and the first quotation rate through a linear weighting algorithm.
Further, the system further comprises:
The extractor success rate module is used for predicting and obtaining the extractor success rate through an extractor success model, wherein the extractor success model is used for establishing the mapping relation among the understanding degree, the data retention rate, the retrieval frequency and the extractor through a neural network algorithm; obtaining understanding degree of the keywords through a professional level model, wherein the keywords are obtained by extracting retrieval information, and the professional level model establishes mapping relations among vocabulary difficulty, historical retrieval data conditions of extractors, professional backgrounds of extractors and the keywords through a neural network algorithm; acquiring historical data retention rate and historical search frequency of the historical extraction data of the extractor, and predicting the historical data retention rate and the historical search frequency through a time sequence prediction model; and inputting the understanding degree, the data retention rate and the search frequency into an extractor success model to obtain the extractor success rate.
Further, the system further comprises:
the data feedback mechanism module is used for adding calibration words to the data according to the demonstration result, wherein the calibration words comprise abnormal calibration words, error calibration words and defect calibration words; when the database management center monitors that the data has the calibration word, a data feedback mechanism is generated; the data feedback mechanism is used for classifying the data with the calibration words according to the calibration words to generate corresponding information and feeding back the information to the data source provider, and the data source provider verifies the data with the calibration words according to the information and feeds back verification information to the database management center; when the database management center receives feedback verification information of the data source provider, whether the calibration words are removed or not is judged according to the verification information, management operation is carried out on the data, and notification information is generated and fed back to a history extractor of the data.
Further, the system further comprises:
The data management module is used for verifying the data introduced into the third party unit and obtaining a judging result when the database management center receives the feedback verification information of the data source provider; and according to the judging result, if the data accords with the condition of removing the calibration word, removing the data calibration word and performing management operation, wherein the management operation comprises deleting the data and correcting the data, and if the data does not accord with the condition of removing the calibration word, the management operation is to list the data into a deactivated database.
It should be noted that, regarding the system in the above embodiment, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment regarding the method, and will not be described in detail herein.
Finally, it should be noted that: although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims (10)

1. A method for biometric security database extraction verification, the method comprising:
acquiring data extracted from a biological safety database, and cleaning the data to obtain first data; the data cleaning is to delete the data of the specific calibration words and the repeated data by using a data processing method; acquiring the quotation rate and abnormal conditions of the first data in the history extraction; the abnormal condition is that when the extracted data is demonstrated to be an abnormal value, the data is marked with an abnormal calibration word; acquiring first data, wherein the quotation rate in the first data is larger than a set threshold value, and marking the data with the abnormal calibration words as second data after the data with the abnormal calibration words are given a set weight;
Processing the second data to obtain third data; the data processing comprises data format conversion and data encoding; the data format conversion is to convert data in different formats into a preset unified format through a data format conversion method; the data coding is to code non-numerical data into numerical data by adopting a coding algorithm;
acquiring data classification and data form of third data; the data classification is to obtain fourth data from the third data through a data classification algorithm; the data form is obtained by descriptive statistical analysis of fourth data, and whether the fourth data obeys normal distribution is checked by a normal checking method; when normal distribution is obeyed, acquiring data in a set confidence threshold interval and recording the data as fifth data; when the normal distribution is not obeyed, the fifth data is equal to the fourth data;
Acquiring known reference data, comparing fifth data with the reference data, and deleting data with a difference value between the fifth data and the reference data larger than a set difference threshold value to obtain sixth data; carrying out demonstration on the sixth data to obtain a demonstration result, adding a calibration word to the sixth data according to the demonstration result, and feeding back to a database management center; when the database management center receives feedback, a data feedback mechanism is generated; if there is no known reference data, the sixth data is equal to the fifth data.
2. The biosafety database extraction verification method of claim 1 wherein said method of obtaining a quote rate includes:
Acquiring the success rate of the extractor and a characteristic attribute model of the first data, and acquiring a first quoting rate of the first data through the characteristic attribute model; the characteristic attribute comprises a unit attribute, a time attribute and a region attribute of the first data in history extraction, and the characteristic attribute model establishes a mapping relation among the unit attribute, the time attribute, the region attribute and the data through a neural network algorithm; the unit attribute, the time attribute and the region attribute of the extractor are obtained and recorded as the characteristics of the extractor, and the characteristics of the extractor are input into a characteristic attribute model to obtain the first quotation rate; and obtaining the quotation rate by the extractor success rate and the first quotation rate through a linear weighting algorithm.
3. The biosafety database extraction verification method of claim 2, wherein the extractor success rate acquisition method includes:
The success rate of the extractor is obtained through the prediction of a success model of the extractor, wherein the success model of the extractor is obtained through the establishment of the mapping relation among the understanding degree, the data retention rate, the retrieval frequency and the extractor through a neural network algorithm; obtaining understanding degree of the keywords through a professional level model, wherein the keywords are obtained by extracting retrieval information, and the professional level model establishes mapping relations among vocabulary difficulty, historical retrieval data conditions of extractors, professional backgrounds of extractors and the keywords through a neural network algorithm; acquiring historical data retention rate and historical search frequency of the historical extraction data of the extractor, and predicting the historical data retention rate and the historical search frequency through a time sequence prediction model; and inputting the understanding degree, the data retention rate and the search frequency into an extractor success model to obtain the extractor success rate.
4. The biosafety database extraction verification method of claim 1, said method further comprising:
The calibration words added to the data according to the demonstration result comprise an abnormal calibration word, an error calibration word and a defect calibration word; when the database management center monitors that the data has the calibration word, a data feedback mechanism is generated; the data feedback mechanism is used for classifying the data with the calibration words according to the calibration words to generate corresponding information and feeding back the information to the data source provider, and the data source provider verifies the data with the calibration words according to the information and feeds back verification information to the database management center; when the database management center receives feedback verification information of the data source provider, whether the calibration words are removed or not is judged according to the verification information, management operation is carried out on the data, and notification information is generated and fed back to a history extractor of the data.
5. The biosafety database extraction verification method of claim 4, said method further comprising:
When the database management center receives feedback verification information of the data source provider, a third party unit is introduced to verify the data and obtain a judging result; and according to the judging result, if the data accords with the condition of removing the calibration word, removing the data calibration word and performing management operation, wherein the management operation comprises deleting the data and correcting the data, and if the data does not accord with the condition of removing the calibration word, the management operation is to list the data into a deactivated database.
6. A biosafety database extraction verification system, the system comprising:
The first data verification module is used for acquiring data extracted from the biological safety database, and cleaning the data to obtain first data; the data cleaning is to delete the data of the specific calibration words and the repeated data by using a data processing method; acquiring the quotation rate and abnormal conditions of the first data in the history extraction; the abnormal condition is that when the extracted data is demonstrated to be an abnormal value, the data is marked with an abnormal calibration word; acquiring first data, wherein the quotation rate in the first data is larger than a set threshold value, and marking the data with the abnormal calibration words as second data after the data with the abnormal calibration words are given a set weight;
The second data verification module is used for obtaining third data through data processing of the second data; the data processing comprises data format conversion and data encoding; the data format conversion is to convert data in different formats into a preset unified format through a data format conversion method; the data coding is to code non-numerical data into numerical data by adopting a coding algorithm;
The third data verification module is used for acquiring data classification and data form of third data; the data classification is to obtain fourth data from the third data through a data classification algorithm; the data form is obtained by descriptive statistical analysis of fourth data, and whether the fourth data obeys normal distribution is checked by a normal checking method; when normal distribution is obeyed, acquiring data in a set confidence threshold interval and recording the data as fifth data; when the normal distribution is not obeyed, the fifth data is equal to the fourth data;
The data demonstration feedback module is used for acquiring known reference data, comparing fifth data with the reference data, and deleting data with a difference value between the fifth data and the reference data larger than a set difference threshold value to obtain sixth data; carrying out demonstration on the sixth data to obtain a demonstration result, adding a calibration word to the sixth data according to the demonstration result, and feeding back to a database management center; when the database management center receives feedback, a data feedback mechanism is generated; if there is no known reference data, the sixth data is equal to the fifth data.
7. The biosafety database extraction verification system of claim 6 wherein said system further comprises:
The quotation rate acquisition module is used for acquiring the success rate of the extractor and a characteristic attribute model of the first data, and acquiring a first quotation rate of the first data through the characteristic attribute model; the characteristic attribute comprises a unit attribute, a time attribute and a region attribute of the first data in history extraction, and the characteristic attribute model establishes a mapping relation among the unit attribute, the time attribute, the region attribute and the data through a neural network algorithm; the unit attribute, the time attribute and the region attribute of the extractor are obtained and recorded as the characteristics of the extractor, and the characteristics of the extractor are input into a characteristic attribute model to obtain the first quotation rate; and obtaining the quotation rate by the extractor success rate and the first quotation rate through a linear weighting algorithm.
8. The biosafety database extraction verification system of claim 7 wherein said system further comprises:
The extractor success rate module is used for predicting and obtaining the extractor success rate through an extractor success model, wherein the extractor success model is used for establishing the mapping relation among the understanding degree, the data retention rate, the retrieval frequency and the extractor through a neural network algorithm; obtaining understanding degree of the keywords through a professional level model, wherein the keywords are obtained by extracting retrieval information, and the professional level model establishes mapping relations among vocabulary difficulty, historical retrieval data conditions of extractors, professional backgrounds of extractors and the keywords through a neural network algorithm; acquiring historical data retention rate and historical search frequency of the historical extraction data of the extractor, and predicting the historical data retention rate and the historical search frequency through a time sequence prediction model; and inputting the understanding degree, the data retention rate and the search frequency into an extractor success model to obtain the extractor success rate.
9. The biosafety database extraction verification system of claim 6 wherein said system further comprises:
the data feedback mechanism module is used for adding calibration words to the data according to the demonstration result, wherein the calibration words comprise abnormal calibration words, error calibration words and defect calibration words; when the database management center monitors that the data has the calibration word, a data feedback mechanism is generated; the data feedback mechanism is used for classifying the data with the calibration words according to the calibration words to generate corresponding information and feeding back the information to the data source provider, and the data source provider verifies the data with the calibration words according to the information and feeds back verification information to the database management center; when the database management center receives feedback verification information of the data source provider, whether the calibration words are removed or not is judged according to the verification information, management operation is carried out on the data, and notification information is generated and fed back to a history extractor of the data.
10. The biosafety database extraction verification system of claim 9 wherein said system further comprises:
The data management module is used for verifying the data introduced into the third party unit and obtaining a judging result when the database management center receives the feedback verification information of the data source provider; and according to the judging result, if the data accords with the condition of removing the calibration word, removing the data calibration word and performing management operation, wherein the management operation comprises deleting the data and correcting the data, and if the data does not accord with the condition of removing the calibration word, the management operation is to list the data into a deactivated database.
CN202410216976.6A 2024-02-28 2024-02-28 Biological safety database extraction verification method and system Active CN117785862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410216976.6A CN117785862B (en) 2024-02-28 2024-02-28 Biological safety database extraction verification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410216976.6A CN117785862B (en) 2024-02-28 2024-02-28 Biological safety database extraction verification method and system

Publications (2)

Publication Number Publication Date
CN117785862A CN117785862A (en) 2024-03-29
CN117785862B true CN117785862B (en) 2024-05-03

Family

ID=90380184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410216976.6A Active CN117785862B (en) 2024-02-28 2024-02-28 Biological safety database extraction verification method and system

Country Status (1)

Country Link
CN (1) CN117785862B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821354A (en) * 2023-04-26 2023-09-29 浙江药科职业大学 Construction method of traditional Chinese medicine knowledge graph
CN116932523A (en) * 2023-08-14 2023-10-24 北京三维天地科技股份有限公司 Platform for integrating and supervising third party environment detection mechanism
CN117453764A (en) * 2023-10-12 2024-01-26 上海禾亘科技有限责任公司 Data mining analysis method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821354A (en) * 2023-04-26 2023-09-29 浙江药科职业大学 Construction method of traditional Chinese medicine knowledge graph
CN116932523A (en) * 2023-08-14 2023-10-24 北京三维天地科技股份有限公司 Platform for integrating and supervising third party environment detection mechanism
CN117453764A (en) * 2023-10-12 2024-01-26 上海禾亘科技有限责任公司 Data mining analysis method

Also Published As

Publication number Publication date
CN117785862A (en) 2024-03-29

Similar Documents

Publication Publication Date Title
Amini et al. Agricultural databases evaluation with machine learning procedure
US11093519B2 (en) Artificial intelligence (AI) based automatic data remediation
US10909188B2 (en) Machine learning techniques for detecting docketing data anomalies
CN111027615B (en) Middleware fault early warning method and system based on machine learning
CN113779272B (en) Knowledge graph-based data processing method, device, equipment and storage medium
CN107077413A (en) The test frame of data-driven
CN107168995B (en) Data processing method and server
US20150220868A1 (en) Evaluating Data Quality of Clinical Trials
CN112365939B (en) Data management method and system based on medical health big data
CN110910991B (en) Medical automatic image processing system
EP3779757A1 (en) Simulated risk contributions
CN113221960A (en) Construction method and collection method of high-quality vulnerability data collection model
CN117785862B (en) Biological safety database extraction verification method and system
Kumar et al. Learning constraint programming models from data using generate-and-aggregate
CN111143616B (en) Video image data management method
CN116383742B (en) Rule chain setting processing method, system and medium based on feature classification
CN112732690B (en) Stabilizing system and method for chronic disease detection and risk assessment
CN111597510B (en) Power transmission and transformation operation detection data quality assessment method and system
CN113722230A (en) Integrated assessment method and device for vulnerability mining capability of fuzzy test tool
CN113555124A (en) Blood routine sample difference checking method based on machine learning
Ebrahimzadeh et al. A Hybrid Recurrent Neural Network Approach for Detecting Abnormal User Behavior in Social Networks
US20230177472A1 (en) Method for detecting inaccuracies and gaps and for suggesting deterioration mechanisms and actions in inspection reports
CN113190805A (en) Code asset management system
Ash et al. WCLD: Curated Large Dataset of Criminal Cases from Wisconsin Circuit Courts
CN114118359A (en) Breeding environment evaluation method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant