CN115455131A - Data storage method, system, equipment and storage medium based on multi-source isomerism - Google Patents

Data storage method, system, equipment and storage medium based on multi-source isomerism Download PDF

Info

Publication number
CN115455131A
CN115455131A CN202211007920.7A CN202211007920A CN115455131A CN 115455131 A CN115455131 A CN 115455131A CN 202211007920 A CN202211007920 A CN 202211007920A CN 115455131 A CN115455131 A CN 115455131A
Authority
CN
China
Prior art keywords
document
fingerprint value
fingerprint
document content
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211007920.7A
Other languages
Chinese (zh)
Inventor
肖芳
罗敏
郭佳璟
樊欣
宋娇
甘早斌
卓应忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Weipuzhitu Data Technology Co ltd
Huazhong University of Science and Technology
Original Assignee
Chongqing Weipuzhitu Data Technology Co ltd
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Weipuzhitu Data Technology Co ltd, Huazhong University of Science and Technology filed Critical Chongqing Weipuzhitu Data Technology Co ltd
Priority to CN202211007920.7A priority Critical patent/CN115455131A/en
Publication of CN115455131A publication Critical patent/CN115455131A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data storage method, a system, equipment and a storage medium based on multi-source isomerism, wherein the method comprises the following steps: acquiring a plurality of document contents according to a plurality of metadata in a database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a text similarity algorithm and a word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content; comparing the fingerprint value corresponding to the content of each document with each fingerprint value in the fingerprint value set of the existing document to obtain a comparison result; and when the comparison result meets the preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into the storage system. And calculating a corresponding fingerprint value through the characteristic value of each document, and comparing the fingerprint value of each document with the fingerprint value set of the existing document to obtain a nonrepeated document, so that the batch deduplication and storage of the contents of the multisource heterogeneous document can be rapidly performed.

Description

Data storage method, system, equipment and storage medium based on multi-source isomerism
Technical Field
The invention relates to the technical field of data processing, in particular to a data storage method, a data storage system, data storage equipment and a data storage medium based on multi-source isomerism.
Background
With the continuous development of big data processing technology, data generated by various information systems have more and more relevance, so that information networks such as social networks, mobile internet, biomolecular relationship networks, digital resources, knowledge maps and the like are formed. In the book information and archive management industry, research focuses mainly on heterogeneous information networks (heterogeneous information networks) composed of digital resources, including digital resource organization, management, disclosure, use, analysis, and the like. The research objective of such research hotspots is to improve the service quality and the service efficiency, that is, in the whole document service flow from the user search request to the resource/knowledge acquisition, the reader can be served faster, better and more conveniently.
However, the current digital resource information network exhibits the following features: (1) the amount of data is extremely large; with the explosive growth of global information, resource metadata is handled by different digital resource providers, the total amount exceeds billions, and the storage cost is high. (2) the data repetition rate is high; due to business or technical barriers between digital resource providers, resources between different providers have the characteristics of crossing, overlapping, complementing and the like, and the access quality of the resources is low. (3) metadata multisource isomerization; each digital resource provider almost has an open access mode with a specific standard, and has the conditions of irregular fields, different meanings, different formats and the like, so that the difficulty in knowledge conversion is high. Therefore, how to realize the non-repeated storage of multi-source heterogeneous data is a problem to be solved urgently.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides a data storage method, a system, equipment and a storage medium based on multi-source isomerism, which are used for solving the problem of how to realize non-repeated storage of multi-source isomerism data.
According to a first aspect of the present invention, a data storage method based on multi-source heterogeneous is provided, including:
acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content;
calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content;
acquiring a fingerprint value set of existing documents, and performing hamming distance comparison on a fingerprint value corresponding to the content of each document and each fingerprint value in the fingerprint value set to obtain a comparison result;
and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.
On the basis of the technical scheme, the invention can be improved as follows.
Optionally, the step of obtaining a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database, and obtaining a plurality of feature values corresponding to each document content includes:
acquiring a plurality of metadata in a PgSQL database;
obtaining a plurality of document contents with different formats in a plurality of data sources according to the plurality of metadata;
and extracting the key fields in the content of each document as a plurality of corresponding characteristic values.
Optionally, the key field at least includes: title, author, signature organization.
Optionally, the step of calculating each feature value according to a Simhash text similarity algorithm and a TF-IDF term frequency-inverse document frequency weighting algorithm to obtain a fingerprint value corresponding to each document content includes:
according to the word frequency and the reverse file frequency corresponding to each characteristic value, a weighted value corresponding to each characteristic value is obtained by using a TF-IDF word frequency-reverse file frequency weighting algorithm;
and calculating each characteristic value of each document content according to a Simhash text similarity calculation method and the weight value to obtain a fingerprint value corresponding to each document content.
Optionally, the step of obtaining the weight value corresponding to each feature value by using a TF-IDF word frequency-reverse file frequency weighting algorithm according to the word frequency and the reverse file frequency corresponding to each feature value further includes:
acquiring weight values corresponding to a plurality of characteristic values corresponding to each document content;
performing linear fitting on the weighted values corresponding to the plurality of characteristic values according to the zifff law to obtain a fitting function;
and optimizing the weight values corresponding to the plurality of characteristic values according to the fitting function to obtain the weight values corresponding to the optimized plurality of characteristic values.
Optionally, the step of comparing the fingerprint value corresponding to each document content with each fingerprint value in the fingerprint value set to obtain a comparison result includes:
and comparing the fingerprint value corresponding to each document content with each fingerprint value in the fingerprint value set by using a sliding window algorithm to obtain a comparison fingerprint value set which accords with a difference threshold value in the fingerprint value set.
Optionally, when the comparison result meets a preset condition, the step of determining that the document content corresponding to the fingerprint value is not repeated includes:
and when the number of elements in the comparison fingerprint value set is equal to 0, judging that the fingerprint values of the document content corresponding to the comparison fingerprint value set are not repeated.
According to a second aspect of the present invention, there is provided a multi-source heterogeneous based data storage system, comprising:
the characteristic acquisition module is used for acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database and acquiring a plurality of characteristic values corresponding to each document content;
the fingerprint calculation module is used for calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content;
the fingerprint comparison module is used for acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result;
and the data storage module is used for judging that the document content corresponding to the fingerprint value is not repeated when the comparison result meets a preset condition, and storing the corresponding metadata into a storage system.
According to a third aspect of the present invention, there is provided an apparatus, including a memory, and a processor, where the processor is configured to implement the steps of any one of the data storage methods based on multi-source heterogeneous in the first aspect when executing a computer management class program stored in the memory.
According to a fourth aspect of the present invention, there is provided a computer-readable storage medium, on which a computer management class program is stored, wherein the computer management class program, when executed by a processor, implements the steps of any of the above-mentioned multi-source heterogeneous based data storage methods of the first aspect.
The invention provides a data storage method, a system, equipment and a storage medium based on multisource isomerism, wherein the method comprises the following steps: acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of an existing document, and comparing a fingerprint value corresponding to the content of each document with each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system. The method of the invention extracts the characteristic values of document contents with different formats in a plurality of data sources, so that the data processing of the document contents with different formats can be carried out according to a unified characteristic processing mode, the fingerprint value corresponding to each document is obtained by utilizing a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm), the fingerprint value obtained by each document through a plurality of characteristic values of the document is more in line with expectation, the complexity and the accuracy of fingerprint value calculation are further reduced, the fingerprint value of each document is compared with the fingerprint value of the existing document to obtain unrepeated document contents, and the unrepeated document contents are further stored, so that the batch de-duplication and storage of the document contents with different formats can be carried out rapidly, the complexity of the de-duplication of the document and the calculation force requirement of a server are greatly reduced, and the storage efficiency of the document contents with different formats is improved.
Drawings
FIG. 1 is a flow chart of a data storage method based on multi-source heterogeneity provided by the present invention;
FIG. 2 is a graph comparing accuracy of an improved Simhash algorithm with that of an original Simhash algorithm;
FIG. 3 is a graph comparing recall rates of an improved Simhash algorithm and an original Simhash algorithm according to the present invention;
FIG. 4 is a graph comparing the execution time of the improved Simhash algorithm with that of the original Simhash algorithm;
FIG. 5 is a schematic structural diagram of a data storage system based on multi-source heterogeneous technologies according to the present invention;
FIG. 6 is a schematic diagram of a hardware structure of a possible apparatus provided in the present invention;
fig. 7 is a schematic diagram of a hardware structure of a possible computer-readable storage medium according to the present invention.
Detailed Description
The following detailed description of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Fig. 1 is a flowchart of a data storage method based on multi-source heterogeneity, as shown in fig. 1, the method includes:
step S100: acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content;
it should be noted that the main body of the method in this embodiment may be a computer terminal device with data processing, network communication and program running functions, for example: computers, tablet computers, etc.; the cloud server may also be a server device having the same similar function, or may also be a cloud server having the similar function, which is not limited in this embodiment. For convenience of understanding, the present embodiment and the following embodiments will be described by taking a server apparatus as an example.
It is to be understood that the PgSQL database is an object relational database system, and may be used to store metadata, and other object relational databases may also be used in the method of this embodiment instead of the PgSQL database, which is not limited in this embodiment.
It should be understood that the above metadata may be data describing basic information of the content of the above document and corresponding stored information, such as: document name, document number, document source, document type, document storage address, and the like.
It should be further noted that each document content has a plurality of characteristic values corresponding thereto, and the characteristic value may be a key field of each document, for example: title, author, signature authority, etc.
It will also be appreciated that the contents of a plurality of documents described above are documents to be stored which require deduplication.
It should also be understood that the contents of the documents with different formats in the data sources may refer to the data sources and formats of the documents different, for example, documents such as a chinese journal paper, a foreign association paper, a chinese academic paper, etc.
In a specific implementation, a plurality of metadata in the PgSQL database are acquired, corresponding document contents are acquired through the plurality of metadata, and a plurality of feature values corresponding to each document content are extracted.
Step S200: calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content;
in specific implementation, a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm are used for calculating each characteristic value of each document content to obtain a characteristic fingerprint value corresponding to each characteristic value, and the characteristic fingerprint values of each document content are subjected to merging accumulation and dimension reduction to obtain a fingerprint value corresponding to each document content.
Step S300: acquiring a fingerprint value set of an existing document, and performing hamming distance comparison on a fingerprint value corresponding to the content of each document and each fingerprint value in the fingerprint value set to obtain a comparison result;
it should be noted that the fingerprint value set of the existing document may be the fingerprint values of all documents already stored at the designated position. The designated location may be local to the server or in a distributed storage system, which is not limited in this embodiment. In order to further improve the data reading efficiency, the fingerprint values of the above-mentioned prior art documents may be stored in a distributed memory storage redis.
It is understood that the hamming distance refers to the code distance, which is also called hamming distance, when the corresponding bits of two legal codes are coded with different numbers of bits in the information coding. The number of bits with different values of corresponding bits of the two code words is called the hamming distance of the two code words. For example, 10101 and 00110 differ from the first digit to the fourth digit and the fifth digit in order, the hamming distance is 3.
It should be understood that the hamming distance comparison described above is specifically: setting a fingerprint value of document content to be stored as newhash, and setting the fingerprint value of the existing document as redishhash; wherein:
newhash=x 1 、x 2 ……x i ,Simhash=y 1 、y 2 ……y i
the calculation formula of the hamming distance is as follows:
Figure BDA0003809752610000071
in a specific implementation, the fingerprint value corresponding to each document content is compared with each fingerprint value in the fingerprint value set of the existing document to obtain the hamming distance between each document content in the plurality of document contents and the fingerprint value of each document in the existing document set.
Step S400: and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.
It should be noted that the preset condition may be set according to an actual requirement, and the preset condition may be: and when the fingerprint value of each document in the plurality of document contents and the fingerprint value of each document in the existing document set are less than 7, judging that the document contents corresponding to the fingerprint values are not overlapped.
It can be understood that, based on the defects in the background art, the embodiment of the invention provides a data storage method based on multi-source isomerism. The method comprises the following steps: acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system. The method comprises the steps of extracting characteristic values of document contents with different formats in a plurality of data sources, enabling the document contents with different formats in the plurality of data sources to be subjected to data processing according to a unified characteristic processing mode, obtaining a fingerprint value corresponding to each document by utilizing a Simhash text similarity algorithm and a TF-IDF (Trans-frequency-inverse document) frequency weighting algorithm, enabling the fingerprint value obtained by each document through a plurality of characteristic values to be more expected, further reducing the complexity and accuracy of fingerprint value calculation, comparing the fingerprint value of each document with the fingerprint value of the existing document to obtain unrepeated document contents, further storing the unrepeated document contents, further rapidly performing batch deduplication and storage on the document contents with different formats, greatly reducing the complexity of document deduplication and the calculation force requirements of a server, and improving the storage efficiency of the document with different formats.
In a possible embodiment, the step of obtaining a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database, and obtaining a plurality of feature values corresponding to each document content includes:
step S101: acquiring a plurality of metadata in a PgSQL database;
step S102: obtaining a plurality of document contents with different formats in a plurality of data sources according to the plurality of metadata;
step S103: and extracting the key fields in the content of each document as a plurality of corresponding characteristic values.
In the embodiment of the invention, the key fields in the document contents with different formats in the data sources are extracted, so that the uniform characteristic matching processing can be performed on the document contents with different multi-source structures, the complexity of data processing is greatly reduced, and the accuracy of data processing is improved.
In a possible embodiment, the step of calculating each feature value according to a Simhash text similarity algorithm and a TF-IDF term frequency-inverse document frequency weighting algorithm to obtain a fingerprint value corresponding to each document content includes:
step S201: according to the word frequency and the reverse file frequency corresponding to each characteristic value, a weighted value corresponding to each characteristic value is obtained by using a TF-IDF word frequency-reverse file frequency weighting algorithm;
step S202: and calculating each characteristic value of each document content according to a Simhash text similarity algorithm and the weight value to obtain a fingerprint value corresponding to each document content.
In the embodiment of the invention, the weight value of each characteristic value is optimized by using a TF-IDF (Trans-frequency-inverse document frequency weighting algorithm), and the fingerprint value is calculated for each characteristic value of each document content according to a Simhash text similarity algorithm and the optimized weight value, so that the calculated fingerprint value can reflect the characteristics of each characteristic value more accurately, and the accuracy of subsequent repeated data identification is greatly improved.
In a possible embodiment, the step of obtaining the weight value corresponding to each feature value by using a TF-IDF word frequency-reverse document frequency weighting algorithm according to the word frequency and the reverse document frequency corresponding to each feature value further includes:
step S2011: acquiring weight values corresponding to a plurality of characteristic values corresponding to each document content;
step S2012: performing linear fitting on the weighted values corresponding to the plurality of characteristic values according to the zifff law to obtain a fitting function;
step S2013: and optimizing the weight values corresponding to the plurality of characteristic values according to the fitting function to obtain the optimized weight values corresponding to the plurality of characteristic values.
In the embodiment of the invention, the weighted values of the characteristic values of each document content are further subjected to linear fitting, so that the weighted value corresponding to each characteristic value is further optimized, each characteristic can be accurately reflected by the weighted value of each characteristic value, and the accuracy of subsequent repeated data identification is further improved.
In a possible embodiment, the step of comparing the fingerprint value corresponding to each document content with each fingerprint value in the fingerprint value set to obtain a comparison result includes:
step S301: and comparing the fingerprint value corresponding to each document content with each fingerprint value in the fingerprint value set by using a sliding window algorithm to obtain a comparison fingerprint value set which accords with a difference threshold value in the fingerprint value set.
It should be noted that the sliding window algorithm may be a platform and a specific time for acquiring the first appearance of the document according to the document type, where the document type includes a special bibliography, a periodical, a book, a newspaper, a conference document, a scientific report, a standard document, a patent document, a academic paper, and a government publication, and the current document platform usually does not include all documents, for example, the patent document usually uses the national intellectual property office as a main publishing platform, so that documents belonging to each document platform are distinguished by document partitions in the distributed memory, and the sliding window algorithm is convenient to identify and read.
It can be understood that the algorithm of the sliding window is specifically: according to the type of the document, acquiring a platform and time when the document appears for the first time, and taking the time as a node or combining the platform and the time into a node, wherein the method specifically comprises the following steps: traversing a platform and time of the first occurrence of the document in the distributed memory, wherein the condition that the platform of the first occurrence cannot be found by the document or the condition that the platforms of the first occurrence of the document occur at the same time exists in the document, if the condition is the condition, the time is taken as a node, a preset time year is advanced forward, fingerprint values within the time year are selected for matching, the preset time year is 5 years in the embodiment, for example, if the time of the first occurrence of the document is 2010, the 2010 is taken as a time node, and the fingerprint values of the document between 2005 and 2010 in the distributed memory are traversed; if the platform and the time of the first appearance of the document can be inquired in the traversal distributed memory, combining the platform and the time into a node, advancing a certain time period forwards by the time of the first appearance of the document and the time of the first appearance of the document, reading fingerprint values of document metadata of other platforms of the same type for matching, for example, if the platform of the first appearance of the document is EI and the time of the first appearance is 2010, traversing all document fingerprint values in the traversal distributed memory between 2005 and 2010 except the EI platform.
It should be understood that the difference threshold may be a threshold obtained by experiment, or may be a threshold range, and the coincidence difference threshold may be a fingerprint value with the hamming distance smaller than the threshold, or with the hamming distance within the threshold range.
According to the embodiment of the invention, the fingerprint values of the documents are matched by using a sliding window algorithm, and the matching efficiency of the fingerprint values is greatly improved by combining the time dimension and the space dimension, and meanwhile, the calculation pressure of a server is also reduced.
In a possible embodiment, the step of determining that the document content corresponding to the fingerprint value is not repeated when the comparison result satisfies a preset condition includes:
step S302: and when the number of elements in the comparison fingerprint value set is equal to 0, judging that the fingerprint values of the document contents corresponding to the comparison fingerprint value set are not repeated.
It should be noted that the number of elements in the comparison fingerprint value set may be greater than 1, equal to 1, or equal to 0; when the number of the elements is more than 1, the fingerprint value of the document content corresponding to the fingerprint value set can be judged and compared to be repeated with the fingerprint value of the existing document through manual intervention; when the element is equal to 1, the fingerprint value of the document content corresponding to the comparison fingerprint value set can be judged to be repeated with the document in the document to be stored; when the number of the elements is equal to 0, it can be determined that there is no duplication in the fingerprint value of the document content corresponding to the matching fingerprint value set.
In a specific implementation, when the fingerprint value of the document content corresponding to the comparison fingerprint value set is determined not to be repeated, storing the metadata corresponding to the document content and the document into a distributed storage system, wherein each node in the distributed storage system is formed by an inner-disk to form two layers of storage; for the metadata operation, the operation type and the operation time are simultaneously stored in a distributed storage system for the (CURD) "add-delete-modify-check" operation at each update. And after updating and storing a piece of metadata, cleaning the piece of metadata in the PgSQL, thereby releasing the storage space and reducing the computing pressure of the server.
In a possible application scenario, in order to further explain the performance and effect enhancement of the embodiment of the present invention, the embodiment further provides a test environment and a test effect analysis, and in the application scenario, the operating environment of a system corresponding to the method of the present invention is as follows:
1. physical environment
A CPU: i7 16 cores, a memory: 64G, the IP addresses of the three testers are respectively as follows: 192.168.21.106, 192.168.21.107, 192.168.21.108.
2. Network environment
Clickhouse (Click Stream Data WareHouse columnar storage database) is deployed at 192.168.21.106, and redis is deployed at 192.168.21.107.
3. Original data set storage mode and address
The original data set is stored in an avro file with a storage address of 192.168.21.108/data/base _ data directory, and the processed data set is stored in a ztdb _ base database of clickhouse.
4. Result data set storage mode and address
The resulting data set is stored in the redis db1 table. The duplicate Data sets found by the algorithm are backed up in the Dulp _ Data table of the Data _2020 database of clickhouse. The hamming distance calculation is stored in the HMD _ data table in the clickhouse's data _2020 database.
In the application scenario, the application further includes an index value, where the numerical index noun explains: TP: the True positive, namely, the judgment is correct, and the record is repeated; TN: true Negative, i.e. the judgment is correct, and the record is not repeated; FP: false positive, namely, the judgment is wrong, the records corresponding to the target simhash value are not repeated, but are judged to be repeated; this record may actually be non-repeated or repeated with another record; FN: false Negative, i.e., a judgment error, records that are themselves associated with the target simhash value are duplicated, but are judged not to be duplicated.
In the application scenario, the method further comprises three test indexes: de-duplication rate, precision rate and recall rate, wherein:
accuracy (Accuracy): the accuracy rate refers to the ratio of the number of samples classified correctly to the total number of samples, and is predicted as the ratio of the number of duplicate documents to the total number of documents in the present experiment, which is also called the deduplication rate.
Precision (Precision): the accuracy rate is the ratio of the number of samples classified correctly to the total number of samples classified, and for the purpose of the experiment, the ratio of the number of the predicted repeated documents to the number of the predicted repeated documents is predicted, which is also called precision rate.
Recall (Recall): the recall ratio refers to the ratio of the number of classified correct samples to the number of known samples, and for the purposes of the present experiment, the ratio of the number of correct duplicate documents to the number of known duplicate documents is predicted, which is also referred to as recall ratio.
The formulas for Accuracy (Accuracy), precision (Precision), recall (Recall) of the results are defined as follows:
Figure BDA0003809752610000131
Figure BDA0003809752610000132
Figure BDA0003809752610000133
in the test of the application scenario, the method further comprises the following description of the data set: the data set year is 2020. The test results were obtained as follows: the total number of documents predicted by the algorithm of the embodiment of the invention is 225277, the total number of documents with the Hamming distance of less than or equal to 2 is selected from the documents predicted to be repeated, the number of documents predicted to be correct is 142950 through a preset rule check, namely TP is 142926, FP is calculated at the same time as 24, 82327 in the total number of documents with the Hamming distance of more than 2 is selected from the documents predicted to be not repeated, 7924 documents are repeated through the rule check, namely FN is 7924, and TN is calculated as 74403.
The result of the experiment is calculated according to the result as follows: the accuracy (de-duplication rate) is 96.47%, the accuracy (precision) is 99.98%, and the recall (recall) is 94.75%.
In the test of the application scenario, the Chinese document can be tested, so that the de-duplication effect of the Chinese document data is obtained, wherein the total amount of the Chinese document data in the experiment is 2347285.
In the above-mentioned chinese literature test, the test results obtained are: the total number of measured and repeated documents is predicted by an algorithm of the embodiment of the invention to be 2347285, the number of documents with the Hamming distance of less than or equal to 2 in the predicted and repeated documents is 297898, wherein the number of documents with correct prediction is 295716 through a preset rule check, namely TP is 295716, FP is calculated to be 2182 at the same time, 2049387 documents with the Hamming distance of more than 2 in the predicted and non-repeated documents are selected, 46037 document repetitions are detected through the preset rule check, namely FN is 46307, and TN is calculated to be 2003350 at the same time.
In the test of the Chinese literature in the application scenario, the result of the experiment obtained by calculation according to the result is as follows: the accuracy (de-duplication rate) is 97.95%, the accuracy (precision) is 99.27%, and the recall (recall) is 86.46%.
The preset rule check may be performed manually.
In order to further explain the improvement of performance and effect in the test result in the embodiment, the implementation of the invention also provides the comparison of the improved simhash algorithm and the original simhash algorithm in the five subject word fields of internet, education, AI, medical treatment and housing, wherein the comparison comprises the accuracy, recall rate and execution time, and refer to fig. 2, fig. 3 and fig. 4; as can be clearly seen from fig. 2, 3, and 4, compared with the existing Simhash algorithm, the multi-source heterogeneous data storage method provided by the present application has significantly improved accuracy and recall rate, and significantly reduced execution time, so that the multi-source heterogeneous data storage method provided by the embodiment of the present application can rapidly implement batch deduplication and storage of multi-source heterogeneous document contents, thereby greatly reducing complexity of document deduplication and computational power requirements of a server, and improving storage efficiency of multi-source heterogeneous documents.
Referring to fig. 5, fig. 5 is a schematic diagram of a structure diagram of a data storage system based on multi-source heterogeneity according to an embodiment of the present invention, as shown in fig. 5, a data storage system based on multi-source heterogeneity includes a feature obtaining module 100, a fingerprint calculating module 200, a fingerprint comparing module 300, and a data storage module 400, where:
the characteristic acquisition module 100 is configured to acquire, according to a plurality of metadata in the PgSQL database, a plurality of document contents with different formats in a plurality of data sources, and acquire a plurality of characteristic values corresponding to each document content; the fingerprint calculation module 200 is configured to calculate each feature value according to a Simhash text similarity algorithm and a TF-IDF word frequency-inverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content; the fingerprint comparison module 300 is configured to obtain a fingerprint value set of an existing document, and perform hamming distance comparison between a fingerprint value corresponding to each document content and each fingerprint value in the fingerprint value set to obtain a comparison result; and the data storage module 400 is configured to determine that document contents corresponding to the fingerprint value are not repeated when the comparison result meets a preset condition, and store corresponding metadata in a storage system.
It can be understood that the data storage system based on the multi-source heterogeneity provided by the present invention corresponds to the data storage method based on the multi-source heterogeneity provided by the foregoing embodiments, and the related technical features of the data storage system based on the multi-source heterogeneity may refer to the related technical features of the data storage method based on the multi-source heterogeneity, and are not described herein again.
Referring to fig. 6, fig. 6 is a schematic diagram of an embodiment of an apparatus according to an embodiment of the present invention. As shown in fig. 6, an embodiment of the present invention provides an apparatus, which includes a memory 1310, a processor 1320, and a computer program 1311 stored in the memory 1310 and executable on the processor 1320, where the processor 1320 executes the computer program 1311 to implement the following steps:
acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.
Referring to fig. 7, fig. 7 is a schematic diagram of an embodiment of a computer-readable storage medium according to the present invention. As shown in fig. 7, the present embodiment provides a computer-readable storage medium 1400, on which a computer program 1411 is stored, the computer program 1411 when executed by a processor implements the steps of:
acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of an existing document, and comparing a fingerprint value corresponding to the content of each document with each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.
The invention provides a data storage method, a system, equipment and a storage medium based on multi-source isomerism, wherein the method comprises the following steps: acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system. The method comprises the steps of extracting characteristic values of document contents with different formats in a plurality of data sources, enabling the document contents with different formats in the plurality of data sources to be subjected to data processing according to a unified characteristic processing mode, obtaining a fingerprint value corresponding to each document by utilizing a Simhash text similarity algorithm and a TF-IDF (Trans-frequency-inverse document) frequency weighting algorithm, enabling the fingerprint value obtained by each document through a plurality of characteristic values to be more expected, further reducing the complexity and accuracy of fingerprint value calculation, comparing the fingerprint value of each document with the fingerprint value of the existing document to obtain unrepeated document contents, further storing the unrepeated document contents, further rapidly performing batch deduplication and storage on the document contents with different formats, greatly reducing the complexity of document deduplication and the calculation force requirements of a server, and improving the storage efficiency of the document with different formats.
It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A data storage method based on multi-source isomerism is characterized by comprising the following steps:
acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content;
calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content;
acquiring a fingerprint value set of existing documents, and performing hamming distance comparison on a fingerprint value corresponding to the content of each document and each fingerprint value in the fingerprint value set to obtain a comparison result;
and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.
2. The method for storing data based on multi-source heterogeneity according to claim 1, wherein the step of obtaining a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database, and obtaining a plurality of feature values corresponding to each document content includes:
acquiring a plurality of metadata in a PgSQL database;
obtaining a plurality of document contents with different formats in a plurality of data sources according to the plurality of metadata;
and extracting the key fields in the content of each document as a plurality of corresponding characteristic values.
3. The multi-source heterogeneous-based data storage method of claim 2, wherein the key fields comprise at least: title, author, signing authority.
4. The data storage method based on multisource isomerism according to claim 1, wherein the step of calculating each eigenvalue according to a Simhash text similarity algorithm and a TF-IDF word frequency-inverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content comprises:
according to the word frequency and the reverse file frequency corresponding to each characteristic value, a weighted value corresponding to each characteristic value is obtained by using a TF-IDF word frequency-reverse file frequency weighting algorithm;
and calculating each characteristic value of each document content according to a Simhash text similarity calculation method and the weight value to obtain a fingerprint value corresponding to each document content.
5. The multi-source heterogeneous data storage method according to claim 4, wherein the step of obtaining the weight value corresponding to each eigenvalue by using a TF-IDF term frequency-inverse document frequency weighting algorithm according to the term frequency and inverse document frequency corresponding to each eigenvalue further comprises:
acquiring weight values corresponding to a plurality of characteristic values corresponding to each document content;
performing linear fitting on the weighted values corresponding to the plurality of characteristic values according to the ziff's law to obtain a fitting function;
and optimizing the weight values corresponding to the plurality of characteristic values according to the fitting function to obtain the optimized weight values corresponding to the plurality of characteristic values.
6. The multi-source heterogeneous data storage method according to claim 1, wherein the step of performing hamming distance comparison on the fingerprint value corresponding to each document content and each fingerprint value in the fingerprint value set to obtain a comparison result comprises:
and carrying out hamming distance comparison on the fingerprint value corresponding to each document content and each fingerprint value in the fingerprint value set by using a sliding window algorithm to obtain a comparison fingerprint value set which accords with a difference threshold value in the fingerprint value set.
7. The multi-source heterogeneous data storage method according to claim 6, wherein the step of determining that the document content corresponding to the fingerprint value is not repeated when the comparison result meets a preset condition includes:
and when the number of elements in the comparison fingerprint value set is equal to 0, judging that the fingerprint values of the document content corresponding to the comparison fingerprint value set are not repeated.
8. A data storage system based on multi-source isomerism is characterized by comprising
The characteristic acquisition module is used for acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database and acquiring a plurality of characteristic values corresponding to each document content;
the fingerprint calculation module is used for calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content;
the fingerprint comparison module is used for acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result;
and the data storage module is used for judging that the document content corresponding to the fingerprint value is not repeated when the comparison result meets a preset condition, and storing the corresponding metadata into a storage system.
9. An apparatus, comprising a memory, and a processor configured to implement the steps of the multi-source heterogeneous based data storage method according to any one of claims 1-7 when executing a computer management class program stored in the memory.
10. A computer-readable storage medium, having stored thereon a computer management class program, which when executed by a processor, performs the steps of the multi-source heterogeneous based data storage method according to any one of claims 1 to 7.
CN202211007920.7A 2022-08-22 2022-08-22 Data storage method, system, equipment and storage medium based on multi-source isomerism Pending CN115455131A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211007920.7A CN115455131A (en) 2022-08-22 2022-08-22 Data storage method, system, equipment and storage medium based on multi-source isomerism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211007920.7A CN115455131A (en) 2022-08-22 2022-08-22 Data storage method, system, equipment and storage medium based on multi-source isomerism

Publications (1)

Publication Number Publication Date
CN115455131A true CN115455131A (en) 2022-12-09

Family

ID=84299015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211007920.7A Pending CN115455131A (en) 2022-08-22 2022-08-22 Data storage method, system, equipment and storage medium based on multi-source isomerism

Country Status (1)

Country Link
CN (1) CN115455131A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076474A (en) * 2023-10-16 2023-11-17 之江实验室 Method, device, equipment and medium for updating offline multi-mode literature data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076474A (en) * 2023-10-16 2023-11-17 之江实验室 Method, device, equipment and medium for updating offline multi-mode literature data
CN117076474B (en) * 2023-10-16 2024-03-12 之江实验室 Method, device, equipment and medium for updating offline multi-mode literature data

Similar Documents

Publication Publication Date Title
US10346257B2 (en) Method and device for deduplicating web page
CN110162750B (en) Text similarity detection method, electronic device and computer readable storage medium
US11423072B1 (en) Artificial intelligence system employing multimodal learning for analyzing entity record relationships
CN110321466B (en) Securities information duplicate checking method and system based on semantic analysis
US7853598B2 (en) Compressed storage of documents using inverted indexes
US10248626B1 (en) Method and system for document similarity analysis based on common denominator similarity
Rowe When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data
US11599667B1 (en) Efficient statistical techniques for detecting sensitive data
US20150347088A1 (en) Prefix burrows-wheeler transformations for creating and searching a merged lexeme set
US11620453B2 (en) System and method for artificial intelligence driven document analysis, including searching, indexing, comparing or associating datasets based on learned representations
US9747273B2 (en) String comparison results for character strings using frequency data
CN115455131A (en) Data storage method, system, equipment and storage medium based on multi-source isomerism
Yu et al. How accurate are policy document mentions? A first look at the role of altmetrics database
CN114281989B (en) Data deduplication method and device based on text similarity, storage medium and server
Brown et al. Estimating parameters for probabilistic linkage of privacy-preserved datasets
US10147095B2 (en) Chain understanding in search
CN117216239A (en) Text deduplication method, text deduplication device, computer equipment and storage medium
Čech et al. Comparing MapReduce-based k-NN similarity joins on Hadoop for high-dimensional data
CN114281950B (en) Data retrieval method and system based on multi-graph weighted fusion
CN115953041A (en) Construction scheme and system of operator policy system
CN114547233A (en) Data duplicate checking method and device and electronic equipment
Rowe Associating drives based on their artifact and metadata distributions
CN113971403A (en) Entity identification method and system considering text semantic information
Mentzingen et al. Automation of legal precedents retrieval: findings from a literature review
US10409871B2 (en) Apparatus and method for searching information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination