CN115455131A

CN115455131A - Data storage method, system, equipment and storage medium based on multi-source isomerism

Info

Publication number: CN115455131A
Application number: CN202211007920.7A
Authority: CN
Inventors: 肖芳; 罗敏; 郭佳璟; 樊欣; 宋娇; 甘早斌; 卓应忠
Original assignee: Chongqing Weipuzhitu Data Technology Co ltd; Huazhong University of Science and Technology
Current assignee: Chongqing Weipuzhitu Data Technology Co ltd; Huazhong University of Science and Technology
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-12-09

Abstract

The invention provides a data storage method, a system, equipment and a storage medium based on multi-source isomerism, wherein the method comprises the following steps: acquiring a plurality of document contents according to a plurality of metadata in a database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a text similarity algorithm and a word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content; comparing the fingerprint value corresponding to the content of each document with each fingerprint value in the fingerprint value set of the existing document to obtain a comparison result; and when the comparison result meets the preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into the storage system. And calculating a corresponding fingerprint value through the characteristic value of each document, and comparing the fingerprint value of each document with the fingerprint value set of the existing document to obtain a nonrepeated document, so that the batch deduplication and storage of the contents of the multisource heterogeneous document can be rapidly performed.

Description

Data storage method, system, equipment and storage medium based on multi-source isomerism

Technical Field

The invention relates to the technical field of data processing, in particular to a data storage method, a data storage system, data storage equipment and a data storage medium based on multi-source isomerism.

Background

With the continuous development of big data processing technology, data generated by various information systems have more and more relevance, so that information networks such as social networks, mobile internet, biomolecular relationship networks, digital resources, knowledge maps and the like are formed. In the book information and archive management industry, research focuses mainly on heterogeneous information networks (heterogeneous information networks) composed of digital resources, including digital resource organization, management, disclosure, use, analysis, and the like. The research objective of such research hotspots is to improve the service quality and the service efficiency, that is, in the whole document service flow from the user search request to the resource/knowledge acquisition, the reader can be served faster, better and more conveniently.

However, the current digital resource information network exhibits the following features: (1) the amount of data is extremely large; with the explosive growth of global information, resource metadata is handled by different digital resource providers, the total amount exceeds billions, and the storage cost is high. (2) the data repetition rate is high; due to business or technical barriers between digital resource providers, resources between different providers have the characteristics of crossing, overlapping, complementing and the like, and the access quality of the resources is low. (3) metadata multisource isomerization; each digital resource provider almost has an open access mode with a specific standard, and has the conditions of irregular fields, different meanings, different formats and the like, so that the difficulty in knowledge conversion is high. Therefore, how to realize the non-repeated storage of multi-source heterogeneous data is a problem to be solved urgently.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a data storage method, a system, equipment and a storage medium based on multi-source isomerism, which are used for solving the problem of how to realize non-repeated storage of multi-source isomerism data.

According to a first aspect of the present invention, a data storage method based on multi-source heterogeneous is provided, including:

acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content;

calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content;

acquiring a fingerprint value set of existing documents, and performing hamming distance comparison on a fingerprint value corresponding to the content of each document and each fingerprint value in the fingerprint value set to obtain a comparison result;

and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.

On the basis of the technical scheme, the invention can be improved as follows.

Optionally, the step of obtaining a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database, and obtaining a plurality of feature values corresponding to each document content includes:

acquiring a plurality of metadata in a PgSQL database;

obtaining a plurality of document contents with different formats in a plurality of data sources according to the plurality of metadata;

and extracting the key fields in the content of each document as a plurality of corresponding characteristic values.

Optionally, the key field at least includes: title, author, signature organization.

Optionally, the step of calculating each feature value according to a Simhash text similarity algorithm and a TF-IDF term frequency-inverse document frequency weighting algorithm to obtain a fingerprint value corresponding to each document content includes:

according to the word frequency and the reverse file frequency corresponding to each characteristic value, a weighted value corresponding to each characteristic value is obtained by using a TF-IDF word frequency-reverse file frequency weighting algorithm;

and calculating each characteristic value of each document content according to a Simhash text similarity calculation method and the weight value to obtain a fingerprint value corresponding to each document content.

Optionally, the step of obtaining the weight value corresponding to each feature value by using a TF-IDF word frequency-reverse file frequency weighting algorithm according to the word frequency and the reverse file frequency corresponding to each feature value further includes:

acquiring weight values corresponding to a plurality of characteristic values corresponding to each document content;

performing linear fitting on the weighted values corresponding to the plurality of characteristic values according to the zifff law to obtain a fitting function;

and optimizing the weight values corresponding to the plurality of characteristic values according to the fitting function to obtain the weight values corresponding to the optimized plurality of characteristic values.

Optionally, the step of comparing the fingerprint value corresponding to each document content with each fingerprint value in the fingerprint value set to obtain a comparison result includes:

and comparing the fingerprint value corresponding to each document content with each fingerprint value in the fingerprint value set by using a sliding window algorithm to obtain a comparison fingerprint value set which accords with a difference threshold value in the fingerprint value set.

Optionally, when the comparison result meets a preset condition, the step of determining that the document content corresponding to the fingerprint value is not repeated includes:

and when the number of elements in the comparison fingerprint value set is equal to 0, judging that the fingerprint values of the document content corresponding to the comparison fingerprint value set are not repeated.

According to a second aspect of the present invention, there is provided a multi-source heterogeneous based data storage system, comprising:

the characteristic acquisition module is used for acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database and acquiring a plurality of characteristic values corresponding to each document content;

the fingerprint calculation module is used for calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content;

the fingerprint comparison module is used for acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result;

and the data storage module is used for judging that the document content corresponding to the fingerprint value is not repeated when the comparison result meets a preset condition, and storing the corresponding metadata into a storage system.

According to a third aspect of the present invention, there is provided an apparatus, including a memory, and a processor, where the processor is configured to implement the steps of any one of the data storage methods based on multi-source heterogeneous in the first aspect when executing a computer management class program stored in the memory.

According to a fourth aspect of the present invention, there is provided a computer-readable storage medium, on which a computer management class program is stored, wherein the computer management class program, when executed by a processor, implements the steps of any of the above-mentioned multi-source heterogeneous based data storage methods of the first aspect.

The invention provides a data storage method, a system, equipment and a storage medium based on multisource isomerism, wherein the method comprises the following steps: acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of an existing document, and comparing a fingerprint value corresponding to the content of each document with each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system. The method of the invention extracts the characteristic values of document contents with different formats in a plurality of data sources, so that the data processing of the document contents with different formats can be carried out according to a unified characteristic processing mode, the fingerprint value corresponding to each document is obtained by utilizing a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm), the fingerprint value obtained by each document through a plurality of characteristic values of the document is more in line with expectation, the complexity and the accuracy of fingerprint value calculation are further reduced, the fingerprint value of each document is compared with the fingerprint value of the existing document to obtain unrepeated document contents, and the unrepeated document contents are further stored, so that the batch de-duplication and storage of the document contents with different formats can be carried out rapidly, the complexity of the de-duplication of the document and the calculation force requirement of a server are greatly reduced, and the storage efficiency of the document contents with different formats is improved.

Drawings

FIG. 1 is a flow chart of a data storage method based on multi-source heterogeneity provided by the present invention;

FIG. 2 is a graph comparing accuracy of an improved Simhash algorithm with that of an original Simhash algorithm;

FIG. 3 is a graph comparing recall rates of an improved Simhash algorithm and an original Simhash algorithm according to the present invention;

FIG. 4 is a graph comparing the execution time of the improved Simhash algorithm with that of the original Simhash algorithm;

FIG. 5 is a schematic structural diagram of a data storage system based on multi-source heterogeneous technologies according to the present invention;

FIG. 6 is a schematic diagram of a hardware structure of a possible apparatus provided in the present invention;

fig. 7 is a schematic diagram of a hardware structure of a possible computer-readable storage medium according to the present invention.

Detailed Description

The following detailed description of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of a data storage method based on multi-source heterogeneity, as shown in fig. 1, the method includes:

step S100: acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content;

it should be noted that the main body of the method in this embodiment may be a computer terminal device with data processing, network communication and program running functions, for example: computers, tablet computers, etc.; the cloud server may also be a server device having the same similar function, or may also be a cloud server having the similar function, which is not limited in this embodiment. For convenience of understanding, the present embodiment and the following embodiments will be described by taking a server apparatus as an example.

It is to be understood that the PgSQL database is an object relational database system, and may be used to store metadata, and other object relational databases may also be used in the method of this embodiment instead of the PgSQL database, which is not limited in this embodiment.

It should be understood that the above metadata may be data describing basic information of the content of the above document and corresponding stored information, such as: document name, document number, document source, document type, document storage address, and the like.

It should be further noted that each document content has a plurality of characteristic values corresponding thereto, and the characteristic value may be a key field of each document, for example: title, author, signature authority, etc.

It will also be appreciated that the contents of a plurality of documents described above are documents to be stored which require deduplication.

It should also be understood that the contents of the documents with different formats in the data sources may refer to the data sources and formats of the documents different, for example, documents such as a chinese journal paper, a foreign association paper, a chinese academic paper, etc.

In a specific implementation, a plurality of metadata in the PgSQL database are acquired, corresponding document contents are acquired through the plurality of metadata, and a plurality of feature values corresponding to each document content are extracted.

Step S200: calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content;

in specific implementation, a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm are used for calculating each characteristic value of each document content to obtain a characteristic fingerprint value corresponding to each characteristic value, and the characteristic fingerprint values of each document content are subjected to merging accumulation and dimension reduction to obtain a fingerprint value corresponding to each document content.

Step S300: acquiring a fingerprint value set of an existing document, and performing hamming distance comparison on a fingerprint value corresponding to the content of each document and each fingerprint value in the fingerprint value set to obtain a comparison result;

it should be noted that the fingerprint value set of the existing document may be the fingerprint values of all documents already stored at the designated position. The designated location may be local to the server or in a distributed storage system, which is not limited in this embodiment. In order to further improve the data reading efficiency, the fingerprint values of the above-mentioned prior art documents may be stored in a distributed memory storage redis.

It is understood that the hamming distance refers to the code distance, which is also called hamming distance, when the corresponding bits of two legal codes are coded with different numbers of bits in the information coding. The number of bits with different values of corresponding bits of the two code words is called the hamming distance of the two code words. For example, 10101 and 00110 differ from the first digit to the fourth digit and the fifth digit in order, the hamming distance is 3.

It should be understood that the hamming distance comparison described above is specifically: setting a fingerprint value of document content to be stored as newhash, and setting the fingerprint value of the existing document as redishhash; wherein:

newhash＝x ₁ 、x ₂ ……x _i ，Simhash＝y ₁ 、y ₂ ……y _i ；

the calculation formula of the hamming distance is as follows:

in a specific implementation, the fingerprint value corresponding to each document content is compared with each fingerprint value in the fingerprint value set of the existing document to obtain the hamming distance between each document content in the plurality of document contents and the fingerprint value of each document in the existing document set.

Step S400: and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.

It should be noted that the preset condition may be set according to an actual requirement, and the preset condition may be: and when the fingerprint value of each document in the plurality of document contents and the fingerprint value of each document in the existing document set are less than 7, judging that the document contents corresponding to the fingerprint values are not overlapped.

It can be understood that, based on the defects in the background art, the embodiment of the invention provides a data storage method based on multi-source isomerism. The method comprises the following steps: acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system. The method comprises the steps of extracting characteristic values of document contents with different formats in a plurality of data sources, enabling the document contents with different formats in the plurality of data sources to be subjected to data processing according to a unified characteristic processing mode, obtaining a fingerprint value corresponding to each document by utilizing a Simhash text similarity algorithm and a TF-IDF (Trans-frequency-inverse document) frequency weighting algorithm, enabling the fingerprint value obtained by each document through a plurality of characteristic values to be more expected, further reducing the complexity and accuracy of fingerprint value calculation, comparing the fingerprint value of each document with the fingerprint value of the existing document to obtain unrepeated document contents, further storing the unrepeated document contents, further rapidly performing batch deduplication and storage on the document contents with different formats, greatly reducing the complexity of document deduplication and the calculation force requirements of a server, and improving the storage efficiency of the document with different formats.

In a possible embodiment, the step of obtaining a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database, and obtaining a plurality of feature values corresponding to each document content includes:

step S101: acquiring a plurality of metadata in a PgSQL database;

step S102: obtaining a plurality of document contents with different formats in a plurality of data sources according to the plurality of metadata;

step S103: and extracting the key fields in the content of each document as a plurality of corresponding characteristic values.

In the embodiment of the invention, the key fields in the document contents with different formats in the data sources are extracted, so that the uniform characteristic matching processing can be performed on the document contents with different multi-source structures, the complexity of data processing is greatly reduced, and the accuracy of data processing is improved.

In a possible embodiment, the step of calculating each feature value according to a Simhash text similarity algorithm and a TF-IDF term frequency-inverse document frequency weighting algorithm to obtain a fingerprint value corresponding to each document content includes:

step S201: according to the word frequency and the reverse file frequency corresponding to each characteristic value, a weighted value corresponding to each characteristic value is obtained by using a TF-IDF word frequency-reverse file frequency weighting algorithm;

step S202: and calculating each characteristic value of each document content according to a Simhash text similarity algorithm and the weight value to obtain a fingerprint value corresponding to each document content.

In the embodiment of the invention, the weight value of each characteristic value is optimized by using a TF-IDF (Trans-frequency-inverse document frequency weighting algorithm), and the fingerprint value is calculated for each characteristic value of each document content according to a Simhash text similarity algorithm and the optimized weight value, so that the calculated fingerprint value can reflect the characteristics of each characteristic value more accurately, and the accuracy of subsequent repeated data identification is greatly improved.

In a possible embodiment, the step of obtaining the weight value corresponding to each feature value by using a TF-IDF word frequency-reverse document frequency weighting algorithm according to the word frequency and the reverse document frequency corresponding to each feature value further includes:

step S2011: acquiring weight values corresponding to a plurality of characteristic values corresponding to each document content;

step S2012: performing linear fitting on the weighted values corresponding to the plurality of characteristic values according to the zifff law to obtain a fitting function;

step S2013: and optimizing the weight values corresponding to the plurality of characteristic values according to the fitting function to obtain the optimized weight values corresponding to the plurality of characteristic values.

In the embodiment of the invention, the weighted values of the characteristic values of each document content are further subjected to linear fitting, so that the weighted value corresponding to each characteristic value is further optimized, each characteristic can be accurately reflected by the weighted value of each characteristic value, and the accuracy of subsequent repeated data identification is further improved.

In a possible embodiment, the step of comparing the fingerprint value corresponding to each document content with each fingerprint value in the fingerprint value set to obtain a comparison result includes:

step S301: and comparing the fingerprint value corresponding to each document content with each fingerprint value in the fingerprint value set by using a sliding window algorithm to obtain a comparison fingerprint value set which accords with a difference threshold value in the fingerprint value set.

It should be noted that the sliding window algorithm may be a platform and a specific time for acquiring the first appearance of the document according to the document type, where the document type includes a special bibliography, a periodical, a book, a newspaper, a conference document, a scientific report, a standard document, a patent document, a academic paper, and a government publication, and the current document platform usually does not include all documents, for example, the patent document usually uses the national intellectual property office as a main publishing platform, so that documents belonging to each document platform are distinguished by document partitions in the distributed memory, and the sliding window algorithm is convenient to identify and read.

It can be understood that the algorithm of the sliding window is specifically: according to the type of the document, acquiring a platform and time when the document appears for the first time, and taking the time as a node or combining the platform and the time into a node, wherein the method specifically comprises the following steps: traversing a platform and time of the first occurrence of the document in the distributed memory, wherein the condition that the platform of the first occurrence cannot be found by the document or the condition that the platforms of the first occurrence of the document occur at the same time exists in the document, if the condition is the condition, the time is taken as a node, a preset time year is advanced forward, fingerprint values within the time year are selected for matching, the preset time year is 5 years in the embodiment, for example, if the time of the first occurrence of the document is 2010, the 2010 is taken as a time node, and the fingerprint values of the document between 2005 and 2010 in the distributed memory are traversed; if the platform and the time of the first appearance of the document can be inquired in the traversal distributed memory, combining the platform and the time into a node, advancing a certain time period forwards by the time of the first appearance of the document and the time of the first appearance of the document, reading fingerprint values of document metadata of other platforms of the same type for matching, for example, if the platform of the first appearance of the document is EI and the time of the first appearance is 2010, traversing all document fingerprint values in the traversal distributed memory between 2005 and 2010 except the EI platform.

It should be understood that the difference threshold may be a threshold obtained by experiment, or may be a threshold range, and the coincidence difference threshold may be a fingerprint value with the hamming distance smaller than the threshold, or with the hamming distance within the threshold range.

According to the embodiment of the invention, the fingerprint values of the documents are matched by using a sliding window algorithm, and the matching efficiency of the fingerprint values is greatly improved by combining the time dimension and the space dimension, and meanwhile, the calculation pressure of a server is also reduced.

In a possible embodiment, the step of determining that the document content corresponding to the fingerprint value is not repeated when the comparison result satisfies a preset condition includes:

step S302: and when the number of elements in the comparison fingerprint value set is equal to 0, judging that the fingerprint values of the document contents corresponding to the comparison fingerprint value set are not repeated.

It should be noted that the number of elements in the comparison fingerprint value set may be greater than 1, equal to 1, or equal to 0; when the number of the elements is more than 1, the fingerprint value of the document content corresponding to the fingerprint value set can be judged and compared to be repeated with the fingerprint value of the existing document through manual intervention; when the element is equal to 1, the fingerprint value of the document content corresponding to the comparison fingerprint value set can be judged to be repeated with the document in the document to be stored; when the number of the elements is equal to 0, it can be determined that there is no duplication in the fingerprint value of the document content corresponding to the matching fingerprint value set.

In a specific implementation, when the fingerprint value of the document content corresponding to the comparison fingerprint value set is determined not to be repeated, storing the metadata corresponding to the document content and the document into a distributed storage system, wherein each node in the distributed storage system is formed by an inner-disk to form two layers of storage; for the metadata operation, the operation type and the operation time are simultaneously stored in a distributed storage system for the (CURD) "add-delete-modify-check" operation at each update. And after updating and storing a piece of metadata, cleaning the piece of metadata in the PgSQL, thereby releasing the storage space and reducing the computing pressure of the server.

In a possible application scenario, in order to further explain the performance and effect enhancement of the embodiment of the present invention, the embodiment further provides a test environment and a test effect analysis, and in the application scenario, the operating environment of a system corresponding to the method of the present invention is as follows:

1. physical environment

A CPU: i7 16 cores, a memory: 64G, the IP addresses of the three testers are respectively as follows: 192.168.21.106, 192.168.21.107, 192.168.21.108.

2. Network environment

Clickhouse (Click Stream Data WareHouse columnar storage database) is deployed at 192.168.21.106, and redis is deployed at 192.168.21.107.

3. Original data set storage mode and address

The original data set is stored in an avro file with a storage address of 192.168.21.108/data/base _ data directory, and the processed data set is stored in a ztdb _ base database of clickhouse.

4. Result data set storage mode and address

The resulting data set is stored in the redis db1 table. The duplicate Data sets found by the algorithm are backed up in the Dulp _ Data table of the Data _2020 database of clickhouse. The hamming distance calculation is stored in the HMD _ data table in the clickhouse's data _2020 database.

In the application scenario, the application further includes an index value, where the numerical index noun explains: TP: the True positive, namely, the judgment is correct, and the record is repeated; TN: true Negative, i.e. the judgment is correct, and the record is not repeated; FP: false positive, namely, the judgment is wrong, the records corresponding to the target simhash value are not repeated, but are judged to be repeated; this record may actually be non-repeated or repeated with another record; FN: false Negative, i.e., a judgment error, records that are themselves associated with the target simhash value are duplicated, but are judged not to be duplicated.

In the application scenario, the method further comprises three test indexes: de-duplication rate, precision rate and recall rate, wherein:

accuracy (Accuracy): the accuracy rate refers to the ratio of the number of samples classified correctly to the total number of samples, and is predicted as the ratio of the number of duplicate documents to the total number of documents in the present experiment, which is also called the deduplication rate.

Precision (Precision): the accuracy rate is the ratio of the number of samples classified correctly to the total number of samples classified, and for the purpose of the experiment, the ratio of the number of the predicted repeated documents to the number of the predicted repeated documents is predicted, which is also called precision rate.

Recall (Recall): the recall ratio refers to the ratio of the number of classified correct samples to the number of known samples, and for the purposes of the present experiment, the ratio of the number of correct duplicate documents to the number of known duplicate documents is predicted, which is also referred to as recall ratio.

The formulas for Accuracy (Accuracy), precision (Precision), recall (Recall) of the results are defined as follows:

in the test of the application scenario, the method further comprises the following description of the data set: the data set year is 2020. The test results were obtained as follows: the total number of documents predicted by the algorithm of the embodiment of the invention is 225277, the total number of documents with the Hamming distance of less than or equal to 2 is selected from the documents predicted to be repeated, the number of documents predicted to be correct is 142950 through a preset rule check, namely TP is 142926, FP is calculated at the same time as 24, 82327 in the total number of documents with the Hamming distance of more than 2 is selected from the documents predicted to be not repeated, 7924 documents are repeated through the rule check, namely FN is 7924, and TN is calculated as 74403.

The result of the experiment is calculated according to the result as follows: the accuracy (de-duplication rate) is 96.47%, the accuracy (precision) is 99.98%, and the recall (recall) is 94.75%.

In the test of the application scenario, the Chinese document can be tested, so that the de-duplication effect of the Chinese document data is obtained, wherein the total amount of the Chinese document data in the experiment is 2347285.

In the above-mentioned chinese literature test, the test results obtained are: the total number of measured and repeated documents is predicted by an algorithm of the embodiment of the invention to be 2347285, the number of documents with the Hamming distance of less than or equal to 2 in the predicted and repeated documents is 297898, wherein the number of documents with correct prediction is 295716 through a preset rule check, namely TP is 295716, FP is calculated to be 2182 at the same time, 2049387 documents with the Hamming distance of more than 2 in the predicted and non-repeated documents are selected, 46037 document repetitions are detected through the preset rule check, namely FN is 46307, and TN is calculated to be 2003350 at the same time.

In the test of the Chinese literature in the application scenario, the result of the experiment obtained by calculation according to the result is as follows: the accuracy (de-duplication rate) is 97.95%, the accuracy (precision) is 99.27%, and the recall (recall) is 86.46%.

The preset rule check may be performed manually.

In order to further explain the improvement of performance and effect in the test result in the embodiment, the implementation of the invention also provides the comparison of the improved simhash algorithm and the original simhash algorithm in the five subject word fields of internet, education, AI, medical treatment and housing, wherein the comparison comprises the accuracy, recall rate and execution time, and refer to fig. 2, fig. 3 and fig. 4; as can be clearly seen from fig. 2, 3, and 4, compared with the existing Simhash algorithm, the multi-source heterogeneous data storage method provided by the present application has significantly improved accuracy and recall rate, and significantly reduced execution time, so that the multi-source heterogeneous data storage method provided by the embodiment of the present application can rapidly implement batch deduplication and storage of multi-source heterogeneous document contents, thereby greatly reducing complexity of document deduplication and computational power requirements of a server, and improving storage efficiency of multi-source heterogeneous documents.

Referring to fig. 5, fig. 5 is a schematic diagram of a structure diagram of a data storage system based on multi-source heterogeneity according to an embodiment of the present invention, as shown in fig. 5, a data storage system based on multi-source heterogeneity includes a feature obtaining module 100, a fingerprint calculating module 200, a fingerprint comparing module 300, and a data storage module 400, where:

the characteristic acquisition module 100 is configured to acquire, according to a plurality of metadata in the PgSQL database, a plurality of document contents with different formats in a plurality of data sources, and acquire a plurality of characteristic values corresponding to each document content; the fingerprint calculation module 200 is configured to calculate each feature value according to a Simhash text similarity algorithm and a TF-IDF word frequency-inverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content; the fingerprint comparison module 300 is configured to obtain a fingerprint value set of an existing document, and perform hamming distance comparison between a fingerprint value corresponding to each document content and each fingerprint value in the fingerprint value set to obtain a comparison result; and the data storage module 400 is configured to determine that document contents corresponding to the fingerprint value are not repeated when the comparison result meets a preset condition, and store corresponding metadata in a storage system.

It can be understood that the data storage system based on the multi-source heterogeneity provided by the present invention corresponds to the data storage method based on the multi-source heterogeneity provided by the foregoing embodiments, and the related technical features of the data storage system based on the multi-source heterogeneity may refer to the related technical features of the data storage method based on the multi-source heterogeneity, and are not described herein again.

Referring to fig. 6, fig. 6 is a schematic diagram of an embodiment of an apparatus according to an embodiment of the present invention. As shown in fig. 6, an embodiment of the present invention provides an apparatus, which includes a memory 1310, a processor 1320, and a computer program 1311 stored in the memory 1310 and executable on the processor 1320, where the processor 1320 executes the computer program 1311 to implement the following steps:

acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.

Referring to fig. 7, fig. 7 is a schematic diagram of an embodiment of a computer-readable storage medium according to the present invention. As shown in fig. 7, the present embodiment provides a computer-readable storage medium 1400, on which a computer program 1411 is stored, the computer program 1411 when executed by a processor implements the steps of:

acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF word frequency-reverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of an existing document, and comparing a fingerprint value corresponding to the content of each document with each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system.

The invention provides a data storage method, a system, equipment and a storage medium based on multi-source isomerism, wherein the method comprises the following steps: acquiring a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in a PgSQL database, and acquiring a plurality of characteristic values corresponding to each document content; calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content; acquiring a fingerprint value set of the existing literature, and performing hamming distance comparison on a fingerprint value corresponding to the content of each literature and each fingerprint value in the fingerprint value set to obtain a comparison result; and when the comparison result meets a preset condition, judging that the document content corresponding to the fingerprint value is not repeated, and storing the corresponding metadata into a storage system. The method comprises the steps of extracting characteristic values of document contents with different formats in a plurality of data sources, enabling the document contents with different formats in the plurality of data sources to be subjected to data processing according to a unified characteristic processing mode, obtaining a fingerprint value corresponding to each document by utilizing a Simhash text similarity algorithm and a TF-IDF (Trans-frequency-inverse document) frequency weighting algorithm, enabling the fingerprint value obtained by each document through a plurality of characteristic values to be more expected, further reducing the complexity and accuracy of fingerprint value calculation, comparing the fingerprint value of each document with the fingerprint value of the existing document to obtain unrepeated document contents, further storing the unrepeated document contents, further rapidly performing batch deduplication and storage on the document contents with different formats, greatly reducing the complexity of document deduplication and the calculation force requirements of a server, and improving the storage efficiency of the document with different formats.

It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A data storage method based on multi-source isomerism is characterized by comprising the following steps:

2. The method for storing data based on multi-source heterogeneity according to claim 1, wherein the step of obtaining a plurality of document contents with different formats in a plurality of data sources according to a plurality of metadata in the PgSQL database, and obtaining a plurality of feature values corresponding to each document content includes:

acquiring a plurality of metadata in a PgSQL database;

3. The multi-source heterogeneous-based data storage method of claim 2, wherein the key fields comprise at least: title, author, signing authority.

4. The data storage method based on multisource isomerism according to claim 1, wherein the step of calculating each eigenvalue according to a Simhash text similarity algorithm and a TF-IDF word frequency-inverse file frequency weighting algorithm to obtain a fingerprint value corresponding to each document content comprises:

5. The multi-source heterogeneous data storage method according to claim 4, wherein the step of obtaining the weight value corresponding to each eigenvalue by using a TF-IDF term frequency-inverse document frequency weighting algorithm according to the term frequency and inverse document frequency corresponding to each eigenvalue further comprises:

performing linear fitting on the weighted values corresponding to the plurality of characteristic values according to the ziff's law to obtain a fitting function;

and optimizing the weight values corresponding to the plurality of characteristic values according to the fitting function to obtain the optimized weight values corresponding to the plurality of characteristic values.

6. The multi-source heterogeneous data storage method according to claim 1, wherein the step of performing hamming distance comparison on the fingerprint value corresponding to each document content and each fingerprint value in the fingerprint value set to obtain a comparison result comprises:

and carrying out hamming distance comparison on the fingerprint value corresponding to each document content and each fingerprint value in the fingerprint value set by using a sliding window algorithm to obtain a comparison fingerprint value set which accords with a difference threshold value in the fingerprint value set.

7. The multi-source heterogeneous data storage method according to claim 6, wherein the step of determining that the document content corresponding to the fingerprint value is not repeated when the comparison result meets a preset condition includes:

8. A data storage system based on multi-source isomerism is characterized by comprising

the fingerprint calculation module is used for calculating each characteristic value according to a Simhash text similarity algorithm and a TF-IDF (Trans-inverse document frequency weighting algorithm) to obtain a fingerprint value corresponding to each document content;

9. An apparatus, comprising a memory, and a processor configured to implement the steps of the multi-source heterogeneous based data storage method according to any one of claims 1-7 when executing a computer management class program stored in the memory.

10. A computer-readable storage medium, having stored thereon a computer management class program, which when executed by a processor, performs the steps of the multi-source heterogeneous based data storage method according to any one of claims 1 to 7.