CN110321347A

CN110321347A - Data matching method and device, storage medium, terminal

Info

Publication number: CN110321347A
Application number: CN201910464307.XA
Authority: CN
Inventors: 汤奇峰; 李青山
Original assignee: Shanghai Data Trading Center Ltd
Current assignee: Shanghai Data Trading Center Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-10-11

Abstract

A kind of data matching method and device, storage medium, terminal, data matching method includes: to carry out Hash operation to source data provided by data supplier, to obtain multiple source Hash buckets, each source Hash bucket has serial number, and each source Hash bucket includes multiple source datas；Hash operation is carried out to the inquiry data of data demander, to obtain multiple queries Hash bucket, each inquiry Hash bucket has serial number, and each inquiry Hash bucket includes multiple queries data；The source data in the inquiry data source Hash bucket corresponding with inquiry Hash bucket in each inquiry Hash bucket is subjected to Data Matching respectively, to obtain matching result, wherein, has corresponding relationship between the serial number of the corresponding source Hash bucket of the serial number of the inquiry Hash bucket.Technical solution of the present invention is able to ascend the efficiency of Data Matching.

Description

Data matching method and device, storage medium, terminal

Technical field

The present invention relates to technical field of data processing more particularly to a kind of data matching method and device, storage medium, ends End.

Background technique

In data trade and the data process of circulation, data trade platform is usually required to data sum number needed for data demander According to carrying out Data Matching between data provided by supplier.Specifically, typical two kinds of scenes need to carry out Data Matching: one Kind scene is that data demander will wish that the Data Identification (Identity, ID) inquired is sent to data supplier's front end processor, and data supply Square front end processor receives and carries out Data Matching with the storage ID of oneself after ID file；Another scene is that data need bearing data supplier Inquiry request is sent, data supplier requests to return to the data record of specified item number according to demander, and data demander's front end processor receives number Data Matching is carried out with the storage ID of oneself after the data returned according to supplier.

Data Matching all occurs on the front end processor of data supplier or data demander under both scenes, and the front end processor is usual For a stand-alone environment.

But when carrying out Data Matching under stand-alone environment, the full dose due to needing to carry out data is matched, Data Matching effect Rate is low；Also, existing Data Matching dependence database, relational database storage cap are about several hundred million, ten tens of thousands of tables Record, when data volume reaches or when close to the storage cap, the read or write speed of relational database can be remarkably decreased, in addition to open source MYSQL other than, the business databases such as Oracle are there are also at high cost, the disadvantages of using difficulty, relational database used to execute number According to matched degraded performance.

Summary of the invention

Present invention solves the technical problem that being how to promote the efficiency of Data Matching.

In order to solve the above technical problems, the embodiment of the present invention provides a kind of data matching method, data matching method includes: Hash operation is carried out to source data provided by data supplier, to obtain multiple source Hash buckets, each source Hash bucket has serial number, Each source Hash bucket includes multiple source datas；Hash operation is carried out to the inquiry data of data demander, to obtain multiple queries Kazakhstan Uncommon bucket, each inquiry Hash bucket have serial number, and each inquiry Hash bucket includes multiple queries data；Respectively by each inquiry Hash The source data in inquiry data source Hash bucket corresponding with inquiry Hash bucket in bucket carries out Data Matching, to obtain matching knot Fruit, wherein have corresponding relationship between the serial number of the corresponding source Hash bucket of the serial number of the inquiry Hash bucket.

Optionally, described respectively by the inquiry data source Hash bucket corresponding with inquiry Hash bucket in each inquiry Hash bucket It includes: respectively that the inquiry data in each inquiry Hash bucket are corresponding with inquiry Hash bucket that interior source data, which carries out Data Matching, Source data in the Hash bucket of source carries out Data Matching, to obtain multiple barrels of matching results；The multiple bucket matching result is carried out Merge, to obtain the matching result.

Optionally, described respectively by the inquiry data source Hash bucket corresponding with inquiry Hash bucket in each inquiry Hash bucket It includes: using multiple processes respectively by the inquiry data and inquiry in each inquiry Hash bucket that interior source data, which carries out Data Matching, Source data in the corresponding source Hash bucket of Hash bucket carries out Data Matching.

Optionally, before the progress Hash operation to source data provided by data supplier further include: receive and come from institute State the source data of data supplier；It includes: from the data demander that the inquiry data to data demander, which carry out Hash operation, Server obtains the inquiry data, and carries out Hash operation to the inquiry data.

Optionally, before the progress Hash operation to source data provided by data supplier further include: receive and come from institute State the inquiry data of data demander；Described includes: from the data to the progress Hash operation of source data provided by data supplier The server of supplier obtains the source data, and carries out Hash operation to the source data.

Optionally, the quantity of the multiple source Hash bucket is identical as the multiple inquiry quantity of Hash bucket.

Optionally, the data matching method further include: it is for statistical analysis to the matching result, to obtain each look into The matching for asking the source data in the corresponding source Hash bucket of the inquiry data in Hash bucket is distributed.

In order to solve the above technical problems, the embodiment of the invention also discloses a kind of data matching device, data matching device It include: the first hash module, to carry out Hash operation to source data provided by data supplier, to obtain multiple source Hash Bucket, each source Hash bucket have serial number, and each source Hash bucket includes multiple source datas；Second hash module, to be needed to data The inquiry data of side carry out Hash operation, and to obtain multiple queries Hash bucket, each inquiry Hash bucket has serial number, each inquiry Hash bucket includes multiple queries data；Data match module, to respectively by it is each inquiry Hash bucket in inquiry data with look into The source data ask in the corresponding source Hash bucket of Hash bucket carries out Data Matching, to obtain matching result, wherein the inquiry Hash Has corresponding relationship between the serial number of the corresponding source Hash bucket of the serial number of bucket.

The embodiment of the invention also discloses a kind of storage mediums, are stored thereon with computer instruction, the computer instruction The step of data matching method is executed when operation.

The embodiment of the invention also discloses a kind of terminal, including memory and processor, being stored on the memory can The computer instruction run on the processor, the processor execute the Data Matching when running the computer instruction The step of method.

Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that

Technical solution of the present invention is breathed out by the inquiry data of the source data and data demander that provide data supplier Source data and inquiry data can be divided to source Hash bucket and inquiry Hash according to its data characteristics respectively by uncommon operation Bucket.It is corresponding due to having the source data of the identical data feature serial number of Hash bucket assigned with data are inquired, it can be with Data Matching only is carried out to the source data in the inquiry data source Hash bucket corresponding with inquiry Hash bucket in inquiry Hash bucket, is kept away The full dose matching for having exempted from data, improves the efficiency of Data Matching.

Further, respectively that the inquiry data in each inquiry Hash bucket are corresponding with inquiry Hash bucket using multiple processes Source Hash bucket in source data carry out Data Matching.In technical solution of the present invention, each process can be responsible for a pair of of inquiry and breathe out The matching of data in uncommon bucket and source Hash bucket, and between multiple processes can with parallel processing, therefore, what multiple processes were responsible for Multipair inquiry Hash bucket and the matching of data in the Hash bucket of source can carry out simultaneously, further promote the efficiency of Data Matching.

Detailed description of the invention

Fig. 1 is a kind of flow chart of data matching method of the embodiment of the present invention；

Fig. 2 is the flow chart of another kind data matching method of the embodiment of the present invention；

Fig. 3 is the flow chart of another data matching method of the embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of data matching device of the embodiment of the present invention.

Specific embodiment

As described in the background art, when carrying out Data Matching under stand-alone environment, due to needing to carry out the full dose of data Match, Data Matching low efficiency；Also, existing Data Matching dependence database, relational database storage cap are about ten Tens of thousands of several hundred million, table records, when data volume reaches or when close to the storage cap, the read or write speed of relational database can be shown Decline is write, other than the MYSQL of open source, the business databases such as Oracle are there are also at high cost, the disadvantages of using difficulty, use pass It is the degraded performance that database executes Data Matching.

To make the above purposes, features and advantages of the invention more obvious and understandable, with reference to the accompanying drawing to the present invention Specific embodiment be described in detail.

Fig. 1 is a kind of flow chart of data matching method of the embodiment of the present invention.

Data matching method shown in the present embodiment can be executed by the front end processor of data supplier, can also be needed by data The front end processor of side executes, or can also be executed by data trade center.

In the present embodiment, data supplier refers to the data providing of transaction system；Data demander refers in transaction system Data requirements side；Front end processor refers to the server that data supplier or data demander and data trade center consult.

Data matching method shown in Fig. 1 may comprise steps of:

Step S101: Hash operation is carried out to source data provided by data supplier, to obtain multiple source Hash buckets, each Source Hash bucket has serial number, and each source Hash bucket includes multiple source datas；

Step S102: Hash operation is carried out to the inquiry data of data demander, to obtain multiple queries Hash bucket, Mei Gecha It askes Hash bucket and has serial number, each inquiry Hash bucket includes multiple queries data；

Step S103: respectively will be in the inquiry data source Hash bucket corresponding with inquiry Hash bucket in each inquiry Hash bucket Source data carry out Data Matching, to obtain matching result, wherein the corresponding source Hash of serial number of the inquiry Hash bucket Has corresponding relationship between the serial number of bucket.

It should be pointed out that the serial number of each step does not represent the limit to the execution sequence of each step in the present embodiment It is fixed.

In the specific implementation of step S101, Hash operation can be carried out to the source data that data supplier provides.Specifically, Source data may include source data ID and label data etc., and source data ID can be the mark of source data, and label data can be with It is the data content of source data；It specifically can be and Hash operation carried out to source data ID.

It can be the process that data fragmentation is carried out to source data to the process that source data carries out Hash operation.That is, Each source data is respectively divided into multiple source Hash buckets.The quantity of source Hash bucket can be it is preconfigured, such as can be with 64,12 etc., the embodiment of the present invention to this with no restriction.

In specific implementation, the corresponding fixed length of source data ID can be calculated by executing hash algorithm to source data ID The sequence number of degree can obtain the serial number of the assigned source Hash bucket of the source data by executing complementation to the sequence number. For example, the complementation to 64 can be executed to the sequence number of source data ID when the quantity of source Hash bucket is 64.

When distributing source data to multiple source Hash buckets, in fact it could happen that data volume is 0 at least one source Hash bucket Situation.

Unlike abovementioned steps S101, in step s 102, the object for executing Hash operation is looking into for data demander Ask data.Specifically, inquiry data also may include inquiry data ID and label data.It specifically can be to inquiry data ID executes Hash operation.

It, can be corresponding solid by the way that inquiry data ID is calculated to inquiry data ID execution hash algorithm in specific implementation The sequence number of measured length can obtain the assigned inquiry Hash of the inquiry data by executing complementation to the sequence number The serial number of bucket.For example, can execute to the sequence number of inquiry data ID and be transported to 64 remainder when the quantity of inquiry Hash bucket is 64 It calculates.

It can refer to step S101 about the detailed process for obtaining multiple queries Hash bucket to inquiry data execution Hash operation Specific embodiment, the invention is not limited in this regard.

It should be noted that in order to guarantee the matched accuracy of follow-up data, the total quantity and inquiry Hash of source Hash bucket The total quantity of bucket is identical.In addition, carrying out Hash operation to source data and breathe out used by Hash operation to inquiry data Uncommon algorithm is identical.

In a specific embodiment, it determines source Hash bucket and inquires the quantity of Hash bucket, such as be 16, it is right respectively Source data and inquiry data execute hash algorithm, then to Hash result execute complementation, such as to Hash result for 16 into Row remainder, to respectively obtain the serial number of the assigned source Hash bucket of source data and inquire the sequence for the inquiry Hash bucket that data are assigned Number.

It, can be respectively to the inquiry data and inquiry Hash in each inquiry Hash bucket in the specific implementation of step S103 Source data in the corresponding source Hash bucket of bucket carries out Data Matching.

In the present embodiment, the source data for having identical data feature is opposite with the assigned serial number of Hash bucket of inquiry data Answer, for example, serial number is identical or serial number between have preset corresponding relationship, therefore can be only to the inquiry in inquiry Hash bucket Source data in data source Hash bucket corresponding with inquiry Hash bucket carries out Data Matching, avoids the full dose matching of data, mentions The high efficiency of Data Matching.

For example, the data in the two can be carried out data by inquiry Hash bucket and source Hash bucket for serial number 000 Match；Data in the two can be carried out Data Matching, with such by inquiry Hash bucket and source Hash bucket for serial number 001 It pushes away, inquiry Hash bucket and source Hash bucket for serial number 063, the data in the two can be subjected to Data Matching.

It is understood that the process about Data Matching can be the process that data are compared, if source data It is consistent with inquiry data, then it can determine that the source data matches with the inquiry data.

In a concrete application scene of the invention, data fragmentation is carried out to the storage ID of data supplier or data demander When, to prevent the data volume in each Hash bucket too big, Hash bucket can be controlled according to the size dynamic of storage ID total amount of data Number.For the size of storage ID in 50G-200G or so, the amount of capacity that can choose each Hash bucket is 1G-2G, Hash bucket Quantity 100 or so.According to such standard, data volume in each Hash bucket is both not too large to cause individual data fragment Matched data is excessively slow, also guarantees that the data volume in Hash bucket too small cannot cause Hash barrelage excessive, causes to generate multi-process Number, keeps Data Matching process too many, influences matching efficiency.

In a preferred embodiment of the invention, xxhash algorithm can be chosen, Hash is carried out to source data and inquiry data Operation.Wherein, xxhash algorithm can satisfy the requirement of randomness, also can satisfy the requirement of calculating speed.

In a specific embodiment of the invention, step S103 shown in Fig. 1 be may comprise steps of: look into respectively by each The source data ask in the source Hash bucket corresponding with inquiry Hash bucket of the inquiry data in Hash bucket carries out Data Matching, more to obtain A barrel of matching result；The multiple bucket matching result is merged, to obtain the matching result.

It, can since the Data Matching of each pair of inquiry Hash bucket and source Hash bucket independently carries out in the present embodiment To respectively obtain multiple independent bucket matching results.Wherein, the quantity of bucket matching result and inquiry Hash bucket (or source Hash Bucket) quantity it is identical.

Furthermore, multiple barrels of matching results can be stored in respectively in different files.

In order to facilitate checking for matching result, multiple barrels of matching results can be merged.Combined concrete mode can To be by the file mergences where multiple barrels of matching results for a summary file.Alternatively, the side that can also be merged by recurrence File where two bucket matching results is successively aggregated into a bigger file by formula, until ultimately generating one summarizes text Part.

In a specific embodiment of the invention, step S103 shown in Fig. 1 be may comprise steps of: use multiple processes The source data in the inquiry data source Hash bucket corresponding with inquiry Hash bucket in each inquiry Hash bucket is subjected to data respectively Matching.

In specific implementation, if carried out using one process to the data in multipair inquiry Hash bucket and source Hash bucket matched Speed tends not to meet the needs of real-time, thus, it is possible to using multi-process in each pair of inquiry Hash bucket and source Hash bucket Data individually matched, not shared data between each process.

In the present embodiment, each process can be responsible for the matching of data in a pair of of inquiry Hash bucket and source Hash bucket, and more Can be with parallel processing, therefore between a process, data in multiple processes responsible multipair inquiry Hash bucket and source Hash bucket Matching can carry out simultaneously, further promoted Data Matching efficiency.

It referring to figure 2., can be the following steps are included: step before step S101 unlike embodiment illustrated in fig. 1 Rapid S201: the source data from the data supplier is received；

Step S102 may comprise steps of: step S202: obtain the inquiry from the server of the data demander Data, and Hash operation is carried out to the inquiry data.

The data matching method of the embodiment of the present invention can be executed by the front end processor of data demander.

In specific implementation, inquiry data can be executed Hash operation in advance by the front end processor of data demander.Data supplier will Source data is sent to the front end processor of data demander.The front end processor of data demander can receive the source data from data supplier, and Hash operation is executed to source data one by one, the source data after Hash operation is put into the Hash bucket of source, in step S103 Middle carry out Data Matching.

Specifically, source data may include source data ID and label data, and data demander can breathe out source data ID Uncommon operation.

It should be noted that source data is other than including source data ID and label data, it can also include any other The data of enforceable type；Similarly, inquiry data also may include it other than including inquiry data ID and label data The data of his any enforceable type, the embodiment of the present invention to this with no restriction.

It referring to figure 3., can be the following steps are included: step before step S101 unlike embodiment illustrated in fig. 1 Rapid S301: the inquiry data from the data demander are received；

Step S101 may comprise steps of: step S302: obtain the source number from the server of the data supplier According to, and Hash operation is carried out to the source data.

The data matching method of the embodiment of the present invention can be executed by the front end processor of data supplier.

In specific implementation, source data can be executed Hash operation in advance by the front end processor of data supplier.Data demander will look into Ask the front end processor that data are sent to data supplier.The front end processor of data supplier can receive the inquiry data from data demander, And Hash operation is executed to inquiry data one by one, the inquiry data after Hash operation are put into inquiry Hash bucket, to be used for Data Matching is carried out in step S103.

In a specific embodiment of the invention, data matching method shown in Fig. 1 can be the following steps are included: to described Matching result is for statistical analysis, to obtain the source in the corresponding source Hash bucket of the inquiry data in each inquiry Hash bucket The matching of data is distributed.

It is for statistical analysis to the matching result to can be to each pair of source Hash bucket and inquiry Hash bucket in specific implementation The interior data volume to match and matching is completed the time it takes etc. and is counted.

Furthermore, the result of statistical analysis can also be shown in a manner of visual.

By being counted to source Hash bucket and the matching result and time inquired in Hash bucket, available all numbers Can have intuitively to the matching result of each pair of data fragmentation according to the matching distribution situation of fragment by the analysis of matching result Impression.And then it can be according to by visually as a result, being optimized to the allocation strategy of Hash bucket.

Referring to figure 4., the embodiment of the invention also discloses a kind of data matching devices.Data matching device 40 may include First hash module 401, the second hash module 402 and data match module 403.

Wherein, the first hash module 401 is more to obtain to carry out Hash operation to source data provided by data supplier A source Hash bucket, each source Hash bucket have serial number, and each source Hash bucket includes multiple source datas；Second hash module 402 is used Hash operation is carried out with the inquiry data to data demander, to obtain multiple queries Hash bucket, each inquiry Hash bucket has sequence Number, each inquiry Hash bucket includes multiple queries data；Data match module 403 is to respectively will be in each inquiry Hash bucket The source data inquired in data source Hash bucket corresponding with Hash bucket is inquired carries out Data Matching, to obtain matching result, wherein Has corresponding relationship between the serial number of the corresponding source Hash bucket of the serial number of the inquiry Hash bucket.

The embodiment of the present invention carries out Hash by the inquiry data of the source data and data demander that provide data supplier Source data and inquiry data can be divided to source Hash bucket and inquiry Hash bucket according to its data characteristics respectively by operation. It is corresponding due to having the source data of the identical data feature serial number of Hash bucket assigned with data are inquired, it can be only right The source data inquired in the source Hash bucket corresponding with inquiry Hash bucket of the inquiry data in Hash bucket carries out Data Matching, avoids The full dose of data matches, and improves the efficiency of Data Matching.

Working principle, more contents of working method about the data matching device 40, are referred to Fig. 1 to Fig. 3 In associated description, which is not described herein again.

The embodiment of the invention also discloses a kind of storage mediums, are stored thereon with computer instruction, the computer instruction The step of method shown in Fig. 1, Fig. 2 or Fig. 3 can be executed when operation.The storage medium may include ROM, RAM, disk or CD etc..The storage medium can also include non-volatility memorizer (non-volatile) or non-transient (non- Transitory) memory etc..

The embodiment of the invention also discloses a kind of terminal, the terminal may include memory and processor, the storage The computer instruction that can be run on the processor is stored on device.The processor can be with when running the computer instruction The step of executing method shown in Fig. 1, Fig. 2 or Fig. 3.The terminal includes but is not limited to that mobile phone, computer, tablet computer etc. are whole End equipment.

Although present disclosure is as above, present invention is not limited to this.Anyone skilled in the art are not departing from this It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Subject to the range of restriction.

Claims

1. a kind of data matching method characterized by comprising

Hash operation is carried out to source data provided by data supplier, to obtain multiple source Hash buckets, each source Hash bucket has Serial number, each source Hash bucket includes multiple source datas；

Hash operation is carried out to the inquiry data of data demander, to obtain multiple queries Hash bucket, each inquiry Hash bucket has Serial number, each inquiry Hash bucket includes multiple queries data；

The source data in the inquiry data source Hash bucket corresponding with inquiry Hash bucket in each inquiry Hash bucket is carried out respectively Data Matching, to obtain matching result, wherein between the serial number of the corresponding source Hash bucket of the serial number of the inquiry Hash bucket Has corresponding relationship.

2. data matching method according to claim 1, which is characterized in that it is described respectively will be in each inquiry Hash bucket Source data in inquiry data source Hash bucket corresponding with Hash bucket is inquired carries out Data Matching and includes:

The source data in the inquiry data source Hash bucket corresponding with inquiry Hash bucket in each inquiry Hash bucket is carried out respectively Data Matching, to obtain multiple barrels of matching results；

The multiple bucket matching result is merged, to obtain the matching result.

3. data matching method according to claim 1, which is characterized in that it is described respectively will be in each inquiry Hash bucket Source data in inquiry data source Hash bucket corresponding with Hash bucket is inquired carries out Data Matching and includes:

It respectively will be in the inquiry data source Hash bucket corresponding with inquiry Hash bucket in each inquiry Hash bucket using multiple processes Source data carry out Data Matching.

4. data matching method according to claim 1, which is characterized in that described to source data provided by data supplier Before progress Hash operation further include:

Receive the source data from the data supplier；

The inquiry data to data demander carry out Hash operation

The inquiry data are obtained from the server of the data demander, and Hash operation is carried out to the inquiry data.

5. data matching method according to claim 1, which is characterized in that described to source data provided by data supplier Before progress Hash operation further include:

Receive the inquiry data from the data demander；

It is described to include: to the progress of source data provided by data supplier Hash operation

The source data is obtained from the server of the data supplier, and Hash operation is carried out to the source data.

6. data matching method according to claim 1, which is characterized in that the quantity of the multiple source Hash bucket with it is described The quantity of multiple queries Hash bucket is identical.

7. data matching method according to claim 1, which is characterized in that further include:

It is for statistical analysis to the matching result, it is breathed out with obtaining the corresponding source of the inquiry data in each inquiry Hash bucket The matching distribution of source data in uncommon bucket.

8. a kind of data matching device characterized by comprising

First hash module, to carry out Hash operation to source data provided by data supplier, to obtain multiple source Hash buckets, Each source Hash bucket has serial number, and each source Hash bucket includes multiple source datas；

Second hash module carries out Hash operation to the inquiry data to data demander, to obtain multiple queries Hash bucket, often A inquiry Hash bucket has serial number, and each inquiry Hash bucket includes multiple queries data；

Data match module, to respectively by the inquiry data source Hash corresponding with inquiry Hash bucket in each inquiry Hash bucket Source data in bucket carries out Data Matching, to obtain matching result, wherein the corresponding source of the serial number of the inquiry Hash bucket Has corresponding relationship between the serial number of Hash bucket.

9. a kind of storage medium, is stored thereon with computer instruction, which is characterized in that the right of execution when computer instruction is run Benefit require any one of 1 to 7 described in data matching method the step of.

10. a kind of terminal, including memory and processor, the meter that can be run on the processor is stored on the memory Calculation machine instruction, which is characterized in that perform claim requires any one of 1 to 7 institute when the processor runs the computer instruction The step of stating data matching method.