CN116050376A - Data comparison method, device, equipment and storage medium - Google Patents

Data comparison method, device, equipment and storage medium Download PDF

Info

Publication number
CN116050376A
CN116050376A CN202211543024.2A CN202211543024A CN116050376A CN 116050376 A CN116050376 A CN 116050376A CN 202211543024 A CN202211543024 A CN 202211543024A CN 116050376 A CN116050376 A CN 116050376A
Authority
CN
China
Prior art keywords
preset
data
codes
target
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211543024.2A
Other languages
Chinese (zh)
Inventor
贺宁
魏程琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Unisinsight Technology Co Ltd
Original Assignee
Chongqing Unisinsight Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Unisinsight Technology Co Ltd filed Critical Chongqing Unisinsight Technology Co Ltd
Priority to CN202211543024.2A priority Critical patent/CN116050376A/en
Publication of CN116050376A publication Critical patent/CN116050376A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying

Abstract

The embodiment of the application provides a data comparison method, a device, equipment and a storage medium, the method is characterized in that to-be-compared long codes and to-be-compared short codes are generated based on to-be-compared data, the to-be-compared short codes are compared with a plurality of preset short codes in a preset short code data set for the first time, a plurality of target short codes are determined, a plurality of candidate long codes corresponding to the target short codes are determined based on preset data identifiers of the target short codes, the to-be-compared long codes are compared with the candidate long codes for the second time, a plurality of target long codes are obtained, initial comparison results of the to-be-compared data are generated based on target structured data associated with the target long codes, and the data comparison is split into the short code comparison and the long code comparison.

Description

Data comparison method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a data comparison method, a device, equipment and a storage medium.
Background
Currently, the development of the artificial intelligence field is faster and faster, and the application field is wider and wider. Taking the field of data comparison as an example, for example, the image comparison, because the data volume in the base is huge, the full-volume data obtained by analyzing one image to be compared is compared with the full-volume data of each base image in the base, so that the method greatly consumes labor and takes much time. In addition, the data obtained after the analysis of the original data is more and more, so that the overall operation speed is slower and slower, and therefore, a more efficient data comparison method is needed to improve the operation speed, save the calculation force and reduce the time consumption.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a data comparison method, apparatus, device and storage medium, which are used for solving the technical problems of slow operation speed, much calculation effort waste and long time consumption of the current data comparison method.
In view of the foregoing, the present invention provides a data comparison method, which includes: obtaining data to be compared, and generating a long code to be compared and a short code to be compared based on the data to be compared; performing first comparison on the short codes to be compared and a plurality of preset short codes in a preset short code data set, and determining a plurality of target short codes; determining a plurality of candidate long codes corresponding to the target short codes based on preset data identifiers of the target short codes, wherein the candidate long codes and the target short codes are associated with the preset data identifiers; performing second comparison on the long codes to be compared and the candidate long codes to obtain a plurality of target long codes; and generating an initial comparison result of the data to be compared based on the target structural data associated with each target long code.
In an embodiment of the present invention, determining a plurality of candidate long codes corresponding to each target short code based on the preset data identifier of each target short code includes: determining the position of a target partition according to a preset partition identifier, wherein the preset partition identifier is obtained based on a preset data identifier; and determining a preset disk based on the preset micro-service pointed by the target partition position, and reading the preset disk according to the preset data identifier through the preset micro-service to obtain a plurality of candidate long codes.
In an embodiment of the present invention, performing a second comparison between the long code to be compared and the candidate long code to obtain a plurality of target long codes includes: performing second comparison on the long codes to be compared and the candidate long codes through a preset micro service to obtain second similarity of each candidate long code; a plurality of target long codes is determined from the plurality of candidate long codes based on the second similarity.
In an embodiment of the present invention, reading, by the preset micro service, the preset disk according to the preset data identifier, to obtain a plurality of candidate long codes includes: matching is carried out according to preset data identifiers and preset data identifiers of a plurality of preset combined files in a preset disk so as to obtain a plurality of target combined files matched with the preset data identifiers, wherein the target combined files comprise preset long codes and preset structured data; and determining the preset long code as a candidate long code to obtain a plurality of candidate long codes and preset structured data associated with the candidate long codes.
In an embodiment of the present invention, generating an initial comparison result of data to be compared based on target structured data associated with each target long code includes: if the candidate long code is determined to be the target long code, determining preset structural data associated with the candidate long code to be the target structural data; generating candidate comparison pair results according to the target structured data and the target long code associated with the target structured data to obtain a plurality of candidate comparison pair results; an initial comparison result is generated based on the plurality of candidate comparison pair results.
In an embodiment of the present invention, generating an initial comparison result of data to be compared based on target structured data associated with each target long code includes: if the candidate long code is determined to be the target long code, determining preset structural data associated with the candidate long code to be the target structural data; generating candidate comparison pair results according to the target structured data to obtain a plurality of candidate comparison pair results; an initial comparison result is generated based on the plurality of candidate comparison pair results.
In an embodiment of the present invention, after reading the preset disk according to the preset data identifier by the preset micro service, obtaining the plurality of candidate long codes further includes: if the preset conditions are met, the preset conditions comprise at least one of the preset magnetic disk, the preset magnetic disk is damaged, and the preset micro-service is subject to service drift; matching a plurality of candidate long codes from a distributed shared storage according to preset data identifiers, wherein the distributed shared storage stores a plurality of preset long codes and preset data identifiers associated with the preset long codes; and matching the preset data identifiers of the candidate long codes from a preset database to obtain a plurality of preset structured data, wherein the preset database stores the plurality of preset structured data and preset data identifiers associated with the preset structured data.
In an embodiment of the present invention, after generating an initial comparison result of the data to be compared based on the target structured data associated with each target long code, the method further includes: performing third comparison on the to-be-compared structured data and each target structured data in the initial comparison result to obtain a third similarity of each target structured data, wherein the to-be-compared structured data is generated based on the to-be-compared data; and if the third similarity is greater than a preset third similarity threshold, determining the target structured data as an intermediate comparison result.
In an embodiment of the present invention, before the first comparison is performed between the short code to be compared and a plurality of preset short codes in the preset short code data set, the method further includes: acquiring a plurality of preset original data, generating original characteristic data and preset structured data based on the preset original data, and generating preset short codes and preset long codes based on the original characteristic data; storing preset structured data of each preset original data into a preset database, storing original characteristic data of each preset original data into distributed shared storage, generating a preset short code data set based on preset short codes of each preset original data, and storing the preset short code data set into a preset memory; and generating a preset combined file based on the preset long code and the preset structured data, and storing the preset combined file of each preset original data in a preset magnetic disk.
In an embodiment of the present invention, after generating the original feature data and the preset structured data based on the preset original data and generating the preset short code and the preset long code based on the original feature data, the method further includes: configuring preset data identifiers for all preset original data, wherein the preset data identifiers comprise preset feature identifiers and preset partition identifiers, the preset feature identifiers are used for distinguishing all preset original data, and the preset partition identifiers are used for representing disk partition positions of preset disks stored in a combined file of the preset original data; storing preset structured data and preset data identifiers of all preset original data into a preset database, storing original characteristic data and preset data identifiers of all preset original data into distributed shared storage, generating a preset short code data set based on preset short codes and preset data identifiers of all preset original data, and storing the preset short code data set into a preset memory; and generating a preset combined file based on the preset long code and the preset structured data, and storing the preset combined file and the preset data identifier of each preset original data in a preset magnetic disk.
In an embodiment of the present invention, before configuring the preset data identifier for each preset raw data, the method further includes: acquiring the total service amount of a plurality of preset micro services and the total partition amount of a plurality of preset disks in a current cluster, wherein the total partition amount is greater than or equal to the total service amount; and determining a reference value according to the partition number and the service number, and determining preset micro-services pointed by the disk partitions of each preset disk, wherein the reference value comprises a quotient value and a remainder value.
In an embodiment of the present invention, generating the preset short code data set based on the preset short codes and the preset data identifiers of the preset original data includes: determining a salient feature existence state based on preset structured data of each preset original data; if the existence state of the salient features of the preset original data is existence, determining the preset original data as first original data, determining preset salient features based on preset structured data of the first original data, and dividing preset short codes and preset data identifiers of each first original data into at least one preset short code data subset with salient features according to the preset salient features; if the existence state of the salient features of the preset original data is nonexistent, determining the preset original data as second original data, and dividing preset short codes and preset data identifiers of the second original data into preset short code data subsets without the salient features; the preset short code data set is generated based on each preset short code data subset with the obvious characteristic and the preset short code data subset without the obvious characteristic.
In an embodiment of the present invention, performing a first comparison between a short code to be compared and a plurality of preset short codes in a preset short code data set, and determining a plurality of target short codes includes: generating structural data to be compared based on the data to be compared, and determining significant characteristics to be compared; if the to-be-compared significant features are the same as the preset significant features corresponding to the significant feature preset short code data subsets, determining the preset short codes in the significant feature preset short code data subsets as screened short codes; if the preset short code data subset without the significant features is not empty, determining the preset short codes in the preset short code data subset without the significant features as screened short codes; and comparing the short codes to be compared with the short codes after screening for the first time to determine a plurality of target short codes.
The embodiment of the invention also provides a data comparison device, which comprises: the acquisition module is used for acquiring data to be compared and generating a long code to be compared and a short code to be compared based on the data to be compared; the first comparison module is used for performing first comparison on the short codes to be compared and a plurality of preset short codes in the preset short code data set, and determining a plurality of target short codes; the candidate long code determining module is used for determining a plurality of candidate long codes corresponding to the target short codes based on preset data identifiers of the target short codes, wherein the candidate long codes and the target short codes are associated with the preset data identifiers; the second comparison module is used for carrying out second comparison on the long codes to be compared and the candidate long codes to obtain a plurality of target long codes; and the result generation module is used for generating an initial comparison result of the data to be compared based on the target structural data associated with each target long code.
The embodiment of the invention also provides electronic equipment, which comprises a processor, a memory and a communication bus; the communication bus is used for connecting the processor and the memory; the processor is configured to execute a computer program stored in the memory to implement the method according to any of the embodiments described above.
The embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program for causing a computer to perform the method according to any one of the above embodiments.
As described above, the data comparison method, device, equipment and storage medium provided by the invention have the following beneficial effects:
according to the method, the to-be-compared long codes and to-be-compared short codes are generated based on the to-be-compared data, the to-be-compared short codes are compared with a plurality of preset short codes in a preset short code data set for the first time, a plurality of target short codes are determined, a plurality of candidate long codes corresponding to the target short codes are determined based on preset data identifications of the target short codes, the to-be-compared long codes and the candidate long codes are compared for the second time, a plurality of target long codes are obtained, an initial comparison result of the to-be-compared data is generated based on target structured data associated with the target long codes, and the data comparison is split into the short code comparison and the long code comparison.
Drawings
FIG. 1 is a flow chart of a data alignment method according to an exemplary embodiment of the present application.
FIG. 2 is a flow chart of a data storage method shown in an exemplary embodiment of the present application.
FIG. 3 is a flow chart of a data alignment method shown in an exemplary embodiment of the present application.
Fig. 4 is a block diagram of a data alignment apparatus according to an exemplary embodiment of the present application.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
Referring to fig. 1, fig. 1 is a flowchart of a data comparison method according to an exemplary embodiment of the present application. As shown in fig. 1, the method at least includes steps S101 to S104, and is described in detail as follows:
step S101, obtaining data to be compared, and generating a long code to be compared and a short code to be compared based on the data to be compared.
The data to be compared may be picture data or other original data set by those skilled in the art.
And generating to-be-compared structured data and to-be-compared characteristic data based on the to-be-compared data, and generating to-be-compared long codes and to-be-compared short codes based on the to-be-compared characteristic data. Wherein the short codes to be compared may be determined based on the long codes to be compared. The above-mentioned long code to be compared, short code to be compared and the generation method of the structured data to be compared can be implemented by adopting a method known to those skilled in the art, and are not limited herein.
It should be noted that, because the long code to be compared, the short code to be compared and the structural data to be compared are generated based on the data to be compared, in order to correlate the long code to be compared, the short code to be compared and the structural data to be compared, a to-be-compared identifier can be set for each piece of data to be compared, and the long code to be compared, the short code to be compared and the structural data to be compared generated by the data to be compared can be correlated through the to-be-compared identifier. When the long code to be compared and the short code to be compared generated by the data to be compared and the structured data to be compared are stored in the corresponding storage space of the base, the long code to be compared and the short code to be compared of certain data to be compared and the structured data to be compared can be queried based on the identification to be compared.
Step S102, the short codes to be compared are compared with a plurality of preset short codes in a preset short code data set for the first time, and a plurality of target short codes are determined.
In an embodiment, performing first comparison on the short code to be compared and a plurality of preset short codes in a preset short code data set, and determining a plurality of target short codes includes:
generating structural data to be compared based on the data to be compared, and determining significant characteristics to be compared according to the structural data to be compared;
screening a plurality of preset short codes in a preset short code data set in a preset memory according to the to-be-compared significant features, wherein at least one part of preset short codes are pre-associated with preset significant features;
determining preset short codes with preset significant features consistent with the significant features to be compared and preset short codes without associated preset significant features as screened short codes;
and comparing the short codes to be compared with the short codes after screening for the first time to obtain a plurality of target short codes.
In an embodiment, to further improve the comparison efficiency, if there are at least two significant dimensions to be compared, a plurality of preset dimensions, such as gender, age, etc., may be set for the preset significant features, and after the significant features to be compared are obtained, the significant features to be compared are also obtained at the same time, and when the preset significant features and the significant features to be compared are compared on the significant feature level, the comparison is performed under the dimension. For example, the significant feature to be compared is female and is less than 20 years old, at this time, the preset short code corresponding to the preset significant feature is male can be screened out, then the preset short code with the preset significant feature being more than 20 is screened out, and the preset short code with the preset significant feature being female or unknown and the age being unknown or more than 20 years old is reserved. As to whether the sex dimension is first screened or the age dimension is first screened, it can be set by those skilled in the art.
In an embodiment, the method further includes, before comparing the short code to be compared with the plurality of preset short codes in the preset short code data set for the first time and determining the plurality of target short codes:
acquiring a plurality of preset original data, generating original characteristic data and preset structured data based on the preset original data, and generating preset short codes and preset long codes based on the original characteristic data, wherein the preset original data can be image data of a plurality of base pictures by taking picture data comparison as an example, the preset original data can also be other data set by a person skilled in the art, the mode of generating the original characteristic data and the preset structured data through the preset original data is similar to the mode of generating the to-be-compared characteristic data and the to-be-compared structured data of the to-be-compared data, the mode of generating the to-be-compared long codes and the to-be-compared short codes based on the to-be-compared characteristic data in the embodiment is similar to the mode of generating the to-be-compared long codes and the to-be-compared short codes based on the to-be-compared characteristic data, and the to-be-compared long codes, the to-be-compared short codes and the preset long codes obtained after processing are made to have comparability, and the preset short codes, and the preset structured data are not repeated.
Storing preset structured data of each preset original data into a preset database, storing original characteristic data of each preset original data into distributed shared storage, generating a preset short code data set based on preset short codes of each preset original data, and storing the preset short code data set into a preset memory;
and generating a preset combined file based on the preset long code and the preset structured data, and storing the preset combined file of each preset original data in a preset magnetic disk.
The size of the preset combined file of each preset original data is the same. It can be understood that the preset short code is similar to the keyword group of the preset original data, and the preset long code is similar to the brief description of the preset original data, so that a plurality of data which are not necessarily required can be roughly screened out by first comparison of the short code to be compared and the preset short code, thereby being beneficial to reducing the waste of calculation power.
In an embodiment, the original characteristic data includes a preset short code and a preset long code, the distributed shared storage stores the whole amount of the original characteristic data, and the preset disk may set a preset combined file storing the whole amount or part of the preset original data, for example, a preset combined file storing only the preset original data of the last 3 months, etc.
Through combining preset long codes and preset structured data into a preset combined file, the preset long codes and the preset structured data are bound one by one, so that the preset structured data can be simultaneously extracted through one-time extraction of the preset long codes, and after the comparison of the long codes is completed subsequently, the corresponding preset structured data are obtained without reading the preset database again, so that the pressure and the reading time of the preset database are greatly reduced.
In an embodiment, after generating the original feature data and the preset structured data based on the preset original data and generating the preset short code and the preset long code based on the original feature data, the method further includes:
configuring preset data identifiers for all preset original data, wherein the preset data identifiers comprise preset feature identifiers and preset partition identifiers, the preset feature identifiers are used for distinguishing all preset original data, the preset partition identifiers are used for representing the disk partition positions of a preset disk stored by a combined file of the preset original data, namely, the positions of corresponding long code files can be known through the preset partition identifiers, so that the preset feature identifiers are not required to be matched with the inside of the preset disk, the long code files can be read directly according to the positions corresponding to the preset partition identifiers, and the reading time of the files can be greatly reduced;
Storing preset structured data and preset data identifiers (or preset feature identifiers can also be used) of all preset original data into a preset database, storing original feature data and preset data identifiers (or preset feature identifiers can also be used) of all preset original data into distributed shared storage, generating a preset short code data set based on preset short codes and preset data identifiers of all preset original data, and storing the preset short code data set into a preset memory;
and generating a preset combined file based on the preset long code and the preset structured data, and storing the preset combined file and preset data identifiers (or preset feature identifiers can also be stored in a preset magnetic disk) of each preset original data.
In other words, the preset long code, the preset short code, the preset structured data and the original feature data generated by the same preset original data are corresponding to the same preset data identifier, or the same preset feature identifier is corresponding to the same preset data, and the corresponding data are stored in the preset database, the distributed shared storage, the preset memory and the preset disk, and meanwhile, the preset data identifiers are stored together. This may facilitate querying the corresponding data. At this time, except that the preset data identifier in the preset memory is required to necessarily include the preset feature identifier and the preset partition identifier, the preset data identifiers in the rest of the preset database, the distributed shared storage and the preset disk may only retain the preset feature identifier, or may be the preset feature identifier+the preset partition identifier. However, the preset short code in the preset memory is associated with the preset data identifier so as to know the partition position of the corresponding preset combined file.
It should be noted that, if only the preset feature identifier is stored in the preset database, the preset disk and the distributed shared storage, when the preset long code is determined by the preset data identifier, the preset data identifier may be split into the preset feature identifier, and the preset feature identifier is used to search the preset long code.
Through the preset data identification, the data can be more conveniently extracted from a preset memory, a preset database, distributed shared storage and a preset disk.
In an exemplary manner, the preset partition identifier is appended with a part of the data source, and the number of bits of the preset partition identifier may be two, based on which consuming partition is actually from.
And storing the original characteristic data into a ceph disk (distributed shared storage), so as to ensure the data security of the system.
In an embodiment, storing the preset combined file and the preset data identifier of each preset original data in the preset disk further includes:
acquiring preset partition identifiers in all preset data identifiers, and determining a disk partition to be stored in the preset combined file based on the preset partition identifiers;
acquiring data acquisition time of preset original data corresponding to each preset combined file;
And storing each preset combined file based on the disk partition and the data acquisition time.
In an embodiment, generating the preset short code dataset based on the preset short codes and the preset data identifiers of the preset raw data comprises:
determining a salient feature existence state based on preset structured data of each preset original data;
if the existence state of the salient features of the preset original data is existence, determining the preset original data as first original data, determining preset salient features based on preset structured data of the first original data, and dividing preset short codes and preset data identifiers of each first original data into at least one preset short code data subset with salient features according to the preset salient features;
if the existence state of the salient features of the preset original data is nonexistent, determining the preset original data as second original data, and dividing preset short codes and preset data identifiers of the second original data into preset short code data subsets without the salient features;
the preset short code data set is generated based on each preset short code data subset with the obvious characteristic and the preset short code data subset without the obvious characteristic.
That is, before storing the preset short codes, the preset short codes are classified according to whether the preset original data corresponding to each preset short code has preset significant features and what preset significant features are included, so as to obtain a plurality of subsets, and finally, a complete preset short code data set is obtained.
In an embodiment, the storage of the preset short codes in the preset memory can be performed through the preset significant features, and the preset short codes with different preset significant features are respectively stored.
For example, pure structured data is stored in a database, pure feature data is stored in ceph storage, and a combined file of long features (long codes) and preset structured data is stored in a local disk. Firstly, the recordID (preset feature identifier) is added with the part of the data source. The number of bits is two, based on which consuming partition is actually from. And storing the preset structured data into a preset database, storing long and short features (original feature data) into a ceph disk (distributed shared storage), ensuring the data security of the system, storing preset short codes in a preset memory through preset remarkable features, and finally combining the preset long codes and the preset structured data in a mode of characteristic length, and adding files. And after the combination is completed, storing the combined file according to the position of the part and the snapshot date.
In an embodiment, before configuring the preset data identifier for each preset raw data, the method further includes:
Acquiring the total service amount of a plurality of preset micro services and the total partition amount of a plurality of preset disks in a current cluster, wherein the total partition amount is greater than or equal to the total service amount;
and determining a reference value according to the partition number and the service number, and determining preset micro-services pointed by the disk partitions of each preset disk, wherein the reference value comprises a quotient value and a remainder value.
The method for determining the preset micro-service pointed by the disk partition of each preset disk comprises the following steps of:
determining the quotient value and the remainder value of the partition number and the service number, and determining the quotient value and the remainder value as reference values;
if the residual value is zero, configuring the disk partitions with the number of quotient values for each preset micro service, and respectively marking the disk partitions corresponding to each preset micro service as 0 to conducit-1, wherein conducit is the quotient value;
if the remainder value is greater than zero, identifying the disk partition of the preset micro service with the preset micro service number smaller than the remainder value from 0 to consult, consult +1 to 2 x conducit, and so on, and identifying the disk partition of the preset micro service with the preset micro service number greater than the remainder value from (remainder-1) to remainder x conducit-1 until the end, wherein remainder is the remainder value.
Since local disk data is not shared externally, it is necessary to determine under which service a partition (disk partition) is, the logic of calculation is the partition (total partition) divided by the number of services (total service), if there is no remainder, the number of each micro service is the number of quotient (consur), and from micro service 0, the numbers of the partition are 0 to consur-1, and so on. If the division cannot be completed, the remainder is a remainder, the micro service result of the micro service number smaller than the remainder is 0 to consult, consult +1 to 2 x conducit, and the like, when the micro service number is larger than the remainder, the part corresponding to each service is (remainder-1) x conducit to remainder x conducit-1, and the like, until the end. (note: the number of parts needs to be greater than the number of services).
In an embodiment, performing first comparison on the short code to be compared and a plurality of preset short codes in a preset short code data set, and determining a plurality of target short codes includes:
generating structural data to be compared based on the data to be compared, and determining significant characteristics to be compared;
if the to-be-compared significant features are the same as the preset significant features corresponding to the significant feature preset short code data subsets, determining the preset short codes in the significant feature preset short code data subsets as screened short codes;
If the preset short code data subset without the significant features is not empty, determining the preset short codes in the preset short code data subset without the significant features as screened short codes;
and comparing the short codes to be compared with the short codes after screening for the first time to determine a plurality of target short codes.
That is, the preset short codes are pre-screened according to the significant features to be compared, the preset short codes with the same significant features as the significant features to be compared are directly determined as screened short codes for subsequent first comparison, the preset short codes without the significant features are also determined as screened short codes for subsequent first comparison, and the preset short codes with the significant features different from the significant features to be compared are screened out without subsequent first comparison, so that the data amount can be effectively reduced and the calculation force can be saved.
It should be noted that the preset significant features may have multiple dimensions, such as gender dimension, the preset significant features may be male, female, and unknown, and when the significant features to be compared are also gender dimension, and the significant features to be compared are female, the preset short codes with the preset significant features being female and unknown may be determined as the short codes after screening, and the first comparison is performed. The precondition of determining the short code after screening is that the dimension of the preset significant feature and the dimension of the significant feature to be compared have an intersection, so that the significant feature to be compared of the intersection is adopted to screen a plurality of preset significant features, and the short code after screening is further obtained.
It should be noted that, if the to-be-compared structured data is generated based on the to-be-compared data, the to-be-compared significant feature is determined to be empty according to the to-be-compared structured data, or the to-be-compared data does not have the to-be-compared significant feature, at this time, when the to-be-compared data is compared for the first time, all preset short codes in the preset memory are adopted to be compared with the to-be-compared short codes for the first time.
In an embodiment, after the short codes to be compared are compared with all or part of the preset short codes for the first time, a first similarity between each preset short code and the short codes to be compared is obtained, the preset short codes can be ordered from high to low according to the first similarity, and a plurality of preset short codes before ordering are used as target short codes. For example, the preset short codes ranked in the first ten thousand are used as target short codes. The preset short code with the first similarity higher than the preset first similarity threshold value can be used as the target short code. The determination of the target short code may also take other forms known to those skilled in the art.
Step S103, determining a plurality of candidate long codes corresponding to the target short codes based on the preset data identification of each target short code.
Wherein, the candidate long code and the target short code are both associated with the preset data identifier, that is, the candidate long code has the same preset data identifier as the target short code.
In an embodiment, determining a plurality of candidate long codes corresponding to the target short codes based on the preset data identifiers of the target short codes includes:
determining the position of a target partition according to a preset partition identifier, wherein the preset partition identifier is obtained based on a preset data identifier;
and determining a preset disk based on the preset micro-service pointed by the target partition position, and reading the preset disk according to the preset data identifier through the preset micro-service to obtain a plurality of candidate long codes.
In an embodiment, the preset data identifier and the preset feature identifier may be consistent, and the preset partition identifier is a part of the preset feature identifier, for example, the preset data identifier has 12 bits of characters, the preset feature identifier is the preset data identifier of the 12 bits of characters, and the preset partition identifier is a character with 12 bits of characters arranged in the last two bits. At this time, the preset disk is read according to the preset feature identifier by the preset micro-service, that is, the preset disk is read according to the preset data identifier, and the corresponding candidate long code is found. Each preset long code in the preset magnetic disk correspondingly stores preset data identification or preset characteristic identification.
In another embodiment, the preset data identifier and the preset feature identifier may be inconsistent, for example, the preset data identifier has 12 characters in total, the preset feature identifier is a character of the first 10 bits in the 12-bit characters, and the preset partition identifier is a character of the last two bits of the 12-bit characters. At this time, the preset disk is read by the preset micro-service according to the preset feature identification, and the corresponding candidate long code is found. Each preset long code in the preset magnetic disk correspondingly stores preset data identification or preset characteristic identification. If the preset data identifier is stored in the preset disk, only the first 10 characters in the preset data identifier are compared, namely only the preset feature identifier is compared.
In the above embodiment, the preset short code and the preset combined file stored in the local disk are both corresponding to the same preset data identifier, and the preset data identifier includes a preset partition identifier indicating a partition position of the preset combined file corresponding to the preset short code, at this time, the target partition position of the preset combined file of the preset long code corresponding to the preset short code may be located by the preset partition identifier, and the position of the preset disk is determined according to the preset micro-service pointed by the target partition position, so that the data reading and the subsequent second comparison operation of the long code to be compared and the candidate long code are performed by the preset micro-service.
Step S104, the long codes to be compared are compared with the candidate long codes for the second time, and a plurality of target long codes are obtained.
In an embodiment, performing a second comparison of the to-be-compared long code and the candidate long code to obtain a plurality of target long codes includes:
performing second comparison on the long codes to be compared and the candidate long codes through a preset micro service to obtain second similarity of each candidate long code;
a plurality of target long codes is determined from the plurality of candidate long codes based on the second similarity.
For example, sorting the candidate long codes based on the second similarity to obtain a candidate long code queue; and determining the candidate long code with the pre-preset ranking in the candidate long code queue as a target long code, namely determining the candidate long code of the front TopK in the candidate long code queue as the target long code.
For another example, a candidate long code having a second similarity greater than a preset second similarity threshold is determined as the target long code.
The above-mentioned preset first similarity, preset second similarity, and the later-mentioned preset third similarity may be fixed values, which may be adjusted by those skilled in the art according to the needs and the scene.
Through the mode, the candidate long codes can be more quickly positioned, and the extraction and comparison of the candidate long codes can be performed. The partition positions are directly obtained by the preset partition identifiers, one comparison of preset data identifiers of preset combined files in all the current magnetic disks in the cluster is not needed through the preset feature identifiers, and calculation power is greatly saved.
In an embodiment, reading, by the preset micro service, the preset disk according to the preset data identifier, and obtaining the plurality of candidate long codes includes:
matching is carried out according to preset data identifiers and preset data identifiers of a plurality of preset combined files in a preset disk so as to obtain a plurality of target combined files matched with the preset data identifiers, wherein the target combined files comprise preset long codes and preset structured data;
and determining the preset long code as a candidate long code to obtain a plurality of candidate long codes and preset structured data associated with the candidate long codes.
As described in the foregoing embodiment, in the preset disk, the preset long code and the preset structured data are bound to form the preset combined file, and in the process of extracting the candidate long code, the preset long code and the preset structured data may not be split and extracted together, so that in the subsequent process, once the target long code is determined, the preset structured data corresponding to the target long code may be directly obtained, and the preset structured data does not need to be searched in the preset database again.
Step S105, generating an initial comparison result of the data to be compared based on the target structural data associated with each target long code.
For example, the preset structured data in the preset combined file where each target long code is located may be determined as target structured data, and the initial comparison result is generated based on the target structured data of each target long code. And the initial comparison result can also be based on a preset combined file corresponding to each target long code.
Of course, the target structured data can also be obtained by searching a preset database based on the preset data identification of the target long code.
In an embodiment, generating an initial comparison result of data to be compared based on target structured data associated with each target long code includes:
if the candidate long code is determined to be the target long code, determining preset structural data associated with the candidate long code to be the target structural data;
generating candidate comparison pair results according to the target structured data and the target long code associated with the target structured data to obtain a plurality of candidate comparison pair results, wherein the preset combined file where the target long code is located is used as the candidate comparison pair result;
an initial comparison result is generated based on the plurality of candidate comparison pair results.
In an embodiment, generating an initial comparison result of data to be compared based on target structured data associated with each target long code includes:
if the candidate long code is determined to be the target long code, determining preset structural data associated with the candidate long code to be the target structural data;
generating candidate comparison pair results according to the target structured data to obtain a plurality of candidate comparison pair results; an initial comparison result is generated based on the plurality of candidate comparison pair results.
In an embodiment, after the preset disk is read by the preset microservice according to the preset feature identifier, obtaining the plurality of candidate long codes further includes:
If the preset conditions are met, the preset conditions comprise at least one of the preset magnetic disk, the preset magnetic disk is damaged, and the preset micro-service is subject to service drift;
matching a plurality of candidate long codes from a distributed shared storage according to preset data identifiers, wherein the distributed shared storage stores a plurality of preset long codes and preset data identifiers associated with the preset long codes;
and matching the preset data identifiers of the candidate long codes from a preset database to obtain a plurality of preset structured data, wherein the preset database stores the plurality of preset structured data and preset data identifiers associated with the preset structured data.
In the above embodiment, since the preset structured data is stored in the preset database and the local disk, when the preset disk is damaged, the preset micro-service is subject to service drift, resulting in disappearance of the preset structured data and the preset long code in the local disk, and incomplete, the preset database can be queried to ensure the overall of the finally obtained preset structured data, and the distributed shared storage can be queried to ensure the overall of the preset long code.
In addition, after the preset magnetic disk is damaged and the preset micro service is subjected to service drift, the preset data identification of the preset long code in the distributed shared storage can be used for obtaining which preset long codes need to be compensated based on the preset partition identification of the preset data identification, and based on the preset data identification of the preset long code needing to be compensated and the preset structured data and the preset data identification in the preset database, the combination of the preset long code needing to be compensated and the preset structured data is carried out, so that a recombined preset combined file is obtained, and then the local magnetic disk (preset magnetic disk) is restored.
For example, when the disk is damaged or the service drifts, the search result is compensated, and on one hand, the file of the local disk is recombined according to the corresponding part number (the preset partition identifier in the preset data identifier) in the ceph and the preset structured data in the database. On the one hand, in the compensation process, the long code is inquired through ceph storage, and the structured data is inquired through a database.
For example, when an abnormality occurs in the environment, such as a pod drifting (service drifting), the data of the corresponding local disk (preset disk) is lost, and after receiving the comparison request, the missing part of the data of the part is obtained from ceph, and the corresponding structured data is obtained from the database. And collecting other unaffected micro services by using the settlement result of the local disk, and returning correct TOPK data after the completion of the collection.
In addition, after the service is shifted, other disks are mounted, at this time, data compensation is needed, the number of the part is confirmed according to the number of the service, then the service is synchronized to the disk through the long code on the ceph, the synchronization process is one-to-one correspondence, and the disk is dropped after the combination of the service and the structured data in the memory is completed.
In an embodiment, after generating the initial comparison result of the data to be compared based on the target structured data associated with each target long code, the method further includes:
performing third comparison on the to-be-compared structured data and each target structured data in the initial comparison result to obtain a third similarity of each target structured data, wherein the to-be-compared structured data is generated based on the to-be-compared data;
and if the third similarity is greater than a preset third similarity threshold, determining the target structured data as an intermediate comparison result.
Or sorting the target structured data based on the third similarity, and determining the target structured data with a certain ranking before sorting as an intermediate comparison result.
At this time, there are still many pieces of target structured data that may exist, and these pieces of target structured data, the long code and the short code may be displayed together as a final comparison result, so that relevant staff may perform further comparison.
And continuing taking the picture as an example, when the initial comparison result, the intermediate comparison result and the final comparison result are displayed, corresponding preset original data, such as a preset original picture, can be displayed at the same time, so that the comparison between the picture to be compared and the preset original picture is convenient for related staff.
It can be seen that, in the embodiment of the present application, the data comparison is performed in advance, and the preset original data needs to be processed and stored first, referring to fig. 2, fig. 2 is a flowchart of a data storage method shown in an exemplary embodiment of the present application. As shown in fig. 2, the preset original data is subjected to data analysis and warehousing to obtain the original feature data (features and feature data shown in fig. 2) and the preset structured data (structured data shown in fig. 2), the preset structured data is spliced after the original feature data, and a two-bit mark partition (part) position is added after a preset data identifier (recordID) of the preset original data. The analysis data is accessed and stored in kafka (high throughput distributed publish-subscribe message system), the structured data is spliced after the feature data, wherein the length of the feature data is fixed, the structured data is spliced after the feature data is directly spliced, and the last two digits of the recordID mark data are part of kafka. And storing the original characteristic data into a distributed shared storage Ceph, storing the characteristic data without structured data in a Ceph (Ceph) disk, and storing the characteristic data with structured data in a local disk (preset disk). In order to save local storage resources, it may be set that all feature data is stored in the ceph disk, but thermal data (for example, data in 7 days of data acquisition time) is stored in the local disk, and the feature data is stored according to the part position when the feature data is stored. In the storage of the local disk, the pod and the local disk are in one-to-one correspondence, so that the pod and the local disk are basically and uniformly dispersed in the disk in the partition.
That is, in the storage of the feature file, the pure structured data is stored in a database (preset database), the pure feature data is stored in ceph storage, and the combined file of the long feature and the structured data is stored in a local disk (preset disk). The analyzed data enters the system, and the recordID is added with the part of the data source. The number of bits is two, based on which consuming partition is actually from. And storing the structured data into a database, storing the long and short features into a ceph disk to ensure the data security of the system, storing short codes in a memory through the obvious features, and finally combining the long codes with the structured data in a mode of characteristic length and then adding files. And after the combination is completed, storing the combined file according to the position of the part and the snapshot date. Because local disk data is not shared externally, it is necessary to determine under which service a part is, the calculated logic is the part divided by the number of services, if there is no remainder, each service is evenly distributed, the number of each micro-service is the number of quotient (consurt), starting from micro-service 0, the part numbers are 0 to consurt-1, and so on. If the division cannot be completed, that is, there is a remainder, the service 0-service (n-1) is (quotient+1), for example, the remainder is remainders, the micro service result of the micro service number smaller than the remainder is 0 to consult, consult +1 to 2 x conducit, and so on, when the micro service number is greater than remainders, the part corresponding to each service is (remainder-1) x conducit to remainder x conducit-1, until the end. Note that the number of parts needs to be larger than the number of services.
Referring to fig. 3, fig. 3 is a flowchart illustrating a data alignment method according to an exemplary embodiment of the present application. As shown in fig. 3, after receiving the query request, short features (short codes) are filtered according to the significant features, then compared according to the short features, ten thousand short features with highest similarity are obtained after the short feature comparison is completed, then according to the recordID (preset data identifier) corresponding to the short features, which file of which disk each piece of data is located (corresponding part is found), long features and structured data are obtained from the files, and after the long codes are compared, a result is quickly returned. For example, when a query request or other alignment task is received, the picture is first parsed into structural data (here, structural data to be aligned) and long and short features (feature data to be aligned, and long code to be aligned and short code to be aligned generated based on the feature data to be aligned) according to a parsing algorithm. Firstly, judging whether significant features (to-be-compared significant features) exist in structural data (to-be-compared structural data), and if the significant features do not exist, carrying out preset short code comparison of the whole text to obtain ten thousand pieces of data which are the most similar. If there are salient features, only a preset short feature comparison under salient feature classification and unknown (ambiguous) is performed according to the classification of salient features. Such as gender: male, female, unknown, if the data is characterized significantly by male, then the male and unknown are aligned. After the short code comparison is completed, positioning the part where the long code is located by comparing the recordID corresponding to the obtained target short code. And confirming the position of the disk according to the micro service pointed by the part, and reading data and carrying out subsequent long code comparison work by the service. And after the long codes and the corresponding structured data are loaded into the service, comparing the long codes, summarizing the comparison result after the comparison of the long codes is completed, and returning result data after the comparison of the similarity-ordered data of the front TopK according to the similarity ordering of the long codes.
In addition, there may be an unexpected event, for example, when a disk is damaged or a service drifts, the search result is compensated, and on one hand, the file of the local disk is recombined according to the corresponding part number in ceph and the data in the database. After the service is shifted, other disks are mounted, at the moment, data compensation is needed, the number of the part is confirmed according to the number of the service, then the service is synchronized to the disk through the long code on the ceph, the synchronization process is one-to-one correspondence, and the disk is dropped after the combination of the service and the structured data is completed in the memory. On the other hand, in the compensation process, the long code is inquired through ceph storage, and the structured data is inquired through a database. When an environment is abnormal, such as a pod drifts, the data of the corresponding local disk is lost, and after a comparison request is received, the missing part of the data of the part is obtained from ceph, and the corresponding structured data is obtained from a database. And collecting other unaffected micro services by using the settlement result of the local disk, and returning correct TOPK data after the completion of the collection.
In the following, the data comparison method is further described by way of example, and a set of business systems, with a partition number (total number of partitions) of 62 and a service number (total number of services) of 3, which is obviously characterized by gender as an example. The following description is given for the sake of example:
The corresponding relationship between the service and the partition is that the service 0:0-20, the service 1:21-41 and the service 2:42-61 correspond to the service (preset micro-service) to consume the corresponding partition data. When the automatically acquired data is accessed and analyzed, firstly storing structured data (preset structured data) and characteristic data (original characteristic data) into a database (preset database) and ceph shared storage (distributed shared storage), then judging that preset significant data (taking gender as an example) exist in preset short codes, if the preset significant data do not exist, storing preset short code data and recordID (preset data identification) in an ambiguous class in a memory, if the preset significant data do not exist, storing the preset short code data and the recordID in corresponding classes, meanwhile splicing the preset structured data in the preset long codes, and storing the preset significant data in a local disk as combined data according to the parameters and the date as dimensions; when the data of the query request is accessed and analyzed by an algorithm into long and short features to be compared and structural data to be compared, firstly, classifying according to the significant features to be compared, filtering to obtain short feature files which are not obvious (if the preset significant features are not significant feature preset short code data subsets with significant features to be compared), and reducing the comparison times, such as: when the access data is male, the short features only compare male and unknown sex data, the data is fished after comparison, the parts of the data are found through the last two-bit marks (preset partition marks) of the recordID during the scooping, then the corresponding service is used for scooping and comparing the similarity of the preset long codes, and finally the service which receives the request is used for summarizing the result, and the final result data is returned.
The embodiment provides a data comparison method, which comprises the steps of obtaining data to be compared, generating a long code to be compared and a short code to be compared based on the data to be compared, performing first comparison on the short code to be compared and a plurality of preset short codes in a preset short code data set, determining a plurality of target short codes, determining a plurality of candidate long codes corresponding to the target short codes based on preset data identifiers of the target short codes, performing second comparison on the long code to be compared and the candidate long codes to obtain a plurality of target long codes, generating an initial comparison result of the data to be compared based on target structured data associated with the target long codes, and splitting the data comparison into short code comparison and long code comparison.
In addition, the method provided by the embodiment reduces the comparison times of the preset short codes by performing the classification storage of the salient features on the data (the preset short codes) in the memory. The short codes are classified, firstly, the short codes are judged according to the obvious characteristics, and after the short codes are classified, the obvious characteristics extracted from the structured data are compared when being searched for comparison, so that the comparison range is narrowed, and the problem of high comparison frequency is solved.
The index (reading) time of the file is reduced by directly pointing to the position of the partition through the preset partition identifier in the recordID (preset data identifier). Recording the file with the long code through the recordID reduces the process of reading the file in the process of acquiring the long code.
Finally, the long code features and the structured combinations (one-to-one binding) are stored in a local disk, and the database is not read after comparison is completed, so that the pressure and the reading time of the database are greatly reduced, and the step of inquiring the database can be reduced. Greatly accelerating the comparison capability of the whole scheme. By combining the long code with the structured data, a small amount of memory space is sacrificed, the process of querying the structured data in the database is reduced, and the return of the query result is greatly accelerated.
According to the method provided by the embodiment, on one hand, the comparison quantity of the features is reduced, on the other hand, the reading speed of the structured data of the feature comparison result is accelerated, and the short features (preset short codes) are classified in the memory, and the recordID points to the file where a certain long code is located, so that the comparison of the short features is reduced, and when the long code is indexed, the reading of the long code file is accelerated, and the indexing time of the file is reduced. And splicing the long code and the structured data, and directly acquiring the structured data when the index acquires and compares the long code. The overall processing time is reduced by the query of the database.
Referring to fig. 4, fig. 4 is a block diagram of a data alignment device according to an exemplary embodiment of the present application, and as shown in fig. 4, the present embodiment provides a data alignment device 400, which includes:
an acquisition module 401, configured to acquire data to be compared, and generate a long code to be compared and a short code to be compared based on the data to be compared;
a first comparison module 402, configured to compare the short code to be compared with a plurality of preset short codes in a preset short code data set for the first time, and determine a plurality of target short codes;
a candidate long code determining module 403, configured to determine, based on preset data identifiers of the target short codes, a plurality of candidate long codes corresponding to the target short codes, where each of the candidate long codes and the target short codes is associated with the preset data identifier;
a second comparison module 404, configured to perform a second comparison on the long code to be compared and the candidate long code, so as to obtain a plurality of target long codes;
the result generating module 405 is configured to generate an initial comparison result of the data to be compared based on the target structured data associated with each target long code.
In an embodiment, the candidate long code determination module is configured to: determining the position of a target partition according to a preset partition identifier, wherein the preset partition identifier is obtained based on a preset data identifier; and determining a preset disk based on the preset micro-service pointed by the target partition position, and reading the preset disk according to the preset data identifier through the preset micro-service to obtain a plurality of candidate long codes.
In an embodiment, the second comparison module is configured to: performing second comparison on the long codes to be compared and the candidate long codes through a preset micro service to obtain second similarity of each candidate long code; a plurality of target long codes is determined from the plurality of candidate long codes based on the second similarity.
In an embodiment, the candidate long code determination module is further configured to: matching is carried out according to preset data identifiers and preset file identifiers of a plurality of preset combined files in a preset disk so as to obtain a plurality of target combined files matched with the preset data identifiers, wherein the target combined files comprise preset long codes and preset structured data; and determining the preset long code as a candidate long code to obtain a plurality of candidate long codes and preset structured data associated with the candidate long codes. That is, the preset combined file matched with the preset data identifier is determined as the preset target file.
In an embodiment, the result generation module is configured to: if the candidate long code is determined to be the target long code, determining preset structural data associated with the candidate long code to be the target structural data; generating candidate comparison pair results according to the target structured data and the target long code associated with the target structured data to obtain a plurality of candidate comparison pair results; an initial comparison result is generated based on the plurality of candidate comparison pair results.
In an embodiment, the result generation module is further configured to: if the candidate long code is determined to be the target long code, determining preset structural data associated with the candidate long code to be the target structural data; generating candidate comparison pair results according to the target structured data to obtain a plurality of candidate comparison pair results; an initial comparison result is generated based on the plurality of candidate comparison pair results.
In an embodiment, the candidate long code determination module is further configured to: if the preset conditions are met, the preset conditions comprise at least one of the preset magnetic disk, the preset magnetic disk is damaged, and the preset micro-service is subject to service drift; matching a plurality of candidate long codes from a distributed shared storage according to preset data identifiers, wherein the distributed shared storage stores a plurality of preset long codes and preset data identifiers associated with the preset long codes; and matching the preset data identifiers of the candidate long codes from a preset database to obtain a plurality of preset structured data, wherein the preset database stores the plurality of preset structured data and preset data identifiers associated with the preset structured data.
In an embodiment, the device further includes a third comparison module, configured to perform a third comparison on the to-be-compared structured data and each target structured data in the initial comparison result, to obtain a third similarity of each target structured data, where the to-be-compared structured data is generated based on the to-be-compared data; and if the third similarity is greater than a preset third similarity threshold, determining the target structured data as an intermediate comparison result.
In an embodiment, the device further includes a preset module, configured to obtain a plurality of preset original data, generate original feature data and preset structured data based on the preset original data, and generate a preset short code and a preset long code based on the original feature data; storing preset structured data of each preset original data into a preset database, storing original characteristic data of each preset original data into distributed shared storage, generating a preset short code data set based on preset short codes of each preset original data, and storing the preset short code data set into a preset memory; and generating a preset combined file based on the preset long code and the preset structured data, and storing the preset combined file of each preset original data in a preset magnetic disk.
In an embodiment, the preset module is further configured to: configuring preset data identifiers for all preset original data, wherein the preset data identifiers comprise preset feature identifiers and preset partition identifiers, the preset feature identifiers are used for distinguishing all preset original data, and the preset partition identifiers are used for representing disk partition positions of preset disks stored in a combined file of the preset original data; storing preset structured data and preset data identifiers of all preset original data into a preset database, storing original characteristic data and preset data identifiers of all preset original data into distributed shared storage, generating a preset short code data set based on preset short codes and preset data identifiers of all preset original data, and storing the preset short code data set into a preset memory; generating a preset combined file based on the preset long code and the preset structured data, and storing the preset combined file and the preset data identification of each preset original data in a preset magnetic disk
In an embodiment, the preset module is further configured to: before configuring preset data identifiers for each preset original data, acquiring the total service amount of a plurality of preset micro services and the total partition amount of a plurality of preset magnetic disks in a current cluster, wherein the total partition amount is greater than or equal to the total service amount; and determining a reference value according to the partition number and the service number, and determining preset micro-services pointed by the disk partitions of each preset disk, wherein the reference value comprises a quotient value and a remainder value.
In an embodiment, the preset module is further configured to: determining a salient feature existence state based on preset structured data of each preset original data; if the existence state of the salient features of the preset original data is existence, determining the preset original data as first original data, determining preset salient features based on preset structured data of the first original data, and dividing preset short codes and preset data identifiers of each first original data into at least one preset short code data subset with salient features according to the preset salient features; if the existence state of the salient features of the preset original data is nonexistent, determining the preset original data as second original data, and dividing preset short codes and preset data identifiers of the second original data into preset short code data subsets without the salient features; the preset short code data set is generated based on each preset short code data subset with the obvious characteristic and the preset short code data subset without the obvious characteristic.
In an embodiment, the first comparison module is further configured to: generating structural data to be compared based on the data to be compared, and determining significant characteristics to be compared; if the to-be-compared significant features are the same as the preset significant features corresponding to the significant feature preset short code data subsets, determining the preset short codes in the significant feature preset short code data subsets as screened short codes; if the preset short code data subset without the significant features is not empty, determining the preset short codes in the preset short code data subset without the significant features as screened short codes; and comparing the short codes to be compared with the short codes after screening for the first time to determine a plurality of target short codes.
In this embodiment, the apparatus is substantially provided with a plurality of modules for executing the method in any of the above embodiments, and specific functions and technical effects are only required to refer to the above embodiments, which are not repeated herein.
Referring to fig. 5, an embodiment of the present invention also provides an electronic device 500 comprising a processor 501, a memory 502, and a communication bus 503;
a communication bus 503 is used to connect the processor 501 and the memory 502;
the processor 501 is configured to execute computer programs stored in the memory 502 to implement the methods of one or more of the embodiments described above.
The embodiment of the present invention also provides a computer-readable storage medium, on which a computer program is stored,
the computer program is for causing a computer to perform the method according to any one of the above embodiments.
The embodiment of the present application further provides a non-volatile readable storage medium, where one or more modules (programs) are stored, where the one or more modules are applied to a device, and the device may be caused to execute instructions (instructions) of a step included in the embodiment one of the embodiment of the present application.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims (16)

1. A method of data alignment, the method comprising:
obtaining data to be compared, and generating a long code to be compared and a short code to be compared based on the data to be compared;
performing first comparison on the short codes to be compared and a plurality of preset short codes in a preset short code data set, and determining a plurality of target short codes;
determining a plurality of candidate long codes corresponding to the target short codes based on preset data identifiers of the target short codes, wherein the candidate long codes and the target short codes are associated with the preset data identifiers;
performing second comparison on the long codes to be compared and the candidate long codes to obtain a plurality of target long codes;
and generating an initial comparison result of the data to be compared based on the target structural data associated with each target long code.
2. The data comparison method of claim 1, wherein determining a plurality of candidate long codes corresponding to the target short code based on the preset data identification of each target short code comprises:
determining the target partition position according to a preset partition identifier, wherein the preset partition identifier is obtained based on the preset data identifier;
and determining a preset disk based on a preset micro service pointed by the target partition position, and reading the preset disk according to the preset data identifier through the preset micro service to obtain a plurality of candidate long codes.
3. The data comparison method of claim 2, wherein the second comparison of the long code to be compared with the candidate long code to obtain a plurality of target long codes comprises:
performing second comparison on the to-be-compared long codes and the candidate long codes through the preset micro service to obtain second similarity of the candidate long codes;
a plurality of target long codes are determined from a plurality of the candidate long codes based on the second similarity.
4. The data comparison method of claim 2, wherein reading the preset disk by the preset microservice according to the preset data identifier, obtaining the plurality of candidate long codes comprises:
Matching is carried out according to the preset data identification and preset data identifications of a plurality of preset combined files in the preset disk so as to obtain a plurality of target combined files matched with the preset data identification, wherein the preset combined files comprise preset long codes and preset structured data;
and determining the preset long code as the candidate long code to obtain a plurality of candidate long codes and preset structural data associated with the candidate long codes.
5. The data alignment method of claim 4, wherein generating the initial alignment result for the data to be aligned based on the target structured data associated with each of the target long codes comprises:
if the candidate long code is determined to be a target long code, determining preset structural data associated with the candidate long code as the target structural data;
generating candidate comparison pair results according to the target structured data and the target long code associated with the target structured data to obtain a plurality of candidate comparison pair results; and generating the initial comparison result based on a plurality of candidate comparison pair results.
6. The data alignment method of claim 4, wherein generating an initial alignment result for data to be aligned based on target structured data associated with each of the target long codes comprises:
If the candidate long code is determined to be the target long code, determining preset structural data associated with the candidate long code to be the target structural data;
generating candidate comparison pair results according to the target structured data to obtain a plurality of candidate comparison pair results; and generating the initial comparison result based on a plurality of candidate comparison pair results.
7. The data comparison method of claim 2, wherein obtaining the plurality of candidate long codes after reading the preset disk according to the preset data identifier by the preset micro service further comprises:
if the preset conditions are met, the preset conditions comprise at least one of the preset magnetic discs, the magnetic discs are damaged, and the preset micro services are subject to service drift;
matching a plurality of candidate long codes from a distributed shared storage according to the preset data identifiers, wherein the distributed shared storage stores a plurality of preset long codes and preset data identifiers associated with the preset long codes;
and matching the preset data identifiers of the candidate long codes to obtain a plurality of preset structured data from a preset database, wherein the preset database stores the plurality of preset structured data and preset data identifiers associated with the preset structured data.
8. The data alignment method of any of claims 1-7, wherein after generating the initial alignment result of the data to be aligned based on the target structured data associated with each of the target long codes, the method further comprises:
performing third comparison on the to-be-compared structured data and each target structured data in the initial comparison result to obtain a third similarity of each target structured data, wherein the to-be-compared structured data is generated based on the to-be-compared data;
and if the third similarity is larger than a preset third similarity threshold, determining the target structured data as an intermediate comparison result.
9. The data alignment method according to any of claims 1-7, wherein prior to first comparing the short code to be aligned with a plurality of preset short codes in a preset short code dataset, the method further comprises:
acquiring a plurality of preset original data, generating original characteristic data and preset structured data based on the preset original data, and generating preset short codes and preset long codes based on the original characteristic data;
storing preset structured data of each preset original data into a preset database, storing original characteristic data of each preset original data into a distributed shared memory, generating a preset short code data set based on preset short codes of each preset original data, and storing the preset short code data set into a preset memory;
And generating a preset combined file based on the preset long code and the preset structured data, and storing the preset combined file of each preset original data in a preset magnetic disk.
10. The data comparison method of claim 9, wherein after generating the original characteristic data and the preset structured data based on the preset original data and generating the preset short code and the preset long code based on the original characteristic data, the method further comprises:
configuring preset data identifiers for all preset original data, wherein the preset data identifiers comprise preset feature identifiers and preset partition identifiers, the preset feature identifiers are used for distinguishing all the preset original data, and the preset partition identifiers are used for representing disk partition positions of preset disks stored in a combined file of the preset original data;
storing preset structured data of each preset original data and the preset data identifier into a preset database, storing original characteristic data of each preset original data and the preset data identifier into distributed shared storage, generating a preset short code data set based on preset short codes of each preset original data and the preset data identifier, and storing the preset short code data set into a preset memory;
Generating a preset combined file based on the preset long code and the preset structured data, and storing the preset combined file of each preset original data and the preset data identifier in a preset magnetic disk.
11. The data comparison method of claim 10, wherein before configuring the preset data identifier for each of the preset raw data, the method further comprises:
acquiring the total service amount of a plurality of preset micro services and the total partition amount of a plurality of preset disks in a current cluster, wherein the total partition amount is greater than or equal to the total service amount;
and determining a reference value according to the partition number and the service number, and determining the preset micro-service pointed by the disk partition of each preset disk, wherein the reference value comprises a quotient value and a remainder value.
12. The data comparison method of claim 10, wherein generating a preset short code dataset based on the preset short codes of each of the preset raw data and the preset data identification comprises:
determining a salient feature existence state based on preset structured data of each preset original data;
if the existence state of the salient features of the preset original data is existence, determining the preset original data as first original data, determining preset salient features based on preset structured data of the first original data, and dividing preset short codes of all the first original data and the preset data identifiers into at least one preset short code data subset with salient features according to the preset salient features;
If the existence state of the salient features of the preset original data is nonexistent, determining the preset original data as second original data, and dividing preset short codes of the second original data and the preset data marks into preset short code data subsets without salient features;
and generating the preset short code data set based on each preset short code data subset with the significant characteristic and the preset short code data subset without the significant characteristic.
13. The method of data alignment of claim 12, wherein first comparing the short code to be aligned with a plurality of preset short codes in a preset short code dataset, determining a plurality of target short codes comprises:
generating structural data to be compared based on the data to be compared, and determining significant characteristics to be compared;
if the to-be-compared significant features are the same as the preset significant features corresponding to the preset short code data subsets with significant features, determining the preset short codes in the preset short code data subsets with significant features as screened short codes;
if the preset short code data subset without the significant features is not empty, determining the preset short codes in the preset short code data subset without the significant features as screened short codes;
And comparing the short codes to be compared with the short codes after screening for the first time to determine a plurality of target short codes.
14. A data alignment apparatus, the apparatus comprising:
the acquisition module is used for acquiring data to be compared and generating a long code to be compared and a short code to be compared based on the data to be compared;
the first comparison module is used for performing first comparison on the short codes to be compared and a plurality of preset short codes in a preset short code data set, and determining a plurality of target short codes;
the candidate long code determining module is used for determining a plurality of candidate long codes corresponding to the target short codes based on preset data identifiers of the target short codes, wherein the candidate long codes and the target short codes are associated with the preset data identifiers;
the second comparison module is used for carrying out second comparison on the long codes to be compared and the candidate long codes to obtain a plurality of target long codes;
and the result generation module is used for generating an initial comparison result of the data to be compared based on the target structural data associated with each target long code.
15. An electronic device comprising a processor, a memory, and a communication bus;
the communication bus is used for connecting the processor and the memory;
The processor is configured to execute a computer program stored in the memory to implement the method of any one of claims 1-13.
16. A computer-readable storage medium, having a computer program stored thereon,
the computer program for causing the computer to perform the method of any one of claims 1-13.
CN202211543024.2A 2022-12-02 2022-12-02 Data comparison method, device, equipment and storage medium Pending CN116050376A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211543024.2A CN116050376A (en) 2022-12-02 2022-12-02 Data comparison method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211543024.2A CN116050376A (en) 2022-12-02 2022-12-02 Data comparison method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116050376A true CN116050376A (en) 2023-05-02

Family

ID=86112272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211543024.2A Pending CN116050376A (en) 2022-12-02 2022-12-02 Data comparison method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116050376A (en)

Similar Documents

Publication Publication Date Title
CN106657213B (en) File transmission method and device
CN110532347B (en) Log data processing method, device, equipment and storage medium
CN110674360B (en) Tracing method and system for data
RU2568276C2 (en) Method of extracting useful content from mobile application setup files for further computer data processing, particularly search
CN111506608A (en) Method and device for comparing structured texts
US9213759B2 (en) System, apparatus, and method for executing a query including boolean and conditional expressions
CN107590233B (en) File management method and device
CN110196952B (en) Program code search processing method, device, equipment and storage medium
CN110895587A (en) Method and device for determining target user
CN111831750A (en) Block chain data analysis method and device, computer equipment and storage medium
CN111666278B (en) Data storage method, data retrieval method, electronic device and storage medium
CN107943849B (en) Video file retrieval method and device
CN114116811B (en) Log processing method, device, equipment and storage medium
CN114611039B (en) Analysis method and device of asynchronous loading rule, storage medium and electronic equipment
CN116185393A (en) Method, device, equipment, medium and product for generating interface document
CN113204706B (en) Data screening and extracting method and system based on MapReduce
CN116050376A (en) Data comparison method, device, equipment and storage medium
CN112612817B (en) Data processing method, device, terminal equipment and computer readable storage medium
CN114936269A (en) Document searching platform, searching method, device, electronic equipment and storage medium
CN113094415B (en) Data extraction method, data extraction device, computer readable medium and electronic equipment
CN110889017B (en) Retrieval method and terminal for information encrypted through base64
CN109308299B (en) Method and apparatus for searching information
CN110321435B (en) Data source dividing method, device, equipment and storage medium
CN108132971B (en) Analysis method and device for database fragment files
CN111695031A (en) Label-based searching method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination