CN115145991A

CN115145991A - Data processing method and system suitable for heterogeneous data

Info

Publication number: CN115145991A
Application number: CN202211059354.4A
Authority: CN
Inventors: 章水鑫; 叶丹青; 杨威
Original assignee: Nanjing Sanbaiyun Information Technology Co ltd
Current assignee: Nanjing Sanbaiyun Information Technology Co ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-10-04
Anticipated expiration: 2042-08-31
Also published as: CN115145991B

Abstract

The invention provides a data processing method and a data processing system suitable for heterogeneous data, wherein a first data format of a corresponding first heterogeneous file in each heterogeneous data source is extracted; determining a first dimension set with the highest similarity to the target dimension set as a second dimension set; comparing the target dimension set with the second dimension set to obtain a first dimension difference set, taking a first data format corresponding to the second dimension set as a second data format, and processing a second heterogeneous file corresponding to the second data format according to the first dimension difference set to obtain a third heterogeneous file; and sequentially traversing the target information corresponding to the first heterogeneous file according to the first dimension difference set, adding the target information into a third heterogeneous file, continuously updating the third heterogeneous file and the first dimension difference set, and judging the current third heterogeneous file to be a fusion output file in a target generation format after judging that the first dimension difference set is an empty set.

Description

Data processing method and system suitable for heterogeneous data

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and system suitable for heterogeneous data.

Background

Heterogeneous data is data with a different structure, and during production and life, there are many different heterogeneous data sources, such as files, relational databases, non-relational databases, web interfaces, and the like.

Taking an example that the heterogeneous data source is a file, related files stored in main bodies of different databases may have different data structures, and when heterogeneous data of main bodies of a plurality of different databases needs to be accessed and integrated in a unified manner, the heterogeneous data needs to be uniformly processed.

Disclosure of Invention

The embodiment of the invention provides a data processing method and system suitable for heterogeneous data, which can be used for rapidly processing various heterogeneous data according to different attributes of the heterogeneous data when the heterogeneous data of main bodies of a plurality of different databases needs to be integrated in a unified mode.

In a first aspect of the embodiments of the present invention, a data processing method suitable for heterogeneous data is provided, including:

extracting heterogeneous data acquisition targets and target generation formats in the heterogeneous data processing requests, determining a plurality of heterogeneous data sources according to the heterogeneous data acquisition targets, and extracting a first data format of a corresponding first heterogeneous file in each heterogeneous data source;

acquiring a target dimension set in the target generation format, comparing the first dimension set of each first data format with the target dimension set, and determining the first dimension set with the highest similarity to the target dimension set as a second dimension set;

comparing the target dimension set with the second dimension set to obtain a first dimension difference set, taking a first data format corresponding to the second dimension set as a second data format, and processing a second heterogeneous file corresponding to the second data format according to the first dimension difference set to obtain a third heterogeneous file;

and sequentially traversing the target information corresponding to the first heterogeneous file according to the first dimension difference set, adding the target information into a third heterogeneous file, continuously updating the third heterogeneous file and the first dimension difference set, and judging the current third heterogeneous file to be a fusion output file in a target generation format after judging that the first dimension difference set is an empty set.

Optionally, in a possible implementation manner of the first aspect, the obtaining a target dimension set in the target generation format, comparing the first dimension set of each first data format with the target dimension set, and determining that the first dimension set with the highest similarity to the target dimension set is the second dimension set includes:

counting dimensions included in the first data formats of all heterogeneous data sources to obtain total dimension information, displaying the total dimension information, and selecting at least one dimension in the total dimension information according to selection information input by a user to generate a target dimension set;

comparing the first dimensions included in each first dimension set with the target dimensions in the target dimension set, and determining the same dimension number and the difference dimension number of each first dimension set and the target dimension set;

calculating according to the first dimension number of the first dimension set, the target dimension number of the target dimension set, the same dimension number and the difference dimension number to obtain a similarity quantization value of the first dimension set and the target dimension set;

and determining the first dimension set with the highest similarity quantization value with the target dimension set as a second dimension set.

Optionally, in a possible implementation manner of the first aspect, the method further includes:

determining a first dimension set with a target dimension set similarity quantization value of 0 as a third dimension set, and displaying a heterogeneous data source and a third dimension corresponding to the third dimension set;

if the user selects at least one third dimension, adding the selected third dimension serving as a target dimension to the target dimension set;

and if the user does not select the third dimension, converting the heterogeneous data source into a non-determined heterogeneous data source.

Optionally, in a possible implementation manner of the first aspect, the calculating according to the first dimension number of the first dimension set, the target dimension number of the target dimension set, the same dimension number, and the difference dimension number to obtain a similarity quantization value between the first dimension set and the target dimension set includes:

calculating according to the first dimension number, the target dimension number of the target dimension set and the same dimension number to obtain a similar number ratio, and calculating according to the first dimension number, the target dimension number of the target dimension set and the difference dimension number to obtain a difference number ratio;

respectively carrying out weighted calculation on the similarity quantity ratio and the difference quantity ratio to obtain a similarity quantization value, calculating the similarity quantization value through the following formula,

wherein,

is as follows

A similarity quantization value of the first dimension set and the target dimension set,

is the same number of dimensions as the number of dimensions,

is as follows

A first dimensionThe number of the first dimension of the set,

is the target number of dimensions of the target set of dimensions,

is a function of the same number of weight values,

for the number of dimensions of the difference,

is a difference number weight value.

displaying the automatically determined second dimension sets, if the judgment is made that the user actively adjusts one of the first dimension sets into the second dimension set, and actively adjusting the automatically determined second dimension set into the first dimension set;

extracting a first same dimension number and a first difference dimension number in the adjusted first dimension set, and a second same dimension number and a second difference dimension number in the adjusted second dimension set;

if the first same dimension quantity is larger than the second same dimension quantity and the first difference dimension quantity is larger than the second difference dimension quantity, carrying out forward adjustment on the weight value of the same quantity;

if the first same dimension quantity is smaller than the second same dimension quantity, and the first difference dimension quantity is smaller than the second difference dimension quantity, carrying out negative adjustment on the weight value of the difference quantity;

the adjusted same number weight value and difference number weight value are calculated by the following formulas,

wherein,

in order to adjust the same number of weight values,

in order to adjust the radix in the forward direction,

is a first constant for the adjustment of the first,

in order to adjust the number of times in the forward direction,

for the adjusted difference quantity weight value,

in order to adjust the base number in the negative direction,

is a second adjustment constant that is a function of,

the number of negative adjustments.

Optionally, in a possible implementation manner of the first aspect, the comparing the target dimension set with the second dimension set to obtain a first dimension difference set, taking a first data format corresponding to the second dimension set as a second data format, and processing a second heterogeneous file corresponding to the second data format according to the first dimension difference set to obtain a third heterogeneous file includes:

comparing the target dimension set with the second dimension set to obtain a first dimension difference set, wherein the first dimension difference set comprises a first difference subset and a second difference subset, the first difference subset has a first difference dimension, and the second difference subset has a second difference dimension;

the first difference dimension is a dimension which is in the second dimension set and is not in the target dimension set, and the second difference dimension is a dimension which is in the target dimension set and is not in the second dimension set;

deleting the dimension items and the information corresponding to the first difference dimension in the second heterogeneous file;

and establishing corresponding dimension items in the second heterogeneous file according to the second difference dimension to obtain a third heterogeneous file.

Optionally, in a possible implementation manner of the first aspect, the creating, according to the second difference dimension, a corresponding dimension item in the second heterogeneous file to obtain a third heterogeneous file includes:

if a user input main body selection instruction is judged, determining all dimension main bodies corresponding to the third unstructured files, and carrying out amplification display on the dimension main bodies according to a first preset display mode;

establishing a perspective selection superposition layer corresponding to the third unstructured file, determining a layer area corresponding to each dimension main body in the perspective selection superposition layer, acquiring a layer outline of the corresponding layer area, and displaying pixel points corresponding to the layer outline according to preset pixel values;

counting the trigger traces of the user in the image layer areas, acquiring the trigger trace proportion of each triggered image layer area, and selecting the dimension main body corresponding to the image layer area with the trigger trace proportion larger than the preset proportion as the dimension main body reserved in the third unstructured file.

Optionally, in a possible implementation manner of the first aspect, the establishing a perspective selection overlay corresponding to the third unstructured file, determining a layer area corresponding to each dimension body in the perspective selection overlay, acquiring a layer outline of the corresponding layer area, and displaying a pixel point corresponding to the layer outline according to a preset pixel value includes:

performing imaging processing on the third heterogeneous file to obtain a third heterogeneous image, and overlaying a perspective selection overlaying layer on the third heterogeneous image;

performing coordinate processing on the third differential image and the perspective selection superposition layer to enable first pixel points and second pixel points corresponding to the third differential image and the perspective selection superposition layer to have the same coordinates;

determining a corresponding layer area according to the coordinates of a first pixel point corresponding to the dimension main body, determining a layer profile of the layer area, and controlling the layer profile to be displayed according to a preset pixel value.

Optionally, in a possible implementation manner of the first aspect, the sequentially traversing, according to the first dimension difference set, the corresponding target information of the first heterogeneous file, adding the target information to a third heterogeneous file, continuously updating the third heterogeneous file and the first dimension difference set, and after determining that the first dimension difference set is an empty set, determining that the current third heterogeneous file is a fusion output file in a target generation format includes:

traversing each first heterogeneous file according to the second difference dimension to obtain target information corresponding to the first heterogeneous file, and adding the target information into a third heterogeneous file to obtain an updated third heterogeneous file;

acquiring a current adding dimension corresponding to the current added target information, and deleting a second difference dimension corresponding to the current adding dimension from the second difference subset to obtain an updated second difference subset;

and continuously detecting the number of the second difference dimensions in the second difference subset, and judging that the first dimension difference set is an empty set when judging that the number of the second difference dimensions is 0.

In a second aspect of the embodiments of the present invention, a data processing system suitable for heterogeneous data is provided, including:

the extraction module is used for extracting heterogeneous data acquisition targets and target generation formats in the heterogeneous data processing requests, determining a plurality of heterogeneous data sources according to the heterogeneous data acquisition targets, and extracting a first data format of a corresponding first heterogeneous file in each heterogeneous data source;

the acquisition module is used for acquiring a target dimension set in the target generation format, comparing the first dimension set of each first data format with the target dimension set, and determining the first dimension set with the highest similarity to the target dimension set as a second dimension set;

the comparison module is used for comparing the target dimension set with the second dimension set to obtain a first dimension difference set, taking a first data format corresponding to the second dimension set as a second data format, and processing a second heterogeneous file corresponding to the second data format according to the first dimension difference set to obtain a third heterogeneous file;

and the updating module is used for sequentially traversing the corresponding target information of the first heterogeneous file according to the first dimension difference set, adding the target information into a third heterogeneous file, continuously updating the third heterogeneous file and the first dimension difference set, and judging the current third heterogeneous file to be a fusion output file in a target generation format after judging that the first dimension difference set is an empty set.

Has the advantages that:

1. when the scheme is used for integrating the required data, the heterogeneous data sources are firstly determined, then the similarity quantization values of the dimension sets corresponding to the heterogeneous data sources and the target dimension set are obtained, the scheme can determine the first dimension set with the highest similarity quantization value with the target dimension set as the second dimension set, the heterogeneous data sources are divided into two types according to the similarity quantization values, one type is a reference file, useless information in the reference file is deleted, the other type is a file to be integrated, the required information is integrated and added into the reference file, and when the data integration is carried out, the data processing amount can be reduced, and the integration efficiency is improved. According to the scheme, when the heterogeneous data of the main bodies of the different databases are required to be integrated in a unified mode, various heterogeneous data can be rapidly processed according to different attributes of the heterogeneous data.

2. When the similarity quantization value is calculated, not only the same quantity of the first dimension set and the second dimension set but also different quantities of the first dimension set and the second dimension set are considered; the same amount is considered to reduce the migration amount of data and thus the data processing amount when integrating data, and the different amount is considered to reduce the deletion amount of data and thus the data processing amount when integrating data; the scheme integrates the same amount and different amounts, can integrate and calculate better similarity, and refers to factors in multiple aspects to reduce data processing amount and improve integration efficiency; in addition, the same quantity weight value and the different quantity weight value in the calculation model are adjusted according to the adjustment information actively input by the user, so that the similarity quantization value calculated next time is more in line with the requirements of the user.

3. When the scheme is used for data integration, data are generally more, and the information selected by the user can be amplified and displayed through the set perspective selection superposition layer for the user to clearly watch. In addition, the area required to be selected by the user can be accurately identified in the form of the number of the vertical coordinate triggers, a certain error is generated when the user triggers the area, the user error space is provided by the scheme, and the corresponding dimension main body is determined when the triggering trace occupation ratio is larger than the preset occupation ratio.

Drawings

Fig. 1 is a schematic flowchart of a data processing method for heterogeneous data according to an embodiment of the present invention;

FIG. 2 is a block diagram of a data processing system for heterogeneous data according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, which is a schematic flowchart of a data processing method suitable for heterogeneous data according to an embodiment of the present invention, an execution main body of the method shown in fig. 1 may be a software and/or hardware device. The subject of execution of the present application may include, but is not limited to, at least one of the following: user equipment, network equipment, etc. The user equipment may include, but is not limited to, a computer, a smart phone, a Personal Digital Assistant (PDA), the above mentioned electronic equipment, and the like. The network device may include, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud consisting of a large number of computers or network servers based on cloud computing, wherein cloud computing is one of distributed computing, one super virtual computer consisting of a group of loosely coupled computers. The present embodiment does not limit this. The method comprises the following steps of S1 to S4:

s1, extracting heterogeneous data acquisition targets and target generation formats in a heterogeneous data processing request, determining a plurality of heterogeneous data sources according to the heterogeneous data acquisition targets, and extracting a first data format of a corresponding first heterogeneous file in each heterogeneous data source.

It can be understood that when heterogeneous data of the main bodies of the plurality of different databases needs to be integrated in a unified manner, a user may input a heterogeneous data processing request, and when the server receives the heterogeneous data processing request, the server may parse the heterogeneous data processing request to obtain a heterogeneous data acquisition target and a target generation format in the heterogeneous data processing request.

The heterogeneous data acquisition target may refer to a data source that needs to acquire data, and therefore, the scheme may determine a plurality of heterogeneous data sources by using the heterogeneous data acquisition target, where the heterogeneous data sources include, for example, a file 1 (e.g., a personnel file), a file 2 (e.g., a financial file), a file 3 (e.g., a legal file), and a file 4 (e.g., a payroll file); it can be understood that, after determining the heterogeneous data sources, the present solution may extract a first data format of a corresponding first heterogeneous file in each heterogeneous data source, where the first data format may refer to, for example, dimension information corresponding to a telephone dimension, an address dimension, a height dimension, a gender dimension, and the like in a file 1 (e.g., a personnel file), and may also refer to, for example, dimension information corresponding to a payroll dimension, a tax dimension, a profit dimension, and the like in a file 2 (e.g., a financial file).

S2, acquiring a target dimension set in the target generation format, comparing the first dimension set of each first data format with the target dimension set, and determining the first dimension set with the highest similarity to the target dimension set as a second dimension set.

After the target generation format is determined, a target dimension set in the target generation format is obtained, where the target dimension set is, for example, { telephone dimension, address dimension, and payroll dimension }.

It will be appreciated that each data source has a first data format corresponding to a first heterogeneous file, and thus has a corresponding first set of dimensions, e.g., for file 1 (e.g., a personnel file), which may be a { phone dimension, an address dimension, a height dimension, a gender dimension }, and for file 2 (e.g., a financial file), which may be a { payroll dimension, a tax dimension, a profit dimension }.

According to the scheme, after the target dimension set and the first dimension set are obtained, the target dimension set and the first dimension set are compared to obtain the similarity, and then the first dimension set with the highest similarity with the target dimension set is determined to be the second dimension set.

For example, after comparison, the first dimension set corresponding to the file 1 is found to have the highest similarity with the target dimension set, and then the first dimension set corresponding to the file 1 is used as the second dimension set.

In some embodiments, S2 (the obtaining a target dimension set in the target generation format, comparing the first dimension set of each first data format with the target dimension set, and determining that the first dimension set with the highest similarity to the target dimension set is the second dimension set) includes S21-S24:

s21, counting dimensions included in the first data formats of all the heterogeneous data sources to obtain total dimension information, displaying the total dimension information, and selecting at least one dimension in the total dimension information according to selection information input by a user to generate a target dimension set.

For example, for a file 1, the total dimension information may be 10, after statistics is completed, the total dimension information is displayed according to display equipment, a user can view 10 dimension information through the display equipment, and the user can select dimension information required to be integrated from the 10 dimension information, for example, { telephone dimension, address dimension } in the 10 dimension information; for the file 2, the total dimension information can be 5, after the statistics is finished, the total dimension information is displayed according to the display equipment, a user can view the 5 dimension information through the display equipment, and the user can select dimension information required to be integrated from the 5 dimension information, for example, the { payroll dimension } in the 5 dimension information is selected; and finally, obtaining a target dimension set { telephone dimension, address dimension and payroll dimension }.

S22, comparing the first dimensions included in each first dimension set with the target dimensions in the target dimension set, and determining the same dimension number and the difference dimension number of each first dimension set and the target dimension set.

For example, the first dimension set corresponding to the file 1 includes 10 first dimensions, and the first dimension set corresponding to the file 2 includes 5 first dimensions, and in the present scheme, the first dimensions corresponding to the files 1 and 2 are compared with the target dimensions in the target dimension set, and the same dimension number and the different dimension number of each first dimension set and the target dimension set are determined.

And S23, calculating according to the first dimension number of the first dimension set, the target dimension number of the target dimension set, the same dimension number and the difference dimension number to obtain a similarity quantization value of the first dimension set and the target dimension set.

In some embodiments, S23 (the obtaining of the similarity quantization value between the first dimension set and the target dimension set by performing the calculation according to the first dimension number of the first dimension set, the target dimension number of the target dimension set, the same dimension number, and the difference dimension number) includes S231-S232:

and S231, calculating according to the first dimensionality quantity, the target dimensionality quantity of the target dimensionality set and the same dimensionality quantity to obtain a similar quantity proportion, and calculating according to the first dimensionality quantity, the target dimensionality quantity of the target dimensionality set and the differential dimensionality quantity to obtain a differential quantity proportion.

According to the scheme, the first dimensionality quantity, the target dimensionality quantity of the target dimensionality set and the same dimensionality quantity are obtained, and the similar quantity proportion is obtained through calculation; and obtaining the first dimension number, the target dimension number of the target dimension set and the difference dimension number, and calculating to obtain the difference number ratio.

S232, respectively carrying out weighted calculation on the similarity quantity ratio and the difference quantity ratio to obtain a similarity quantization value, calculating the similarity quantization value through the following formula,

wherein,

is as follows

A similarity quantization value of the first set of dimensions and the target set of dimensions,

the number of the dimensions being the same as the number of the dimensions,

is as follows

A first number of dimensions of the first set of dimensions,

is the target number of dimensions of the target set of dimensions,

is the same number of weight values,

for the number of dimensions of the difference to be,

is a difference number weight value.

In the above-mentioned formula,

represents the first

The sum of the number of first dimensions of the first set of dimensions and the number of target dimensions of the set of target dimensions,

representing similar number ratio, same dimension number

The larger the ratio of the corresponding similar quantity is, the larger the ratio is;

representing the ratio of the number of differences and the number of dimensions of differences

The larger the difference is, the larger the corresponding difference number ratio is; finally, according to the same quantity weighted value

And a difference quantity weight value

And comprehensively calculating the similarity quantization value. Wherein when it comes to

Similarity quantization value of first dimension set and target dimension set

When less than 0, the scheme may give

A fixed value may be 0, for example.

It should be noted that the same number of weight values in the above formula

Greater than a delta number weight value

The purpose is to improve the proportion of the similar quantity proportion dimension and improve the reference quantity of the similar quantity proportion dimension.

It should be noted that, the present solution considers not only the same amount of the first dimension set and the second dimension set, but also considers different amounts of the first dimension set and the second dimension set; the same amount is considered to reduce the migration amount of data and thus the data processing amount when integrating data, and the different amount is considered to reduce the deletion amount of data and thus the data processing amount when integrating data; the scheme integrates the same amount and different amounts, can integrate and calculate better similarity, and refers to factors in multiple aspects to reduce data processing amount and improve integration efficiency.

And S24, determining the first dimension set with the highest similarity quantization value with the target dimension set as a second dimension set.

It can be understood that, after the quantized value of the similarity with the target dimension set is obtained, the present scheme may determine the first dimension set with the highest quantized value of the similarity with the target dimension set as the second dimension set. When data integration is performed, the data processing amount can be reduced, and the integration efficiency can be improved.

On the basis of the above embodiment, the present solution further includes S25 to S27:

and S25, determining the first dimension set with the target dimension set similarity quantization value of 0 as a third dimension set, and displaying the heterogeneous data source and the third dimension corresponding to the third dimension set.

It can be understood that there may be a plurality of heterogeneous data sources and a plurality of corresponding first dimension sets, and in actual operation, there may be 2 cases, one of which is that after the heterogeneous data sources are determined, a user has no need for data in the corresponding heterogeneous data sources, and at this time, the user does not select data in the heterogeneous data sources; another situation is that the user may miss data within the corresponding heterogeneous data source. For the above situation, if there is no data in the corresponding first dimension set that is the same as the target dimension set, the corresponding target dimension set similarity quantization value is 0.

According to the scheme, the first dimension set with the target dimension set similarity quantization value of 0 is determined to be the third dimension set, heterogeneous data sources and the third dimension corresponding to the third dimension set are displayed, and a user can see the heterogeneous data sources and the third dimension corresponding to the third dimension set.

And S26, if the user selects at least one third dimension, adding the selected third dimension as a target dimension to the target dimension set.

It can be understood that if the user is said to miss, after the user is prompted, the user may reselect, select at least one third dimension, and add the selected third dimension as a target dimension to the target dimension set, thereby improving the accuracy of the target dimension set.

And S27, if the user does not select the third dimension, converting the heterogeneous data source into a non-determined heterogeneous data source.

It can be understood that if the user finds that the selection is not missed, but does not want to select the corresponding third dimension, the scheme may convert the heterogeneous data source into a non-determined heterogeneous data source.

On the basis of the above embodiment, the present solution further includes S281-S284:

and S281, displaying the automatically determined second dimension sets, if it is judged that the user actively adjusts one of the first dimension sets to be the second dimension set, and actively adjusting the automatically determined second dimension set to be the first dimension set.

It can be understood that, according to the scheme, after the automatically determined second dimension sets are obtained, the second dimension sets are displayed, and if a user feels that the automatically obtained second dimension sets do not meet requirements, the user actively adjusts one of the first dimension sets to be the second dimension set, and actively adjusts the automatically determined second dimension set to be the first dimension set.

For example, taking the first dimension set 1 and the first dimension set 2 as an example, after the similarity is calculated in the foregoing embodiment, the calculated quantized value of the similarity of the first dimension set 1 is greater than the quantized value of the similarity of the first dimension set 2, so that the scheme may automatically determine the first dimension set 1 as the second dimension set, and the first dimension set 2 as the first dimension set. However, the user feels that the automatically obtained second dimension set does not meet the requirement, and the user actively adjusts the first dimension set 2 into the second dimension set and the first dimension set 1 into the first dimension set.

S282, extracting a first same dimension number and a first difference dimension number in the adjusted first dimension set, and a second same dimension number and a second difference dimension number in the adjusted second dimension set.

According to the scheme, after the user actively adjusts, the first same dimension number and the first difference dimension number in the adjusted first dimension set and the second same dimension number and the second difference dimension number in the adjusted second dimension set can be obtained.

S283, if the number of the first same dimensions is greater than the number of the second same dimensions, and the number of the first different dimensions is greater than the number of the second different dimensions, performing a forward adjustment on the weight value of the same number.

It can be understood that, if it is said that the first same dimension number is greater than the second same dimension number, and the first difference dimension number is greater than the second difference dimension number, it indicates that the reason why the user actively selects the adjusted first dimension set is that the first same dimension number is greater than the second same dimension number, and it can be understood that, at this time, the user considers that the reference dimension of the same dimension number is greater than the difference dimension number, so that at this time, the same number of weight values needs to be adjusted forward, so that in a subsequent calculation process of the similarity quantization value, calculation can be performed in a manner of more emphasizing the same dimension number, so that the calculated similarity quantization value is more suitable for a current calculation scenario.

S284, if the first same dimension number is smaller than the second same dimension number, and the first difference dimension number is smaller than the second difference dimension number, performing negative adjustment on the weight value of the difference number.

It can be understood that, if it is said that the first same dimension number is smaller than the second same dimension number, and the first difference dimension number is smaller than the second difference dimension number, it is indicated that the reason why the user actively selects the adjusted first dimension set is that the first difference dimension number is larger than the second difference dimension number, and it can be understood that, at this time, the user considers that the reference dimension of the difference dimension number is larger than the same dimension number, so that the difference number weight value needs to be forward adjusted, so that in the subsequent calculation process of the similarity quantization value, the calculation can be performed in a manner of more emphasizing the difference dimension number, so that the calculated similarity quantization value is more suitable for the current calculation scenario.

wherein,

in order to adjust the same number of weight values,

in order to adjust the radix in the forward direction,

is a first constant for the adjustment of the first,

in order to adjust the number of times in the forward direction,

for the adjusted difference quantity weight value,

in order to adjust the cardinality in the negative direction,

is a second adjustment constant that is a function of,

the number of negative adjustments.

In the above-mentioned formula,

representing the positive adjustment range, for the same number of weighted values

Performing forward adjustment to make the same number of weight values

Enlarging;

representing a negative adjustment amplitude, weighted by a difference quantity

Making negative adjustment to make difference number weighted value

And becomes smaller. In the actual application scenario, the same number of weighted values

Or a difference number weight value

The adjustment times may be multiple times, so that the adjustment amplitude is gradually reduced along with the increase of the adjustment times, and the condition that the adjusted weight values with the same quantity are caused by the excessive adjustment times is avoided

Or a difference number weight value

The same number weight value and the different number weight value are different from those at the initial time. The method and the device gradually slow down the adjustment of the same quantity weight value and the different quantity weight value, thereby avoiding the condition that the same quantity weight value and the different quantity weight value obtained after multiple adjustments are too large or too small, and ensuring the accuracy of the calculated similarity quantization value.

And S3, comparing the target dimension set with the second dimension set to obtain a first dimension difference set, taking a first data format corresponding to the second dimension set as a second data format, and processing a second heterogeneous file corresponding to the second data format according to the first dimension difference set to obtain a third heterogeneous file.

According to the scheme, after the second dimension set is obtained, the target dimension set is compared with the second dimension set to obtain the first dimension difference set.

According to the scheme, the first data format corresponding to the second dimension set is used as the second data format, and then the second heterogeneous file corresponding to the second data format is processed by utilizing the first dimension difference set to obtain the third heterogeneous file.

In some embodiments, S3 (the comparing the target dimension set with the second dimension set to obtain a first dimension difference set, taking a first data format corresponding to the second dimension set as a second data format, and processing a second heterogeneous file corresponding to the second data format according to the first dimension difference set to obtain a third heterogeneous file) includes S31 to S34:

s31, comparing the target dimension set with the second dimension set to obtain a first dimension difference set, wherein the first dimension difference set comprises a first difference subset and a second difference subset, the first difference subset has a first difference dimension, and the second difference subset has a second difference dimension.

According to the scheme, the target dimension set is compared with the second dimension set to obtain a difference result, and then the difference result is utilized to obtain the first dimension difference set. The first set of dimension differences for the present scheme includes a first subset of differences and a second subset of differences.

And S32, the first difference dimension is a dimension which is in the second dimension set and is not in the target dimension set, and the second difference dimension is a dimension which is in the target dimension set and is not in the second dimension set.

The first difference subset corresponds to a plurality of first difference dimensions, and the first difference dimensions are dimensions which are contained in the second dimension set but not contained in the target dimension set; the second difference subset corresponds to a plurality of second difference dimensions, and the second difference dimensions are dimensions which are contained in the target dimension set and are not contained in the second dimension set.

It can be understood that, in the present solution, the second dimension set and the target dimension set are respectively used as references to find corresponding difference dimensions to determine a corresponding first difference subset and a corresponding second difference subset.

And S33, deleting the dimension items and the information corresponding to the first difference dimension in the second heterogeneous file.

It can be understood that the first difference dimension is a dimension which is in the second dimension set but is not in the target dimension set, and when data is integrated, data corresponding to the first difference dimension is data which is not needed by a user, so that dimension items and information corresponding to the first difference dimension in the second heterogeneous file need to be deleted, and useless information needs to be removed.

And S34, establishing corresponding dimension items in the second heterogeneous file according to the second difference dimension to obtain a third heterogeneous file.

It can be understood that the second difference dimension is a dimension that is in the target dimension set and is not in the second dimension set, that is, the second difference dimension is data required when integrating data, but the second dimension set is not, so that the scheme establishes a dimension item corresponding to the second difference dimension in the second heterogeneous file, and obtains a third heterogeneous file.

In the second heterogeneous file, a dimension item corresponding to the second difference dimension is established, taking an EXCEL table as an example, the dimension item may be a column establishing one horizontal dimension, for example, a corresponding payroll dimension column. And the subsequently inquired payroll data can be correspondingly filled in the corresponding payroll dimension column.

In some embodiments, S34 (the creating of the corresponding dimension item in the second heterogeneous file according to the second difference dimension, resulting in a third heterogeneous file) includes S341-S343:

and S341, if the user input main body selection instruction is judged, determining all dimension main bodies corresponding to the third unstructured file, and amplifying and displaying the dimension main bodies according to a first preset display mode.

It can be understood that, a user of the present solution may input a main body selection instruction, and the server, in response to the main body selection instruction, may determine all the dimension main bodies corresponding to the third unstructured file, and perform enlarged display on the dimension main bodies according to the first preset display manner.

It should be noted that, when data integration is performed, data is generally more, and the scheme can amplify and display information selected by a user in the above manner, so that the user can clearly view the information.

S342, establishing a perspective selection overlay corresponding to the third unstructured file, determining a layer region corresponding to each dimension body in the perspective selection overlay, acquiring a layer contour of the corresponding layer region, and displaying a pixel point corresponding to the layer contour according to a preset pixel value;

the scheme is provided with a perspective selection overlapping layer, the perspective selection overlapping layer corresponds to the third unstructured file, for example, the perspective selection overlapping layer can be covered above the third unstructured file, but the watching of the content of the third unstructured file by a user is not influenced. The perspective selection overlay layer can be preset in the server in a software or APP mode, the display is carried out in response to a call instruction of a user, the position and the coverage area of the perspective selection overlay layer can be adjusted by the user, and the perspective selection overlay layer is covered above the third meta-configuration file in a mode required by the user.

It should be noted that, in the perspective selection overlay layer of the present solution, layer regions corresponding to each dimension main body are provided, for example, a phone dimension corresponds to one layer region, an address dimension corresponds to one region, each layer region has a corresponding layer contour, and the present solution displays pixel points corresponding to the layer contour according to a preset pixel value.

In some embodiments, S342 (the establishing of the perspective selection overlay corresponding to the third configuration file, determining a layer region corresponding to each dimension body in the perspective selection overlay, obtaining a layer contour of the corresponding layer region, and displaying a pixel point corresponding to the layer contour according to a preset pixel value) includes S3421 to S3423:

s3421, perform an imaging process on the third heterogeneous file to obtain a third heterogeneous image, and superimpose a perspective selection overlay on the third heterogeneous image.

Firstly, the scheme performs format conversion on the third heterogeneous file, converts the format conversion into an image, for example, a PDF format, to obtain a required third heterogeneous image, and then superimposes a perspective selection superposition layer on the upper part of the third heterogeneous image.

And S3422, performing coordinate processing on the third differential composition image and the perspective selection superposition layer, so that first pixel points and second pixel points corresponding to the third differential composition image and the perspective selection superposition layer have the same coordinates.

The scheme can coordinate the third different composition image and the perspective selection overlay layer, so that the first pixel points and the second pixel points corresponding to the third different composition image and the perspective selection overlay layer have the same coordinates. It can be understood that the scheme is in the form of coordinates, so that the third heterogeneous image corresponds to the perspective selection superposition layer.

S3423, determining a corresponding layer area according to the coordinates of the first pixel point corresponding to the dimension main body, determining a layer contour of the layer area, and controlling the layer contour to be displayed according to a preset pixel value.

According to the scheme, the corresponding layer area is determined according to the coordinate of the first pixel point corresponding to the dimension main body, after the layer area is determined, the layer outline of the layer area is determined, and then the layer outline is controlled to be displayed according to the preset pixel value. For example, the layer outline may be controlled to be highlighted with a pixel value corresponding to yellow.

And S343, counting the trigger traces of the user in the image layer areas, acquiring the trigger trace proportion of each triggered image layer area, and selecting the dimension body corresponding to the image layer area with the trigger trace proportion larger than the preset proportion as the dimension body reserved in the third unstructured file.

According to the scheme, the triggering operation of the user on the layer area is collected in real time, the triggering trace of the user is obtained, then the triggering trace proportion of each triggered layer area is obtained, and the dimension main body corresponding to the layer area with the triggering trace proportion larger than the preset proportion (for example, larger than 60%) is selected and used as the dimension main body reserved in the third reconstruction file.

In some embodiments, S343 (the step of counting the trigger traces of the user in the image layer region, obtaining a trigger trace proportion of each triggered image layer region, and selecting a dimension body corresponding to an image layer region with a trigger trace proportion larger than a preset proportion as a dimension body reserved in the third unstructured file) includes S3431-S3432:

s3431, the vertical coordinates of the pixel points in each triggered layer area are counted to obtain the vertical coordinate triggering quantity.

Illustratively, the information corresponding to the dimension main body is horizontal, the scheme takes the situation that a user slides from top to bottom as an example to form a trigger trace from top to bottom, and the scheme can acquire the vertical coordinate of each triggered pixel point in the layer area in real time to obtain the vertical coordinate trigger quantity.

S3432, calculating according to the vertical coordinate triggering quantity and the total quantity of the vertical coordinates to obtain a triggering trace proportion, and selecting a dimension main body corresponding to the layer area with the triggering trace proportion larger than a preset proportion.

According to the scheme, after the vertical coordinate triggering number and the vertical coordinate total number are obtained, the triggering trace proportion can be calculated according to the vertical coordinate triggering number and the vertical coordinate total number, and the dimension main body corresponding to the layer area with the triggering trace proportion larger than the preset proportion (for example, larger than 60%) is selected.

According to the scheme, the areas required to be selected by the user can be accurately identified in the form of the number triggered by the vertical coordinates, certain errors are generated when the user is triggered, the user error space is given to the user, and the corresponding dimension main body can be determined when the triggering trace proportion is larger than the preset proportion.

And S4, sequentially traversing corresponding target information of the first heterogeneous file according to the first dimension difference set, adding the target information into a third heterogeneous file, continuously updating the third heterogeneous file and the first dimension difference set, and judging that the current third heterogeneous file is a fusion output file in a target generation format after judging that the first dimension difference set is an empty set.

After the third isomorphic file is obtained, the information corresponding to the first dimension difference set needs to be integrated into the third isomorphic file, so that the first dimension difference set can be utilized to sequentially traverse the corresponding target information of the first isomerous file and add the target information into the third isomerous file, meanwhile, the third isomerous file and the first dimension difference set can be continuously updated, after the first dimension difference set is judged to be an empty set, the data integration is indicated to be finished, and the current third isomerous file can be judged to be a fusion output file in a target generation format.

In some embodiments, S4 (adding the target information corresponding to the first heterogeneous file sequentially traversed according to the first dimension difference set to a third heterogeneous file, continuously updating the third heterogeneous file and the first dimension difference set, and determining that the current third heterogeneous file is a merged output file in a target generation format after determining that the first dimension difference set is an empty set) includes S41 to S43:

s41, traversing each first heterogeneous file according to the second difference dimension to obtain target information corresponding to the first heterogeneous file, and adding the target information into a third heterogeneous file to obtain an updated third heterogeneous file.

According to the scheme, each first heterogeneous file is traversed by using the second difference dimension to obtain the corresponding target information of the first heterogeneous file, and the target information is added into the third heterogeneous file to obtain the updated third heterogeneous file.

S42, obtaining a current adding dimension corresponding to the current added target information, and deleting a second difference dimension corresponding to the current adding dimension from the second difference subset to obtain an updated second difference subset.

After the target information needing to be added is obtained, the scheme determines the current adding dimension of the target information needing to be added, and then deletes the second difference dimension corresponding to the current adding dimension from the second difference subset to obtain an updated second difference subset.

And S43, continuously detecting the number of the second difference dimensions in the second difference subset, and judging that the first dimension difference set is an empty set when the number of the second difference dimensions is judged to be 0.

It can be understood that, according to the present solution, the number of the second difference dimensions in the second difference subset is continuously detected, when the number of the second difference dimensions is judged to be 0, the first dimension difference set is judged to be an empty set, and after the first dimension difference set is judged to be an empty set, it is described that data integration is completed, and the present third unstructured file is judged to be a fused output file of the target generation format.

Referring to fig. 2, a schematic structural diagram of a data processing system suitable for heterogeneous data according to an embodiment of the present invention is shown, where the data processing system suitable for heterogeneous data includes:

and the updating module is used for sequentially traversing the corresponding target information of the first heterogeneous file according to the first dimension difference set, adding the target information into a third heterogeneous file, continuously updating the third heterogeneous file and the first dimension difference set, and judging that the current third heterogeneous file is a fusion output file in a target generation format after the first dimension difference set is judged to be an empty set.

The apparatus in the embodiment shown in fig. 2 can be correspondingly used to perform the steps in the method embodiment shown in fig. 1, and the implementation principle and technical effect are similar, which are not described herein again.

Referring to fig. 3, which is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention, the electronic device 30 includes: a processor 31, a memory 32 and a computer program; wherein

A memory 32 for storing the computer program, which may also be a flash memory (flash). The computer program is, for example, an application program, a functional module, or the like that implements the above method.

A processor 31 for executing the computer program stored in the memory to implement the steps performed by the apparatus in the above method. Reference may be made in particular to the description relating to the preceding method embodiment.

Alternatively, the memory 32 may be separate or integrated with the processor 31.

When the memory 32 is a device independent of the processor 31, the apparatus may further include:

a bus 33 for connecting the memory 32 and the processor 31.

The present invention also provides a readable storage medium, in which a computer program is stored, which, when being executed by a processor, is adapted to implement the methods provided by the various embodiments described above.

The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Additionally, the ASIC may reside in user equipment. Of course, the processor and the readable storage medium may also reside as discrete components in a communication device. The readable storage medium may be read-only memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, and the like.

The present invention also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the device may read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the device to implement the methods provided by the various embodiments described above.

In the above embodiments of the apparatus, it should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A data processing method suitable for heterogeneous data is characterized by comprising the following steps:

2. The data processing method for heterogeneous data according to claim 1,

the obtaining of the target dimension set in the target generation format, comparing the first dimension set of each first data format with the target dimension set, and determining the first dimension set with the highest similarity to the target dimension set as a second dimension set includes:

3. The data processing method suitable for heterogeneous data according to claim 2, further comprising:

if the user selects at least one third dimension, adding the selected third dimension as a target dimension to the target dimension set;

4. The data processing method for heterogeneous data according to claim 3,

the calculating according to the first dimension number of the first dimension set, the target dimension number of the target dimension set, the same dimension number and the difference dimension number to obtain a similarity quantization value of the first dimension set and the target dimension set includes:

wherein,

is as follows

the number of the dimensions being the same as the number of the dimensions,

is as follows

A first number of dimensions of the first set of dimensions,

is the target number of dimensions of the target set of dimensions,

is a function of the same number of weight values,

for the number of dimensions of the difference,

is a difference number weight value.

5. The data processing method applicable to heterogeneous data according to claim 4, further comprising:

wherein,

in order to adjust the same number of weight values,

in order to adjust the cardinality in the forward direction,

is a first adjustment constant that is a function of,

in order to adjust the number of times in the forward direction,

for the adjusted difference quantity weight value,

in order to adjust the cardinality in the negative direction,

is a second adjustment constant that is a function of,

the number of negative adjustments.

6. The data processing method for heterogeneous data according to claim 4,

the comparing the target dimension set with the second dimension set to obtain a first dimension difference set, using a first data format corresponding to the second dimension set as a second data format, and processing a second heterogeneous file corresponding to the second data format according to the first dimension difference set to obtain a third heterogeneous file, including:

7. The data processing method for heterogeneous data according to claim 6,

establishing corresponding dimension items in the second heterogeneous file according to the second difference dimension to obtain a third heterogeneous file, wherein the method comprises the following steps:

counting the trigger traces of the user in the image layer areas, acquiring the trigger trace proportion of each triggered image layer area, and selecting the dimension main body corresponding to the image layer area with the trigger trace proportion larger than the preset proportion as the dimension main body reserved in the third reconstruction file.

8. The data processing method for heterogeneous data according to claim 7,

the establishing of the perspective selection superposition layer corresponding to the third unstructured file, determining the layer area corresponding to each dimension main body in the perspective selection superposition layer, acquiring the layer outline of the corresponding layer area, and displaying the pixel points corresponding to the layer outline according to the preset pixel values includes:

performing coordinate processing on the third different composition image and the perspective selection superposition layer to enable first pixel points and second pixel points corresponding to the third different composition image and the perspective selection superposition layer to have the same coordinates;

9. The data processing method for heterogeneous data according to claim 6,

sequentially traversing the corresponding target information of the first heterogeneous file according to the first dimension difference set, adding the target information into a third heterogeneous file, continuously updating the third heterogeneous file and the first dimension difference set, and judging that the current third heterogeneous file is a fusion output file in a target generation format after judging that the first dimension difference set is an empty set, wherein the method comprises the following steps of:

acquiring a current adding dimension corresponding to the currently added target information, and deleting a second difference dimension corresponding to the current adding dimension from the second difference subset to obtain an updated second difference subset;

10. A data processing system adapted for heterogeneous data, comprising: