WO2018166343A1

WO2018166343A1 - Data fusion method and device, storage medium and electronic device

Info

Publication number: WO2018166343A1
Application number: PCT/CN2018/077184
Authority: WO
Inventors: 甘骏; 苏可; 饶孟良
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2017-03-13
Filing date: 2018-02-26
Publication date: 2018-09-20
Also published as: CN108572947B; CN108572947A

Abstract

Disclosed are a data fusion method and device, a storage medium and an electronic device. The method comprises: extracting attributes in first data and second data, wherein the first data and the second data comprise a correlation between an attribute and an attribute value; calculating a semantic similarity value between various attributes, determining a semantic similarity value greater than a pre-set first threshold, and determining an attribute corresponding to each semantic similarity value as a pair of common attributes of the first data and the second data; and by comparing the attribute values corresponding to each pair of common attributes, determining a similarity value between the first data and the second data, and if the similarity value between the first data and the second data is greater than a pre-set second threshold, then fusing the first data and the second data. By means of the embodiments of the present invention, the data fusion rate is improved on the premise of ensuring the accuracy of data fusion.

Description

Data fusion method and device, storage medium and electronic device

The present application claims priority to Chinese Patent Application No. JP-A No. No. No. No. No. No. No. No. No. No. No. No. No. No. No. in.

Technical field

Embodiments of the present invention relate to the field of data processing, and in particular, to a data fusion method and apparatus, a storage medium, and an electronic device.

Background technique

At present, in the process of data fusion, it is necessary to first determine whether data can be fused, usually to determine whether the features included in the data can be fused. The existing processing method is based on a string to compare the features included in the data to complete the data fusion. However, strict matching of features based on strings can result in lower data fusion rates. In other words, this approach will result in data that is actually fused to be unfused.

Summary of the invention

In view of this, the embodiments of the present invention provide a data fusion method and device, a storage medium, and an electronic device, to at least solve the technical problem of low data fusion rate in the related art.

An embodiment of the present invention provides a data fusion method, where the method includes:

Extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;

Calculate semantic similarity values between attributes;

Determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of common attributes of the first data and the second data;

Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of common attributes;

And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.

Optionally, determining the similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, including:

Obtaining, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculating a semantic similarity value between the attribute values corresponding to the same pair of shared attributes;

And determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.

Optionally, the method further includes:

In the first data and the second data, a weight value corresponding to each pair of common attributes is calculated.

Optionally, determining the similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes, including:

The product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.

Optionally, before determining the similarity value between the first data and the second data according to the semantic similarity value between the attribute values corresponding to each pair of the shared attributes, the method further includes:

From the common attribute, the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.

Optionally, before the calculating the semantic similarity value between the attributes, the method further includes:

And extracting an attribute value corresponding to each attribute in the first data and the second data, and acquiring an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.

Optionally, the calculating a semantic similarity value between the respective attributes includes:

And calculating a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.

The attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.

Calculates semantic similarity values between attributes that are not synonymous.

Separating the semantic vectors corresponding to each attribute by using a preset word embedding model;

Calculate the semantic similarity value between the semantic vectors corresponding to each attribute.

The embodiment of the invention further provides a data fusion method, the method comprising:

Extracting attribute values in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;

Calculate the similarity value between each attribute value;

Determining a similarity value between the first data and the second data according to a similarity value between the respective attribute values;

Optionally, before the calculating the similarity value between the attribute values, the method further includes:

Extracting attributes in the first data and the second data;

Calculate semantic similarity values between attributes;

Determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.

Optionally, the calculating a similarity value between each attribute value includes:

Calculate the semantic similarity value between the attribute values corresponding to the same pair of common attributes.

Optionally, determining, according to the similarity value between the respective attribute values, a similarity value between the first data and the second data, including:

Optionally, the method further includes:

Obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.

Calculates the string similarity value between each attribute value.

Embodiments of the present invention also provide a data fusion device including one or more processors, and one or more memories storing program units, wherein the program units are executed by the processor, and the program units include :

An extraction module, configured to extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;

a first calculation module configured to calculate a semantic similarity value between the respective attributes;

a first determining module, configured to determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of the first data and the second data Common attribute

a second determining module, configured to determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes;

The fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.

Optionally, the second determining module includes:

a first calculation submodule configured to obtain, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculate a semantic similarity value between the attribute values corresponding to the same pair of shared attributes ;

The first determining submodule is configured to determine a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.

Optionally, the device further includes:

The second calculating module is configured to calculate, in the first data and the second data, a weight value corresponding to each pair of shared attributes.

Optionally, the first determining submodule includes:

The accumulating submodule is configured to accumulate the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the pair of shared attributes, to obtain the first data and the second data. The similarity value between.

Optionally, the device further includes:

The screening module is configured to filter, from the common attribute, a common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold.

Optionally, the device further includes:

And an acquiring module, configured to extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.

Optionally, the first computing module includes:

The second calculation submodule is configured to calculate a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.

Optionally, the device further includes:

The third determining module is configured to determine, by querying the preset synonym database, an attribute belonging to the synonym as a pair of common attributes of the first data and the second data.

Optionally, the first computing module includes:

A third computing sub-module configured to calculate semantic similarity values between attributes that are not synonymous.

Optionally, the first computing module includes:

Obtaining a sub-module, configured to obtain a semantic vector corresponding to each attribute by using a preset word embedding model;

The fourth calculation sub-module is configured to calculate a semantic similarity value between the semantic vectors corresponding to the respective attributes.

An extraction module, configured to extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;

a calculation module configured to calculate a similarity value between respective attribute values;

a determining module, configured to determine a similarity value between the first data and the second data according to a similarity value between the respective attribute values;

The embodiment of the present invention further provides a storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute the method described in the embodiment of the present invention.

An embodiment of the present invention further provides an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the method described in the embodiment of the present invention by using the computer program. method.

In the data fusion method provided by the embodiment of the present invention, first, an attribute in the first data and the second data is extracted, wherein the first data and the second data include a correspondence between an attribute and an attribute value. Secondly, calculating a semantic similarity value between the respective attributes, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the first A pair of common attributes of two data. Finally, determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data When the second threshold is greater than the preset, the first data and the second data are merged. The present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings may also be obtained from those of ordinary skill in the art in view of the drawings.

FIG. 1 is a flowchart of a data fusion method according to an embodiment of the present invention;

2 is a flowchart of another data fusion method according to an embodiment of the present invention;

FIG. 3 is a flowchart of another data fusion method according to an embodiment of the present invention;

4 is a schematic structural diagram of a data fusion device according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a portion of an electronic device according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

An embodiment of the present invention provides a data fusion method. Referring to FIG. 1 , it is a flowchart of a data fusion method according to an embodiment of the present invention.

S101: Extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value.

The first data and the second data in the embodiment of the present invention both include the correspondence between the attribute and the attribute value. For example, the correspondence between the first data and the singer-Andy Lau may be included, and the second data may include the correspondence of the singer-Hua Tsai. Relationship; among them, singers and singers are attributes, and Andy Lau and Hua Tsai are attribute values corresponding to singers and singers respectively.

In the embodiment of the present invention, before the first data and the second data are merged, it is first determined whether the first data and the second data can be merged. In an actual application, each attribute included in the first data and the second data is first extracted.

S102: Calculate a semantic similarity value between each attribute.

In the embodiment of the present invention, after the attributes in the first data and the second data are extracted, the semantic similarity value between the attributes is calculated. Through the calculation of the semantic similarity value, the embodiment of the present invention can determine the attribute that points to the same entity substantially, without the need for the exact match of the string, thereby avoiding the strict matching of the data based on the string matching of the feature. problem.

In practical applications, the semantic similarity between the attributes in the first data and the attributes in the second data can be calculated. In an implementation manner, first, a semantic vector corresponding to each attribute is obtained by using a preset word embedding model, and second, a semantic similarity value between semantic vectors corresponding to each attribute is calculated, that is, semantics between the respective attributes. Similarity value.

S103: Determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.

In the embodiment of the present invention, after the semantic similarity value between the attributes is obtained by calculation, a semantic similarity value greater than a preset first threshold is determined. Further, an attribute corresponding to the semantic similarity value of the preset first threshold is determined, and the attribute is determined as a common attribute between the first data and the second data. That is to say, a pair of attributes having a higher semantic similarity value can be determined as a common attribute between the first data and the second data.

For example, as the attribute in the first data and the second data, the semantic similarity value between the singer and the singer is higher than the preset first threshold, and the singer and the singer are determined as the first data and the second data Common attributes.

S104: Determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes.

In the embodiment of the present invention, after determining the common attribute between the first data and the second data, comparing the similarity between the attribute values corresponding to each pair of shared attributes, optionally, the similarity may be calculated for each pair of common The semantic similarity value or the string similarity value between the attribute values corresponding to the attributes respectively, and finally determining the similarity between the first data and the second data according to the similarity between the attribute values corresponding to each pair of the shared attributes value.

In one implementation, the similarity is a semantic similarity value between the attribute values corresponding to each pair of common attributes. First, from the first data and the second data, an attribute value corresponding to each pair of shared attributes is obtained, and a semantic similarity value between the attribute values corresponding to the same pair of shared attributes is calculated. For example, to obtain the singer and singer as a pair of common attributes, respectively, the corresponding attribute values, Andy Lau and Hua Tsai, calculate the semantic similarity between Andy Lau and Hua Tsai. Secondly, a similarity value between the first data and the second data is determined according to a semantic similarity value between attribute values corresponding to each pair of shared attributes. That is, the degree of similarity between the first data and the second data depends on the similarity between the attribute values corresponding to the common attributes of the first data and the second data.

In order to improve the calculation efficiency of the similarity value between the first data and the second data, the embodiment of the present invention determines the common attribute before determining the similarity value between the first data and the second data. The semantic similarity value is not greater than the attribute corresponding to the attribute value of the preset third threshold, and the attribute corresponding to the attribute value that is not greater than the preset third threshold is excluded, so as to improve the accuracy of the shared attribute and reduce the number of the shared attribute. . That is to say, the attribute corresponding to the attribute value whose semantic similarity value is not greater than the preset third threshold does not belong to the common attribute between the first data and the second data. In the embodiment of the present invention, the common attribute between the first data and the second data is further determined in advance, and the screening is not a true common attribute, so as to improve the calculation efficiency of the similarity value between the subsequent first data and the second data. .

In addition, an embodiment of the present invention provides a method for determining a similarity value between a first data and a second data. First, in the first data and the second data, weight values corresponding to each pair of common attributes are calculated. Optionally, the weight value corresponding to each pair of common attributes in the first data and the second data is calculated by using a Term Frequency-Inverse Document Frequency (tf-idf) algorithm. Secondly, the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute are accumulated to obtain the similarity between the first data and the second data. value.

For example, if the singer and singer belonging to a pair of common attributes have a semantic similarity value between the corresponding attribute values Andy Lau and Hua Tsai, 90%, and the weight value corresponding to the shared attribute is 0.6, the pair is calculated. The product of 90% and 0.6 corresponding to the attribute is used as an addend of the subsequent accumulation, and so on, and the product corresponding to each pair of common attributes is obtained and accumulated, and finally between the first data and the second data is obtained. Similarity value.

S105: If the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.

After calculating the similarity value between the first data and the second data, determining whether the similarity value is greater than a preset second threshold, if the similarity value is greater than the first a second threshold, the first data and the second data are merged, and the first data and the second data that are directed to the same entity are combined and de-duplicated, and finally the data that points to different entities is implemented. For example, the song library contains the song "Forget Love Water" from the A music application, which contains several attributes, such as singer Andy Lau, song length for 4 minutes; in addition, the song library also stores songs derived from the B music application. "Forget the Water", including the singer Andy Lau, the release date of 1994 and other attributes. Since the two songs are essentially the same song, in order to avoid the song query error, the system needs to fuse the two songs, that is, merge into a song "Forget Love Water" stored in the song library, wherein the merged song contains the above All attributes of the two songs; if the similarity value is not greater than the second threshold, it indicates that the first data and the second data cannot be fused.

In the data fusion method provided by the embodiment of the present invention, first, an attribute in the first data and the second data is extracted, wherein the first data and the second data include a correspondence between an attribute and an attribute value. Secondly, calculating a semantic similarity value between the respective attributes, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the first A pair of common attributes of two data. Finally, determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data When the second threshold is greater than the preset, the first data and the second data are merged. The embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. . Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.

The embodiment of the present invention further provides a data fusion method. Referring to FIG. 2, it is a flowchart of another data fusion method according to an embodiment of the present invention.

S201: Extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value.

S202: Extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.

S203: Calculate a semantic similarity value between attributes corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.

In the embodiment of the present invention, by calculating a similarity value between the attribute values in the first data and the second data, determining an attribute value whose similarity value is greater than a preset fourth threshold, and acquiring an attribute corresponding to the attribute value. That is to say, in the embodiment of the present invention, the attribute that is more likely to belong to the common attribute between the first data and the second data is filtered by the calculation of the similarity value between the attribute values, that is, the similarity value is greater than the preset fourth threshold. The attribute value corresponding to the attribute. On the basis of this, calculating a semantic similarity value between the attributes corresponding to the attribute value whose similarity value is greater than the fourth threshold, thereby determining a common attribute between the first data and the second data, and improving the determination of the common attribute. effectiveness.

In addition, the embodiment of the present invention may further determine the attribute belonging to the synonym as a common attribute of the first data and the second data by querying the preset synonym database before determining the common attribute of the first data and the second data. Further, before calculating the semantic similarity value between the attributes corresponding to the attribute value whose similarity value is greater than the fourth threshold value, the embodiment of the present invention may filter the attribute that has been determined as the common attribute by the synonym, and further Determining other common attributes between the first data and the second data can also improve the efficiency of determining the common attributes.

S204: Determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.

S205: Determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes.

S206: If the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.

S201 and S204-S206 in the embodiment of the present invention are the same as the above-described processes of S101 and S103-S105, and can be understood by referring to the above explanation.

In the data fusion method provided by the embodiment of the present invention, by calculating the similarity value between each attribute value in the first data and the second data, it is more likely to belong to the common between the first data and the second data. The attribute of the attribute, in addition, the common attribute belonging to the synonym can be determined by the thesaurus, and then the other common attributes between the first data and the second data are determined, thereby improving the determining efficiency of the common attribute in the data fusion. .

In addition, the embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity between the first data and the second data. Degree value. Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.

The embodiment of the present invention further provides a data fusion method. Referring to FIG. 3, it is a flowchart of another data fusion method according to an embodiment of the present invention. The data fusion method includes:

S301: Extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between the attribute and the attribute value.

In the embodiment of the present invention, each attribute value in the first data and the second data is first extracted. For example, the first data includes a correspondence relationship between the singer and Andy Lau, and the second data includes a correspondence relationship between the singer and the Chinese singer. , Andy Lau in the first data and Hua Tsai in the second data are attribute values.

S302: Calculate a similarity value between each attribute value.

In the embodiment of the present invention, after extracting attribute values in the first data and the second data, calculating similarity values between the respective attribute values, for example, calculating similarity values between the attribute values Andy Lau and Hua Tsai.

In practical applications, semantic similarity values between individual attribute values can be calculated. In order to improve the accuracy, the embodiment of the present invention can also directly calculate the string similarity value between each attribute value.

In an implementation manner, a method for calculating a semantic similarity value between each attribute value may obtain a semantic vector corresponding to each attribute value by using a preset word embedding model, and then calculate a semantic vector corresponding to each attribute value. The semantic similarity value is the semantic similarity value between each attribute value.

S303: Determine a similarity value between the first data and the second data according to the similarity value between the respective attribute values.

In the embodiment of the present invention, after calculating the similarity value between the respective attribute values, the similarity value between the first data and the second data is determined according to the similarity value between the respective attribute values.

In an implementation manner, before calculating the similarity value between the respective attribute values, first extracting the attributes in the first data and the second data, and calculating a semantic similarity value between the respective attributes, thereby determining the first A common attribute of data and second data. Optionally, the attribute whose semantic similarity value is greater than the preset first threshold is determined as a common attribute of the first data and the second data.

When calculating the similarity value between the attribute values, the embodiment of the present invention may only calculate the semantic similarity value between the attribute values corresponding to the same pair of shared attributes, so as to improve the calculation efficiency of the similarity between the attribute values.

In addition, when determining the similarity value between the first data and the second data, it may be determined according to a semantic similarity value between the attribute values corresponding to each pair of common attributes of the first data and the second data. Optionally, pre-calculating weight values of each pair of shared attributes in the first data and the second data, and then, using a semantic similarity value between the attribute values corresponding to each pair of shared attributes and a weight value corresponding to the pair of shared attributes The product of the sum is accumulated to obtain a similarity value between the first data and the second data.

In order to improve the determining efficiency of the similarity value between the first data and the second data, the embodiment of the present invention pre-screens the common attribute corresponding to the attribute value whose semantic similarity value is not greater than the preset third threshold value from the determined common attribute. In order to improve the accuracy of the common attribute, the number of shared attributes is also reduced, and the efficiency of determining the similarity value between the first data and the second data is improved.

In addition, before determining the common attribute of the first data and the second data, in order to improve the determining efficiency of the common attribute, first determining an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold, where the similarity value is greater than the pre- Setting a semantic similarity value between the attributes corresponding to the attribute value of the fourth threshold, determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to the semantic similarity value that is greater than the preset first threshold as the first A common attribute of a data and a second data.

In addition, in the embodiment of the present invention, the attribute belonging to the synonym is directly determined as the common attribute of the first data and the second data by querying the preset synonym database, and the subsequent only needs to be screened as the common attribute by the synonym. Attributes that compute semantic similarity values between attributes that are not synonymous, thereby improving the efficiency of determining common attributes.

The embodiment of the present invention further provides a method for calculating a semantic similarity value between each attribute. Optionally, the semantic vector corresponding to each attribute is separately obtained by using a preset word embedding model. Secondly, the semantic similarity value between the semantic vectors corresponding to each attribute is calculated, that is, the semantic similarity value between the respective attributes.

S304: If the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.

After calculating the similarity value between the first data and the second data, determining whether the similarity value is greater than a preset second threshold, if the similarity value is greater than the first The second threshold is used to fuse the first data and the second data; otherwise, the first data and the second data cannot be merged.

In the data fusion method provided by the embodiment of the present invention, first, an attribute value in the first data and the second data is extracted, where the first data and the second data include a correspondence between an attribute and an attribute value. Second, calculate the similarity value between each attribute value. Finally, a similarity value between the first data and the second data is determined according to a similarity value between the respective attribute values. And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged. The embodiment of the present invention determines the similarity value between the first data and the second data by directly calculating the similarity value between the attribute values in the first data and the second data, thereby improving the efficiency of data fusion.

Further, determining a common attribute of the first data and the second data based on the semantic similarity value, and comparing the similarity between the attribute values corresponding to the shared attribute, and finally determining a similarity value between the first data and the second data, Under the premise of ensuring the accuracy of data fusion, the data fusion rate is improved.

Embodiments of the present invention provide a data fusion apparatus including one or more processors and one or more memories storing program units, wherein the program units are executed by the processor. FIG. 4 is a schematic structural diagram of a data fusion device according to an embodiment of the present invention, where the device includes:

The extracting module 401 is configured to extract the attributes in the first data and the second data, wherein the first data and the second data include a correspondence between the attribute and the attribute value;

The first calculation module 402 is configured to calculate a semantic similarity value between the respective attributes;

The first determining module 403 is configured to determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as one of the first data and the second data Common attribute

The second determining module 404 is configured to determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes;

The fusion module 405 is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.

It should be noted that the foregoing extraction module 401, the first calculation module 402, the first determination module 403, the second determination module 404, and the fusion module 405 can be run in the terminal as part of the device, and can be processed by the processor in the terminal. To implement the functions implemented by the above modules, the terminal may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.

The second determining module includes:

It should be noted that the foregoing first computing submodule and the first determining submodule may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.

Optionally, the device further includes:

It should be noted that the foregoing second computing module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.

Correspondingly, the first determining submodule comprises:

It should be noted here that the above-mentioned accumulation sub-module can be operated in the terminal as a part of the device, and the functions implemented by the above module can be executed by the processor in the terminal.

In addition, the device further includes:

It should be noted here that the screening module can be operated in the terminal as part of the device, and the functions implemented by the above module can be performed by the processor in the terminal.

The device also includes:

It should be noted that the foregoing acquisition module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.

Correspondingly, the first calculation module comprises:

It should be noted that the foregoing second computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.

The device also includes:

It should be noted that the foregoing third determining module may be operated in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.

Correspondingly, the first calculation module comprises:

It should be noted that the foregoing third computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.

Optionally, the first computing module includes:

It should be noted that the foregoing obtaining sub-module and the fourth computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above-mentioned modules may be performed by a processor in the terminal.

It should be noted here that the above extraction module, calculation module, determination module and fusion module can be operated in the terminal as part of the device.

The data fusion device provided by the embodiment of the present invention can implement the following functions: extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between attributes and attribute values. Calculating a semantic similarity value between each attribute, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the second data A pair of shared attributes. Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data is greater than a pre- Setting a second threshold, the first data and the second data are merged. The embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. . Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.

Correspondingly, an embodiment of the present invention further provides an electronic device, as shown in FIG. 5, which may include:

The processor 501, the memory 502, the input device 503, and the output device 504. The number of processors 501 in the browser server may be one or more, and one processor is taken as an example in FIG. In some embodiments of the present invention, the processor 501, the memory 502, the input device 503, and the output device 504 may be connected by a bus or other means, wherein the bus connection is taken as an example in FIG.

The memory 502 can be used to store a computer program and a module, such as a data fusion method and a program instruction/module corresponding to the device in the embodiment of the present invention. The processor 501 is configured to execute each of the software programs and modules stored in the memory 502. A functional application and data processing, that is, the above data fusion method is implemented. The memory 502 can mainly include a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function, and the like. Moreover, memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. In some examples, memory 502 can further include memory remotely located relative to processor 501, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Input device 503 can be used to receive input numeric or character information and to generate key signal inputs related to user settings and function controls of the browser server.

Specifically, in this embodiment, the processor 501 loads the executable file corresponding to the process of one or more applications into the memory 502 according to the following instructions, and is executed by the processor 501 to be stored in the memory 502. The application to implement various functions:

Calculate semantic similarity values between attributes;

For example, the specific examples in this embodiment may refer to the examples described in the foregoing embodiments, and details are not described herein again.

A person skilled in the art can understand that the structure shown in FIG. 5 is merely illustrative, and the electronic device can be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile Internet device (MID). Terminal equipment such as PAD. Fig. 5 does not limit the structure of the above electronic device. For example, the electronic device may also include more or fewer components (such as a network interface, display device, etc.) than shown in FIG. 5, or have a different configuration than that shown in FIG.

A person of ordinary skill in the art may understand that all or part of the steps of the foregoing embodiments may be completed by a program to instruct terminal device related hardware, and the program may be stored in a computer readable storage medium, and the storage medium may be Including: flash disk, read-only memory (ROM), random access memory (RAM), disk or optical disk.

Embodiments of the present invention also provide a storage medium. Optionally, in this embodiment, a computer program is stored in the storage medium, wherein the computer program is configured to be executed to execute a video monitoring method.

Optionally, in this embodiment, the foregoing storage medium may be located on at least one of the plurality of network devices in the network shown in the foregoing embodiment.

Optionally, in the present embodiment, the storage medium is arranged to store program code for performing the following steps:

Extracting attributes in the first data and the second data, where the first data and the second data include a correspondence between the attribute and the attribute value;

Calculate semantic similarity values between attributes;

Determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to each semantic similarity value as a pair of common attributes of the first data and the second data;

If the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are fused.

Optionally, in this embodiment, the foregoing storage medium may include, but not limited to, a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, and a magnetic A variety of media that can store program code, such as a disc or a disc.

The data fusion method and apparatus, storage medium, and electronic apparatus according to the present invention are described above by way of example with reference to the accompanying drawings. However, those skilled in the art should understand that various improvements can be made to the video monitoring method and apparatus, the storage medium, and the electronic apparatus proposed by the present invention without departing from the scope of the present invention. Therefore, the scope of the invention should be determined by the content of the appended claims.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment. The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.

It should be noted that, in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply such entities or operations. There is any such actual relationship or order between them. Furthermore, the term "comprises" or "comprises" or "comprises" or any other variations thereof is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that comprises a plurality of elements includes not only those elements but also Other elements, or elements that are inherent to such a process, method, item, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.

The data fusion method and device, the storage medium and the electronic device provided by the embodiments of the present invention are described in detail. The principles and implementations of the present invention are described in the specific examples. The description of the above embodiments is only The method for understanding the present invention and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in specific embodiments and application scopes. The description should not be construed as limiting the invention.

Industrial applicability

Extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value. Calculating a semantic similarity value between each attribute, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the second data A pair of shared attributes. Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data is greater than a pre- Setting a second threshold, the first data and the second data are merged. The present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.

Claims

A data fusion method, the method comprising:

Extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;

Calculate semantic similarity values between attributes;

Determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of common attributes of the first data and the second data;

Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of common attributes;

And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
The data fusion method according to claim 1, wherein the determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes comprises:

Obtaining, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculating a semantic similarity value between the attribute values corresponding to the same pair of shared attributes;

And determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
The data fusion method according to claim 2, wherein the method further comprises:

In the first data and the second data, a weight value corresponding to each pair of common attributes is calculated.
The data fusion method according to claim 3, wherein the determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes ,include:

The product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.
The data fusion method according to claim 2, wherein the determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes Previously, it also included:

And from the common attribute, the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.
The data fusion method according to claim 1, wherein before the calculating the semantic similarity value between the attributes, the method further comprises:

And extracting an attribute value corresponding to each attribute in the first data and the second data, and acquiring an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
The data fusion method according to claim 6, wherein said calculating a semantic similarity value between the respective attributes comprises:

And calculating a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
The data fusion method according to claim 1, wherein before the calculating the semantic similarity value between the attributes, the method further comprises:

The attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.
The data fusion method according to claim 8, wherein said calculating a semantic similarity value between the respective attributes comprises:

Calculates semantic similarity values between attributes that are not synonymous.
The data fusion method according to claim 1, wherein said calculating a semantic similarity value between the respective attributes comprises:

Separating the semantic vectors corresponding to each attribute by using a preset word embedding model;

Calculate the semantic similarity value between the semantic vectors corresponding to each attribute.
A data fusion method, wherein the method comprises:

Extracting attribute values in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;

Calculate the similarity value between each attribute value;

Determining a similarity value between the first data and the second data according to a similarity value between the respective attribute values;

And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
The data fusion method according to claim 11, wherein before the calculating the similarity value between the respective attribute values, the method further comprises:

Extracting attributes in the first data and the second data;

Calculate semantic similarity values between attributes;

Determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.
The data fusion method according to claim 12, wherein said calculating a similarity value between respective attribute values comprises:

Calculate the semantic similarity value between the attribute values corresponding to the same pair of common attributes.
The data fusion method according to claim 13, wherein the determining a similarity value between the first data and the second data according to a similarity value between the respective attribute values comprises:

And determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
The data fusion method according to claim 14, wherein the method further comprises:

In the first data and the second data, a weight value corresponding to each pair of common attributes is calculated.
The data fusion method according to claim 15, wherein the determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes ,include:

The product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.
The data fusion method according to claim 14, wherein the determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes Previously, it also included:

From the common attribute, the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.
The data fusion method according to claim 12, wherein before the calculating the semantic similarity value between the attributes, the method further comprises:

Obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
The data fusion method according to claim 18, wherein said calculating a semantic similarity value between the respective attributes comprises:

And calculating a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
The data fusion method according to claim 12, wherein before the calculating the semantic similarity value between the attributes, the method further comprises:

The attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.
The data fusion method according to claim 20, wherein said calculating a semantic similarity value between the respective attributes comprises:

Calculates semantic similarity values between attributes that are not synonymous.
The data fusion method according to claim 12, wherein said calculating a semantic similarity value between the respective attributes comprises:

Separating the semantic vectors corresponding to each attribute by using a preset word embedding model;

Calculate the semantic similarity value between the semantic vectors corresponding to each attribute.
The data fusion method according to claim 11, wherein said calculating a similarity value between respective attribute values comprises:

Calculates the string similarity value between each attribute value.
A data fusion device comprising one or more processors and one or more memories storing program units, wherein the program units are executed by the processor, the program units comprising:

An extraction module, configured to extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;

a first calculation module configured to calculate a semantic similarity value between the respective attributes;

a first determining module, configured to determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of the first data and the second data Common attribute

a second determining module, configured to determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes;

The fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
The data fusion device of claim 24, wherein the second determining module comprises:

a first calculation submodule configured to obtain, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculate a semantic similarity value between the attribute values corresponding to the same pair of shared attributes ;

The first determining submodule is configured to determine a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
The data fusion device of claim 25, wherein the device further comprises:

The second calculating module is configured to calculate, in the first data and the second data, a weight value corresponding to each pair of shared attributes.
The data fusion device of claim 26, wherein the first determining sub-module comprises:

The accumulating submodule is configured to accumulate the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the pair of shared attributes, to obtain the first data and the second data. The similarity value between.
The data fusion device of claim 25, wherein the device further comprises:

The screening module is configured to filter, from the common attribute, a common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold.
The data fusion device of claim 24, wherein the device further comprises:

And an acquiring module, configured to extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
The data fusion device of claim 29, wherein the first computing module comprises:

The second calculation submodule is configured to calculate a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
The data fusion device of claim 24, wherein the device further comprises:

The third determining module is configured to determine, by querying the preset synonym database, an attribute belonging to the synonym as a pair of common attributes of the first data and the second data.
The data fusion device of claim 31, wherein the first computing module comprises:

A third computing sub-module configured to calculate semantic similarity values between attributes that are not synonymous.
The data fusion device of claim 24, wherein the first computing module comprises:

Obtaining a sub-module, configured to obtain a semantic vector corresponding to each attribute by using a preset word embedding model;

The fourth calculation sub-module is configured to calculate a semantic similarity value between the semantic vectors corresponding to the respective attributes.
A data fusion device comprising one or more processors and one or more memories storing program units, wherein the program units are executed by the processor, the program units comprising:

An extraction module, configured to extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;

a calculation module configured to calculate a similarity value between respective attribute values;

a determining module, configured to determine a similarity value between the first data and the second data according to a similarity value between the respective attribute values;

The fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
A storage medium, wherein a computer program is stored in the storage medium, the computer program being arranged to perform the method of any one of claims 1 to 23 at runtime.
An electronic device comprising a memory and a processor, wherein the memory stores a computer program, the processor being arranged to perform the method of any one of claims 1 to 23 by the computer program .