WO2018166343A1 - Data fusion method and device, storage medium and electronic device - Google Patents

Data fusion method and device, storage medium and electronic device Download PDF

Info

Publication number
WO2018166343A1
WO2018166343A1 PCT/CN2018/077184 CN2018077184W WO2018166343A1 WO 2018166343 A1 WO2018166343 A1 WO 2018166343A1 CN 2018077184 W CN2018077184 W CN 2018077184W WO 2018166343 A1 WO2018166343 A1 WO 2018166343A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
attribute
similarity value
attributes
value
Prior art date
Application number
PCT/CN2018/077184
Other languages
French (fr)
Chinese (zh)
Inventor
甘骏
苏可
饶孟良
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018166343A1 publication Critical patent/WO2018166343A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • Embodiments of the present invention relate to the field of data processing, and in particular, to a data fusion method and apparatus, a storage medium, and an electronic device.
  • the embodiments of the present invention provide a data fusion method and device, a storage medium, and an electronic device, to at least solve the technical problem of low data fusion rate in the related art.
  • An embodiment of the present invention provides a data fusion method, where the method includes:
  • the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
  • determining the similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes including:
  • the method further includes:
  • determining the similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes including:
  • the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.
  • the method before determining the similarity value between the first data and the second data according to the semantic similarity value between the attribute values corresponding to each pair of the shared attributes, the method further includes:
  • the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.
  • the method further includes:
  • the calculating a semantic similarity value between the respective attributes includes:
  • the method further includes:
  • the attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.
  • the calculating a semantic similarity value between the respective attributes includes:
  • the calculating a semantic similarity value between the respective attributes includes:
  • the embodiment of the invention further provides a data fusion method, the method comprising:
  • the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
  • the method further includes:
  • the calculating a similarity value between each attribute value includes:
  • determining, according to the similarity value between the respective attribute values, a similarity value between the first data and the second data including:
  • the method further includes:
  • determining the similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes including:
  • the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.
  • the method before determining the similarity value between the first data and the second data according to the semantic similarity value between the attribute values corresponding to each pair of the shared attributes, the method further includes:
  • the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.
  • the method further includes:
  • the calculating a semantic similarity value between the respective attributes includes:
  • the method further includes:
  • the attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.
  • the calculating a semantic similarity value between the respective attributes includes:
  • the calculating a semantic similarity value between the respective attributes includes:
  • the calculating a similarity value between each attribute value includes:
  • Embodiments of the present invention also provide a data fusion device including one or more processors, and one or more memories storing program units, wherein the program units are executed by the processor, and the program units include :
  • An extraction module configured to extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;
  • a first calculation module configured to calculate a semantic similarity value between the respective attributes
  • a first determining module configured to determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of the first data and the second data Common attribute
  • a second determining module configured to determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes
  • the fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
  • the second determining module includes:
  • a first calculation submodule configured to obtain, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculate a semantic similarity value between the attribute values corresponding to the same pair of shared attributes ;
  • the first determining submodule is configured to determine a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
  • the device further includes:
  • the second calculating module is configured to calculate, in the first data and the second data, a weight value corresponding to each pair of shared attributes.
  • the first determining submodule includes:
  • the accumulating submodule is configured to accumulate the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the pair of shared attributes, to obtain the first data and the second data.
  • the device further includes:
  • the screening module is configured to filter, from the common attribute, a common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold.
  • the device further includes:
  • an acquiring module configured to extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
  • the first computing module includes:
  • the second calculation submodule is configured to calculate a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
  • the device further includes:
  • the third determining module is configured to determine, by querying the preset synonym database, an attribute belonging to the synonym as a pair of common attributes of the first data and the second data.
  • the first computing module includes:
  • a third computing sub-module configured to calculate semantic similarity values between attributes that are not synonymous.
  • the first computing module includes:
  • Obtaining a sub-module configured to obtain a semantic vector corresponding to each attribute by using a preset word embedding model
  • the fourth calculation sub-module is configured to calculate a semantic similarity value between the semantic vectors corresponding to the respective attributes.
  • Embodiments of the present invention also provide a data fusion device including one or more processors, and one or more memories storing program units, wherein the program units are executed by the processor, and the program units include :
  • An extraction module configured to extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;
  • a calculation module configured to calculate a similarity value between respective attribute values
  • a determining module configured to determine a similarity value between the first data and the second data according to a similarity value between the respective attribute values
  • the fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
  • the embodiment of the present invention further provides a storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute the method described in the embodiment of the present invention.
  • An embodiment of the present invention further provides an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the method described in the embodiment of the present invention by using the computer program. method.
  • the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data.
  • the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.
  • FIG. 1 is a flowchart of a data fusion method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of another data fusion method according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of another data fusion method according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a data fusion device according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a portion of an electronic device according to an embodiment of the present invention.
  • FIG. 1 it is a flowchart of a data fusion method according to an embodiment of the present invention.
  • S101 Extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value.
  • the first data and the second data in the embodiment of the present invention both include the correspondence between the attribute and the attribute value.
  • the correspondence between the first data and the singer-Andy Lau may be included, and the second data may include the correspondence of the singer-Hua Tsai. Relationship; among them, singers and singers are attributes, and Andy Lau and Hua Tsai are attribute values corresponding to singers and singers respectively.
  • each attribute included in the first data and the second data is first extracted.
  • the embodiment of the present invention after the attributes in the first data and the second data are extracted, the semantic similarity value between the attributes is calculated. Through the calculation of the semantic similarity value, the embodiment of the present invention can determine the attribute that points to the same entity substantially, without the need for the exact match of the string, thereby avoiding the strict matching of the data based on the string matching of the feature. problem.
  • the semantic similarity between the attributes in the first data and the attributes in the second data can be calculated.
  • a semantic vector corresponding to each attribute is obtained by using a preset word embedding model
  • a semantic similarity value between semantic vectors corresponding to each attribute is calculated, that is, semantics between the respective attributes. Similarity value.
  • S103 Determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.
  • a semantic similarity value greater than a preset first threshold is determined. Further, an attribute corresponding to the semantic similarity value of the preset first threshold is determined, and the attribute is determined as a common attribute between the first data and the second data. That is to say, a pair of attributes having a higher semantic similarity value can be determined as a common attribute between the first data and the second data.
  • the semantic similarity value between the singer and the singer is higher than the preset first threshold, and the singer and the singer are determined as the first data and the second data Common attributes.
  • S104 Determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes.
  • the similarity may be calculated for each pair of common The semantic similarity value or the string similarity value between the attribute values corresponding to the attributes respectively, and finally determining the similarity between the first data and the second data according to the similarity between the attribute values corresponding to each pair of the shared attributes value.
  • the similarity is a semantic similarity value between the attribute values corresponding to each pair of common attributes.
  • an attribute value corresponding to each pair of shared attributes is obtained, and a semantic similarity value between the attribute values corresponding to the same pair of shared attributes is calculated.
  • the corresponding attribute values, Andy Lau and Hua Tsai calculate the semantic similarity between Andy Lau and Hua Tsai.
  • a similarity value between the first data and the second data is determined according to a semantic similarity value between attribute values corresponding to each pair of shared attributes. That is, the degree of similarity between the first data and the second data depends on the similarity between the attribute values corresponding to the common attributes of the first data and the second data.
  • the embodiment of the present invention determines the common attribute before determining the similarity value between the first data and the second data.
  • the semantic similarity value is not greater than the attribute corresponding to the attribute value of the preset third threshold, and the attribute corresponding to the attribute value that is not greater than the preset third threshold is excluded, so as to improve the accuracy of the shared attribute and reduce the number of the shared attribute. . That is to say, the attribute corresponding to the attribute value whose semantic similarity value is not greater than the preset third threshold does not belong to the common attribute between the first data and the second data.
  • the common attribute between the first data and the second data is further determined in advance, and the screening is not a true common attribute, so as to improve the calculation efficiency of the similarity value between the subsequent first data and the second data. .
  • an embodiment of the present invention provides a method for determining a similarity value between a first data and a second data.
  • weight values corresponding to each pair of common attributes are calculated in the first data and the second data.
  • the weight value corresponding to each pair of common attributes in the first data and the second data is calculated by using a Term Frequency-Inverse Document Frequency (tf-idf) algorithm.
  • tf-idf Term Frequency-Inverse Document Frequency
  • the singer and singer belonging to a pair of common attributes have a semantic similarity value between the corresponding attribute values Andy Lau and Hua Tsai, 90%, and the weight value corresponding to the shared attribute is 0.6
  • the pair is calculated.
  • the product of 90% and 0.6 corresponding to the attribute is used as an addend of the subsequent accumulation, and so on, and the product corresponding to each pair of common attributes is obtained and accumulated, and finally between the first data and the second data is obtained. Similarity value.
  • the song library contains the song "Forget Love Water” from the A music application, which contains several attributes, such as singer Andy Lau, song length for 4 minutes; in addition, the song library also stores songs derived from the B music application. "Forget the Water", including the singer Andy Lau, the release date of 1994 and other attributes.
  • the system needs to fuse the two songs, that is, merge into a song "Forget Love Water" stored in the song library, wherein the merged song contains the above All attributes of the two songs; if the similarity value is not greater than the second threshold, it indicates that the first data and the second data cannot be fused.
  • the embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. .
  • the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.
  • the embodiment of the present invention further provides a data fusion method.
  • FIG. 2 it is a flowchart of another data fusion method according to an embodiment of the present invention.
  • S201 Extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value.
  • S202 Extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
  • S203 Calculate a semantic similarity value between attributes corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
  • the attribute that is more likely to belong to the common attribute between the first data and the second data is filtered by the calculation of the similarity value between the attribute values, that is, the similarity value is greater than the preset fourth threshold.
  • the attribute value corresponding to the attribute is calculated by calculating a semantic similarity value between the attributes corresponding to the attribute value whose similarity value is greater than the fourth threshold, thereby determining a common attribute between the first data and the second data, and improving the determination of the common attribute. effectiveness.
  • the embodiment of the present invention may further determine the attribute belonging to the synonym as a common attribute of the first data and the second data by querying the preset synonym database before determining the common attribute of the first data and the second data. Further, before calculating the semantic similarity value between the attributes corresponding to the attribute value whose similarity value is greater than the fourth threshold value, the embodiment of the present invention may filter the attribute that has been determined as the common attribute by the synonym, and further Determining other common attributes between the first data and the second data can also improve the efficiency of determining the common attributes.
  • S204 Determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.
  • S205 Determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes.
  • S201 and S204-S206 in the embodiment of the present invention are the same as the above-described processes of S101 and S103-S105, and can be understood by referring to the above explanation.
  • the data fusion method by calculating the similarity value between each attribute value in the first data and the second data, it is more likely to belong to the common between the first data and the second data.
  • the attribute of the attribute in addition, the common attribute belonging to the synonym can be determined by the thesaurus, and then the other common attributes between the first data and the second data are determined, thereby improving the determining efficiency of the common attribute in the data fusion. .
  • the embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity between the first data and the second data. Degree value.
  • the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.
  • the embodiment of the present invention further provides a data fusion method.
  • FIG. 3 it is a flowchart of another data fusion method according to an embodiment of the present invention.
  • the data fusion method includes:
  • S301 Extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between the attribute and the attribute value.
  • each attribute value in the first data and the second data is first extracted.
  • the first data includes a correspondence relationship between the singer and Andy Lau
  • the second data includes a correspondence relationship between the singer and the Chinese singer.
  • Andy Lau in the first data and Hua Tsai in the second data are attribute values.
  • semantic similarity values between individual attribute values can be calculated.
  • the embodiment of the present invention can also directly calculate the string similarity value between each attribute value.
  • a method for calculating a semantic similarity value between each attribute value may obtain a semantic vector corresponding to each attribute value by using a preset word embedding model, and then calculate a semantic vector corresponding to each attribute value.
  • the semantic similarity value is the semantic similarity value between each attribute value.
  • S303 Determine a similarity value between the first data and the second data according to the similarity value between the respective attribute values.
  • the similarity value between the first data and the second data is determined according to the similarity value between the respective attribute values.
  • the attribute whose semantic similarity value is greater than the preset first threshold is determined as a common attribute of the first data and the second data.
  • the embodiment of the present invention may only calculate the semantic similarity value between the attribute values corresponding to the same pair of shared attributes, so as to improve the calculation efficiency of the similarity between the attribute values.
  • the similarity value between the first data and the second data it may be determined according to a semantic similarity value between the attribute values corresponding to each pair of common attributes of the first data and the second data.
  • pre-calculating weight values of each pair of shared attributes in the first data and the second data and then, using a semantic similarity value between the attribute values corresponding to each pair of shared attributes and a weight value corresponding to the pair of shared attributes
  • the product of the sum is accumulated to obtain a similarity value between the first data and the second data.
  • the embodiment of the present invention pre-screens the common attribute corresponding to the attribute value whose semantic similarity value is not greater than the preset third threshold value from the determined common attribute.
  • the number of shared attributes is also reduced, and the efficiency of determining the similarity value between the first data and the second data is improved.
  • the attribute belonging to the synonym is directly determined as the common attribute of the first data and the second data by querying the preset synonym database, and the subsequent only needs to be screened as the common attribute by the synonym. Attributes that compute semantic similarity values between attributes that are not synonymous, thereby improving the efficiency of determining common attributes.
  • the embodiment of the present invention further provides a method for calculating a semantic similarity value between each attribute.
  • the semantic vector corresponding to each attribute is separately obtained by using a preset word embedding model.
  • the semantic similarity value between the semantic vectors corresponding to each attribute is calculated, that is, the semantic similarity value between the respective attributes.
  • the second threshold is used to fuse the first data and the second data; otherwise, the first data and the second data cannot be merged.
  • an attribute value in the first data and the second data is extracted, where the first data and the second data include a correspondence between an attribute and an attribute value.
  • a similarity value between the first data and the second data is determined according to a similarity value between the respective attribute values.
  • the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
  • the embodiment of the present invention determines the similarity value between the first data and the second data by directly calculating the similarity value between the attribute values in the first data and the second data, thereby improving the efficiency of data fusion.
  • Embodiments of the present invention provide a data fusion apparatus including one or more processors and one or more memories storing program units, wherein the program units are executed by the processor.
  • FIG. 4 is a schematic structural diagram of a data fusion device according to an embodiment of the present invention, where the device includes:
  • the extracting module 401 is configured to extract the attributes in the first data and the second data, wherein the first data and the second data include a correspondence between the attribute and the attribute value;
  • the first calculation module 402 is configured to calculate a semantic similarity value between the respective attributes
  • the first determining module 403 is configured to determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as one of the first data and the second data Common attribute
  • the second determining module 404 is configured to determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes;
  • the fusion module 405 is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
  • the terminal may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.
  • a smart phone such as an Android phone, an iOS phone, etc.
  • a tablet computer such as an iPad, Samsung Galaxy Tabs, Samsung Galaxy Tabs, etc.
  • MID mobile Internet device
  • PAD PAD
  • the second determining module includes:
  • a first calculation submodule configured to obtain, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculate a semantic similarity value between the attribute values corresponding to the same pair of shared attributes ;
  • the first determining submodule is configured to determine a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
  • first computing submodule and the first determining submodule may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
  • the device further includes:
  • the second calculating module is configured to calculate, in the first data and the second data, a weight value corresponding to each pair of shared attributes.
  • the foregoing second computing module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
  • the first determining submodule comprises:
  • the accumulating submodule is configured to accumulate the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the pair of shared attributes, to obtain the first data and the second data.
  • the above-mentioned accumulation sub-module can be operated in the terminal as a part of the device, and the functions implemented by the above module can be executed by the processor in the terminal.
  • the device further includes:
  • the screening module is configured to filter, from the common attribute, a common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold.
  • the screening module can be operated in the terminal as part of the device, and the functions implemented by the above module can be performed by the processor in the terminal.
  • the device also includes:
  • an acquiring module configured to extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
  • the foregoing acquisition module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
  • the first calculation module comprises:
  • the second calculation submodule is configured to calculate a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
  • the foregoing second computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
  • the device also includes:
  • the third determining module is configured to determine, by querying the preset synonym database, an attribute belonging to the synonym as a pair of common attributes of the first data and the second data.
  • the foregoing third determining module may be operated in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
  • the first calculation module comprises:
  • a third computing sub-module configured to calculate semantic similarity values between attributes that are not synonymous.
  • the foregoing third computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
  • the first computing module includes:
  • Obtaining a sub-module configured to obtain a semantic vector corresponding to each attribute by using a preset word embedding model
  • the fourth calculation sub-module is configured to calculate a semantic similarity value between the semantic vectors corresponding to the respective attributes.
  • the foregoing obtaining sub-module and the fourth computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above-mentioned modules may be performed by a processor in the terminal.
  • Embodiments of the present invention also provide a data fusion device including one or more processors, and one or more memories storing program units, wherein the program units are executed by the processor, and the program units include :
  • An extraction module configured to extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;
  • a calculation module configured to calculate a similarity value between respective attribute values
  • a determining module configured to determine a similarity value between the first data and the second data according to a similarity value between the respective attribute values
  • the fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
  • extraction module calculation module
  • determination module determination module
  • fusion module can be operated in the terminal as part of the device.
  • the data fusion device can implement the following functions: extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between attributes and attribute values. Calculating a semantic similarity value between each attribute, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the second data A pair of shared attributes. Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data is greater than a pre- Setting a second threshold, the first data and the second data are merged.
  • the embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. .
  • the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.
  • an embodiment of the present invention further provides an electronic device, as shown in FIG. 5, which may include:
  • the number of processors 501 in the browser server may be one or more, and one processor is taken as an example in FIG.
  • the processor 501, the memory 502, the input device 503, and the output device 504 may be connected by a bus or other means, wherein the bus connection is taken as an example in FIG.
  • the memory 502 can be used to store a computer program and a module, such as a data fusion method and a program instruction/module corresponding to the device in the embodiment of the present invention.
  • the processor 501 is configured to execute each of the software programs and modules stored in the memory 502. A functional application and data processing, that is, the above data fusion method is implemented.
  • the memory 502 can mainly include a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function, and the like.
  • memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • memory 502 can further include memory remotely located relative to processor 501, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • Input device 503 can be used to receive input numeric or character information and to generate key signal inputs related to user settings and function controls of the browser server.
  • the processor 501 loads the executable file corresponding to the process of one or more applications into the memory 502 according to the following instructions, and is executed by the processor 501 to be stored in the memory 502.
  • the application to implement various functions:
  • the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
  • the structure shown in FIG. 5 is merely illustrative, and the electronic device can be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile Internet device (MID). Terminal equipment such as PAD.
  • Fig. 5 does not limit the structure of the above electronic device.
  • the electronic device may also include more or fewer components (such as a network interface, display device, etc.) than shown in FIG. 5, or have a different configuration than that shown in FIG.
  • Embodiments of the present invention also provide a storage medium.
  • a computer program is stored in the storage medium, wherein the computer program is configured to be executed to execute a video monitoring method.
  • the foregoing storage medium may be located on at least one of the plurality of network devices in the network shown in the foregoing embodiment.
  • the storage medium is arranged to store program code for performing the following steps:
  • the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are fused.
  • the foregoing storage medium may include, but not limited to, a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, and a magnetic
  • ROM read-only memory
  • RAM random access memory
  • mobile hard disk a magnetic
  • magnetic A variety of media that can store program code, such as a disc or a disc.
  • the device embodiment since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
  • Calculating a semantic similarity value between each attribute, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the second data A pair of shared attributes. Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data is greater than a pre- Setting a second threshold, the first data and the second data are merged.
  • the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data.
  • the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.

Abstract

Disclosed are a data fusion method and device, a storage medium and an electronic device. The method comprises: extracting attributes in first data and second data, wherein the first data and the second data comprise a correlation between an attribute and an attribute value; calculating a semantic similarity value between various attributes, determining a semantic similarity value greater than a pre-set first threshold, and determining an attribute corresponding to each semantic similarity value as a pair of common attributes of the first data and the second data; and by comparing the attribute values corresponding to each pair of common attributes, determining a similarity value between the first data and the second data, and if the similarity value between the first data and the second data is greater than a pre-set second threshold, then fusing the first data and the second data. By means of the embodiments of the present invention, the data fusion rate is improved on the premise of ensuring the accuracy of data fusion.

Description

一种数据融合方法及装置、存储介质以及电子装置Data fusion method and device, storage medium and electronic device
本申请要求于2017年03月13日提交中国专利局、优先权号为2017101459761、发明名称为“一种数据融合方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. JP-A No. No. No. No. No. No. No. No. No. No. No. No. No. No. No. in.
技术领域Technical field
本发明实施例涉及数据处理领域,具体涉及一种数据融合方法及装置、存储介质以及电子装置。Embodiments of the present invention relate to the field of data processing, and in particular, to a data fusion method and apparatus, a storage medium, and an electronic device.
背景技术Background technique
目前,在数据融合的过程中,需要首先判断数据之间是否能够进行融合,通常是判断数据包含的特征是否能够融合。现有的处理方式是基于字符串对数据包含的特征进行比对判断,从而完成数据融合。但是,基于字符串对特征的严格匹配会造成数据的融合率较低。也就是说,这种方式会造成实际上能够进行融合的数据得不到融合。At present, in the process of data fusion, it is necessary to first determine whether data can be fused, usually to determine whether the features included in the data can be fused. The existing processing method is based on a string to compare the features included in the data to complete the data fusion. However, strict matching of features based on strings can result in lower data fusion rates. In other words, this approach will result in data that is actually fused to be unfused.
发明内容Summary of the invention
有鉴于此,本发明实施例提供了一种数据融合方法及装置、存储介质以及电子装置,以至少解决相关技术中数据融合率低的技术问题。In view of this, the embodiments of the present invention provide a data fusion method and device, a storage medium, and an electronic device, to at least solve the technical problem of low data fusion rate in the related art.
本发明实施例提供了一种数据融合方法,所述方法包括:An embodiment of the present invention provides a data fusion method, where the method includes:
提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;Extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;
计算各个属性之间的语义相似度值;Calculate semantic similarity values between attributes;
确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性;Determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of common attributes of the first data and the second data;
通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值;Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of common attributes;
如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则 将所述第一数据和所述第二数据进行融合。And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
可选地,所述通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值,包括:Optionally, determining the similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, including:
从所述第一数据和所述第二数据中,获取每对共有属性对应的属性值,并计算同一对共有属性对应的属性值之间的语义相似度值;Obtaining, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculating a semantic similarity value between the attribute values corresponding to the same pair of shared attributes;
根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值。And determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
可选地,所述方法还包括:Optionally, the method further includes:
在所述第一数据和所述第二数据中,计算每对共有属性对应的权重值。In the first data and the second data, a weight value corresponding to each pair of common attributes is calculated.
可选地,所述根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值,包括:Optionally, determining the similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes, including:
将每对共有属性对应的属性值之间的语义相似度值与该对共有属性对应的权重值的乘积进行累加,得到所述第一数据和所述第二数据之间的相似度值。The product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.
可选地,所述根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值之前,还包括:Optionally, before determining the similarity value between the first data and the second data according to the semantic similarity value between the attribute values corresponding to each pair of the shared attributes, the method further includes:
从所述共有属性中,筛除所述语义相似度值不大于预设第三阈值的属性值对应的共有属性。From the common attribute, the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.
可选地,所述计算各个属性之间的语义相似度值之前,还包括:Optionally, before the calculating the semantic similarity value between the attributes, the method further includes:
提取所述第一数据和所述第二数据中各个属性对应的属性值,并获取相似度值大于预设第四阈值的属性值对应的属性。And extracting an attribute value corresponding to each attribute in the first data and the second data, and acquiring an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
可选地,所述计算各个属性之间的语义相似度值,包括:Optionally, the calculating a semantic similarity value between the respective attributes includes:
计算所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值。And calculating a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
可选地,所述计算各个属性之间的语义相似度值之前,还包括:Optionally, before the calculating the semantic similarity value between the attributes, the method further includes:
通过查询预设的同义词库,将属于同义词的属性确定为所述第一数据和所述第二数据的一对共有属性。The attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.
可选地,所述计算各个属性之间的语义相似度值,包括:Optionally, the calculating a semantic similarity value between the respective attributes includes:
计算不属于同义词的属性之间的语义相似度值。Calculates semantic similarity values between attributes that are not synonymous.
可选地,所述计算各个属性之间的语义相似度值,包括:Optionally, the calculating a semantic similarity value between the respective attributes includes:
利用预设的词嵌入模型分别获取各个属性对应的语义向量;Separating the semantic vectors corresponding to each attribute by using a preset word embedding model;
计算各个属性对应的语义向量之间的语义相似度值。Calculate the semantic similarity value between the semantic vectors corresponding to each attribute.
本发明实施例还提供了一种数据融合方法,所述方法包括:The embodiment of the invention further provides a data fusion method, the method comprising:
提取第一数据和第二数据中的属性值,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;Extracting attribute values in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;
计算各个属性值之间的相似度值;Calculate the similarity value between each attribute value;
根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值;Determining a similarity value between the first data and the second data according to a similarity value between the respective attribute values;
如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
可选地,所述计算各个属性值之间的相似度值之前,还包括:Optionally, before the calculating the similarity value between the attribute values, the method further includes:
提取所述第一数据和所述第二数据中的属性;Extracting attributes in the first data and the second data;
计算各个属性之间的语义相似度值;Calculate semantic similarity values between attributes;
确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性。Determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.
可选地,所述计算各个属性值之间的相似度值,包括:Optionally, the calculating a similarity value between each attribute value includes:
计算同一对共有属性对应的属性值之间的语义相似度值。Calculate the semantic similarity value between the attribute values corresponding to the same pair of common attributes.
可选地,所述根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值,包括:Optionally, determining, according to the similarity value between the respective attribute values, a similarity value between the first data and the second data, including:
根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值。And determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
可选地,所述方法还包括:Optionally, the method further includes:
在所述第一数据和所述第二数据中,计算每对共有属性对应的权重值。In the first data and the second data, a weight value corresponding to each pair of common attributes is calculated.
可选地,所述根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值,包括:Optionally, determining the similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes, including:
将每对共有属性对应的属性值之间的语义相似度值与该对共有属性对应的权重值的乘积进行累加,得到所述第一数据和所述第二数据之间的相似度值。The product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.
可选地,所述根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值之前,还包括:Optionally, before determining the similarity value between the first data and the second data according to the semantic similarity value between the attribute values corresponding to each pair of the shared attributes, the method further includes:
从所述共有属性中,筛除所述语义相似度值不大于预设第三阈值的属性值对应的共有属性。From the common attribute, the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.
可选地,所述计算各个属性之间的语义相似度值之前,还包括:Optionally, before the calculating the semantic similarity value between the attributes, the method further includes:
获取相似度值大于预设第四阈值的属性值对应的属性。Obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
可选地,所述计算各个属性之间的语义相似度值,包括:Optionally, the calculating a semantic similarity value between the respective attributes includes:
计算所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值。And calculating a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
可选地,所述计算各个属性之间的语义相似度值之前,还包括:Optionally, before the calculating the semantic similarity value between the attributes, the method further includes:
通过查询预设的同义词库,将属于同义词的属性确定为所述第一数据和所述第二数据的一对共有属性。The attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.
可选地,所述计算各个属性之间的语义相似度值,包括:Optionally, the calculating a semantic similarity value between the respective attributes includes:
计算不属于同义词的属性之间的语义相似度值。Calculates semantic similarity values between attributes that are not synonymous.
可选地,所述计算各个属性之间的语义相似度值,包括:Optionally, the calculating a semantic similarity value between the respective attributes includes:
利用预设的词嵌入模型分别获取各个属性对应的语义向量;Separating the semantic vectors corresponding to each attribute by using a preset word embedding model;
计算各个属性对应的语义向量之间的语义相似度值。Calculate the semantic similarity value between the semantic vectors corresponding to each attribute.
可选地,所述计算各个属性值之间的相似度值,包括:Optionally, the calculating a similarity value between each attribute value includes:
计算各个属性值之间的字符串相似度值。Calculates the string similarity value between each attribute value.
本发明实施例还提供了一种数据融合装置,包括一个或多个处理器,以及一个或多个存储程序单元的存储器,其中,所述程序单元由所述处理器执行,所述程序单元包括:Embodiments of the present invention also provide a data fusion device including one or more processors, and one or more memories storing program units, wherein the program units are executed by the processor, and the program units include :
提取模块,被设置为提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;An extraction module, configured to extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;
第一计算模块,被设置为计算各个属性之间的语义相似度值;a first calculation module configured to calculate a semantic similarity value between the respective attributes;
第一确定模块,被设置为确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性;a first determining module, configured to determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of the first data and the second data Common attribute
第二确定模块,被设置为通过比较每对共有属性对应的属性值,确定所述 第一数据和所述第二数据之间的相似度值;a second determining module, configured to determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes;
融合模块,被设置为在所述第一数据和所述第二数据之间的相似度值大于预设第二阈值时,将所述第一数据和所述第二数据进行融合。The fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
可选地,所述第二确定模块包括:Optionally, the second determining module includes:
第一计算子模块,被设置为从所述第一数据和所述第二数据中,获取每对共有属性对应的属性值,并计算同一对共有属性对应的属性值之间的语义相似度值;a first calculation submodule configured to obtain, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculate a semantic similarity value between the attribute values corresponding to the same pair of shared attributes ;
第一确定子模块,被设置为根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值。The first determining submodule is configured to determine a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
可选地,所述装置还包括:Optionally, the device further includes:
第二计算模块,被设置为在所述第一数据和所述第二数据中,计算每对共有属性对应的权重值。The second calculating module is configured to calculate, in the first data and the second data, a weight value corresponding to each pair of shared attributes.
可选地,所述第一确定子模块包括:Optionally, the first determining submodule includes:
累加子模块,被设置为将每对共有属性对应的属性值之间的语义相似度值与该对共有属性对应的权重值的乘积进行累加,得到所述第一数据和所述第二数据之间的相似度值。The accumulating submodule is configured to accumulate the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the pair of shared attributes, to obtain the first data and the second data. The similarity value between.
可选地,所述装置还包括:Optionally, the device further includes:
筛除模块,被设置为从所述共有属性中,筛除所述语义相似度值不大于预设第三阈值的属性值对应的共有属性。The screening module is configured to filter, from the common attribute, a common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold.
可选地,所述装置还包括:Optionally, the device further includes:
获取模块,被设置为提取所述第一数据和所述第二数据中各个属性对应的属性值,并获取相似度值大于预设第四阈值的属性值对应的属性。And an acquiring module, configured to extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
可选地,所述第一计算模块包括:Optionally, the first computing module includes:
第二计算子模块,被设置为计算所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值。The second calculation submodule is configured to calculate a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
可选地,所述装置还包括:Optionally, the device further includes:
第三确定模块,被设置为通过查询预设的同义词库,将属于同义词的属性确定为所述第一数据和所述第二数据的一对共有属性。The third determining module is configured to determine, by querying the preset synonym database, an attribute belonging to the synonym as a pair of common attributes of the first data and the second data.
可选地,所述第一计算模块包括:Optionally, the first computing module includes:
第三计算子模块,被设置为计算不属于同义词的属性之间的语义相似度值。A third computing sub-module configured to calculate semantic similarity values between attributes that are not synonymous.
可选地,所述第一计算模块包括:Optionally, the first computing module includes:
获取子模块,被设置为利用预设的词嵌入模型分别获取各个属性对应的语义向量;Obtaining a sub-module, configured to obtain a semantic vector corresponding to each attribute by using a preset word embedding model;
第四计算子模块,被设置为计算各个属性对应的语义向量之间的语义相似度值。The fourth calculation sub-module is configured to calculate a semantic similarity value between the semantic vectors corresponding to the respective attributes.
本发明实施例还提供了一种数据融合装置,包括一个或多个处理器,以及一个或多个存储程序单元的存储器,其中,所述程序单元由所述处理器执行,所述程序单元包括:Embodiments of the present invention also provide a data fusion device including one or more processors, and one or more memories storing program units, wherein the program units are executed by the processor, and the program units include :
提取模块,被设置为提取第一数据和第二数据中的属性值,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;An extraction module, configured to extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;
计算模块,被设置为计算各个属性值之间的相似度值;a calculation module configured to calculate a similarity value between respective attribute values;
确定模块,被设置为根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值;a determining module, configured to determine a similarity value between the first data and the second data according to a similarity value between the respective attribute values;
融合模块,被设置为在所述第一数据和所述第二数据之间的相似度值大于预设第二阈值时,将所述第一数据和所述第二数据进行融合。The fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
本发明实施例还提供了一种存储介质,其中,所述存储介质中存储有计算机程序,所述计算机程序被设置为运行时执行本发明实施例中所述的方法。The embodiment of the present invention further provides a storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute the method described in the embodiment of the present invention.
本发明实施例还提供了一种电子装置,包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述处理器被设置为通过所述计算机程序执行本发明实施例中所述的方法。An embodiment of the present invention further provides an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the method described in the embodiment of the present invention by using the computer program. method.
在本发明实施例提供的数据融合方法中,首先,提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系。其次,计算各个属性之间的语义相似度值,确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性。最后,通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值,如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进 行融合。本发明基于语义相似度值确定第一数据和第二数据的共有属性,进而比较共有属性对应的属性值之间的相似度,最终确定第一数据和第二数据之间的相似度值。与相关技术相比,本发明实施例在保证数据融合准确性的前提下,提高了数据融合率。In the data fusion method provided by the embodiment of the present invention, first, an attribute in the first data and the second data is extracted, wherein the first data and the second data include a correspondence between an attribute and an attribute value. Secondly, calculating a semantic similarity value between the respective attributes, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the first A pair of common attributes of two data. Finally, determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data When the second threshold is greater than the preset, the first data and the second data are merged. The present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.
附图说明DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings may also be obtained from those of ordinary skill in the art in view of the drawings.
图1为本发明实施例提供的一种数据融合方法流程图;FIG. 1 is a flowchart of a data fusion method according to an embodiment of the present invention;
图2为本发明实施例提供的另一种数据融合方法流程图;2 is a flowchart of another data fusion method according to an embodiment of the present invention;
图3为本发明实施例提供的另一种数据融合方法流程图;FIG. 3 is a flowchart of another data fusion method according to an embodiment of the present invention;
图4为本发明实施例提供的一种数据融合装置的结构示意图;以及4 is a schematic structural diagram of a data fusion device according to an embodiment of the present invention;
图5为本发明实施例提供的一种电子装置的部分结构示意图。FIG. 5 is a schematic structural diagram of a portion of an electronic device according to an embodiment of the present invention.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
本发明实施例提供了一种数据融合方法,参考图1,为本发明实施例提供的一种数据融合方法流程图,所述方法具体可以包括:An embodiment of the present invention provides a data fusion method. Referring to FIG. 1 , it is a flowchart of a data fusion method according to an embodiment of the present invention.
S101:提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系。S101: Extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value.
本发明实施例中的第一数据和第二数据均包括属性与属性值的对应关系,例如第一数据中可以包括演唱者-刘德华的对应关系,第二数据中可以包括歌 手-华仔的对应关系;其中,演唱者、歌手均为属性,刘德华、华仔分别为与演唱者、歌手具有对应关系的属性值。The first data and the second data in the embodiment of the present invention both include the correspondence between the attribute and the attribute value. For example, the correspondence between the first data and the singer-Andy Lau may be included, and the second data may include the correspondence of the singer-Hua Tsai. Relationship; among them, singers and singers are attributes, and Andy Lau and Hua Tsai are attribute values corresponding to singers and singers respectively.
在本发明实施例中,在对第一数据和第二数据进行融合之前,首先需要判断第一数据和第二数据是否能够进行融合。在实际应用中,首先提取第一数据和第二数据中包括的各个属性。In the embodiment of the present invention, before the first data and the second data are merged, it is first determined whether the first data and the second data can be merged. In an actual application, each attribute included in the first data and the second data is first extracted.
S102:计算各个属性之间的语义相似度值。S102: Calculate a semantic similarity value between each attribute.
本发明实施例中,在提取到第一数据和第二数据中的属性后,计算各个属性之间的语义相似度值。通过语义相似度值的计算,本发明实施例能够确定实质上指向同一实体的属性,而不需要字符串的完全匹配,从而避免基于字符串对特征的严格匹配会造成数据的融合率较低的问题。In the embodiment of the present invention, after the attributes in the first data and the second data are extracted, the semantic similarity value between the attributes is calculated. Through the calculation of the semantic similarity value, the embodiment of the present invention can determine the attribute that points to the same entity substantially, without the need for the exact match of the string, thereby avoiding the strict matching of the data based on the string matching of the feature. problem.
在实际应用中,可以计算第一数据中的属性与第二数据中的属性之间的语义相似度。在一种实现方式中,首先,利用预设的词嵌入模型分别获取各个属性对应的语义向量,其次,计算各个属性对应的语义向量之间的语义相似度值,即为各个属性之间的语义相似度值。In practical applications, the semantic similarity between the attributes in the first data and the attributes in the second data can be calculated. In an implementation manner, first, a semantic vector corresponding to each attribute is obtained by using a preset word embedding model, and second, a semantic similarity value between semantic vectors corresponding to each attribute is calculated, that is, semantics between the respective attributes. Similarity value.
S103:确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性。S103: Determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.
本发明实施例中,通过计算得到各个属性之间的语义相似度值之后,确定大于预设第一阈值的语义相似度值。进一步的,确定大于预设第一阈值的语义相似度值对应的属性,并将所述属性确定为第一数据和第二数据之间的共有属性。也就是说,可以将语义相似度值比较高的一对属性,确定为第一数据和第二数据之间的共有属性。In the embodiment of the present invention, after the semantic similarity value between the attributes is obtained by calculation, a semantic similarity value greater than a preset first threshold is determined. Further, an attribute corresponding to the semantic similarity value of the preset first threshold is determined, and the attribute is determined as a common attribute between the first data and the second data. That is to say, a pair of attributes having a higher semantic similarity value can be determined as a common attribute between the first data and the second data.
例如,作为第一数据和第二数据中的属性,演唱者和歌手之间的语义相似度值高于预设第一阈值,则将演唱者和歌手确定为第一数据和第二数据之间的共有属性。For example, as the attribute in the first data and the second data, the semantic similarity value between the singer and the singer is higher than the preset first threshold, and the singer and the singer are determined as the first data and the second data Common attributes.
S104:通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值。S104: Determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes.
本发明实施例中,在确定第一数据和第二数据之间的共有属性后,比较每对共有属性分别对应的属性值之间的相似度,可选地,相似度可以为计算每对共有属性分别对应的属性值之间的语义相似度值或字符串相似度值等,最终根 据每对共有属性对应的属性值之间的相似度,确定第一数据和第二数据之间的相似度值。In the embodiment of the present invention, after determining the common attribute between the first data and the second data, comparing the similarity between the attribute values corresponding to each pair of shared attributes, optionally, the similarity may be calculated for each pair of common The semantic similarity value or the string similarity value between the attribute values corresponding to the attributes respectively, and finally determining the similarity between the first data and the second data according to the similarity between the attribute values corresponding to each pair of the shared attributes value.
在一种实现方式中,相似度为每对共有属性分别对应的属性值之间的语义相似度值。首先,从所述第一数据和所述第二数据中,获取每对共有属性对应的属性值,并计算同一对共有属性对应的属性值之间的语义相似度值。例如,获取作为一对共有属性的演唱者和歌手,分别对应的属性值刘德华和华仔,计算刘德华和华仔之间的语义相似度值。其次,根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值。也就是说,第一数据和第二数据之间的相似度取决于第一数据和第二数据的共有属性对应的属性值之间的相似度。In one implementation, the similarity is a semantic similarity value between the attribute values corresponding to each pair of common attributes. First, from the first data and the second data, an attribute value corresponding to each pair of shared attributes is obtained, and a semantic similarity value between the attribute values corresponding to the same pair of shared attributes is calculated. For example, to obtain the singer and singer as a pair of common attributes, respectively, the corresponding attribute values, Andy Lau and Hua Tsai, calculate the semantic similarity between Andy Lau and Hua Tsai. Secondly, a similarity value between the first data and the second data is determined according to a semantic similarity value between attribute values corresponding to each pair of shared attributes. That is, the degree of similarity between the first data and the second data depends on the similarity between the attribute values corresponding to the common attributes of the first data and the second data.
为了提高第一数据和第二数据之间的相似度值的计算效率,本发明实施例在确定所述第一数据和所述第二数据之间的相似度值之前,确定所述共有属性中语义相似度值不大于预设第三阈值的属性值对应的属性,并将不大于预设第三阈值的属性值对应的属性剔除,以提高共有属性的准确度,也减少了共有属性的数量。也就是说,语义相似度值不大于所述预设第三阈值的属性值对应的属性不属于第一数据和第二数据之间的共有属性。本发明实施例预先对第一数据和第二数据之间的共有属性进行进一步的确定,筛除不是真正的共有属性,以提高后续第一数据和第二数据之间的相似度值的计算效率。In order to improve the calculation efficiency of the similarity value between the first data and the second data, the embodiment of the present invention determines the common attribute before determining the similarity value between the first data and the second data. The semantic similarity value is not greater than the attribute corresponding to the attribute value of the preset third threshold, and the attribute corresponding to the attribute value that is not greater than the preset third threshold is excluded, so as to improve the accuracy of the shared attribute and reduce the number of the shared attribute. . That is to say, the attribute corresponding to the attribute value whose semantic similarity value is not greater than the preset third threshold does not belong to the common attribute between the first data and the second data. In the embodiment of the present invention, the common attribute between the first data and the second data is further determined in advance, and the screening is not a true common attribute, so as to improve the calculation efficiency of the similarity value between the subsequent first data and the second data. .
另外,本发明实施例提供了一种确定第一数据和第二数据之间的相似度值的方法。首先,在第一数据和第二数据中,计算每对共有属性对应的权重值。可选地,利用词频-逆文件频率(Term Frequency-Inverse Document Frequency,简称为tf-idf)算法计算在第一数据和第二数据中每对共有属性对应的权重值。其次,将每对共有属性对应的属性值之间的语义相似度值,与该对共有属性对应的权重值的乘积进行累加,得到所述第一数据和所述第二数据之间的相似度值。In addition, an embodiment of the present invention provides a method for determining a similarity value between a first data and a second data. First, in the first data and the second data, weight values corresponding to each pair of common attributes are calculated. Optionally, the weight value corresponding to each pair of common attributes in the first data and the second data is calculated by using a Term Frequency-Inverse Document Frequency (tf-idf) algorithm. Secondly, the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute are accumulated to obtain the similarity between the first data and the second data. value.
例如,属于一对共有属性的演唱者和歌手,分别对应的属性值刘德华和华仔之间的语义相似度值为90%,同时该对共有属性对应的权重值为0.6,则计算该对共有属性对应的90%与0.6的乘积,作为后续累加的一个加数,依此类推,得到每对共有属性对应的乘积后进行累加,最终得到所述第一数据和所述 第二数据之间的相似度值。For example, if the singer and singer belonging to a pair of common attributes have a semantic similarity value between the corresponding attribute values Andy Lau and Hua Tsai, 90%, and the weight value corresponding to the shared attribute is 0.6, the pair is calculated. The product of 90% and 0.6 corresponding to the attribute is used as an addend of the subsequent accumulation, and so on, and the product corresponding to each pair of common attributes is obtained and accumulated, and finally between the first data and the second data is obtained. Similarity value.
S105:如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。S105: If the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
本发明实施例在计算得到所述第一数据和所述第二数据之间的相似度值后,判断所述相似度值是否大于预设第二阈值,如果所述相似度值大于所述第二阈值,则将所述第一数据和所述第二数据进行融合,可以是对指向同一实体的所述第一数据和所述第二数据进行合并去重,最终实现对指向不同实体的数据保留,例如,歌曲库中存储有源自A音乐应用的歌曲《忘情水》,包含有若干属性,如歌手刘德华、曲长4分钟;另外,歌曲库中还存储有源自B音乐应用的歌曲《忘情水》,包含歌手刘德华,发行时间1994年等属性。由于两首歌曲实质上是同一首歌曲,为了避免出现歌曲查询错误,系统需要对两首歌曲进行融合,即融合成一首歌曲《忘情水》存储在歌曲库中,其中融合后的歌曲中包含上述两首歌曲的所有属性;如果所述相似度值不大于所述第二阈值,则说明所述第一数据和所述第二数据不能够进行融合。After calculating the similarity value between the first data and the second data, determining whether the similarity value is greater than a preset second threshold, if the similarity value is greater than the first a second threshold, the first data and the second data are merged, and the first data and the second data that are directed to the same entity are combined and de-duplicated, and finally the data that points to different entities is implemented. For example, the song library contains the song "Forget Love Water" from the A music application, which contains several attributes, such as singer Andy Lau, song length for 4 minutes; in addition, the song library also stores songs derived from the B music application. "Forget the Water", including the singer Andy Lau, the release date of 1994 and other attributes. Since the two songs are essentially the same song, in order to avoid the song query error, the system needs to fuse the two songs, that is, merge into a song "Forget Love Water" stored in the song library, wherein the merged song contains the above All attributes of the two songs; if the similarity value is not greater than the second threshold, it indicates that the first data and the second data cannot be fused.
在本发明实施例提供的数据融合方法中,首先,提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系。其次,计算各个属性之间的语义相似度值,确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性。最后,通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值,如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。本发明实施例基于语义相似度值确定第一数据和第二数据的共有属性,进而比较共有属性对应的属性值之间的相似度,最终确定第一数据和第二数据之间的相似度值。与相关技术相比,本发明实施例在保证数据融合准确性的前提下,提高了数据融合率。In the data fusion method provided by the embodiment of the present invention, first, an attribute in the first data and the second data is extracted, wherein the first data and the second data include a correspondence between an attribute and an attribute value. Secondly, calculating a semantic similarity value between the respective attributes, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the first A pair of common attributes of two data. Finally, determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data When the second threshold is greater than the preset, the first data and the second data are merged. The embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. . Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.
本发明实施例还提供了一种数据融合方法,参考图2,为本发明实施例提供的另一种数据融合方法的流程图,所述数据融合方法具体包括:The embodiment of the present invention further provides a data fusion method. Referring to FIG. 2, it is a flowchart of another data fusion method according to an embodiment of the present invention.
S201:提取第一数据和第二数据中的属性,其中,所述第一数据和所述第 二数据中包括属性与属性值的对应关系。S201: Extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value.
S202:提取所述第一数据和所述第二数据中各个属性对应的属性值,并获取相似度值大于预设第四阈值的属性值对应的属性。S202: Extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
S203:计算所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值。S203: Calculate a semantic similarity value between attributes corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
本发明实施例中,通过计算第一数据和第二数据中属性值之间的相似度值,确定相似度值大于预设第四阈值的属性值,进而获取所述属性值对应的属性。也就是说,本发明实施例通过属性值之间的相似度值的计算,筛选出更可能属于第一数据和第二数据之间的共有属性的属性,即相似度值大于预设第四阈值的属性值对应的属性。在此基础上,计算所述相似度值大于第四阈值的属性值对应的属性之间的语义相似度值,从而确定第一数据和第二数据之间的共有属性,能够提高共有属性的确定效率。In the embodiment of the present invention, by calculating a similarity value between the attribute values in the first data and the second data, determining an attribute value whose similarity value is greater than a preset fourth threshold, and acquiring an attribute corresponding to the attribute value. That is to say, in the embodiment of the present invention, the attribute that is more likely to belong to the common attribute between the first data and the second data is filtered by the calculation of the similarity value between the attribute values, that is, the similarity value is greater than the preset fourth threshold. The attribute value corresponding to the attribute. On the basis of this, calculating a semantic similarity value between the attributes corresponding to the attribute value whose similarity value is greater than the fourth threshold, thereby determining a common attribute between the first data and the second data, and improving the determination of the common attribute. effectiveness.
另外,本发明实施例还可以在确定第一数据和第二数据的共有属性之前,通过查询预设的同义词库,将属于同义词的属性预先确定为第一数据和第二数据的共有属性。进一步的,本发明实施例在计算相似度值大于第四阈值的属性值对应的属性之间的语义相似度值之前,可以筛除已经通过同义词确定为共有属性的属性,在此基础上进一步的确定第一数据和第二数据之间的其他共有属性,也能够提高共有属性的确定效率。In addition, the embodiment of the present invention may further determine the attribute belonging to the synonym as a common attribute of the first data and the second data by querying the preset synonym database before determining the common attribute of the first data and the second data. Further, before calculating the semantic similarity value between the attributes corresponding to the attribute value whose similarity value is greater than the fourth threshold value, the embodiment of the present invention may filter the attribute that has been determined as the common attribute by the synonym, and further Determining other common attributes between the first data and the second data can also improve the efficiency of determining the common attributes.
S204:确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性。S204: Determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.
S205:通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值。S205: Determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes.
S206:如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。S206: If the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
本发明实施例中的S201、S204-S206与上述S101、S103-S105执行过程相同,可以参照上述解释进行理解。S201 and S204-S206 in the embodiment of the present invention are the same as the above-described processes of S101 and S103-S105, and can be understood by referring to the above explanation.
在本发明实施例提供的数据融合方法中,通过对第一数据和第二数据中各个属性值之间的相似度值的计算,筛选出更可能属于第一数据和第二数据之间的共有属性的属性,另外也可以通过同义词库确定属于同义词的共有属性,进 而在此基础上确定出第一数据和第二数据之间的其他的共有属性,提高了数据融合中的共有属性的确定效率。In the data fusion method provided by the embodiment of the present invention, by calculating the similarity value between each attribute value in the first data and the second data, it is more likely to belong to the common between the first data and the second data. The attribute of the attribute, in addition, the common attribute belonging to the synonym can be determined by the thesaurus, and then the other common attributes between the first data and the second data are determined, thereby improving the determining efficiency of the common attribute in the data fusion. .
另外,本发明实施例基于语义相似度值确定第一数据和第二数据的共有属性,进而比较共有属性对应的属性值之间的相似度,最终确定第一数据和第二数据之间的相似度值。与相关技术相比,本发明实施例在保证数据融合准确性的前提下,提高了数据融合率。In addition, the embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity between the first data and the second data. Degree value. Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.
本发明实施例还提供了一种数据融合方法,参考图3,为本发明实施例提供的另一种数据融合方法的流程图,所述数据融合方法包括:The embodiment of the present invention further provides a data fusion method. Referring to FIG. 3, it is a flowchart of another data fusion method according to an embodiment of the present invention. The data fusion method includes:
S301:提取第一数据和第二数据中的属性值,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系。S301: Extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between the attribute and the attribute value.
本发明实施例中,首先提取第一数据和第二数据中的各个属性值,例如,第一数据中包括演唱者-刘德华的对应关系,第二数据中包括歌手-华仔的对应关系,其中,第一数据中的刘德华和第二数据中的华仔为属性值。In the embodiment of the present invention, each attribute value in the first data and the second data is first extracted. For example, the first data includes a correspondence relationship between the singer and Andy Lau, and the second data includes a correspondence relationship between the singer and the Chinese singer. , Andy Lau in the first data and Hua Tsai in the second data are attribute values.
S302:计算各个属性值之间的相似度值。S302: Calculate a similarity value between each attribute value.
本发明实施例中,在提取第一数据和第二数据中的属性值后,计算各个属性值之间的相似度值,例如计算属性值刘德华和华仔之间的相似度值。In the embodiment of the present invention, after extracting attribute values in the first data and the second data, calculating similarity values between the respective attribute values, for example, calculating similarity values between the attribute values Andy Lau and Hua Tsai.
在实际应用中,可以计算各个属性值之间的语义相似度值。为了提高准确性,本发明实施例还可以直接计算各个属性值之间的字符串相似度值。In practical applications, semantic similarity values between individual attribute values can be calculated. In order to improve the accuracy, the embodiment of the present invention can also directly calculate the string similarity value between each attribute value.
在一种实现方式中,计算各个属性值之间的语义相似度值的方法可以利用预设的词嵌入模型分别获取各个属性值对应的语义向量,然后,计算各个属性值对应的语义向量之间的语义相似度值,即为各个属性值之间的语义相似度值。In an implementation manner, a method for calculating a semantic similarity value between each attribute value may obtain a semantic vector corresponding to each attribute value by using a preset word embedding model, and then calculate a semantic vector corresponding to each attribute value. The semantic similarity value is the semantic similarity value between each attribute value.
S303:根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值。S303: Determine a similarity value between the first data and the second data according to the similarity value between the respective attribute values.
在本发明实施例中,在计算各个属性值之间的相似度值后,根据各个属性值之间的相似度值,确定第一数据和第二数据之间的相似度值。In the embodiment of the present invention, after calculating the similarity value between the respective attribute values, the similarity value between the first data and the second data is determined according to the similarity value between the respective attribute values.
在一种实现方式中,在计算各个属性值之间的相似度值之前,首先提取第一数据和第二数据中的属性,并计算各个属性之间的语义相似度值,从而确定 出第一数据和第二数据的共有属性。可选地,将语义相似度值大于预设第一阈值的属性确定为第一数据和第二数据的共有属性。In an implementation manner, before calculating the similarity value between the respective attribute values, first extracting the attributes in the first data and the second data, and calculating a semantic similarity value between the respective attributes, thereby determining the first A common attribute of data and second data. Optionally, the attribute whose semantic similarity value is greater than the preset first threshold is determined as a common attribute of the first data and the second data.
本发明实施例在计算各个属性值之间的相似度值时,可以只计算同一对共有属性对应的属性值之间的语义相似度值,以提高属性值之间的相似度的计算效率。When calculating the similarity value between the attribute values, the embodiment of the present invention may only calculate the semantic similarity value between the attribute values corresponding to the same pair of shared attributes, so as to improve the calculation efficiency of the similarity between the attribute values.
另外,在确定第一数据和第二数据之间的相似度值时,可以根据第一数据和第二数据的每对共有属性对应的属性值之间的语义相似度值而确定。可选地,预先计算每对共有属性在第一数据和第二数据中的权重值,然后,将每对共有属性对应的属性值之间的语义相似度值与该对共有属性对应的权重值的乘积进行累加,进而得到第一数据和第二数据之间的相似度值。In addition, when determining the similarity value between the first data and the second data, it may be determined according to a semantic similarity value between the attribute values corresponding to each pair of common attributes of the first data and the second data. Optionally, pre-calculating weight values of each pair of shared attributes in the first data and the second data, and then, using a semantic similarity value between the attribute values corresponding to each pair of shared attributes and a weight value corresponding to the pair of shared attributes The product of the sum is accumulated to obtain a similarity value between the first data and the second data.
为了提高第一数据和第二数据之间的相似度值的确定效率,本发明实施例预先从确定的共有属性中筛除语义相似度值不大于预设第三阈值的属性值对应的共有属性,以提高共有属性的准确度,也减少了共有属性的数量,提高第一数据和第二数据之间的相似度值的确定效率。In order to improve the determining efficiency of the similarity value between the first data and the second data, the embodiment of the present invention pre-screens the common attribute corresponding to the attribute value whose semantic similarity value is not greater than the preset third threshold value from the determined common attribute. In order to improve the accuracy of the common attribute, the number of shared attributes is also reduced, and the efficiency of determining the similarity value between the first data and the second data is improved.
另外,在确定第一数据和第二数据的共有属性之前,为了提高共有属性的确定效率,首先确定相似度值大于预设第四阈值的属性值对应的属性,在所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值中,确定大于预设第一阈值的语义相似度值,将大于预设第一阈值的语义相似度值对应的属性确定为第一数据和第二数据的共有属性。In addition, before determining the common attribute of the first data and the second data, in order to improve the determining efficiency of the common attribute, first determining an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold, where the similarity value is greater than the pre- Setting a semantic similarity value between the attributes corresponding to the attribute value of the fourth threshold, determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to the semantic similarity value that is greater than the preset first threshold as the first A common attribute of a data and a second data.
另外,本发明实施例还可以通过查询预设的同义词库的方式,预先将属于同义词的属性直接确定为第一数据和第二数据的共有属性,后续只需要筛除已经通过同义词确定为共有属性的属性,计算不属于同义词的属性之间的语义相似度值,从而提高共有属性的确定效率。In addition, in the embodiment of the present invention, the attribute belonging to the synonym is directly determined as the common attribute of the first data and the second data by querying the preset synonym database, and the subsequent only needs to be screened as the common attribute by the synonym. Attributes that compute semantic similarity values between attributes that are not synonymous, thereby improving the efficiency of determining common attributes.
本发明实施例还提供了一种计算各个属性之间的语义相似度值的方法,可选地,首先利用预设的词嵌入模型分别获取各个属性对应的语义向量。其次,计算各个属性对应的语义向量之间的语义相似度值,即为各个属性之间的语义相似度值。The embodiment of the present invention further provides a method for calculating a semantic similarity value between each attribute. Optionally, the semantic vector corresponding to each attribute is separately obtained by using a preset word embedding model. Secondly, the semantic similarity value between the semantic vectors corresponding to each attribute is calculated, that is, the semantic similarity value between the respective attributes.
S304:如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。S304: If the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
本发明实施例在计算得到所述第一数据和所述第二数据之间的相似度值后,判断所述相似度值是否大于预设第二阈值,如果所述相似度值大于所述第二阈值,则将所述第一数据和所述第二数据进行融合;否则说明所述第一数据和所述第二数据不能够进行融合。After calculating the similarity value between the first data and the second data, determining whether the similarity value is greater than a preset second threshold, if the similarity value is greater than the first The second threshold is used to fuse the first data and the second data; otherwise, the first data and the second data cannot be merged.
本发明实施例提供的数据融合方法中,首先,提取第一数据和第二数据中的属性值,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系。其次,计算各个属性值之间的相似度值。最后,根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值。如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。本发明实施例通过直接计算第一数据和第二数据中属性值之间的相似度值,确定第一数据和第二数据之间的相似度值,提高了数据融合的效率。In the data fusion method provided by the embodiment of the present invention, first, an attribute value in the first data and the second data is extracted, where the first data and the second data include a correspondence between an attribute and an attribute value. Second, calculate the similarity value between each attribute value. Finally, a similarity value between the first data and the second data is determined according to a similarity value between the respective attribute values. And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged. The embodiment of the present invention determines the similarity value between the first data and the second data by directly calculating the similarity value between the attribute values in the first data and the second data, thereby improving the efficiency of data fusion.
进一步的,基于语义相似度值确定第一数据和第二数据的共有属性,进而比较共有属性对应的属性值之间的相似度,最终确定第一数据和第二数据之间的相似度值,在保证数据融合准确性的前提下,提高了数据融合率。Further, determining a common attribute of the first data and the second data based on the semantic similarity value, and comparing the similarity between the attribute values corresponding to the shared attribute, and finally determining a similarity value between the first data and the second data, Under the premise of ensuring the accuracy of data fusion, the data fusion rate is improved.
本发明实施例提供了一种数据融合装置,包括一个或多个处理器,以及一个或多个存储程序单元的存储器,其中,所述程序单元由所述处理器执行。参考图4,为本发明实施例提供的一种数据融合装置结构示意图,所述装置包括:Embodiments of the present invention provide a data fusion apparatus including one or more processors and one or more memories storing program units, wherein the program units are executed by the processor. FIG. 4 is a schematic structural diagram of a data fusion device according to an embodiment of the present invention, where the device includes:
提取模块401,被设置为提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;The extracting module 401 is configured to extract the attributes in the first data and the second data, wherein the first data and the second data include a correspondence between the attribute and the attribute value;
第一计算模块402,被设置为计算各个属性之间的语义相似度值;The first calculation module 402 is configured to calculate a semantic similarity value between the respective attributes;
第一确定模块403,被设置为确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性;The first determining module 403 is configured to determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as one of the first data and the second data Common attribute
第二确定模块404,被设置为通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值;The second determining module 404 is configured to determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes;
融合模块405,被设置为在所述第一数据和所述第二数据之间的相似度值大于预设第二阈值时,将所述第一数据和所述第二数据进行融合。The fusion module 405 is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
此处需要说明的是,上述提取模块401、第一计算模块402、第一确定模块403第二确定模块404、融合模块405可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能,终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。It should be noted that the foregoing extraction module 401, the first calculation module 402, the first determination module 403, the second determination module 404, and the fusion module 405 can be run in the terminal as part of the device, and can be processed by the processor in the terminal. To implement the functions implemented by the above modules, the terminal may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.
其中,所述第二确定模块包括:The second determining module includes:
第一计算子模块,被设置为从所述第一数据和所述第二数据中,获取每对共有属性对应的属性值,并计算同一对共有属性对应的属性值之间的语义相似度值;a first calculation submodule configured to obtain, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculate a semantic similarity value between the attribute values corresponding to the same pair of shared attributes ;
第一确定子模块,被设置为根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值。The first determining submodule is configured to determine a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
此处需要说明的是,上述第一计算子模块和第一确定子模块可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能。It should be noted that the foregoing first computing submodule and the first determining submodule may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
可选地,所述装置还包括:Optionally, the device further includes:
第二计算模块,被设置为在所述第一数据和所述第二数据中,计算每对共有属性对应的权重值。The second calculating module is configured to calculate, in the first data and the second data, a weight value corresponding to each pair of shared attributes.
此处需要说明的是,上述第二计算模块可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能。It should be noted that the foregoing second computing module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
相应的,所述第一确定子模块包括:Correspondingly, the first determining submodule comprises:
累加子模块,被设置为将每对共有属性对应的属性值之间的语义相似度值与该对共有属性对应的权重值的乘积进行累加,得到所述第一数据和所述第二数据之间的相似度值。The accumulating submodule is configured to accumulate the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the pair of shared attributes, to obtain the first data and the second data. The similarity value between.
此处需要说明的是,上述累加子模块可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能。It should be noted here that the above-mentioned accumulation sub-module can be operated in the terminal as a part of the device, and the functions implemented by the above module can be executed by the processor in the terminal.
另外,所述装置还包括:In addition, the device further includes:
筛除模块,被设置为从所述共有属性中,筛除所述语义相似度值不大于预设第三阈值的属性值对应的共有属性。The screening module is configured to filter, from the common attribute, a common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold.
此处需要说明的是,上述筛除模块可以作为装置的一部分运行在终端中, 可以通过终端中的处理器来执行上述模块实现的功能。It should be noted here that the screening module can be operated in the terminal as part of the device, and the functions implemented by the above module can be performed by the processor in the terminal.
所述装置还包括:The device also includes:
获取模块,被设置为提取所述第一数据和所述第二数据中各个属性对应的属性值,并获取相似度值大于预设第四阈值的属性值对应的属性。And an acquiring module, configured to extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
此处需要说明的是,上述获取模块可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能。It should be noted that the foregoing acquisition module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
相应的,所述第一计算模块包括:Correspondingly, the first calculation module comprises:
第二计算子模块,被设置为计算所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值。The second calculation submodule is configured to calculate a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
此处需要说明的是,上述第二计算子模块可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能。It should be noted that the foregoing second computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
所述装置还包括:The device also includes:
第三确定模块,被设置为通过查询预设的同义词库,将属于同义词的属性确定为所述第一数据和所述第二数据的一对共有属性。The third determining module is configured to determine, by querying the preset synonym database, an attribute belonging to the synonym as a pair of common attributes of the first data and the second data.
此处需要说明的是,上述第三确定模块可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能。It should be noted that the foregoing third determining module may be operated in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
相应的,所述第一计算模块包括:Correspondingly, the first calculation module comprises:
第三计算子模块,被设置为计算不属于同义词的属性之间的语义相似度值。A third computing sub-module configured to calculate semantic similarity values between attributes that are not synonymous.
此处需要说明的是,上述第三计算子模块可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能。It should be noted that the foregoing third computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the terminal.
可选地,所述第一计算模块包括:Optionally, the first computing module includes:
获取子模块,被设置为利用预设的词嵌入模型分别获取各个属性对应的语义向量;Obtaining a sub-module, configured to obtain a semantic vector corresponding to each attribute by using a preset word embedding model;
第四计算子模块,被设置为计算各个属性对应的语义向量之间的语义相似度值。The fourth calculation sub-module is configured to calculate a semantic similarity value between the semantic vectors corresponding to the respective attributes.
此处需要说明的是,上述获取子模块和第四计算子模块可以作为装置的一部分运行在终端中,可以通过终端中的处理器来执行上述模块实现的功能。It should be noted that the foregoing obtaining sub-module and the fourth computing sub-module may be run in the terminal as part of the device, and the functions implemented by the above-mentioned modules may be performed by a processor in the terminal.
本发明实施例还提供了一种数据融合装置,包括一个或多个处理器,以及 一个或多个存储程序单元的存储器,其中,所述程序单元由所述处理器执行,所述程序单元包括:Embodiments of the present invention also provide a data fusion device including one or more processors, and one or more memories storing program units, wherein the program units are executed by the processor, and the program units include :
提取模块,被设置为提取第一数据和第二数据中的属性值,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;An extraction module, configured to extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;
计算模块,被设置为计算各个属性值之间的相似度值;a calculation module configured to calculate a similarity value between respective attribute values;
确定模块,被设置为根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值;a determining module, configured to determine a similarity value between the first data and the second data according to a similarity value between the respective attribute values;
融合模块,被设置为在所述第一数据和所述第二数据之间的相似度值大于预设第二阈值时,将所述第一数据和所述第二数据进行融合。The fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
此处需要说明的是,上述提取模块、计算模块、确定模块和融合模块可以作为装置的一部分运行在终端中。It should be noted here that the above extraction module, calculation module, determination module and fusion module can be operated in the terminal as part of the device.
本发明实施例提供的数据融合装置能够实现如下功能:提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系。计算各个属性之间的语义相似度值,确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性。通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值,如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。本发明实施例基于语义相似度值确定第一数据和第二数据的共有属性,进而比较共有属性对应的属性值之间的相似度,最终确定第一数据和第二数据之间的相似度值。与相关技术相比,本发明实施例在保证数据融合准确性的前提下,提高了数据融合率。The data fusion device provided by the embodiment of the present invention can implement the following functions: extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between attributes and attribute values. Calculating a semantic similarity value between each attribute, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the second data A pair of shared attributes. Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data is greater than a pre- Setting a second threshold, the first data and the second data are merged. The embodiment of the present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. . Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.
相应的,本发明实施例还提供一种电子装置,参见图5所示,可以包括:Correspondingly, an embodiment of the present invention further provides an electronic device, as shown in FIG. 5, which may include:
处理器501、存储器502、输入装置503和输出装置504。浏览器服务器中的处理器501的数量可以一个或多个,图5中以一个处理器为例。在本发明的一些实施例中,处理器501、存储器502、输入装置503和输出装置504可通过总线或其它方式连接,其中,图5中以通过总线连接为例。The processor 501, the memory 502, the input device 503, and the output device 504. The number of processors 501 in the browser server may be one or more, and one processor is taken as an example in FIG. In some embodiments of the present invention, the processor 501, the memory 502, the input device 503, and the output device 504 may be connected by a bus or other means, wherein the bus connection is taken as an example in FIG.
存储器502可用于存储计算机程序以及模块,如本发明实施例中的数据融 合方法及装置对应的程序指令/模块,处理器501被设置为通过运行存储在存储器502的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的数据融合方法。存储器502可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等。此外,存储器502可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。在一些实例中,存储器502可进一步包括相对于处理器501远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 502 can be used to store a computer program and a module, such as a data fusion method and a program instruction/module corresponding to the device in the embodiment of the present invention. The processor 501 is configured to execute each of the software programs and modules stored in the memory 502. A functional application and data processing, that is, the above data fusion method is implemented. The memory 502 can mainly include a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function, and the like. Moreover, memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. In some examples, memory 502 can further include memory remotely located relative to processor 501, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置503可用于接收输入的数字或字符信息,以及产生与浏览器服务器的用户设置以及功能控制有关的键信号输入。 Input device 503 can be used to receive input numeric or character information and to generate key signal inputs related to user settings and function controls of the browser server.
具体在本实施例中,处理器501会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器502中,并由处理器501来运行存储在存储器502中的应用程序,从而实现各种功能:Specifically, in this embodiment, the processor 501 loads the executable file corresponding to the process of one or more applications into the memory 502 according to the following instructions, and is executed by the processor 501 to be stored in the memory 502. The application to implement various functions:
提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;Extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;
计算各个属性之间的语义相似度值;Calculate semantic similarity values between attributes;
确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性;Determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of common attributes of the first data and the second data;
通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值;Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of common attributes;
如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
可选地,本实施例中的具体示例可以参考上述实施例中所描述的示例,本实施例在此不再赘述。For example, the specific examples in this embodiment may refer to the examples described in the foregoing embodiments, and details are not described herein again.
本领域普通技术人员可以理解,图5所示的结构仅为示意,电子装置可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图5其并不 对上述电子装置的结构造成限定。例如,电子装置还可包括比图5中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图5所示不同的配置。A person skilled in the art can understand that the structure shown in FIG. 5 is merely illustrative, and the electronic device can be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile Internet device (MID). Terminal equipment such as PAD. Fig. 5 does not limit the structure of the above electronic device. For example, the electronic device may also include more or fewer components (such as a network interface, display device, etc.) than shown in FIG. 5, or have a different configuration than that shown in FIG.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps of the foregoing embodiments may be completed by a program to instruct terminal device related hardware, and the program may be stored in a computer readable storage medium, and the storage medium may be Including: flash disk, read-only memory (ROM), random access memory (RAM), disk or optical disk.
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时可以用于执行视频监控方法。Embodiments of the present invention also provide a storage medium. Optionally, in this embodiment, a computer program is stored in the storage medium, wherein the computer program is configured to be executed to execute a video monitoring method.
可选地,在本实施例中,上述存储介质可以位于上述实施例所示的网络中的多个网络设备中的至少一个网络设备上。Optionally, in this embodiment, the foregoing storage medium may be located on at least one of the plurality of network devices in the network shown in the foregoing embodiment.
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:Optionally, in the present embodiment, the storage medium is arranged to store program code for performing the following steps:
提取第一数据和第二数据中的属性,其中,第一数据和第二数据中包括属性与属性值的对应关系;Extracting attributes in the first data and the second data, where the first data and the second data include a correspondence between the attribute and the attribute value;
计算各个属性之间的语义相似度值;Calculate semantic similarity values between attributes;
确定大于预设第一阈值的语义相似度值,并将每个语义相似度值对应的属性确定为第一数据和第二数据的一对共有属性;Determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to each semantic similarity value as a pair of common attributes of the first data and the second data;
通过比较每对共有属性对应的属性值,确定第一数据和第二数据之间的相似度值;Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of common attributes;
如果第一数据和第二数据之间的相似度值大于预设第二阈值,则将第一数据和第二数据进行融合。If the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are fused.
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Optionally, in this embodiment, the foregoing storage medium may include, but not limited to, a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, and a magnetic A variety of media that can store program code, such as a disc or a disc.
如上参照附图以示例的方式描述了根据本发明的数据融合方法及装置、存储介质以及电子装置。但是,本领域技术人员应当理解,对于上述本发明所提出的视频监控方法及装置、存储介质以及电子装置,还可以在不脱离本发明内容的基础上做出各种改进。因此,本发明的保护范围应当由所附的权利要求书的内容确定。The data fusion method and apparatus, storage medium, and electronic apparatus according to the present invention are described above by way of example with reference to the accompanying drawings. However, those skilled in the art should understand that various improvements can be made to the video monitoring method and apparatus, the storage medium, and the electronic apparatus proposed by the present invention without departing from the scope of the present invention. Therefore, the scope of the invention should be determined by the content of the appended claims.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment. The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply such entities or operations. There is any such actual relationship or order between them. Furthermore, the term "comprises" or "comprises" or "comprises" or any other variations thereof is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that comprises a plurality of elements includes not only those elements but also Other elements, or elements that are inherent to such a process, method, item, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.
以上对本发明实施例所提供的一种数据融合方法及装置、存储介质以及电子装置进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The data fusion method and device, the storage medium and the electronic device provided by the embodiments of the present invention are described in detail. The principles and implementations of the present invention are described in the specific examples. The description of the above embodiments is only The method for understanding the present invention and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in specific embodiments and application scopes. The description should not be construed as limiting the invention.
工业实用性Industrial applicability
提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系。计算各个属性之间的语义相似度值,确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性。通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值,如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。本发明基于语义相似度值确定第一数据和第二数据的共有属性,进而比较共有属性对应的属性值之间的相似度,最终确定第一数据和第二数据之间的相似度值。与相关技术相比,本发明实施例在保证数据融合准确性的前提下,提高了数据融合率。Extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value. Calculating a semantic similarity value between each attribute, determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as the first data and the second data A pair of shared attributes. Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes, if a similarity value between the first data and the second data is greater than a pre- Setting a second threshold, the first data and the second data are merged. The present invention determines the common attribute of the first data and the second data based on the semantic similarity value, and then compares the similarity between the attribute values corresponding to the shared attribute, and finally determines the similarity value between the first data and the second data. Compared with the related art, the embodiment of the present invention improves the data fusion rate under the premise of ensuring the accuracy of data fusion.

Claims (36)

  1. 一种数据融合方法,所述方法包括:A data fusion method, the method comprising:
    提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;Extracting attributes in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;
    计算各个属性之间的语义相似度值;Calculate semantic similarity values between attributes;
    确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性;Determining a semantic similarity value that is greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of common attributes of the first data and the second data;
    通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值;Determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of common attributes;
    如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
  2. 根据权利要求1所述的数据融合方法,其中,所述通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值,包括:The data fusion method according to claim 1, wherein the determining a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes comprises:
    从所述第一数据和所述第二数据中,获取每对共有属性对应的属性值,并计算同一对共有属性对应的属性值之间的语义相似度值;Obtaining, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculating a semantic similarity value between the attribute values corresponding to the same pair of shared attributes;
    根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值。And determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
  3. 根据权利要求2所述的数据融合方法,其中,所述方法还包括:The data fusion method according to claim 2, wherein the method further comprises:
    在所述第一数据和所述第二数据中,计算每对共有属性对应的权重值。In the first data and the second data, a weight value corresponding to each pair of common attributes is calculated.
  4. 根据权利要求3所述的数据融合方法,其中,所述根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值,包括:The data fusion method according to claim 3, wherein the determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes ,include:
    将每对共有属性对应的属性值之间的语义相似度值与该对共有属性对应的权重值的乘积进行累加,得到所述第一数据和所述第二数据之间的相似度值。The product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.
  5. 根据权利要求2所述的数据融合方法,其中,所述根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值之前,还包括:The data fusion method according to claim 2, wherein the determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes Previously, it also included:
    从所述共有属性中,筛除所述语义相似度值不大于预设第三阈值的属性值 对应的共有属性。And from the common attribute, the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.
  6. 根据权利要求1所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值之前,还包括:The data fusion method according to claim 1, wherein before the calculating the semantic similarity value between the attributes, the method further comprises:
    提取所述第一数据和所述第二数据中各个属性对应的属性值,并获取相似度值大于预设第四阈值的属性值对应的属性。And extracting an attribute value corresponding to each attribute in the first data and the second data, and acquiring an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
  7. 根据权利要求6所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值,包括:The data fusion method according to claim 6, wherein said calculating a semantic similarity value between the respective attributes comprises:
    计算所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值。And calculating a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
  8. 根据权利要求1所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值之前,还包括:The data fusion method according to claim 1, wherein before the calculating the semantic similarity value between the attributes, the method further comprises:
    通过查询预设的同义词库,将属于同义词的属性确定为所述第一数据和所述第二数据的一对共有属性。The attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.
  9. 根据权利要求8所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值,包括:The data fusion method according to claim 8, wherein said calculating a semantic similarity value between the respective attributes comprises:
    计算不属于同义词的属性之间的语义相似度值。Calculates semantic similarity values between attributes that are not synonymous.
  10. 根据权利要求1所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值,包括:The data fusion method according to claim 1, wherein said calculating a semantic similarity value between the respective attributes comprises:
    利用预设的词嵌入模型分别获取各个属性对应的语义向量;Separating the semantic vectors corresponding to each attribute by using a preset word embedding model;
    计算各个属性对应的语义向量之间的语义相似度值。Calculate the semantic similarity value between the semantic vectors corresponding to each attribute.
  11. 一种数据融合方法,其中,所述方法包括:A data fusion method, wherein the method comprises:
    提取第一数据和第二数据中的属性值,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;Extracting attribute values in the first data and the second data, wherein the first data and the second data include a correspondence between an attribute and an attribute value;
    计算各个属性值之间的相似度值;Calculate the similarity value between each attribute value;
    根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值;Determining a similarity value between the first data and the second data according to a similarity value between the respective attribute values;
    如果所述第一数据和所述第二数据之间的相似度值大于预设第二阈值,则将所述第一数据和所述第二数据进行融合。And if the similarity value between the first data and the second data is greater than a preset second threshold, the first data and the second data are merged.
  12. 根据权利要求11所述的数据融合方法,其中,所述计算各个属性值之间的相似度值之前,还包括:The data fusion method according to claim 11, wherein before the calculating the similarity value between the respective attribute values, the method further comprises:
    提取所述第一数据和所述第二数据中的属性;Extracting attributes in the first data and the second data;
    计算各个属性之间的语义相似度值;Calculate semantic similarity values between attributes;
    确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性。Determining a semantic similarity value greater than a preset first threshold, and determining an attribute corresponding to each of the semantic similarity values as a pair of shared attributes of the first data and the second data.
  13. 根据权利要求12所述的数据融合方法,其中,所述计算各个属性值之间的相似度值,包括:The data fusion method according to claim 12, wherein said calculating a similarity value between respective attribute values comprises:
    计算同一对共有属性对应的属性值之间的语义相似度值。Calculate the semantic similarity value between the attribute values corresponding to the same pair of common attributes.
  14. 根据权利要求13所述的数据融合方法,其中,所述根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值,包括:The data fusion method according to claim 13, wherein the determining a similarity value between the first data and the second data according to a similarity value between the respective attribute values comprises:
    根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值。And determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
  15. 根据权利要求14所述的数据融合方法,其中,所述方法还包括:The data fusion method according to claim 14, wherein the method further comprises:
    在所述第一数据和所述第二数据中,计算每对共有属性对应的权重值。In the first data and the second data, a weight value corresponding to each pair of common attributes is calculated.
  16. 根据权利要求15所述的数据融合方法,其中,所述根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值,包括:The data fusion method according to claim 15, wherein the determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes ,include:
    将每对共有属性对应的属性值之间的语义相似度值与该对共有属性对应的权重值的乘积进行累加,得到所述第一数据和所述第二数据之间的相似度值。The product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the shared attribute is accumulated to obtain a similarity value between the first data and the second data.
  17. 根据权利要求14所述的数据融合方法,其中,所述根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值之前,还包括:The data fusion method according to claim 14, wherein the determining a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes Previously, it also included:
    从所述共有属性中,筛除所述语义相似度值不大于预设第三阈值的属性值对应的共有属性。From the common attribute, the common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold is filtered out.
  18. 根据权利要求12所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值之前,还包括:The data fusion method according to claim 12, wherein before the calculating the semantic similarity value between the attributes, the method further comprises:
    获取相似度值大于预设第四阈值的属性值对应的属性。Obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
  19. 根据权利要求18所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值,包括:The data fusion method according to claim 18, wherein said calculating a semantic similarity value between the respective attributes comprises:
    计算所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值。And calculating a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
  20. 根据权利要求12所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值之前,还包括:The data fusion method according to claim 12, wherein before the calculating the semantic similarity value between the attributes, the method further comprises:
    通过查询预设的同义词库,将属于同义词的属性确定为所述第一数据和所述第二数据的一对共有属性。The attribute belonging to the synonym is determined as a pair of common attributes of the first data and the second data by querying a preset synonym database.
  21. 根据权利要求20所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值,包括:The data fusion method according to claim 20, wherein said calculating a semantic similarity value between the respective attributes comprises:
    计算不属于同义词的属性之间的语义相似度值。Calculates semantic similarity values between attributes that are not synonymous.
  22. 根据权利要求12所述的数据融合方法,其中,所述计算各个属性之间的语义相似度值,包括:The data fusion method according to claim 12, wherein said calculating a semantic similarity value between the respective attributes comprises:
    利用预设的词嵌入模型分别获取各个属性对应的语义向量;Separating the semantic vectors corresponding to each attribute by using a preset word embedding model;
    计算各个属性对应的语义向量之间的语义相似度值。Calculate the semantic similarity value between the semantic vectors corresponding to each attribute.
  23. 根据权利要求11所述的数据融合方法,其中,所述计算各个属性值之间的相似度值,包括:The data fusion method according to claim 11, wherein said calculating a similarity value between respective attribute values comprises:
    计算各个属性值之间的字符串相似度值。Calculates the string similarity value between each attribute value.
  24. 一种数据融合装置,包括一个或多个处理器,以及一个或多个存储程序单元的存储器,其中,所述程序单元由所述处理器执行,所述程序单元包括:A data fusion device comprising one or more processors and one or more memories storing program units, wherein the program units are executed by the processor, the program units comprising:
    提取模块,被设置为提取第一数据和第二数据中的属性,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;An extraction module, configured to extract an attribute in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;
    第一计算模块,被设置为计算各个属性之间的语义相似度值;a first calculation module configured to calculate a semantic similarity value between the respective attributes;
    第一确定模块,被设置为确定大于预设第一阈值的语义相似度值,并将每个所述语义相似度值对应的属性确定为所述第一数据和所述第二数据的一对共有属性;a first determining module, configured to determine a semantic similarity value that is greater than a preset first threshold, and determine an attribute corresponding to each of the semantic similarity values as a pair of the first data and the second data Common attribute
    第二确定模块,被设置为通过比较每对共有属性对应的属性值,确定所述第一数据和所述第二数据之间的相似度值;a second determining module, configured to determine a similarity value between the first data and the second data by comparing attribute values corresponding to each pair of shared attributes;
    融合模块,被设置为在所述第一数据和所述第二数据之间的相似度值大于预设第二阈值时,将所述第一数据和所述第二数据进行融合。The fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
  25. 根据权利要求24所述的数据融合装置,其中,所述第二确定模块包括:The data fusion device of claim 24, wherein the second determining module comprises:
    第一计算子模块,被设置为从所述第一数据和所述第二数据中,获取每对共有属性对应的属性值,并计算同一对共有属性对应的属性值之间的语义相似度值;a first calculation submodule configured to obtain, from the first data and the second data, an attribute value corresponding to each pair of shared attributes, and calculate a semantic similarity value between the attribute values corresponding to the same pair of shared attributes ;
    第一确定子模块,被设置为根据每对共有属性对应的属性值之间的语义相似度值,确定所述第一数据和所述第二数据之间的相似度值。The first determining submodule is configured to determine a similarity value between the first data and the second data according to a semantic similarity value between attribute values corresponding to each pair of shared attributes.
  26. 根据权利要求25所述的数据融合装置,其中,所述装置还包括:The data fusion device of claim 25, wherein the device further comprises:
    第二计算模块,被设置为在所述第一数据和所述第二数据中,计算每对共有属性对应的权重值。The second calculating module is configured to calculate, in the first data and the second data, a weight value corresponding to each pair of shared attributes.
  27. 根据权利要求26所述的数据融合装置,其中,所述第一确定子模块包括:The data fusion device of claim 26, wherein the first determining sub-module comprises:
    累加子模块,被设置为将每对共有属性对应的属性值之间的语义相似度值与该对共有属性对应的权重值的乘积进行累加,得到所述第一数据和所述第二数据之间的相似度值。The accumulating submodule is configured to accumulate the product of the semantic similarity value between the attribute values corresponding to each pair of common attributes and the weight value corresponding to the pair of shared attributes, to obtain the first data and the second data. The similarity value between.
  28. 根据权利要求25所述的数据融合装置,其中,所述装置还包括:The data fusion device of claim 25, wherein the device further comprises:
    筛除模块,被设置为从所述共有属性中,筛除所述语义相似度值不大于预设第三阈值的属性值对应的共有属性。The screening module is configured to filter, from the common attribute, a common attribute corresponding to the attribute value whose semantic similarity value is not greater than a preset third threshold.
  29. 根据权利要求24所述的数据融合装置,其中,所述装置还包括:The data fusion device of claim 24, wherein the device further comprises:
    获取模块,被设置为提取所述第一数据和所述第二数据中各个属性对应的属性值,并获取相似度值大于预设第四阈值的属性值对应的属性。And an acquiring module, configured to extract an attribute value corresponding to each attribute in the first data and the second data, and obtain an attribute corresponding to the attribute value whose similarity value is greater than a preset fourth threshold.
  30. 根据权利要求29所述的数据融合装置,其中,所述第一计算模块包括:The data fusion device of claim 29, wherein the first computing module comprises:
    第二计算子模块,被设置为计算所述相似度值大于预设第四阈值的属性值对应的属性之间的语义相似度值。The second calculation submodule is configured to calculate a semantic similarity value between the attributes corresponding to the attribute values whose similarity values are greater than a preset fourth threshold.
  31. 根据权利要求24所述的数据融合装置,其中,所述装置还包括:The data fusion device of claim 24, wherein the device further comprises:
    第三确定模块,被设置为通过查询预设的同义词库,将属于同义词的属性确定为所述第一数据和所述第二数据的一对共有属性。The third determining module is configured to determine, by querying the preset synonym database, an attribute belonging to the synonym as a pair of common attributes of the first data and the second data.
  32. 根据权利要求31所述的数据融合装置,其中,所述第一计算模块包括:The data fusion device of claim 31, wherein the first computing module comprises:
    第三计算子模块,被设置为计算不属于同义词的属性之间的语义相似度值。A third computing sub-module configured to calculate semantic similarity values between attributes that are not synonymous.
  33. 根据权利要求24所述的数据融合装置,其中,所述第一计算模块包括:The data fusion device of claim 24, wherein the first computing module comprises:
    获取子模块,被设置为利用预设的词嵌入模型分别获取各个属性对应的语义向量;Obtaining a sub-module, configured to obtain a semantic vector corresponding to each attribute by using a preset word embedding model;
    第四计算子模块,被设置为计算各个属性对应的语义向量之间的语义相似度值。The fourth calculation sub-module is configured to calculate a semantic similarity value between the semantic vectors corresponding to the respective attributes.
  34. 一种数据融合装置,包括一个或多个处理器,以及一个或多个存储程序单元的存储器,其中,所述程序单元由所述处理器执行,所述程序单元包括:A data fusion device comprising one or more processors and one or more memories storing program units, wherein the program units are executed by the processor, the program units comprising:
    提取模块,被设置为提取第一数据和第二数据中的属性值,其中,所述第一数据和所述第二数据中包括属性与属性值的对应关系;An extraction module, configured to extract an attribute value in the first data and the second data, where the first data and the second data include a correspondence between an attribute and an attribute value;
    计算模块,被设置为计算各个属性值之间的相似度值;a calculation module configured to calculate a similarity value between respective attribute values;
    确定模块,被设置为根据所述各个属性值之间的相似度值,确定所述第一数据和所述第二数据之间的相似度值;a determining module, configured to determine a similarity value between the first data and the second data according to a similarity value between the respective attribute values;
    融合模块,被设置为在所述第一数据和所述第二数据之间的相似度值大于预设第二阈值时,将所述第一数据和所述第二数据进行融合。The fusion module is configured to fuse the first data and the second data when a similarity value between the first data and the second data is greater than a preset second threshold.
  35. 一种存储介质,其中,所述存储介质中存储有计算机程序,所述计算机程序被设置为运行时执行所述权利要求1至23任一项中所述的方法。A storage medium, wherein a computer program is stored in the storage medium, the computer program being arranged to perform the method of any one of claims 1 to 23 at runtime.
  36. 一种电子装置,包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述处理器被设置为通过所述计算机程序执行所述权利要求1至23任一项中所述的方法。An electronic device comprising a memory and a processor, wherein the memory stores a computer program, the processor being arranged to perform the method of any one of claims 1 to 23 by the computer program .
PCT/CN2018/077184 2017-03-13 2018-02-26 Data fusion method and device, storage medium and electronic device WO2018166343A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710145976.1 2017-03-13
CN201710145976.1A CN108572947B (en) 2017-03-13 2017-03-13 A kind of data fusion method and device

Publications (1)

Publication Number Publication Date
WO2018166343A1 true WO2018166343A1 (en) 2018-09-20

Family

ID=63522782

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077184 WO2018166343A1 (en) 2017-03-13 2018-02-26 Data fusion method and device, storage medium and electronic device

Country Status (2)

Country Link
CN (1) CN108572947B (en)
WO (1) WO2018166343A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517077A (en) * 2019-08-21 2019-11-29 天津货比三价科技有限公司 Commodity similarity analysis method, apparatus and storage medium based on attributive distance
CN112256882A (en) * 2020-10-16 2021-01-22 美林数据技术股份有限公司 Multi-similarity-based cross-system network entity fusion method
CN116257420A (en) * 2023-03-14 2023-06-13 北京崇迅科技有限公司 Computer intelligent regulation and control system and method based on data fusion

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840080B (en) * 2018-12-28 2022-08-26 东软集团股份有限公司 Character attribute comparison method and device, storage medium and electronic equipment
CN110222200A (en) * 2019-06-20 2019-09-10 京东方科技集团股份有限公司 Method and apparatus for entity fusion
CN110704405B (en) * 2019-08-29 2020-11-10 南京医渡云医学技术有限公司 Data fusion method and device based on disease indexes
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN113032775B (en) * 2019-12-25 2024-02-06 中国电信股份有限公司 Information processing method and information processing system
CN111882416A (en) * 2020-07-24 2020-11-03 未鲲(上海)科技服务有限公司 Training method and related device of risk prediction model
CN112163485B (en) * 2020-09-18 2023-11-24 杭州海康威视系统技术有限公司 Data processing method and device, database system and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07302265A (en) * 1994-05-10 1995-11-14 Nippon Telegr & Teleph Corp <Ntt> Method for refining data for similarity discrimination and device for performing the method
CN103530334A (en) * 2013-09-29 2014-01-22 方正国际软件有限公司 System and method for data matching based on comparison module
CN104182517A (en) * 2014-08-22 2014-12-03 北京羽乐创新科技有限公司 Data processing method and data processing device
CN105488176A (en) * 2015-11-30 2016-04-13 华为软件技术有限公司 Data processing method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1955960A (en) * 2005-10-28 2007-05-02 日电(中国)有限公司 File information table structure device and browing and search system using it
CN103207859B (en) * 2012-01-11 2016-07-06 北京四维图新科技股份有限公司 The method and apparatus of integrated database
CN103617192B (en) * 2013-11-07 2017-06-16 北京奇虎科技有限公司 The clustering method and device of a kind of data object
CN104504138A (en) * 2014-12-31 2015-04-08 广州索答信息科技有限公司 Human-based information fusion method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07302265A (en) * 1994-05-10 1995-11-14 Nippon Telegr & Teleph Corp <Ntt> Method for refining data for similarity discrimination and device for performing the method
CN103530334A (en) * 2013-09-29 2014-01-22 方正国际软件有限公司 System and method for data matching based on comparison module
CN104182517A (en) * 2014-08-22 2014-12-03 北京羽乐创新科技有限公司 Data processing method and data processing device
CN105488176A (en) * 2015-11-30 2016-04-13 华为软件技术有限公司 Data processing method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517077A (en) * 2019-08-21 2019-11-29 天津货比三价科技有限公司 Commodity similarity analysis method, apparatus and storage medium based on attributive distance
CN112256882A (en) * 2020-10-16 2021-01-22 美林数据技术股份有限公司 Multi-similarity-based cross-system network entity fusion method
CN116257420A (en) * 2023-03-14 2023-06-13 北京崇迅科技有限公司 Computer intelligent regulation and control system and method based on data fusion
CN116257420B (en) * 2023-03-14 2023-12-15 山西融创智联信息科技有限公司 Computer intelligent regulation and control system and method based on data fusion

Also Published As

Publication number Publication date
CN108572947B (en) 2019-11-19
CN108572947A (en) 2018-09-25

Similar Documents

Publication Publication Date Title
WO2018166343A1 (en) Data fusion method and device, storage medium and electronic device
JP6594543B2 (en) Order clustering method and apparatus and method and apparatus for countering malicious information
US8468146B2 (en) System and method for creating search index on cloud database
US10679055B2 (en) Anomaly detection using non-target clustering
US10635668B2 (en) Intelligently utilizing non-matching weighted indexes
WO2017206376A1 (en) Searching method, searching device and non-volatile computer storage medium
US20230076387A1 (en) Systems and methods for providing a comment-centered news reader
WO2016015431A1 (en) Search method, apparatus and device and non-volatile computer storage medium
WO2020233360A1 (en) Method and device for generating product evaluation model
US20190213057A1 (en) Adding descriptive metadata to application programming interfaces for consumption by an intelligent agent
US20120047124A1 (en) Database query optimizations
CN109977233B (en) Idiom knowledge graph construction method and device
CN111291571A (en) Semantic error correction method, electronic device and storage medium
CN113660541B (en) Method and device for generating abstract of news video
US11899770B2 (en) Verification method and apparatus, and computer readable storage medium
CN106202440B (en) Data processing method, device and equipment
CN110555108B (en) Event context generation method, device, equipment and storage medium
JP2016212879A (en) Information processing method and information processing apparatus
CN107016028B (en) Data processing method and apparatus thereof
CN110019783B (en) Attribute word clustering method and device
CN112069267A (en) Data processing method and device
CN109213972B (en) Method, device, equipment and computer storage medium for determining document similarity
CN112363814A (en) Task scheduling method and device, computer equipment and storage medium
CN114357180A (en) Knowledge graph updating method and electronic equipment
CN109033070B (en) Data processing method, server and computer readable medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18767250

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18767250

Country of ref document: EP

Kind code of ref document: A1