CN112733939A

CN112733939A - Similarity feature vector construction method and device, electronic equipment and storage medium

Info

Publication number: CN112733939A
Application number: CN202110037613.2A
Authority: CN
Inventors: 黄艳香; 吴信东; 白强伟
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-04-30

Abstract

The application provides a method and a device for constructing similarity feature vectors, electronic equipment and a storage medium, wherein at least one attribute feature and an attribute value corresponding to each attribute feature are determined from each data matching tag in a plurality of data matching tags; determining the attribute similarity of the attribute features of the same category in any two data matching labels based on the attribute value of each attribute feature of each data matching label; and finally, constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching tags are matched or not based on the attribute similarity between each two data matching tags and the attribute number ratio. Therefore, a plurality of data matching labels can be effectively converted into the similarity characteristic vectors, the limitation of a matching model can be reduced, and the accuracy of an entity matching result can be improved.

Description

Similarity feature vector construction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method and an apparatus for constructing a similarity feature vector, an electronic device, and a storage medium.

Background

With the continuous development of enterprises, due to the reasons of administrator change, physical layout dispersion, system autonomy and the like, data has the problems of complex sources (different types of relational databases, data of different departments and the like), structural heterogeneity (SQL, NoSQL databases, text files, Hive big data and the like), and the like, and it is not simple to complete the unified management of data assets of different departments. In the process of digital transformation of an enterprise, integration and fusion of multi-source heterogeneous data are necessary basic conditions for the enterprise to make upper-layer application, and entity matching is a very important part in the process of data fusion.

At present, feature vectors constructed by a traditional machine learning method, a word embedding (word embedding) method and the like are often related to feature quantity, when the feature quantities acquired from different data sources are different, a matching model trained for a certain data source is used, so that a matching result is not accurate enough, and in addition, the matching model trained by the same data source cannot carry out entity matching across the data sources, so that the limitation is high.

Disclosure of Invention

In view of this, an object of the present application is to provide a method and an apparatus for constructing a similarity feature vector, an electronic device, and a storage medium, which can effectively convert data matching tags acquired from different data sources into the similarity feature vector, and further, are helpful for improving the robustness of a matching model and reducing the limitations of the matching model, so as to improve the accuracy of an entity matching result.

The embodiment of the application provides a construction method of a similarity characteristic vector, which comprises the following steps:

acquiring a plurality of data matching labels to be matched;

for each data matching label, determining at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching label;

for every two data matching labels, determining attribute similarity between attribute features of the same category in the two data matching labels based on the attribute value corresponding to each attribute feature in each data matching label;

determining the attribute number ratio between the two data matching labels based on the attribute features and the number of the attribute features included in each of the two data matching labels;

and constructing a similarity feature vector used for being input into a matching model for determining whether the entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels.

Further, the constructing a similarity feature vector for inputting to a matching model for determining whether an entity in the plurality of data matching tags matches based on the attribute similarity between every two data matching tags in the plurality of data matching tags and the attribute count ratio includes:

for every two data matching labels, determining statistical similarity and similarity ratio between the two data matching labels based on the attribute similarity of the two data matching labels under each attribute feature;

and constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels.

Further, when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, the determining the attribute count ratio between the two data matching tags based on the attribute features and the number of the attribute features included in each of the two data matching tags includes:

determining a first number of attribute features that each of the two data matching tags has, and a second number of attribute features that each of the two data matching tags includes;

based on the first number and the second number of each data matching tag, determining a common attribute fraction, a minimum attribute fraction, and a maximum attribute fraction between the two data matching tags.

Further, determining a similarity ratio between the two data matching labels by:

determining a third number of attribute similarities in each preset similarity interval based on the attribute similarity between the two data matching tags;

and for each similarity interval, determining the similarity ratio of the two data matching labels in the similarity interval based on the first number and the third number.

Further, the statistical similarity includes a maximum attribute similarity, a minimum attribute similarity, an average attribute similarity, and a median attribute similarity.

The embodiment of the present application further provides a device for constructing a similarity feature vector, where the device for constructing includes:

the tag acquisition module is used for acquiring a plurality of data matching tags to be matched;

the first determining module is used for determining at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching tags aiming at each data matching tag;

the second determining module is used for determining attribute similarity between the attribute features of the same category in each two data matching labels based on the attribute value corresponding to each attribute feature in each data matching label;

a third determining module, configured to determine, based on the attribute features and the number of attribute features included in each of the two data matching tags, an attribute count ratio between the two data matching tags;

and the vector construction module is used for constructing a similarity characteristic vector which is input into a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels.

Further, when the vector construction module is configured to construct a similarity feature vector for input to a matching model that determines whether an entity in the plurality of data matching tags matches, based on attribute similarity between every two data matching tags in the plurality of data matching tags and an attribute count ratio, the vector construction module is configured to:

Further, when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, the third determining module, when configured to determine the attribute count ratio between the two data matching tags based on the attribute features included in each of the two data matching tags and the number of the attribute features, is configured to:

Further, the vector construction module is configured to determine a similarity ratio between the two data matching labels by:

An embodiment of the present application further provides an electronic device, including: the device comprises a processor, a memory and a bus, wherein the memory stores machine readable instructions executable by the processor, the processor and the memory are communicated through the bus when the electronic device runs, and the machine readable instructions are executed by the processor to execute the steps of the construction method of the similarity characteristic vector.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for constructing a similarity feature vector are performed as described above.

The method, the apparatus, the electronic device, and the storage medium for constructing the similarity feature vector provided in the embodiments of the present application may determine at least one attribute feature and an attribute value corresponding to each attribute feature from each of the obtained multiple data matching tags, further determine, for each two data matching tags, an attribute similarity between attribute features of the same category in the two data matching tags based on the attribute value of each attribute feature in each of the two data matching tags, simultaneously determine an attribute count ratio between the two data matching tags according to the attribute features and the number of attribute features included in each data matching tag, and finally construct a similarity feature vector for inputting to a matching model for determining whether an entity in the multiple data matching tags matches based on the attribute similarity and the attribute count ratio between each two data matching tags in the multiple data matching tags Amount of the compound (A). Therefore, the data matching labels acquired from different data sources can be effectively converted into the similarity characteristic vectors, and further, the robustness of the matching model is improved, the limitation of the matching model is reduced, and therefore the accuracy of the entity matching result can be improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of a method for constructing a similarity feature vector according to an embodiment of the present disclosure;

fig. 2 is a flowchart of another method for constructing a similarity feature vector according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a similarity feature vector constructing apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

Research shows that at present, feature vectors constructed by a traditional machine learning method, a word embedding (word embedding) method and the like are often related to feature quantity, when the feature quantities acquired from different data sources are different, a matching model trained for a certain data source is used, so that the matching result is not accurate enough, and in addition, the entity matching cannot be performed across the data sources by using the matching model trained for the same data source, so that the limitation is high.

Based on this, the embodiment of the application provides a method for constructing the similarity feature vector, which is beneficial to improving the robustness of the matching model and reducing the limitation of the matching model, so that the accuracy of the entity matching result can be improved.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for constructing a similarity feature vector according to an embodiment of the present disclosure. As shown in fig. 1, a method for constructing a similarity feature vector provided in an embodiment of the present application includes:

s101, obtaining a plurality of data matching labels to be matched.

In this step, a plurality of data matching tags to be matched are obtained from different data sources, where each data matching tag has an entity to be matched, for example, data matching tag 1: "Zhang three, male, 30 years old, company A", data matching tag 2: "Zhang three, male, 28 years old, company B", for the above two data matching tags, it can be judged whether the entity "Zhang three" is the same person; alternatively, it is determined whether "company A" and "company B" are the same company.

Here, the number of attribute features in each data matching tag may be the same or different, and for example, the data matching tag 3: "zhang san, 30 years old, company a", at which the number of attribute features in the data matching tag 3 is different from the number of attribute features in the data matching tag 1 and the data matching tag 2.

And S102, aiming at each data matching label, determining at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching label.

In this step, for each data matching tag of the acquired multiple data matching tags, at least one attribute feature included in the data matching tag and an attribute value corresponding to each attribute feature are identified.

Corresponding to the above embodiment, the data matching tag 1: "Zhang three, Man, 30 years old, company A", for data matching tag 1, identify attribute feature A₁₁"name", Attribute feature A₁₁Corresponding attribute value V₁₁'Zhangsan' attribute characteristic A₁₂"gender", Attribute feature A₁₂Corresponding attribute value V₁₂"Male", attribute feature A₁₃"age", attribute feature A₁₃Corresponding attribute value V₁₃"age 30", attribute feature A₁₄"work Unit", Attribute feature A₁₄Corresponding attribute value V₁₄Similarly, the attribute features and corresponding attribute values in the data matching tag 2 and the data matching tag 3 can be obtained, which is not described herein again.

S103, for every two data matching labels, determining attribute similarity between attribute features of the same category in the two data matching labels based on the attribute values corresponding to the attribute features in each data matching label.

In this step, the attribute similarity between every two data matching tags in the plurality of data matching tags and between attribute features of the same type is respectively determined, and specifically, for every two obtained data matching tags, the attribute similarity sim (V) under the attribute features of the same type in the two data matching tags is determined based on the attribute value corresponding to each attribute feature in the two data matching tags_i) Wherein sim (V)_i) And matching the attribute similarity of the labels under the ith same class of attribute features for the two data.

Corresponding to the above embodiment, the attribute similarity between the attribute features of the same category in the data matching tag 1 and the data matching tag 2 is respectively determined; the attribute similarity between the attribute features of the same category in the data matching tag 2 and the data matching tag 3 and the attribute similarity between the attribute features of the same category in the data matching tag 1 and the data matching tag 3.

Taking data matching tag 1 and data matching tag 3 as examples, data matching tag 1: attribute feature A₁₁"name", Attribute feature A₁₁Corresponding attribute value V₁₁'Zhangsan' attribute characteristic A₂₁"gender", Attribute feature A₂₁Corresponding attribute value V₂₁"Male", attribute feature A₃₁"age", attribute feature A₃₁Corresponding attribute value V₃₁"age 30", attribute feature A₄₁"work Unit", Attribute feature A₄₁Corresponding attribute value V₄₁"company A"; data matching tag 3: attribute feature A₁₃"name", Attribute feature A₁₃Corresponding attribute value V₁₃'Zhangsan' attribute characteristic A₃₃"age", attribute feature A₃₃Corresponding attribute value V₃₃"age 30", attribute feature A₄₃"work Unit", Attribute feature A₄₃Corresponding attribute value V₄₃"company A".

Here, since the data matching tag 3 does not contain the attribute feature of "gender", the attribute feature a is for the data matching tag 3₂₃Attribute value V corresponding to "gender₂₃Null (null), in which case, the similarity between the data matching label 1 and the data matching label 3 under the attribute of "gender" is 0.

When the attribute values of the attribute features of the same category in the two data matching tags are not 0, the similarity of the two data matching tags under the attribute features of the category can be calculated by adopting the existing similarity calculation mode, such as the traditional edit distance, the Jaccard similarity, the cosine similarity, or the semantic similarity based on word embedding, and the like, namely, the similarity sim (V) of the two data matching tags under the attribute feature of 'name' can be calculated by adopting the similarity calculation mode₁) Similarity sim (V) under attribute feature of "age₃) And similarity sim (V) under the attribute feature of "work unit₄)。

S104, determining the attribute number ratio between the two data matching labels based on the attribute features and the number of the attribute features included in each data matching label of the two data matching labels.

In this step, for every two acquired data matching tags, an attribute number ratio between the two data matching tags is determined according to the attribute features and the number of the attribute features in each of the two data matching tags, where the attribute number ratio may include a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio.

S105, constructing similarity feature vectors input into a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels.

In this step, after determining the attribute similarity and the attribute number ratio between every two data matching tags in the plurality of data matching tags, based on the attribute similarity and the attribute number ratio between every two data matching tags, that is, based on "the attribute similarity and the attribute number ratio between the data matching tag 1 and the data matching tag 2", "the attribute similarity and the attribute number ratio between the data matching tag 1 and the data matching tag 3", and "the attribute similarity and the attribute number ratio between the data matching tag 2 and the data matching tag 3", a similarity feature vector corresponding to the plurality of data matching tags is constructed, where the similarity feature vector may be used as an input feature and input to a matching model that determines whether an entity in the plurality of data matching tags matches.

The similarity feature vector can be used for training an untrained matching model; or after the matching model is trained, whether the entities in the multiple data matching labels are the same entity may be determined according to the similarity feature vector, and corresponding to the above embodiment, whether "zhang san" in the data matching label 1 and the data matching label 2 is the same person is determined.

According to the method for constructing the similarity feature vector, at least one attribute feature and an attribute value corresponding to each attribute feature can be determined from each data matching tag in a plurality of acquired data matching tags, and further, for each two data matching tags, the attribute similarity between the attribute features of the same type in the two data matching tags is determined based on the attribute value of each attribute feature in each data matching tag in the two data matching tags; and finally, constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels. Therefore, the data matching labels acquired from different data sources can be effectively converted into the similarity characteristic vectors, and further, the robustness of the matching model is improved, the limitation of the matching model is reduced, and therefore the accuracy of the entity matching result can be improved.

Referring to fig. 2, fig. 2 is a flowchart illustrating another method for constructing a similarity feature vector according to an embodiment of the present disclosure. As shown in fig. 2, the method for constructing a similarity feature vector provided in the embodiment of the present application includes:

s201, obtaining a plurality of data matching labels to be matched.

S202, aiming at each data matching label, at least one attribute feature and an attribute value corresponding to each attribute feature are determined from the data matching labels.

S203, for every two data matching labels, determining the attribute similarity between the attribute features of the same category in the two data matching labels based on the attribute value corresponding to each attribute feature in each data matching label.

S204, determining the attribute number ratio between the two data matching labels based on the attribute features and the number of the attribute features included in each data matching label of the two data matching labels.

S205, for every two data matching labels, determining the statistical similarity and similarity ratio between the two data matching labels based on the attribute similarity of the two data matching labels under each attribute feature.

In this step, for every two data matching labels in the plurality of data matching labels, the statistical similarity and the similarity ratio between the two data matching labels are calculated based on the attribute similarity of the two data matching labels under each attribute feature.

Here, the statistical similarity includes a maximum attribute similarity, a minimum attribute similarity, an average attribute similarity, and a median attribute similarity. The similarity ratio is the ratio of the attribute similarity of the two data matching labels under each attribute characteristic in each similarity interval.

Specifically, the maximum attribute similarity is calculated by the following formula:

wherein, sim_maxFor maximum attribute similarity, sim (V)_i) And attribute similarity of the two data matching labels under the ith same class of attribute features.

The minimum attribute similarity is calculated by the following formula:

wherein, sim_minFor minimum attribute similarity, sim (V)_i) And attribute similarity of the two data matching labels under the ith same class of attribute features.

The average attribute similarity is calculated by the following formula:

wherein, sim_avgIs the average attribute similarity, # (V)_i,V_imNot equal to null and V_inNot null) is the number of attribute features, sim (V), that both have in both data matching tags_i) And attribute similarity of the two data matching labels under the ith same class of attribute features.

Calculating the median attribute similarity by the following formula:

wherein, sim_medSimilarity of median attributes, sim (V)_c) Is the attribute similarity of the c-th attribute feature.

Here, when the minimum similarity, the average attribute similarity, and the median attribute similarity are calculated, only the attribute similarity of the attribute feature having the attribute value between the two data matching tags is calculated, and if the attribute value of the attribute feature having one of the two data matching tags is null, the attribute similarity of the two data matching tags under the attribute feature is not considered when the minimum similarity, the average attribute similarity, and the median attribute similarity are calculated.

S206, constructing similarity feature vectors input into a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels.

In this step, similarity feature vectors corresponding to the plurality of data matching labels are constructed based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels, where the similarity feature vectors may be input as input features to a matching model that determines whether entities in the plurality of data matching labels are matched.

The descriptions of S201 to S204 may refer to the descriptions of S101 to S104, and the same technical effects can be achieved, which are not described in detail herein.

Further, when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, step S204 includes: determining a first number of attribute features that each of the two data matching tags has, and a second number of attribute features that each of the two data matching tags includes; based on the first number and the second number of each data matching tag, determining a common attribute fraction, a minimum attribute fraction, and a maximum attribute fraction between the two data matching tags.

In this step, when the attribute count ratio includes the common attribute ratio, the minimum attribute ratio, and the maximum attribute ratio, before determining the attribute count ratio between the two data matching tags, first, a first number of attribute features included in both the two data matching tags is determined, and for the data matching tag 1 and the data matching tag 3, the attribute features included in both the data matching tag 1 and the data matching tag are the attribute feature of "name", "age", and the attribute feature of "work unit", corresponding to the above-described embodiment; then, the first number of attribute features for both data match tag 1 and data match tag 3 is 3.

Then, a second number of attribute features included in each of the two data matching tags is respectively determined, and corresponding to the above embodiment, the attribute features included in the data matching tag 1 are "name", "age", "gender" and "work unit", and then, the second number of attribute features included in the data matching tag 1 is 4; and for the attribute features included in the data matching tag 3 being "name", "age", and "work unit", the second number of attribute features included in the data matching tag 3 is 3.

Finally, based on the determined first number and the second number of each data matching tag, determining a common attribute ratio, a minimum attribute ratio and a maximum attribute ratio between the two data matching tags by the following formulas:

wherein, X₁Common is a first number of attribute features included in both of the two data matching tags, and m, n are a second number of attribute features included in each of the data matching tags, respectively, for a common attribute ratio.

Wherein, X₂For the minimum attribute fraction, common is a first number of attribute features included in both of the two data matching tags, and min (m, n) is the minimum of a second number of attribute features included in both of the two data matching tags.

Wherein, X₃For the maximum attribute ratio, common is a first number of attribute features included in both the two data matching tags, and max (m, n) is the maximum of a second number of attribute features included in both the two data matching tags.

Further, determining a similarity ratio between the two data matching labels by: determining a third number of attribute similarities in each preset similarity interval based on the attribute similarity between the two data matching tags; and for each similarity interval, determining the similarity ratio of the two data matching labels in the similarity interval based on the first number and the third number.

In the step, a plurality of similarity intervals are pre-divided, and a third number of attribute similarities in each preset similarity interval is determined according to the attribute similarities of the two data matching labels under the attribute characteristics; furthermore, for each similarity interval, the similarity ratio of the two data matching labels in the similarity interval is determined based on the first number and the third number of the attribute similarities in the similarity interval.

Calculating the similarity ratio by the following formula:

wherein, X₄Is the ratio of the similarity within the interval of [ a, b) ], # (a ≦ sim (V)_i)<b) Common is the first number of attribute features included in both data matching tags for the third number of attribute similarities lying within [ a, b) this similarity interval.

The method for constructing the similarity feature vector provided by the embodiment of the application can determine at least one attribute feature and an attribute value corresponding to each attribute feature from each data matching tag of a plurality of acquired data matching tags, further determine, for each two data matching tags, attribute similarity between attribute features of the same category in the two data matching tags based on the attribute value of each attribute feature in each data matching tag of the two data matching tags, and determine statistical similarity and similarity ratio between the two data matching tags based on the attribute similarity of the two data matching tags under each attribute feature; and finally, constructing a similarity feature vector used for being input into a matching model for determining whether the entities in the data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the data matching labels. Therefore, the data matching labels acquired from different data sources can be effectively converted into the similarity characteristic vectors, and further, the robustness of the matching model is improved, the limitation of the matching model is reduced, and therefore the accuracy of the entity matching result can be improved.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a device for constructing a similarity feature vector according to an embodiment of the present disclosure. As shown in fig. 3, the construction apparatus 300 includes:

a tag obtaining module 310, configured to obtain a plurality of data matching tags to be matched;

a first determining module 320, configured to determine, for each data matching tag, at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching tag;

a second determining module 330, configured to determine, for each two data matching tags, an attribute similarity between attribute features of the same category in the two data matching tags based on an attribute value corresponding to each attribute feature in each data matching tag;

a third determining module 340, configured to determine, based on the attribute features and the number of attribute features included in each of the two data matching tags, an attribute count ratio between the two data matching tags;

a vector construction module 350, configured to construct, based on the attribute similarity and the attribute number ratio between every two data matching tags in the plurality of data matching tags, a similarity feature vector for inputting to a matching model that determines whether an entity in the plurality of data matching tags matches.

Further, when the vector construction module 350 is configured to construct a similarity feature vector for input to a matching model for determining whether an entity in the plurality of data matching tags matches, based on the attribute similarity between every two data matching tags in the plurality of data matching tags and the attribute count ratio, the vector construction module 350 is configured to:

Further, when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio and a maximum attribute ratio, the third determining module 340 is configured to, when determining the attribute count ratio between the two data matching tags based on the attribute features included in each of the two data matching tags and the number of the attribute features, the third determining module 340 is configured to:

Further, the vector construction module 350 is configured to determine a similarity ratio between the two data matching labels by:

The device for constructing the similarity feature vector provided by the embodiment of the application can determine at least one attribute feature and an attribute value corresponding to each attribute feature from each data matching tag in the obtained plurality of data matching tags, further, for every two data matching tags, the attribute similarity between the same category of attribute features in the two data matching tags is determined based on the attribute values of the respective attribute features in each of the two data matching tags, and at the same time, determining the attribute number ratio between the two data matching labels according to the attribute features and the number of the attribute features included in each data matching label, and finally, and constructing a similarity feature vector which is input to a matching model for determining whether the entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels. Therefore, the data matching labels acquired from different data sources can be effectively converted into the similarity characteristic vectors, and further, the robustness of the matching model is improved, the limitation of the matching model is reduced, and therefore the accuracy of the entity matching result can be improved.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 4, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.

The memory 420 stores machine-readable instructions executable by the processor 410, when the electronic device 400 runs, the processor 410 communicates with the memory 420 through the bus 430, and when the machine-readable instructions are executed by the processor 410, the steps of the method for constructing the similarity feature vector in the embodiment of the method shown in fig. 1 and fig. 2 may be executed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for constructing the similarity feature vector in the method embodiments shown in fig. 1 and fig. 2 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A construction method of a similarity feature vector is characterized by comprising the following steps:

acquiring a plurality of data matching labels to be matched;

2. The constructing method according to claim 1, wherein constructing a similarity feature vector for input to a matching model that determines whether an entity in the plurality of data matching tags matches or not based on the attribute similarity between each two data matching tags in the plurality of data matching tags and an attribute count ratio comprises:

3. The building method according to claim 2, wherein when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, the determining the attribute count ratio between the two data matching tags based on the attribute features and the number of attribute features included in each of the two data matching tags includes:

4. The building method of claim 3, wherein the similarity ratio between the two data matching labels is determined by:

5. The method of claim 2, wherein the statistical similarity comprises a maximum attribute similarity, a minimum attribute similarity, an average attribute similarity, and a median attribute similarity.

6. A similarity feature vector constructing apparatus, comprising:

7. The constructing apparatus of claim 6, wherein the vector constructing module, when being configured to construct the similarity feature vector for input to the matching model for determining whether the entity in the plurality of data matching tags matches based on the attribute similarity and the attribute count ratio between every two data matching tags in the plurality of data matching tags, is configured to:

8. The building apparatus according to claim 7, wherein when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, the third determination module, when configured to determine the attribute count ratio between the two data matching tags based on the attribute features included in each of the two data matching tags and the number of the attribute features, is configured to:

9. An electronic device, comprising: processor, memory and bus, the memory stores machine readable instructions executable by the processor, when the electronic device runs, the processor and the memory communicate through the bus, the machine readable instructions are executed by the processor to execute the steps of the method for constructing similarity feature vector according to any one of claims 1 to 5.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, performs the steps of the method for constructing a similarity feature vector according to any one of claims 1 to 5.