CN112733939A - Similarity feature vector construction method and device, electronic equipment and storage medium - Google Patents

Similarity feature vector construction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112733939A
CN112733939A CN202110037613.2A CN202110037613A CN112733939A CN 112733939 A CN112733939 A CN 112733939A CN 202110037613 A CN202110037613 A CN 202110037613A CN 112733939 A CN112733939 A CN 112733939A
Authority
CN
China
Prior art keywords
attribute
data matching
similarity
labels
ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110037613.2A
Other languages
Chinese (zh)
Inventor
黄艳香
吴信东
白强伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Minglue Artificial Intelligence Group Co Ltd
Original Assignee
Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Minglue Artificial Intelligence Group Co Ltd filed Critical Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority to CN202110037613.2A priority Critical patent/CN112733939A/en
Publication of CN112733939A publication Critical patent/CN112733939A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The application provides a method and a device for constructing similarity feature vectors, electronic equipment and a storage medium, wherein at least one attribute feature and an attribute value corresponding to each attribute feature are determined from each data matching tag in a plurality of data matching tags; determining the attribute similarity of the attribute features of the same category in any two data matching labels based on the attribute value of each attribute feature of each data matching label; and finally, constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching tags are matched or not based on the attribute similarity between each two data matching tags and the attribute number ratio. Therefore, a plurality of data matching labels can be effectively converted into the similarity characteristic vectors, the limitation of a matching model can be reduced, and the accuracy of an entity matching result can be improved.

Description

Similarity feature vector construction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a method and an apparatus for constructing a similarity feature vector, an electronic device, and a storage medium.
Background
With the continuous development of enterprises, due to the reasons of administrator change, physical layout dispersion, system autonomy and the like, data has the problems of complex sources (different types of relational databases, data of different departments and the like), structural heterogeneity (SQL, NoSQL databases, text files, Hive big data and the like), and the like, and it is not simple to complete the unified management of data assets of different departments. In the process of digital transformation of an enterprise, integration and fusion of multi-source heterogeneous data are necessary basic conditions for the enterprise to make upper-layer application, and entity matching is a very important part in the process of data fusion.
At present, feature vectors constructed by a traditional machine learning method, a word embedding (word embedding) method and the like are often related to feature quantity, when the feature quantities acquired from different data sources are different, a matching model trained for a certain data source is used, so that a matching result is not accurate enough, and in addition, the matching model trained by the same data source cannot carry out entity matching across the data sources, so that the limitation is high.
Disclosure of Invention
In view of this, an object of the present application is to provide a method and an apparatus for constructing a similarity feature vector, an electronic device, and a storage medium, which can effectively convert data matching tags acquired from different data sources into the similarity feature vector, and further, are helpful for improving the robustness of a matching model and reducing the limitations of the matching model, so as to improve the accuracy of an entity matching result.
The embodiment of the application provides a construction method of a similarity characteristic vector, which comprises the following steps:
acquiring a plurality of data matching labels to be matched;
for each data matching label, determining at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching label;
for every two data matching labels, determining attribute similarity between attribute features of the same category in the two data matching labels based on the attribute value corresponding to each attribute feature in each data matching label;
determining the attribute number ratio between the two data matching labels based on the attribute features and the number of the attribute features included in each of the two data matching labels;
and constructing a similarity feature vector used for being input into a matching model for determining whether the entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
Further, the constructing a similarity feature vector for inputting to a matching model for determining whether an entity in the plurality of data matching tags matches based on the attribute similarity between every two data matching tags in the plurality of data matching tags and the attribute count ratio includes:
for every two data matching labels, determining statistical similarity and similarity ratio between the two data matching labels based on the attribute similarity of the two data matching labels under each attribute feature;
and constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
Further, when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, the determining the attribute count ratio between the two data matching tags based on the attribute features and the number of the attribute features included in each of the two data matching tags includes:
determining a first number of attribute features that each of the two data matching tags has, and a second number of attribute features that each of the two data matching tags includes;
based on the first number and the second number of each data matching tag, determining a common attribute fraction, a minimum attribute fraction, and a maximum attribute fraction between the two data matching tags.
Further, determining a similarity ratio between the two data matching labels by:
determining a third number of attribute similarities in each preset similarity interval based on the attribute similarity between the two data matching tags;
and for each similarity interval, determining the similarity ratio of the two data matching labels in the similarity interval based on the first number and the third number.
Further, the statistical similarity includes a maximum attribute similarity, a minimum attribute similarity, an average attribute similarity, and a median attribute similarity.
The embodiment of the present application further provides a device for constructing a similarity feature vector, where the device for constructing includes:
the tag acquisition module is used for acquiring a plurality of data matching tags to be matched;
the first determining module is used for determining at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching tags aiming at each data matching tag;
the second determining module is used for determining attribute similarity between the attribute features of the same category in each two data matching labels based on the attribute value corresponding to each attribute feature in each data matching label;
a third determining module, configured to determine, based on the attribute features and the number of attribute features included in each of the two data matching tags, an attribute count ratio between the two data matching tags;
and the vector construction module is used for constructing a similarity characteristic vector which is input into a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
Further, when the vector construction module is configured to construct a similarity feature vector for input to a matching model that determines whether an entity in the plurality of data matching tags matches, based on attribute similarity between every two data matching tags in the plurality of data matching tags and an attribute count ratio, the vector construction module is configured to:
for every two data matching labels, determining statistical similarity and similarity ratio between the two data matching labels based on the attribute similarity of the two data matching labels under each attribute feature;
and constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
Further, when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, the third determining module, when configured to determine the attribute count ratio between the two data matching tags based on the attribute features included in each of the two data matching tags and the number of the attribute features, is configured to:
determining a first number of attribute features that each of the two data matching tags has, and a second number of attribute features that each of the two data matching tags includes;
based on the first number and the second number of each data matching tag, determining a common attribute fraction, a minimum attribute fraction, and a maximum attribute fraction between the two data matching tags.
Further, the vector construction module is configured to determine a similarity ratio between the two data matching labels by:
determining a third number of attribute similarities in each preset similarity interval based on the attribute similarity between the two data matching tags;
and for each similarity interval, determining the similarity ratio of the two data matching labels in the similarity interval based on the first number and the third number.
Further, the statistical similarity includes a maximum attribute similarity, a minimum attribute similarity, an average attribute similarity, and a median attribute similarity.
An embodiment of the present application further provides an electronic device, including: the device comprises a processor, a memory and a bus, wherein the memory stores machine readable instructions executable by the processor, the processor and the memory are communicated through the bus when the electronic device runs, and the machine readable instructions are executed by the processor to execute the steps of the construction method of the similarity characteristic vector.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for constructing a similarity feature vector are performed as described above.
The method, the apparatus, the electronic device, and the storage medium for constructing the similarity feature vector provided in the embodiments of the present application may determine at least one attribute feature and an attribute value corresponding to each attribute feature from each of the obtained multiple data matching tags, further determine, for each two data matching tags, an attribute similarity between attribute features of the same category in the two data matching tags based on the attribute value of each attribute feature in each of the two data matching tags, simultaneously determine an attribute count ratio between the two data matching tags according to the attribute features and the number of attribute features included in each data matching tag, and finally construct a similarity feature vector for inputting to a matching model for determining whether an entity in the multiple data matching tags matches based on the attribute similarity and the attribute count ratio between each two data matching tags in the multiple data matching tags Amount of the compound (A). Therefore, the data matching labels acquired from different data sources can be effectively converted into the similarity characteristic vectors, and further, the robustness of the matching model is improved, the limitation of the matching model is reduced, and therefore the accuracy of the entity matching result can be improved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a flowchart of a method for constructing a similarity feature vector according to an embodiment of the present disclosure;
fig. 2 is a flowchart of another method for constructing a similarity feature vector according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a similarity feature vector constructing apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.
Research shows that at present, feature vectors constructed by a traditional machine learning method, a word embedding (word embedding) method and the like are often related to feature quantity, when the feature quantities acquired from different data sources are different, a matching model trained for a certain data source is used, so that the matching result is not accurate enough, and in addition, the entity matching cannot be performed across the data sources by using the matching model trained for the same data source, so that the limitation is high.
Based on this, the embodiment of the application provides a method for constructing the similarity feature vector, which is beneficial to improving the robustness of the matching model and reducing the limitation of the matching model, so that the accuracy of the entity matching result can be improved.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for constructing a similarity feature vector according to an embodiment of the present disclosure. As shown in fig. 1, a method for constructing a similarity feature vector provided in an embodiment of the present application includes:
s101, obtaining a plurality of data matching labels to be matched.
In this step, a plurality of data matching tags to be matched are obtained from different data sources, where each data matching tag has an entity to be matched, for example, data matching tag 1: "Zhang three, male, 30 years old, company A", data matching tag 2: "Zhang three, male, 28 years old, company B", for the above two data matching tags, it can be judged whether the entity "Zhang three" is the same person; alternatively, it is determined whether "company A" and "company B" are the same company.
Here, the number of attribute features in each data matching tag may be the same or different, and for example, the data matching tag 3: "zhang san, 30 years old, company a", at which the number of attribute features in the data matching tag 3 is different from the number of attribute features in the data matching tag 1 and the data matching tag 2.
And S102, aiming at each data matching label, determining at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching label.
In this step, for each data matching tag of the acquired multiple data matching tags, at least one attribute feature included in the data matching tag and an attribute value corresponding to each attribute feature are identified.
Corresponding to the above embodiment, the data matching tag 1: "Zhang three, Man, 30 years old, company A", for data matching tag 1, identify attribute feature A11"name", Attribute feature A11Corresponding attribute value V11'Zhangsan' attribute characteristic A12"gender", Attribute feature A12Corresponding attribute value V12"Male", attribute feature A13"age", attribute feature A13Corresponding attribute value V13"age 30", attribute feature A14"work Unit", Attribute feature A14Corresponding attribute value V14Similarly, the attribute features and corresponding attribute values in the data matching tag 2 and the data matching tag 3 can be obtained, which is not described herein again.
S103, for every two data matching labels, determining attribute similarity between attribute features of the same category in the two data matching labels based on the attribute values corresponding to the attribute features in each data matching label.
In this step, the attribute similarity between every two data matching tags in the plurality of data matching tags and between attribute features of the same type is respectively determined, and specifically, for every two obtained data matching tags, the attribute similarity sim (V) under the attribute features of the same type in the two data matching tags is determined based on the attribute value corresponding to each attribute feature in the two data matching tagsi) Wherein sim (V)i) And matching the attribute similarity of the labels under the ith same class of attribute features for the two data.
Corresponding to the above embodiment, the attribute similarity between the attribute features of the same category in the data matching tag 1 and the data matching tag 2 is respectively determined; the attribute similarity between the attribute features of the same category in the data matching tag 2 and the data matching tag 3 and the attribute similarity between the attribute features of the same category in the data matching tag 1 and the data matching tag 3.
Taking data matching tag 1 and data matching tag 3 as examples, data matching tag 1: attribute feature A11"name", Attribute feature A11Corresponding attribute value V11'Zhangsan' attribute characteristic A21"gender", Attribute feature A21Corresponding attribute value V21"Male", attribute feature A31"age", attribute feature A31Corresponding attribute value V31"age 30", attribute feature A41"work Unit", Attribute feature A41Corresponding attribute value V41"company A"; data matching tag 3: attribute feature A13"name", Attribute feature A13Corresponding attribute value V13'Zhangsan' attribute characteristic A33"age", attribute feature A33Corresponding attribute value V33"age 30", attribute feature A43"work Unit", Attribute feature A43Corresponding attribute value V43"company A".
Here, since the data matching tag 3 does not contain the attribute feature of "gender", the attribute feature a is for the data matching tag 323Attribute value V corresponding to "gender23Null (null), in which case, the similarity between the data matching label 1 and the data matching label 3 under the attribute of "gender" is 0.
When the attribute values of the attribute features of the same category in the two data matching tags are not 0, the similarity of the two data matching tags under the attribute features of the category can be calculated by adopting the existing similarity calculation mode, such as the traditional edit distance, the Jaccard similarity, the cosine similarity, or the semantic similarity based on word embedding, and the like, namely, the similarity sim (V) of the two data matching tags under the attribute feature of 'name' can be calculated by adopting the similarity calculation mode1) Similarity sim (V) under attribute feature of "age3) And similarity sim (V) under the attribute feature of "work unit4)。
S104, determining the attribute number ratio between the two data matching labels based on the attribute features and the number of the attribute features included in each data matching label of the two data matching labels.
In this step, for every two acquired data matching tags, an attribute number ratio between the two data matching tags is determined according to the attribute features and the number of the attribute features in each of the two data matching tags, where the attribute number ratio may include a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio.
S105, constructing similarity feature vectors input into a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
In this step, after determining the attribute similarity and the attribute number ratio between every two data matching tags in the plurality of data matching tags, based on the attribute similarity and the attribute number ratio between every two data matching tags, that is, based on "the attribute similarity and the attribute number ratio between the data matching tag 1 and the data matching tag 2", "the attribute similarity and the attribute number ratio between the data matching tag 1 and the data matching tag 3", and "the attribute similarity and the attribute number ratio between the data matching tag 2 and the data matching tag 3", a similarity feature vector corresponding to the plurality of data matching tags is constructed, where the similarity feature vector may be used as an input feature and input to a matching model that determines whether an entity in the plurality of data matching tags matches.
The similarity feature vector can be used for training an untrained matching model; or after the matching model is trained, whether the entities in the multiple data matching labels are the same entity may be determined according to the similarity feature vector, and corresponding to the above embodiment, whether "zhang san" in the data matching label 1 and the data matching label 2 is the same person is determined.
According to the method for constructing the similarity feature vector, at least one attribute feature and an attribute value corresponding to each attribute feature can be determined from each data matching tag in a plurality of acquired data matching tags, and further, for each two data matching tags, the attribute similarity between the attribute features of the same type in the two data matching tags is determined based on the attribute value of each attribute feature in each data matching tag in the two data matching tags; and finally, constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels. Therefore, the data matching labels acquired from different data sources can be effectively converted into the similarity characteristic vectors, and further, the robustness of the matching model is improved, the limitation of the matching model is reduced, and therefore the accuracy of the entity matching result can be improved.
Referring to fig. 2, fig. 2 is a flowchart illustrating another method for constructing a similarity feature vector according to an embodiment of the present disclosure. As shown in fig. 2, the method for constructing a similarity feature vector provided in the embodiment of the present application includes:
s201, obtaining a plurality of data matching labels to be matched.
S202, aiming at each data matching label, at least one attribute feature and an attribute value corresponding to each attribute feature are determined from the data matching labels.
S203, for every two data matching labels, determining the attribute similarity between the attribute features of the same category in the two data matching labels based on the attribute value corresponding to each attribute feature in each data matching label.
S204, determining the attribute number ratio between the two data matching labels based on the attribute features and the number of the attribute features included in each data matching label of the two data matching labels.
S205, for every two data matching labels, determining the statistical similarity and similarity ratio between the two data matching labels based on the attribute similarity of the two data matching labels under each attribute feature.
In this step, for every two data matching labels in the plurality of data matching labels, the statistical similarity and the similarity ratio between the two data matching labels are calculated based on the attribute similarity of the two data matching labels under each attribute feature.
Here, the statistical similarity includes a maximum attribute similarity, a minimum attribute similarity, an average attribute similarity, and a median attribute similarity. The similarity ratio is the ratio of the attribute similarity of the two data matching labels under each attribute characteristic in each similarity interval.
Specifically, the maximum attribute similarity is calculated by the following formula:
Figure BDA0002894872860000111
wherein, simmaxFor maximum attribute similarity, sim (V)i) And attribute similarity of the two data matching labels under the ith same class of attribute features.
The minimum attribute similarity is calculated by the following formula:
Figure BDA0002894872860000112
wherein, simminFor minimum attribute similarity, sim (V)i) And attribute similarity of the two data matching labels under the ith same class of attribute features.
The average attribute similarity is calculated by the following formula:
Figure BDA0002894872860000121
wherein, simavgIs the average attribute similarity, # (V)i,VimNot equal to null and VinNot null) is the number of attribute features, sim (V), that both have in both data matching tagsi) And attribute similarity of the two data matching labels under the ith same class of attribute features.
Calculating the median attribute similarity by the following formula:
Figure BDA0002894872860000122
wherein, simmedSimilarity of median attributes, sim (V)c) Is the attribute similarity of the c-th attribute feature.
Here, when the minimum similarity, the average attribute similarity, and the median attribute similarity are calculated, only the attribute similarity of the attribute feature having the attribute value between the two data matching tags is calculated, and if the attribute value of the attribute feature having one of the two data matching tags is null, the attribute similarity of the two data matching tags under the attribute feature is not considered when the minimum similarity, the average attribute similarity, and the median attribute similarity are calculated.
S206, constructing similarity feature vectors input into a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
In this step, similarity feature vectors corresponding to the plurality of data matching labels are constructed based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels, where the similarity feature vectors may be input as input features to a matching model that determines whether entities in the plurality of data matching labels are matched.
The descriptions of S201 to S204 may refer to the descriptions of S101 to S104, and the same technical effects can be achieved, which are not described in detail herein.
Further, when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, step S204 includes: determining a first number of attribute features that each of the two data matching tags has, and a second number of attribute features that each of the two data matching tags includes; based on the first number and the second number of each data matching tag, determining a common attribute fraction, a minimum attribute fraction, and a maximum attribute fraction between the two data matching tags.
In this step, when the attribute count ratio includes the common attribute ratio, the minimum attribute ratio, and the maximum attribute ratio, before determining the attribute count ratio between the two data matching tags, first, a first number of attribute features included in both the two data matching tags is determined, and for the data matching tag 1 and the data matching tag 3, the attribute features included in both the data matching tag 1 and the data matching tag are the attribute feature of "name", "age", and the attribute feature of "work unit", corresponding to the above-described embodiment; then, the first number of attribute features for both data match tag 1 and data match tag 3 is 3.
Then, a second number of attribute features included in each of the two data matching tags is respectively determined, and corresponding to the above embodiment, the attribute features included in the data matching tag 1 are "name", "age", "gender" and "work unit", and then, the second number of attribute features included in the data matching tag 1 is 4; and for the attribute features included in the data matching tag 3 being "name", "age", and "work unit", the second number of attribute features included in the data matching tag 3 is 3.
Finally, based on the determined first number and the second number of each data matching tag, determining a common attribute ratio, a minimum attribute ratio and a maximum attribute ratio between the two data matching tags by the following formulas:
Figure BDA0002894872860000131
wherein, X1Common is a first number of attribute features included in both of the two data matching tags, and m, n are a second number of attribute features included in each of the data matching tags, respectively, for a common attribute ratio.
Figure BDA0002894872860000132
Wherein, X2For the minimum attribute fraction, common is a first number of attribute features included in both of the two data matching tags, and min (m, n) is the minimum of a second number of attribute features included in both of the two data matching tags.
Figure BDA0002894872860000141
Wherein, X3For the maximum attribute ratio, common is a first number of attribute features included in both the two data matching tags, and max (m, n) is the maximum of a second number of attribute features included in both the two data matching tags.
Further, determining a similarity ratio between the two data matching labels by: determining a third number of attribute similarities in each preset similarity interval based on the attribute similarity between the two data matching tags; and for each similarity interval, determining the similarity ratio of the two data matching labels in the similarity interval based on the first number and the third number.
In the step, a plurality of similarity intervals are pre-divided, and a third number of attribute similarities in each preset similarity interval is determined according to the attribute similarities of the two data matching labels under the attribute characteristics; furthermore, for each similarity interval, the similarity ratio of the two data matching labels in the similarity interval is determined based on the first number and the third number of the attribute similarities in the similarity interval.
Calculating the similarity ratio by the following formula:
Figure BDA0002894872860000142
wherein, X4Is the ratio of the similarity within the interval of [ a, b) ], # (a ≦ sim (V)i)<b) Common is the first number of attribute features included in both data matching tags for the third number of attribute similarities lying within [ a, b) this similarity interval.
The method for constructing the similarity feature vector provided by the embodiment of the application can determine at least one attribute feature and an attribute value corresponding to each attribute feature from each data matching tag of a plurality of acquired data matching tags, further determine, for each two data matching tags, attribute similarity between attribute features of the same category in the two data matching tags based on the attribute value of each attribute feature in each data matching tag of the two data matching tags, and determine statistical similarity and similarity ratio between the two data matching tags based on the attribute similarity of the two data matching tags under each attribute feature; and finally, constructing a similarity feature vector used for being input into a matching model for determining whether the entities in the data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the data matching labels. Therefore, the data matching labels acquired from different data sources can be effectively converted into the similarity characteristic vectors, and further, the robustness of the matching model is improved, the limitation of the matching model is reduced, and therefore the accuracy of the entity matching result can be improved.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a device for constructing a similarity feature vector according to an embodiment of the present disclosure. As shown in fig. 3, the construction apparatus 300 includes:
a tag obtaining module 310, configured to obtain a plurality of data matching tags to be matched;
a first determining module 320, configured to determine, for each data matching tag, at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching tag;
a second determining module 330, configured to determine, for each two data matching tags, an attribute similarity between attribute features of the same category in the two data matching tags based on an attribute value corresponding to each attribute feature in each data matching tag;
a third determining module 340, configured to determine, based on the attribute features and the number of attribute features included in each of the two data matching tags, an attribute count ratio between the two data matching tags;
a vector construction module 350, configured to construct, based on the attribute similarity and the attribute number ratio between every two data matching tags in the plurality of data matching tags, a similarity feature vector for inputting to a matching model that determines whether an entity in the plurality of data matching tags matches.
Further, when the vector construction module 350 is configured to construct a similarity feature vector for input to a matching model for determining whether an entity in the plurality of data matching tags matches, based on the attribute similarity between every two data matching tags in the plurality of data matching tags and the attribute count ratio, the vector construction module 350 is configured to:
for every two data matching labels, determining statistical similarity and similarity ratio between the two data matching labels based on the attribute similarity of the two data matching labels under each attribute feature;
and constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
Further, when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio and a maximum attribute ratio, the third determining module 340 is configured to, when determining the attribute count ratio between the two data matching tags based on the attribute features included in each of the two data matching tags and the number of the attribute features, the third determining module 340 is configured to:
determining a first number of attribute features that each of the two data matching tags has, and a second number of attribute features that each of the two data matching tags includes;
based on the first number and the second number of each data matching tag, determining a common attribute fraction, a minimum attribute fraction, and a maximum attribute fraction between the two data matching tags.
Further, the vector construction module 350 is configured to determine a similarity ratio between the two data matching labels by:
determining a third number of attribute similarities in each preset similarity interval based on the attribute similarity between the two data matching tags;
and for each similarity interval, determining the similarity ratio of the two data matching labels in the similarity interval based on the first number and the third number.
Further, the statistical similarity includes a maximum attribute similarity, a minimum attribute similarity, an average attribute similarity, and a median attribute similarity.
The device for constructing the similarity feature vector provided by the embodiment of the application can determine at least one attribute feature and an attribute value corresponding to each attribute feature from each data matching tag in the obtained plurality of data matching tags, further, for every two data matching tags, the attribute similarity between the same category of attribute features in the two data matching tags is determined based on the attribute values of the respective attribute features in each of the two data matching tags, and at the same time, determining the attribute number ratio between the two data matching labels according to the attribute features and the number of the attribute features included in each data matching label, and finally, and constructing a similarity feature vector which is input to a matching model for determining whether the entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels. Therefore, the data matching labels acquired from different data sources can be effectively converted into the similarity characteristic vectors, and further, the robustness of the matching model is improved, the limitation of the matching model is reduced, and therefore the accuracy of the entity matching result can be improved.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 4, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.
The memory 420 stores machine-readable instructions executable by the processor 410, when the electronic device 400 runs, the processor 410 communicates with the memory 420 through the bus 430, and when the machine-readable instructions are executed by the processor 410, the steps of the method for constructing the similarity feature vector in the embodiment of the method shown in fig. 1 and fig. 2 may be executed.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for constructing the similarity feature vector in the method embodiments shown in fig. 1 and fig. 2 may be executed.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A construction method of a similarity feature vector is characterized by comprising the following steps:
acquiring a plurality of data matching labels to be matched;
for each data matching label, determining at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching label;
for every two data matching labels, determining attribute similarity between attribute features of the same category in the two data matching labels based on the attribute value corresponding to each attribute feature in each data matching label;
determining the attribute number ratio between the two data matching labels based on the attribute features and the number of the attribute features included in each of the two data matching labels;
and constructing a similarity feature vector used for being input into a matching model for determining whether the entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
2. The constructing method according to claim 1, wherein constructing a similarity feature vector for input to a matching model that determines whether an entity in the plurality of data matching tags matches or not based on the attribute similarity between each two data matching tags in the plurality of data matching tags and an attribute count ratio comprises:
for every two data matching labels, determining statistical similarity and similarity ratio between the two data matching labels based on the attribute similarity of the two data matching labels under each attribute feature;
and constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
3. The building method according to claim 2, wherein when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, the determining the attribute count ratio between the two data matching tags based on the attribute features and the number of attribute features included in each of the two data matching tags includes:
determining a first number of attribute features that each of the two data matching tags has, and a second number of attribute features that each of the two data matching tags includes;
based on the first number and the second number of each data matching tag, determining a common attribute fraction, a minimum attribute fraction, and a maximum attribute fraction between the two data matching tags.
4. The building method of claim 3, wherein the similarity ratio between the two data matching labels is determined by:
determining a third number of attribute similarities in each preset similarity interval based on the attribute similarity between the two data matching tags;
and for each similarity interval, determining the similarity ratio of the two data matching labels in the similarity interval based on the first number and the third number.
5. The method of claim 2, wherein the statistical similarity comprises a maximum attribute similarity, a minimum attribute similarity, an average attribute similarity, and a median attribute similarity.
6. A similarity feature vector constructing apparatus, comprising:
the tag acquisition module is used for acquiring a plurality of data matching tags to be matched;
the first determining module is used for determining at least one attribute feature and an attribute value corresponding to each attribute feature from the data matching tags aiming at each data matching tag;
the second determining module is used for determining attribute similarity between the attribute features of the same category in each two data matching labels based on the attribute value corresponding to each attribute feature in each data matching label;
a third determining module, configured to determine, based on the attribute features and the number of attribute features included in each of the two data matching tags, an attribute count ratio between the two data matching tags;
and the vector construction module is used for constructing a similarity characteristic vector which is input into a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the attribute similarity and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
7. The constructing apparatus of claim 6, wherein the vector constructing module, when being configured to construct the similarity feature vector for input to the matching model for determining whether the entity in the plurality of data matching tags matches based on the attribute similarity and the attribute count ratio between every two data matching tags in the plurality of data matching tags, is configured to:
for every two data matching labels, determining statistical similarity and similarity ratio between the two data matching labels based on the attribute similarity of the two data matching labels under each attribute feature;
and constructing a similarity feature vector which is input to a matching model for determining whether entities in the plurality of data matching labels are matched or not based on the statistical similarity, the similarity ratio and the attribute number ratio between every two data matching labels in the plurality of data matching labels.
8. The building apparatus according to claim 7, wherein when the attribute count ratio includes a common attribute ratio, a minimum attribute ratio, and a maximum attribute ratio, the third determination module, when configured to determine the attribute count ratio between the two data matching tags based on the attribute features included in each of the two data matching tags and the number of the attribute features, is configured to:
determining a first number of attribute features that each of the two data matching tags has, and a second number of attribute features that each of the two data matching tags includes;
based on the first number and the second number of each data matching tag, determining a common attribute fraction, a minimum attribute fraction, and a maximum attribute fraction between the two data matching tags.
9. An electronic device, comprising: processor, memory and bus, the memory stores machine readable instructions executable by the processor, when the electronic device runs, the processor and the memory communicate through the bus, the machine readable instructions are executed by the processor to execute the steps of the method for constructing similarity feature vector according to any one of claims 1 to 5.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, performs the steps of the method for constructing a similarity feature vector according to any one of claims 1 to 5.
CN202110037613.2A 2021-01-12 2021-01-12 Similarity feature vector construction method and device, electronic equipment and storage medium Pending CN112733939A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110037613.2A CN112733939A (en) 2021-01-12 2021-01-12 Similarity feature vector construction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110037613.2A CN112733939A (en) 2021-01-12 2021-01-12 Similarity feature vector construction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112733939A true CN112733939A (en) 2021-04-30

Family

ID=75590496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110037613.2A Pending CN112733939A (en) 2021-01-12 2021-01-12 Similarity feature vector construction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112733939A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435219A (en) * 2021-06-25 2021-09-24 上海中商网络股份有限公司 Anti-counterfeiting detection method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080300976A1 (en) * 2007-05-31 2008-12-04 Hulikunta Prahlad Raghunandan Identification of users for advertising purposes
CN110390024A (en) * 2019-07-16 2019-10-29 合肥工业大学 The processing method and processing device of family's modal data, processor
CN112015730A (en) * 2019-05-31 2020-12-01 上海晶赞融宣科技有限公司 Label matching method and device, storage medium and server

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080300976A1 (en) * 2007-05-31 2008-12-04 Hulikunta Prahlad Raghunandan Identification of users for advertising purposes
CN112015730A (en) * 2019-05-31 2020-12-01 上海晶赞融宣科技有限公司 Label matching method and device, storage medium and server
CN110390024A (en) * 2019-07-16 2019-10-29 合肥工业大学 The processing method and processing device of family's modal data, processor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
付仲良等: "道路网多特征匹配优化算法", 《测绘学报》 *
颜惠琴等: "云计算平台在船舶数据融合中的应用", 《舰船科学技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435219A (en) * 2021-06-25 2021-09-24 上海中商网络股份有限公司 Anti-counterfeiting detection method and device, electronic equipment and storage medium
CN113435219B (en) * 2021-06-25 2023-04-07 上海中商网络股份有限公司 Anti-counterfeiting detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2022105115A1 (en) Question and answer pair matching method and apparatus, electronic device and storage medium
US10216769B1 (en) Asset catalog management methods and systems
US20210200768A1 (en) Responding to similarity queries using vector dimensionality reduction
WO2021196934A1 (en) Question recommendation method and apparatus based on field similarity calculation, and server
CN113127506B (en) Target query statement construction method and device, storage medium and electronic device
US11379466B2 (en) Data accuracy using natural language processing
CN111026753A (en) Label storage method and device based on tree structure
CN112528315A (en) Method and device for identifying sensitive data
CN111625567A (en) Data model matching method, device, computer system and readable storage medium
CN112733939A (en) Similarity feature vector construction method and device, electronic equipment and storage medium
CN111444368B (en) Method and device for constructing user portrait, computer equipment and storage medium
CN111581296B (en) Data correlation analysis method and device, computer system and readable storage medium
CN115210705A (en) Vector embedding model for relational tables with invalid or equivalent values
CN112597168A (en) Processing method, device and platform of multi-source customer data and storage medium
CN113761185A (en) Main key extraction method, equipment and storage medium
CN110704635B (en) Method and device for converting triplet data in knowledge graph
CN109144999B (en) Data positioning method, device, storage medium and program product
CN116955856A (en) Information display method, device, electronic equipment and storage medium
CN115905885A (en) Data identification method, device, storage medium and program product
CN112507098B (en) Question processing method, question processing device, electronic equipment, storage medium and program product
CN113934729A (en) Data management method based on knowledge graph, related equipment and medium
CN114490667A (en) Multidimensional data analysis method and device, electronic equipment and medium
CN114022188A (en) Target crowd circling method, device, equipment and storage medium
US20200210431A1 (en) Query response using semantically similar database records
US20200201829A1 (en) Systems and methods for compiling a database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination