CN113934866B

CN113934866B - Commodity entity matching method and device based on set similarity

Info

Publication number: CN113934866B
Application number: CN202111546445.6A
Authority: CN
Inventors: 张磊; 王文文; 任毅; 肖明明; 陈富强; 寇嘉敏
Original assignee: Luban Beijing Electronic Commerce Technology Co ltd
Current assignee: Luban Beijing Electronic Commerce Technology Co ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-08
Anticipated expiration: 2041-12-17
Also published as: CN113934866A

Abstract

The invention discloses a commodity entity matching method and device based on set similarity, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring a platform knowledge base and a knowledge base to be matched; inputting the platform knowledge base and the knowledge base to be matched into the entity matching model; and outputting an entity matching set based on the platform knowledge base, the knowledge base to be matched and the entity matching model. According to the method, the entities are screened based on domain knowledge, the matching range is narrowed, the entity pair similarity is calculated by using the optimized set similarity, the entity pair ordering is adjusted by using the domain rule, the accuracy of entity alignment in multi-source heterogeneous data can be effectively improved, the problem that the bottom layer data of the traditional intelligent e-commerce platform is difficult to fuse is effectively solved, manual intervention is greatly reduced, and a new thought can be provided for sustainable development of the e-commerce field in the traditional industry.

Description

Commodity entity matching method and device based on set similarity

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a commodity entity matching method and device based on set similarity.

Background

In recent years, the knowledge graph has better organization and data management capability, can store different types of data and complex entity relationships, has good data flow efficiency, and is widely applied to scenes requiring a large amount of knowledge, such as e-commerce business systems with novel business operation modes for demands of question answering, searching, recommendation and the like. In the e-commerce field, the service scale is continuously expanded, more complex data application scenes appear, a large amount of unstructured data are dispersed in various sources and are basically represented in an unstructured text mode, the data interconnection requirements are stronger, the deep cognition requirements for user requirements are also improved, and under the background, many types of e-commerce field knowledge maps appear, for example, a catering entertainment knowledge map 'mei-rou brain' constructed by mei-rou can fully mine and associate various scene data to realize intelligent search and personalized recommendation of business circle gourmet, an e-commerce cognition map 'AliCoCo' constructed by Alibara can realize more intelligent search and accurate recommendation of user requirements, and interest recall based on a commodity knowledge map of Tokyo in Tokyo.

In the process of constructing the knowledge graph in the field of the industry e-commerce, especially for the traditional e-commerce industry with a more traditional data management mode, due to the fact that a plurality of business departments are involved, the scale of upstream and downstream industries is large, the information sources of different knowledge bases are different, and the difference of manual definition and proofreading exists, the problems of repetition, various semantics, non-standard naming, non-uniform format, uneven quality and the like often exist in the bottom layer data of the intelligent e-commerce platform, and finally the problem of multi-source heterogeneous data integration is caused in construction. Therefore, the knowledge fusion is important in the construction process of the knowledge graph, wherein the entity alignment is a key technology of the knowledge fusion. The entity alignment is to mine each entity in the heterogeneous data source knowledge base that the entity points to the same object in the real world, establish a connection among multi-source data and construct a standard and uniform knowledge base, and finally realize effective fusion of bottom-layer multi-source heterogeneous data. Entity alignment, which may also be referred to as entity matching, link prediction, object recognition, etc., can be divided into a probability-based entity matching method, a supervised learning and semi-supervised learning-based entity matching method, and an unsupervised learning-based entity matching method, and generally three steps are required in entity matching: the method comprises the following steps of data preprocessing, partition index and matching degree calculation, wherein the data preprocessing mainly comprises the steps of verifying before data enters an algorithm and normalizing data representation to ensure the accuracy and the effectiveness of the data; the partition index technology divides the entity according to the index value of the entity, so that invalid matching is reduced; the matching degree calculation is to calculate the similarity of an entity pair to judge whether a matching relationship exists between the two entities, and the common similarity calculation methods can be divided into a measurement method based on the similarity of a character string, a similarity based on a space structure and a similarity measurement method based on representation learning, wherein the Jaccard similarity based on the similarity of the character string is widely applied due to the remarkable and simple effect.

The entity matching method based on supervised learning and semi-supervised learning relies on manual labeling of a large number of entity pairs as prior knowledge, so that requirements on the specialty and the accuracy of labeling personnel are high, and a good effect cannot be exerted on multi-source heterogeneous data fusion. The entity matching method based on unsupervised learning takes longer time to accurately cluster and is not suitable for application scenes, and although the entity matching rule based on the probability model can directly utilize the similarity of the jaccard to judge the matching of the entities, the method depends on domain information when in use, but lacks effective domain information for the object trade field oriented by the invention, so the method is also not suitable. Therefore, aiming at the problem that how to use the technology in the field of knowledge graph to convert the multi-source isomerism problem of data into the entity alignment problem in the field of knowledge graph aiming at the multi-source isomerism of the underlying data in the field of e-commerce of the traditional industry, the realization of information matching through entity alignment is a problem which needs to be solved urgently at present.

Disclosure of Invention

The invention provides a method for realizing information matching by entity alignment, aiming at the problem of how to convert the multi-source isomerism problem of data into the problem of entity alignment in the field of knowledge graphs by using the technology in the field of knowledge graphs.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a method for matching commodity entities based on set similarity, where the method is implemented by an electronic device, and the method includes:

and S1, acquiring a platform knowledge base and a knowledge base to be matched.

And S2, inputting the platform knowledge base and the knowledge base to be matched into the entity matching model.

And S3, outputting an entity matching set based on the platform knowledge base, the knowledge base to be matched and the entity matching model.

Optionally, the entity matching model includes a knowledge base partitioning module, a data preprocessing module, an entity pair matching module, and an entity pair sorting module.

Based on the platform knowledge base, the knowledge base to be matched and the entity matching model in S3, outputting the entity matching set includes:

s31, inputting the platform knowledge base and the knowledge base to be matched into the knowledge base dividing module to obtain a plurality of groups of entity data sets S = { S1, S2, …, Sn } of the platform knowledge base and a plurality of groups of entity data sets S of the knowledge base to be matched₁={S₁1,S₁2,…,S₁j }; wherein Sk ∈ S, S₁k∈S₁(ii) a Let k = 1.

S32, mixing S₁k is input into a data preprocessing module to obtain a preprocessed entity data set S₁ ^’k。

S33, collecting the preprocessed entity data S₁ ^’k and a plurality of groups of entity data sets of the platform knowledge base are input into the entity pair matching module to obtain the entity pair matching degree.

And S34, inputting the entity pair matching degree into the entity pair sorting module to obtain a sorted entity pair data set.

S35, if k < j, let k = k +1, go to S32, and if k = j, output all sorted entity pair datasets, i.e. entity matching sets.

Optionally, in S31, the platform knowledge base and the knowledge base to be matched are input into the knowledge base partitioning module, and multiple sets of entity data sets S = { S1, S2, …, Sn } of the platform knowledge base and multiple sets of entity data sets S of the knowledge base to be matched are obtained₁={S₁1,S₁2,…,S₁j, including:

inputting the platform knowledge base and the knowledge base to be matched into a knowledge base dividing module, dividing the platform knowledge base and the knowledge base to be matched respectively according to a preset product name dictionary to obtain a plurality of groups of entity data sets S = { S1, S2, …, Sn } of the platform knowledge base and a plurality of groups of entity data sets S of the knowledge base to be matched₁={S₁1,S₁2,…,S₁j}。

Wherein the content of the first and second substances,each entity data set in a plurality of groups of entity data sets of the platform knowledge base comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column f₁Is an entity name column; second row f₂To the n-th column f_nSetting m < n or m ═ n for the entity attribute column; when m < n, the entity attribute column includes a second column f₂To the m-th column f_mThe m +1 th column f_m+1To the n-th column f_nOther related entity attribute columns of (1); when m is n, the entity attribute column includes a second column f₂To the m-th column f_mThe entity key attribute column of (1).

Each group of entity data sets in a plurality of groups of entity data sets of the knowledge base to be matched comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column g₁Is an entity name column; second row g₂To j column g_jSetting i < j or i ═ j for the entity attribute column; when i < j, the entity attribute column includes a second column g₂To ith column g_iThe i +1 th column g_i+1To j column g_jOther related entity attribute columns of (1); when i ═ j, the entity-attribute column includes a second column g₂To ith column g_iThe entity key attribute column of (1).

Optionally, S of S32₁k is input into a data preprocessing module to obtain a preprocessed entity data set S₁ ^’k includes:

s321, collecting entity data according to a preset word segmentation dictionary S₁And k, carrying out atomization to obtain an atomized entity data set.

And S322, removing redundancy from the atomized entity data set to obtain a redundancy-removed entity data set.

S323, unit conversion is carried out on the entity data set after redundancy removal to obtain an entity data set the same as the unit of the platform knowledge base, and the preprocessed entity data set S is obtained₁ ^’k。

Optionally, the entity data set S in S321 is set according to a preset word segmentation dictionary₁k, performing atomization to obtain an atomized entity data set, comprising:

s3211, obtaining a maximum segmentation length according to a preset segmentation dictionary, and collecting the entity data according to a maximum length matching method₁And performing word segmentation on attribute information in the entity attribute column of the k to obtain a word-segmented entity data set.

And S3212, deleting words belonging to the disabled dictionary from the attribute information in the entity attribute column of the segmented entity data set according to a preset disabled dictionary to obtain a deleted entity data set.

And S3213, splitting attribute information in each column of the entity attribute columns of the deleted entity data set according to a preset separator to obtain multiple columns of entity attributes, and deleting the original entity attribute column to obtain an atomized entity data set.

Optionally, the removing redundancy from the atomized entity data set in S322 to obtain a redundancy-removed entity data set includes:

and deleting entity attribute columns similar to the original entity attributes in the multi-column entity attributes of the atomized entity data set according to the similarity degree threshold.

Deleting the index column of the atomized entity data set; the index column is a column that does not contain an entity attribute value; and obtaining the entity data set with redundancy removed.

Optionally, the step of performing unit conversion on the entity data set from which the redundancy is removed in S323 to obtain the entity data set with the same unit as the platform knowledge base includes:

multiplying the attribute information with units in the attribute information in the entity attribute column in the entity data set after the redundancy is removed by a conversion constant to obtain an entity data set with the same units as the platform knowledge base; the denominator of the conversion constant is a dimension unit of the platform knowledge base, and the numerator is a dimension unit of the knowledge base to be matched.

Optionally, the preprocessed entity data set S in S33₁ ^’k and a plurality of groups of entity data sets of the platform knowledge base are input into the entity pair matching moduleAnd obtaining the entity pair matching degree, which comprises the following steps:

s331, according to the Jaccard similarity of the entity name, the preprocessed entity data set S₁ ^’And k, sequencing the entity names of the entities and the entity name similarity of the platform knowledge base, and selecting a data set with the entity name similarity higher than a preset threshold value in the platform knowledge base to obtain an entity set to be matched.

S332, calculating an entity set to be matched and a preprocessed entity data set S₁ ^’The Cartesian product of k to obtain an entity pair set entitySet1, entitySet1 { (E)^F,E^M)| E^F∈S₁ ^’k，1≤k≤j，E^M∈Sk，1≤k≤n} 。

S333, extracting entity key attribute column set in entity pair set

And selecting the entity pairs with the attribute information similarity higher than a preset threshold value to obtain an entity pair set entitySet2 to be matched.

S334, calculating the matching degree of the entity pairs in the entity pair set to be matched to obtain the matching degree of the entity pairs

。

Optionally, the inputting the entity pair matching degree into the entity pair sorting module in S34, and the obtaining the sorted entity pair data set includes:

s341, computing entity

Is matched with the threshold value

，

Representing entities

Length of attribute information of (1).

S342, selecting and entity

When matching, the entity pair matching degree is larger than or equal to the entity pair with the matching threshold value

。

S343, sorting the selected entity pairs from large to small according to the matching degree values of the entity pairs, and when the matching degree values of the entity pairs are the same, sorting the selected entity pairs according to the matching objects

Length of attribute information of

Sorting from small to large, when the attribute information length values are the same, according to the matching objects

The number of attribute values is sorted from large to small.

S344, for the entity

And taking the first ranked entity pair as the best match, and storing the best match to an entity pair data set entitySet3 to obtain a sorted entity pair data set.

On the other hand, the invention provides a commodity entity matching device based on set similarity, which is applied to the realization of a commodity entity matching method based on the set similarity, and comprises the following steps:

and the acquisition unit is used for acquiring the platform knowledge base and the knowledge base to be matched.

And the input unit is used for inputting the platform knowledge base and the knowledge base to be matched into the entity matching model.

And the output unit is used for outputting the entity matching set based on the platform knowledge base, the knowledge base to be matched and the entity matching model.

An output unit further to:

Optionally, the output unit is further configured to:

Each entity data set in a plurality of groups of entity data sets of the platform knowledge base comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column f₁Is an entity name column; second row f₂To the n-th column f_nAs an entityAn attribute column, wherein m is less than n or m is equal to n; when m < n, the entity attribute column includes a second column f₂To the m-th column f_mThe m +1 th column f_m+1To the n-th column f_nOther related entity attribute columns of (1); when m is n, the entity attribute column includes a second column f₂To the m-th column f_mThe entity key attribute column of (1).

Optionally, the output unit is further configured to:

S332, calculating an entity set to be matched and a preprocessed entity data set S₁ ^’The Cartesian product of k to obtain an entity pair set entitySet1, entitySet1 { (E)^F,E^M)| E^F∈S₁ ^’k，1≤k≤j，E^M∈Sk，1≤k≤n}。

S333, extracting entity key attribute column set in entity pair set

。

Optionally, the output unit is further configured to:

s341, computing entity

Is matched with the threshold value

，

Representing entities

Length of attribute information of (1).

S342, selecting and entity

。

Length of attribute information of

The number of attribute values is sorted from large to small.

S344, for the entity

In one aspect, an electronic device is provided, where the electronic device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above commodity entity matching method based on set similarity.

In one aspect, a computer-readable storage medium is provided, where at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the above commodity entity matching method based on set similarity.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

in the scheme, the real data of the e-commerce platform in the industry e-commerce field is oriented, the knowledge graph technology is applied to the actual transaction process to construct the industry e-commerce knowledge graph aiming at the heterogeneous problem of the bottom layer data in the traditional industry e-commerce field, the matching problem of the same object in the multi-source heterogeneous data in the bottom layer of the e-commerce platform is converted into the entity alignment problem in the knowledge graph field, and the set similarity entity alignment algorithm based on the field knowledge is provided. According to the method, the entity pairs are screened based on the domain knowledge to reduce the matching range, the entity pair similarity is calculated by using the optimized set similarity, the entity pair sequencing is adjusted by using the domain rule, the accuracy of entity alignment in multi-source heterogeneous data can be effectively improved, the problem that the bottom layer data of the traditional intelligent e-commerce platform is difficult to fuse is effectively solved, manual intervention is greatly reduced, and a new thought can be provided for sustainable development of the e-commerce field in the traditional industry.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a commodity entity matching method based on set similarity according to the present invention;

FIG. 2 is a schematic flow chart of a commodity entity matching method based on set similarity according to the present invention;

FIG. 3 is a block diagram of a commodity entity matching device based on set similarity according to the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, an embodiment of the present invention provides a method for matching commodity entities based on set similarity, where the method is implemented by an electronic device. As shown in fig. 1, a flowchart of a commodity entity matching method based on set similarity may include the following steps:

and S11, acquiring a platform knowledge base and a knowledge base to be matched.

And S12, inputting the platform knowledge base and the knowledge base to be matched into the entity matching model.

And S13, outputting an entity matching set based on the platform knowledge base, the knowledge base to be matched and the entity matching model.

Based on the platform knowledge base, the knowledge base to be matched and the entity matching model in S13, outputting the entity matching set includes:

s131, inputting the platform knowledge base and the knowledge base to be matched into a knowledge base dividing module to obtain a plurality of groups of entity data sets S = { S1, S2, …, Sn } of the platform knowledge base and a plurality of groups of entity data sets S of the knowledge base to be matched₁={S₁1,S₁2,…,S₁j }; wherein Sk ∈ S, S₁k∈S₁(ii) a Let k = 1.

S132, mixing S₁k is input into a data preprocessing module to obtain a preprocessed entity data set S₁ ^’k。

S133, collecting the preprocessed entity data S₁ ^’k and a plurality of groups of entity data sets of the platform knowledge base are input into the entity pair matching module to obtain the entity pair matching degree.

And S134, inputting the entity pair matching degree into an entity pair sorting module to obtain a sorted entity pair data set.

And S135, if k is less than j, enabling k = k +1, executing S132, and if k = j, outputting all sorted entity pair data sets, namely the entity matching set.

Optionally, in S131, the platform knowledge base and the knowledge base to be matched are input into the knowledge base partitioning module, and multiple sets of entity data sets S = { S1, S2, …, Sn } of the platform knowledge base and multiple sets of entity data sets S of the knowledge base to be matched are obtained₁={S₁1,S₁2,…,S₁j, including:

Each entity data set in a plurality of groups of entity data sets of the platform knowledge base comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column f₁Is an entity name column; second row f₂To the n-th column f_nSetting m < n or m ═ n for the entity attribute column; when m < n, the entity attribute column includes a second column f₂To the m-th column f_mThe m +1 th column f_m+1To the n-th column f_nOther related entity attribute columns of (1); when m is n, the entity attribute column includes a second column f₂To the m-th column f_mThe entity key attribute column of (1).

Optionally, S in S132₁k is input into a data preprocessing module to obtain a preprocessed entity data set S₁ ^’k includes:

s1321, collecting entity data according to a preset word segmentation dictionary S₁And k, carrying out atomization to obtain an atomized entity data set.

S1322, removing redundancy from the atomized entity data set to obtain the redundancy-removed entity data set.

S1323, performing unit conversion on the entity data set with the redundancy removed to obtain an entity data set with the same unit as that of the platform knowledge base, namely obtaining a preprocessed entity data set S₁ ^’k。

Optionally, the entity data set S in S1321 is set according to a preset word segmentation dictionary₁k, performing atomization to obtain an atomized entity data set, comprising:

s13211, obtaining the maximum word segmentation length according to a preset word segmentation dictionary, and matching the entity according to the maximum length matching methodData set S₁And performing word segmentation on attribute information in the entity attribute column of the k to obtain a word-segmented entity data set.

S13212, deleting words belonging to the stop dictionary from the attribute information in the entity attribute column of the segmented entity data set according to the preset stop dictionary to obtain a deleted entity data set.

S13213, splitting attribute information in each row of the entity attribute rows of the deleted entity data set according to a preset separator to obtain multiple rows of entity attributes, and deleting the original entity attribute row to obtain an atomized entity data set.

Optionally, the removing redundancy from the atomized entity data set in S1322 to obtain a redundancy-removed entity data set includes:

Optionally, the unit conversion of the entity data set from which the redundancy is removed in 1323 to obtain the entity data set the same as the unit of the platform knowledge base includes:

Optionally, the preprocessed entity data set S in S133₁ ^’k and inputting the multiple groups of entity data sets of the platform knowledge base into an entity pair matching module to obtain the matching degree of the entity pairs, wherein the matching degree comprises the following steps:

s1331, according to the Jaccard similarity of the entity names, collecting the preprocessed entity data S₁ ^’Entity name of kAnd sequencing the entity name similarity of the scale and platform knowledge base, and selecting a data set with the entity name similarity higher than a preset threshold value in the platform knowledge base to obtain an entity set to be matched.

S1332, calculating an entity set to be matched and a preprocessed entity data set S₁ ^’The Cartesian product of k to obtain an entity pair set entitySet1, entitySet1 { (E)^F,E^M)| E^F∈S₁ ^’k，1≤k≤j，E^M∈Sk，1≤k≤n}。

S1333, extracting entity key attribute list sets in the entity pair set

S1334, calculating the matching degree of the entity pairs in the entity pair set to be matched to obtain the matching degree of the entity pairs

。

Optionally, the inputting the entity pair matching degree into the entity pair sorting module in S134, and obtaining the sorted entity pair data set includes:

s1341, calculating entity

Is matched with the threshold value

，

Representing entities

Length of attribute information of (1).

S1342, selecting and entity

。

S1343, sorting the selected entity pairs from big to small according to the matching degree value of the entity pairs, and when the matching degree values of the entity pairs are the same, sorting the selected entity pairs according to the matching objects

Length of attribute information of

The number of attribute values is sorted from large to small.

S1344, pair entity

In the embodiment of the invention, real data of an e-commerce platform in the industry e-commerce field is oriented, a knowledge graph technology is applied to the actual transaction process to construct the industry e-commerce knowledge graph aiming at the heterogeneous problem of bottom-layer data in the traditional industry e-commerce field, the matching problem of the same object in multi-source heterogeneous data in the bottom layer of the e-commerce platform is converted into the entity alignment problem in the knowledge graph field, and a set similarity entity alignment algorithm based on the field knowledge is provided. According to the method, the entity pairs are screened based on the domain knowledge to reduce the matching range, the entity pair similarity is calculated by using the optimized set similarity, the entity pair sequencing is adjusted by using the domain rule, the accuracy of entity alignment in multi-source heterogeneous data can be effectively improved, the problem that the bottom layer data of the traditional intelligent e-commerce platform is difficult to fuse is effectively solved, manual intervention is greatly reduced, and a new thought can be provided for sustainable development of the e-commerce field in the traditional industry.

As shown in fig. 2, an embodiment of the present invention provides a commodity entity matching method based on set similarity, where the method is implemented by an electronic device, and as shown in a flowchart of the commodity entity matching method based on set similarity shown in fig. 2, a processing flow of the method may include the following steps:

and S21, acquiring a platform knowledge base and a knowledge base to be matched.

The platform knowledge base can be a material entity data set of the e-commerce platform and comprises a two-dimensional table of material entity data of the e-commerce platform, each row in the platform knowledge base can be an e-commerce platform material entity, and each material entity comprises a plurality of columns of specific attribute information.

The knowledge base to be matched can be a commodity entity data set provided by a supplier, and comprises a two-dimensional table of commodity entity data of the supplier, wherein each row corresponds to a commodity entity of the supplier, and each commodity entity comprises a plurality of columns of specific attribute information.

S22, inputting the platform knowledge base and the knowledge base to be matched into a knowledge base dividing module, dividing the platform knowledge base and the knowledge base to be matched respectively according to a preset product name dictionary to obtain a plurality of groups of entity data sets S = { S1, S2, …, Sn } of the platform knowledge base and a plurality of groups of entity data sets S of the knowledge base to be matched₁={S₁1,S₁2,…,S₁j}。

Wherein Sk ∈ S, S₁k belongs to S1; the initial value of k is set to 1.

Each entity data set in a plurality of groups of entity data sets of the platform knowledge base comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column f₁Is an entity name column; second row f₂To the n-th column f_nSetting m < n or m ═ n for the entity attribute column; when m < n, the entity attribute column includes a second column f₂To the m-th column f_mThe m +1 th column f_m+1To the n-th column f_nOther related entity attribute columns of(ii) a When m is n, the entity attribute column includes a second column f₂To the m-th column f_mThe entity key attribute column of (1).

The product name dictionary contains entity names of the e-commerce platform and the supplier platform, so that the knowledge base can be divided during subsequent processing, and the formats of the entity names can be [ "entity name 1" and "entity name 2" … ].

In one possible embodiment, the columns of entity names are determined from a product name dictionary, where the rule used is to find the number of times the elements in each column appear in the product name dictionary, with the highest number of columns being defined as the columns of entity names. And dividing the knowledge base according to the entity name column, and dividing the same type of entity into a group of data sets according to the entity name in the original data set so as to uniformly perform conversion and normalized representation of the entity attribute measurement unit on the type of entity.

For example, the data in the e-commerce platform knowledge base S is divided into several groups of data sets { S1, S2 … Sn } according to entity names, where each group of data set includes several rows of commodity information with the same entity name, for example, S1 is a data set of a commodity with an entity name of "i-beam", S1 includes several rows of commodity information with an entity name of "i-beam", and n columns of attribute information, and can be represented as S1 column = { f = } according to column names₁,f₂,f₃…f_m,…f_nWherein f can be substituted₁Represented as entity name column, { f₂,f₃,…f_mDenoted as entity Key Attribute column, { f_m+1,…f_nAnd is expressed as other related entity attribute information columns.

In the same way, the supplier knowledge base S₁The data in (1) is divided into S according to entity names₁1,S₁2…S₁j, each group of data set containing a plurality of rows of commodity information with the same entity name, e.g. S₁2 is a data set of a class of commodities with the same entity name, S₁2 contains commodity information with the same entity name in a plurality of rows and i columns of attribute information, and can be represented as S according to the column names₁2colum={g₁,g₂,g₃…g_i,…g_jWherein g can be replaced₁Is an entity name column, { g₂,g₃…g_iIs an entity key attribute column, { g_i+1…g_jIs the other relevant information attribute column of the provider entity.

S23, collecting entity data according to preset word segmentation dictionary S₁And k, carrying out atomization to obtain an atomized entity data set.

In one possible embodiment, the step S23 may include the following steps S231 to S233:

s231, obtaining the maximum word segmentation length according to a preset word segmentation dictionary, and collecting the entity data S according to a maximum length matching method₁And performing word segmentation on attribute information in the entity attribute column of the k to obtain a word-segmented entity data set.

The word segmentation dictionary can be customized according to data requirements, lists words needing to be split in the entity attribute data according to the maximum length of the split field, and is convenient for word segmentation operation on the attribute data. The format of the segmentation dictionary is [ "segmentation 1", "segmentation 2", "segmentation 3", … ], for example, [ "hot rolled steel strip", "i-steel", "altitude" … ].

In a feasible implementation mode, the maximum segmentation length is obtained according to a segmentation dictionary, and the maximum length matching method is used for S₁The entity attribute information of the k data sets is dividedThe maximum length matching method is that the length of the longest word in the dictionary is taken each time as the length of the number of words taken by word segmentation for the first time, the character string is scanned, if a matched word is found, the character string is divided by adding separators before and after the matched word, the separators can be self-defined according to the specific data content and structure, then the word is subtracted from the character string to be matched, the remaining character string is segmented, if no matched word is found, the length of the number of words taken is reduced by 1 until the dictionary hits or 1 single word is left.

S232, deleting words belonging to the stop dictionary from the attribute information in the entity attribute column of the segmented entity data set according to the preset stop dictionary to obtain the deleted entity data set.

In one possible embodiment, since the intercepted character string may contain unnecessary characters, the disabled dictionary is used to determine whether the intercepted character string is in the disabled dictionary, and if the character string is in the disabled dictionary, the character string is deleted. The deactivation dictionary, which may be customized from the data, defines meaningless symbols and unwanted data that interfere with the results of subsequent physical alignment operations to facilitate deletion operations on such data during preprocessing of the data, is formatted as [ "stop word 1", "stop word 2", "stop word 3", … ], for example, [ "sixty-five zero", "exit civil", "/", … ].

S233, splitting attribute information in each row of the entity attribute rows of the deleted entity data set according to a preset separator to obtain multiple rows of entity attributes, and deleting the original entity attribute row to obtain an atomized entity data set.

In a possible implementation, according to the self-defined separator split field, the split result is saved as "original attribute column name 1" and "original attribute column name 2" …, and the original entity attribute column is deleted, so as to obtain the atomized attribute value.

By way of example, S₁k data set origin g₂Attribute list Attribute of an entity, the content obtained through the above process is "work Q235B 12", for this group numberSince the words are divided according to the blank spaces, the words are divided according to the blank spaces to obtain the words, "I", "Q235B" and "12", and the obtained data is saved as "g₂₁”、“g₂₂”、“g₂₃"Attribute column, and delete original g₂And an attribute column.

And S24, removing redundancy from the atomized entity data set to obtain a redundancy-removed entity data set.

In a possible embodiment, the step S24 may include the following steps S241 to S242:

and S241, deleting entity attribute columns similar to the original entity attributes in the multi-column entity attributes of the atomized entity data set according to the similarity degree threshold.

In one possible embodiment, S₁The new attribute column after the k data set is atomized may be duplicated with the original attribute column, the redundant attribute column is removed by comparing the similarity between the attribute columns, the threshold of the similarity degree of the two columns can be set to be 90%, if more than 90% (including 90%) of the data in the two columns are the same, it can be shown that the two columns describe the same attribute of the entity, and the duplicated attribute column is removed at this time.

To illustrate, g, where the atomized attribute resides₂₁Column and g in original data₅Column data has 90% or more data duplication, indicating that the two columns describe the same attribute of the entity, and g is needed₅The column is deleted.

S242, deleting the index columns of the entity data sets after atomization; the index column is a column that does not contain an entity attribute value; and obtaining the entity data set with redundancy removed.

In a possible implementation, there may be an index column in the data set, and the index column may have a negative effect on the subsequent entity alignment result, and the index column needs to be determined by determining the content in the column, and if a certain column of data has no repeated value and does not include a value in the key attribute, the column of the type may be determined as the index column, and the column may be deleted.

For example, if a row of increasing numeric strings with data [0,1,2,3 … ] appears, the row can be judged as the index row and deleted.

And S25, multiplying the attribute information with units in the attribute information in the entity attribute column in the entity data set after the redundancy is removed by a conversion constant to obtain the entity data set with the same units as the platform knowledge base.

The denominator of the conversion constant is a dimension unit of the platform knowledge base, and the numerator is a dimension unit of the knowledge base to be matched.

In a possible implementation manner, unified unit conversion is performed on the data set after redundancy removal, and for data with non-unified units, a dimensional unit with the e-commerce platform as a standard can be uniformly used, and a dimensionless fraction equal to 1, called a conversion constant, is first obtained by making up an attribute value of each band unit

The amount of the solvent to be used is, for example,

and enabling the denominator to be a dimension unit of the commodity data of the supplier and the numerator to be a dimension unit of the corresponding material of the e-commerce platform, and then multiplying the dimension unit by the attribute value of the supplier needing to be converted. So for arbitrary data under different dimensions, conversion constants are used

This procedure can be written as

. For example, supplier S₁ ^’k is 2000mm long and the length unit of the E-commerce platform standard is m, in this case

The converted data is 2m, and after the converted data is obtained, the original units in the attribute values are deleted uniformly, for example, the 2000mm converted data is 2m, and the units after 2000mm are deleted are mm.

For the supplier commodity entity attribute value is a value in an interval, for example 100< h <300, the numerical unit of the interval boundary is converted into a dimension unit consistent with the E-commerce platform by using a conversion constant, and the commodity entity attribute value is stored in the format of the interval value.

Obtaining the preprocessed entity data set data S through the above S23 and S24₁ ^’k。

S26, collecting the preprocessed entity data S₁ ^’k and a plurality of groups of entity data sets of the platform knowledge base are input into the entity pair matching module to obtain the entity pair matching degree.

In one possible embodiment, the step S26 may include the following steps S261-S264:

s261, according to the Jaccard similarity of the entity names, collecting the preprocessed entity data S₁ ^’And k, sequencing the entity names of the entities and the entity name similarity of the platform knowledge base, and selecting a data set with the entity name similarity higher than a preset threshold value in the platform knowledge base to obtain an entity set to be matched.

In one possible embodiment, entity data S for a group of goods of a supplier is based on the Jaccard similarity of entity names, as shown in equation (1) below₁ ^’k, sorting the entity name similarity of the e-commerce platform data set S, selecting the data set with the entity name similarity higher than a set threshold value in the e-commerce platform, and selecting the data set with the supplier S in the e-commerce platform₁ ^’k is the data set matched with the group of data and is set as the entity set to be matched.

Of these, A, B are two sets.

S262, calculating an entity set to be matched and a preprocessed entity data set S₁ ^’The Cartesian product of k to obtain an entity pair set entitySet1, entitySet1 { (E)^F,E^M)| E^F∈S₁ ^’k，1≤k≤j，E^M∈Sk，1≤k≤n}。

S263, extracting entity key attribute list set in entity pair set

S264, calculating the matching degree of the entity pairs in the entity pair set to be matched to obtain the matching degree of the entity pairs

。

In one possible embodiment, the degree of match calculation aims at quantifying pairs of entities

By the value of the degree of matching

Different entity pairs can be compared to find out the entity pairs with matching relationship.

First, the supplier S is counted₁ ^’k number of attributes having the same entity attribute value as the entity attribute value in the e-commerce platform data set S1, that is, k is

Wherein

To represent

Property value of

Whether or not it appears in

Set of attribute values of

Wherein the value is shown in the following formula (2), and the value is 1 when the value appears, or 0 when the value does not appear.

The same number of attributes is then divided by the supplier S₁ ^’The attribute number (n) in k obtains the matching degree of the entity pair

. The calculation formula is shown in the following formula (3):

and S27, inputting the entity pair matching degree into the entity pair sorting module to obtain a sorted entity pair data set.

In a possible embodiment, the step S27 may include the following steps S271 to S274:

s271, computing entity

Is matched with the threshold value

，

Representing entities

Length of attribute information of (1).

S272, selecting and entity

When matched, the entity pairEntity pair with matching degree greater than or equal to matching threshold

。

S273, sorting the selected entity pairs from large to small according to the matching degree values of the entity pairs, and when the matching degree values of the entity pairs are the same, sorting the selected entity pairs according to the matching objects

Length of attribute information of

The number of attribute values is sorted from large to small.

S274, pair entity

S28, if k < j, let k = k +1, go to S23, and if k = j, output all sorted entity pair datasets, i.e. entity matching sets.

In one possible embodiment, the above-mentioned S23-S28 are data sets { S of commodity entity after being divided for the supplier knowledge base₁1,S₁2…S₁j } of a group S₁1(k) performing entity matching, repeating the steps after the group of data entities are matched, and continuously matching other commodity entity data sets in the divided supplier knowledge base until all commodity entities in the supplier knowledge base are matched.

As shown in fig. 3, an embodiment of the present invention provides a commodity entity matching device 300 based on set similarity, where the device 300 is applied to implement a commodity entity matching method based on set similarity, and the device 300 includes:

the obtaining unit 310 is configured to obtain a platform knowledge base and a knowledge base to be matched.

And the input unit 320 is used for inputting the platform knowledge base and the knowledge base to be matched into the entity matching model.

The output unit 330 is configured to output an entity matching set based on the platform knowledge base, the knowledge base to be matched, and the entity matching model.

An output unit 330, further configured to:

S33, preprocessing the mixtureEntity data set S₁ ^’k and a plurality of groups of entity data sets of the platform knowledge base are input into the entity pair matching module to obtain the entity pair matching degree.

Optionally, the output unit 330 is further configured to:

Each group of entity data sets in a plurality of groups of entity data sets of the knowledge base to be matched comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column g₁Is an entity name column; second row g₂To j column g_jSetting i < j or i ═ j for the entity attribute column; when i < j, the entity attribute column includes a second column g₂To ith column g_iThe i +1 th column g_i+1To j column g_jOther related matters ofA body attribute column; when i ═ j, the entity-attribute column includes a second column g₂To ith column g_iThe entity key attribute column of (1).

Optionally, the output unit 330 is further configured to:

s331, according to the Jaccard similarity of the entity name, the preprocessed entity data set S₁ ^’And k, sequencing the entity names of the entities and the entity name similarity of the platform knowledge base, and acquiring a data set with the entity name similarity higher than a preset threshold value in the platform knowledge base to obtain an entity set to be matched.

S333, extracting entity key attribute column set in entity pair set

。

Optionally, the output unit 330 is further configured to:

s341, computing entity

Is matched with the threshold value

，

Representing entities

Length of attribute information of (1).

S342, selecting and entity

。

Length of attribute information of

The number of attribute values is sorted from large to small.

S344, for the entity

Fig. 4 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the following method for matching commodity entities based on set similarity:

and S1, acquiring a platform knowledge base and a knowledge base to be matched.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the set similarity-based commodity entity matching method is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A commodity entity matching method based on set similarity is characterized by comprising the following steps:

s1, acquiring a platform knowledge base and a knowledge base to be matched;

s2, inputting the platform knowledge base and the knowledge base to be matched into an entity matching model;

s3, outputting an entity matching set based on the platform knowledge base, the knowledge base to be matched and the entity matching model;

the entity matching model comprises a knowledge base dividing module, a data preprocessing module, an entity pair matching module and an entity pair sequencing module;

in S3, based on the platform knowledge base, the knowledge base to be matched, and the entity matching model, outputting the entity matching set includes:

s31, inputting the platform knowledge base and the knowledge base to be matched into the knowledge base dividing module to obtain a plurality of groups of entity data sets of the platform knowledge base

Multiple groups of entity data sets of knowledge base to be matched

(ii) a Wherein the content of the first and second substances,

；

s32, mixing the above

Inputting the data into the data preprocessing module to obtain a preprocessed entity data set

；

S33, collecting the preprocessed entity data

Inputting a plurality of groups of entity data sets of the platform knowledge base into the entity pair matching module to obtain entity pair matching degree;

s34, inputting the entity pair matching degree into the entity pair sorting module to obtain a sorted entity pair data set;

s35, if

Then give an order

Go to execute S32 if

Outputting all the sorted entity pair data sets, namely an entity matching set;

inputting the platform knowledge base and the knowledge base to be matched into the knowledge base dividing module in the S31 to obtain a plurality of groups of entity data sets of the platform knowledge base

Multiple groups of entity data sets of knowledge base to be matched

The method comprises the following steps:

inputting the platform knowledge base and the knowledge base to be matched into the knowledge base dividing module, and respectively matching the platform knowledge base and the knowledge base to be matched according to a preset product name dictionaryThe knowledge base is divided to obtain a plurality of groups of entity data sets of the platform knowledge base

Multiple groups of entity data sets of knowledge base to be matched

；

Each entity data set in a plurality of groups of entity data sets of the platform knowledge base comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column

Is an entity name column; second column

To the first

Column(s) of

For entity attribute column, setting

Or

(ii) a When in use

The entity attribute column includes a second column

To the m-th column

Entity key attribute column ofColumn m +1

To the n-th column f_nOther related entity attribute columns of (1); when m is n, the entity attribute column includes a second column f₂To the m-th column f_mThe entity key attribute column of (1);

each group of entity data sets in a plurality of groups of entity data sets of the knowledge base to be matched comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column g₁Is an entity name column; second row g₂To j column g_jSetting i < j or i ═ j for the entity attribute column; when i < j, the entity attribute column comprises a second column g₂To ith column g_iThe i +1 th column g_i+1To j column g_jOther related entity attribute columns of (1); when i ═ j, the entity-attribute column includes a second column g₂To ith column g_iThe entity key attribute column of (1).

2. The method of claim 1, wherein said S32 is performed according to the following steps₁k is input into the data preprocessing module to obtain a preprocessed entity data set S₁ ^’k includes:

s321, collecting the entity data according to a preset word segmentation dictionary S₁k, carrying out atomization to obtain an atomized entity data set;

s322, removing redundancy from the atomized entity data set to obtain a redundancy-removed entity data set;

s323, unit conversion is carried out on the entity data set with the redundancy removed to obtain an entity data set with the same unit as the platform knowledge base, and the preprocessed entity data set S is obtained₁ ^’k。

3. The method according to claim 2, wherein the entity data set S in S321 is a set of entity data according to a predetermined segmentation dictionary S₁k is advancedLine atomization to obtain an atomized entity data set comprises:

s3211, obtaining a maximum segmentation length according to a preset segmentation dictionary, and collecting the entity data set S according to a maximum length matching method₁Performing word segmentation on attribute information in the entity attribute column of k to obtain a word-segmented entity data set;

s3212, deleting words belonging to the deactivation dictionary from the attribute information in the entity attribute column of the segmented entity data set according to a preset deactivation dictionary to obtain a deleted entity data set;

and S3213, splitting attribute information in each of the entity attribute columns of the deleted entity data set according to a preset separator to obtain multiple columns of entity attributes, and deleting the original entity attribute column to obtain an atomized entity data set.

4. The method according to claim 2, wherein the removing redundancy of the atomized entity data set in S322, and obtaining a redundancy-removed entity data set includes:

deleting entity attribute columns similar to the original entity attributes in the multi-column entity attributes of the atomized entity data set according to the similarity degree threshold;

5. The method according to claim 2, wherein the step of performing unit transformation on the entity data set with the redundancy removed in S323 to obtain an entity data set with the same unit as the platform knowledge base includes:

6. The method according to claim 2, wherein the preprocessed entity data set S33 is₁ ^’k and the multiple groups of entity data sets of the platform knowledge base are input into the entity pair matching module to obtain the entity pair matching degree, and the method comprises the following steps:

s331, according to the Jaccard similarity of the entity name, the preprocessed entity data set S₁ ^’The entity names of the k and the entity name similarity of the platform knowledge base are sequenced, and a data set with the entity name similarity higher than a preset threshold value in the platform knowledge base is selected to obtain an entity set to be matched;

s332, calculating the entity set to be matched and the preprocessed entity data set S₁ ^’Cartesian product of k to obtain entity pair set

；

S333, extracting entity key attribute column set { f) in the entity pair set_2，f_3…f_m}、{g_2，g_3...g_iSelecting entity pairs with attribute information similarity higher than a preset threshold value to obtain an entity pair set entitySet2 to be matched;

s334, calculating the matching degree of the entity pairs in the entity pair set to be matched to obtain the matching degree sim (E) of the entity pairs^F,E^M)。

7. The method according to claim 6, wherein the inputting the entity pair matching degree into the entity pair sorting module in S34, and the obtaining the sorted entity pair data set comprises:

s341, calculatingBody

Is matched with the threshold value

，

Representing entities

Length of attribute information of (1);

s342, selecting and the entity

When matching, the entity pair matching degree is larger than or equal to the entity pair of the matching threshold value

；

Length of attribute information of

Sorting from small to large, when the length values of the attribute information are the same, according to the matching objects

Sorting the number of attribute values from large to small;

s344, aiming at the entity

Taking the first-ranked entity pair as the best match and storing the best match in the entity pair dataAnd gathering entitySet3 to obtain the sorted entity pair data set.

8. A commodity entity matching apparatus based on set similarity, the apparatus comprising:

the acquisition unit is used for acquiring the platform knowledge base and the knowledge base to be matched;

the input unit is used for inputting the platform knowledge base and the knowledge base to be matched into the entity matching model;

the output unit is used for outputting an entity matching set based on the platform knowledge base, the knowledge base to be matched and the entity matching model; the entity matching model comprises a knowledge base dividing module, a data preprocessing module, an entity pair matching module and an entity pair sequencing module;

the outputting the entity matching set based on the platform knowledge base, the knowledge base to be matched and the entity matching model comprises:

Multiple groups of entity data sets of knowledge base to be matched

(ii) a Wherein the content of the first and second substances,

；

s32, mixing the S₁k is input into the data preprocessing module to obtain a preprocessed entity data set S₁ ^’k；

S33, collecting the preprocessed entity data S₁ ^’k and a plurality of groups of entity data sets of the platform knowledge base are input into the entity pair matching module to obtain entity pair matching degree;

s35, if k < j, let k = k +1, go to execute S32, if k = j, output all sorted entity pair data sets, i.e. entity matching sets;

in the step S31, the platform knowledge base and the knowledge base to be matched are input to the knowledge base partitioning module, and multiple sets of entity data sets S = { S1, S2, …, Sk, …, Sn } of the platform knowledge base and multiple sets of entity data sets S of the knowledge base to be matched are obtained₁={S₁1,S₁2,…,S₁k,…,S₁j, including:

inputting the platform knowledge base and the knowledge base to be matched into the knowledge base dividing module, dividing the platform knowledge base and the knowledge base to be matched respectively according to a preset product name dictionary to obtain a plurality of groups of entity data sets S = { S1, S2, …, Sk, …, Sn } of the platform knowledge base and a plurality of groups of entity data sets S of the knowledge base to be matched₁={S₁1,S₁2,…,S₁k,…,S₁j}；

Each entity data set in a plurality of groups of entity data sets of the platform knowledge base comprises an entity data two-dimensional table; the entity data two-dimensional table comprises a plurality of rows and a plurality of columns, wherein the first column f₁Is an entity name column; second row f₂To the n-th column f_nSetting m < n or m ═ n for the entity attribute column; when m < n, the entity attribute column includes a second column f₂To the m-th column f_mThe m +1 th column f_m+1To the n-th column f_nOther related entity attribute columns of (1); when m is n, the entity attribute column includes a second column f₂To the m-th column f_mThe entity key attribute column of (1);