CN113763061A

CN113763061A - Method and apparatus for polymerizing similar articles

Info

Publication number: CN113763061A
Application number: CN202010494670.9A
Authority: CN
Inventors: 张雄伟; 赫阳; 陶通
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2021-12-07

Abstract

The invention discloses a method and a device for polymerizing similar products, and relates to the technical field of computers. One embodiment of the method comprises: restoring the article attribute information; generating an article characteristic vector according to the repaired article attribute information, and performing nearest neighbor search on the article characteristic vector to obtain a first similar article candidate set; processing the articles in the first similar article candidate set based on a preset article key attribute to obtain a second similar article candidate set; selecting a third similar object candidate set from the second similar object candidate set based on the similar object discrimination model; and clustering the items in the third similar item candidate set through a clustering algorithm to perform similar item aggregation. The embodiment can solve the quality problem of the article information, ensures the timeliness of the algorithm, solves the problem of low detection and identification precision, and greatly improves the similar article retrieval efficiency.

Description

Method and apparatus for polymerizing similar articles

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for polymerizing similar products.

Background

With the development of science and technology and the progress of society, the electronic commerce field is rapidly developed. In the process of consumer shopping, one of the most common needs is to make final purchasing decisions by comparing a plurality of similar goods. In order to improve the click rate of commodities and the transaction rate of the commodities, the electronic commerce platform recommends similar commodities for the user when the user searches for the target commodities and browses the target commodities, so that convenience is provided for the user.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the existing method for identifying and detecting similar commodities has the defects of insufficient timeliness, high requirement on commodity information integrity, unstable model quality, low accuracy and the like.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for aggregating similar articles, which can solve the quality problem of article information, ensure timeliness of an algorithm, solve the problem of low detection and identification precision, and greatly improve similar article retrieval efficiency.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of similar product polymerization.

A method of similar product polymerization, comprising: restoring the article attribute information; generating an article characteristic vector according to the repaired article attribute information, and performing nearest neighbor search on the article characteristic vector to obtain a first similar article candidate set; processing the articles in the first similar article candidate set based on a preset article key attribute to obtain a second similar article candidate set; selecting a third similar object candidate set from the second similar object candidate set based on a similar object discrimination model; clustering the items in the third similar item candidate set through a clustering algorithm to perform similar item aggregation.

Optionally, the article attribute information is repaired using an unsupervised learning method.

Optionally, the repairing the article attribute information includes: acquiring the attribute of the article to be repaired and a candidate word set of each article attribute from the article attribute information; for each article attribute, respectively calculating the conditional probability of each candidate word of the article attribute under each existing article attribute information based on the existing article attribute information; for each candidate word, multiplying the conditional probability of the candidate word under each existing article attribute information to obtain a result, and taking the result as the score of the candidate word; and taking the candidate word with the highest score as the final value of the article attribute so as to repair the article attribute.

Optionally, obtaining a first candidate set of similar articles by performing a nearest neighbor search on the article feature vector includes: calculating the vector distance between the article pairs through a nearest neighbor search technology based on the article feature vectors; and obtaining similar item pairs by limiting a threshold value so as to obtain a first similar item candidate set.

Optionally, based on a preset item key attribute, processing the items in the first similar item candidate set to obtain a second similar item candidate set includes: and based on preset item key attributes, screening the first similar item candidate set by calculating whether the values of the item key attributes of each similar item pair in the first similar item candidate set are matched to obtain a second similar item candidate set.

Optionally, the similar article discrimination model is obtained by training through the following method: selecting and labeling samples from the second similar sample candidate set; extracting the characteristics of similar article pairs of the marked samples; and training a similar article distinguishing model by performing machine learning on the extracted features.

According to another aspect of embodiments of the present invention, an apparatus for similar item aggregation is provided.

An apparatus for the polymerization of like products comprising: the information restoration module is used for restoring the article attribute information; the first processing module is used for generating an article characteristic vector according to the repaired article attribute information and obtaining a first similar article candidate set by carrying out nearest neighbor search on the article characteristic vector; the second processing module is used for processing the articles in the first similar article candidate set to obtain a second similar article candidate set based on preset article key attributes; the third processing module is used for selecting a third similar object candidate set from the second similar object candidate set based on a similar object discrimination model; and the fourth processing module is used for clustering the articles in the third similar article candidate set through a clustering algorithm so as to carry out similar article aggregation.

Optionally, the information recovery module is further configured to: acquiring the attribute of the article to be repaired and a candidate word set of each article attribute from the article attribute information; for each article attribute, respectively calculating the conditional probability of each candidate word of the article attribute under each existing article attribute information based on the existing article attribute information; for each candidate word, multiplying the conditional probability of the candidate word under each existing article attribute information to obtain a result, and taking the result as the score of the candidate word; and taking the candidate word with the highest score as the final value of the article attribute so as to repair the article attribute.

Optionally, the first processing module is further configured to: calculating the vector distance between the article pairs through a nearest neighbor search technology based on the article feature vectors; and obtaining similar item pairs by limiting a threshold value so as to obtain a first similar item candidate set.

Optionally, the second processing module is further configured to: and based on preset item key attributes, screening the first similar item candidate set by calculating whether the values of the item key attributes of each similar item pair in the first similar item candidate set are matched to obtain a second similar item candidate set.

According to yet another aspect of embodiments of the present invention, there is provided an electronic device similar to a pin tumbler.

An electronic device similar to a pin tumbler comprising: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method for similar goods aggregation provided by the embodiment of the invention.

According to yet another aspect of embodiments of the present invention, a computer-readable medium is provided.

A computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method of similar item aggregation as provided by embodiments of the present invention.

One embodiment of the above invention has the following advantages or benefits: repairing the article attribute information, generating an article characteristic vector according to the repaired article attribute information, and performing nearest neighbor search on the article characteristic vector to obtain a first similar article candidate set; processing the articles in the first similar article candidate set based on a preset article key attribute to obtain a second similar article candidate set; selecting a third similar object candidate set from the second similar object candidate set based on the similar object discrimination model; clustering the articles in the third similar article candidate set through a clustering algorithm to carry out similar article aggregation, and repairing the article attribute information to solve the quality problem of the article information; the calculation amount can be greatly reduced by determining similar articles by performing nearest neighbor search on the article feature vectors, so that the calculation resources are saved, and the timeliness of the algorithm is ensured; the final similar article set is obtained by searching nearest neighbors of the features, filtering similar articles based on key attributes of the articles and then screening for the first time based on a machine learning model, so that the problem of low detection and identification precision is solved. In addition, similar article sets are integrated into similar article clusters through a clustering algorithm, and articles at the center points of the article clusters are taken as representatives of the article clusters, so that the number of the articles is greatly reduced, and the similar article retrieval efficiency is greatly improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a method of similar article aggregation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the steps for implementing the repair of an article attribute according to one embodiment of the present invention;

FIG. 3 is a diagram illustrating the steps of implementing a nearest neighbor search on an item vector according to one embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating steps for screening similar objects based on a similar object discrimination model according to an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a main block of an apparatus for similarly aggregating articles according to an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

With the rapid development of the electronic commerce field, the types and the number of commodities are greatly increased, and the number of merchants and consumers is also greatly increased. During the process of shopping by consumers, one of the most common needs is to compare a plurality of similar commodities, including the brand, price, public praise and other factors of the commodities, and then make the final purchasing decision. In order to improve the click rate of the commodities and the transaction rate of the commodities, the electronic commerce platform recommends similar commodities for the user when the user searches for the target commodities and browses the commodity detail page of the target commodities, so that convenience is provided for the user, the user experience is improved, and the click rate and the transaction rate of the commodities on the electronic commerce platform are improved. In order to efficiently and accurately recommend commodities similar to or identical to the shopping intention of a user, the e-commerce platform needs to quickly find commodities meeting the requirements from thousands of commodity libraries thereof and then recommend the commodities to the user. This places very high demands on the effectiveness and accuracy of similar product identification and detection techniques. How to rapidly identify and detect similar commodities in large-scale data becomes a hot topic in academia and industry.

The identification and detection of similar commodities are methods and technologies for judging whether a plurality of commodities are the same commodity by utilizing big data correlation technology or manual identification. In the field of electronic commerce, the existing large-scale similar commodity identification and detection modes mainly include the following modes:

1. similar commodities are aggregated based on the key attributes of the commodities, one piece of information of the commodities is manually appointed to be key information, then similar commodities are aggregated in a commodity library of an electronic commerce platform, and the key attributes in each aggregation unit are guaranteed to be consistent. Or similar commodity aggregation is carried out based on the characteristics of the commodities, the image characteristics of the commodities are extracted, and the commodities are aggregated by calculating the similarity of the characteristics. The method has the advantages of high effectiveness and capability of quickly acquiring a similar commodity candidate set from a large-scale commodity library. But the challenges encountered are also apparent. According to the method, only information of a certain aspect of the commodity is considered when similar commodities are aggregated, and all attribute and title semantic information of the commodity are not considered, so that the identification accuracy of a similar commodity set is low, and the number of recalled candidate similar commodities is relatively small. Furthermore, this method does not take into account the absence or error of the product attributes. Because the commodity is self-made by a merchant, the attribute information of the commodity is also artificially filled, and the accuracy of similar commodities is further influenced;

2. similar commodities are aggregated based on nearest neighbor search, the commodities are vectorized through deep learning, and the similar commodities are retrieved through a nearest neighbor search mode. The method has the advantages that the effectiveness is high, and similar commodities are searched and retrieved in a nearest neighbor searching mode; but the challenges encountered are: the nearest neighbor search usually can only obtain similar commodities in an approximate mode, so that the algorithm cannot ensure enough accuracy, in addition, the nearest neighbor search needs to select an experience threshold value, and then whether similar relations exist among the commodities can be judged, the experience threshold value is difficult to obtain, and the experience threshold value cannot be universally used for data of various categories in a commodity library of an e-commerce platform, so that the accuracy is reduced;

3. similar commodities are searched by combining a natural language processing technology and a computer vision technology. The method combines the existing computer vision method and natural language processing technology to identify the vision similarity and description text similarity between commodities, and obtains a final result by integrating two measurement modes, thereby finding out similar commodities. The method has the advantages that the image and text description information of the commodity are integrated, the similarity of the commodity is comprehensively quantified, and a similar commodity set can be accurately detected and identified. However, the method has high requirement on the integrity of commodity information, and a large number of samples are required to train the model, which is difficult to use in practical application scenarios.

The above conventional similar product detection and identification method has the following problems:

1. the accuracy is low, and the method cannot be directly used for searching recommended scenes: in two methods of aggregating similar commodities based on the key attributes of the commodities and aggregating the similar commodities based on nearest neighbor search, one common problem is that the accuracy of the obtained result is low;

2. the effectiveness is not enough, and the method cannot be used for real-time calculation. Similar commodity sets are searched and sorted in a database based on image information and text information of commodities to be searched, the calculation amount is extremely large in a massive commodity scene in the process, the algorithm consumes too long time, and the requirement of a user on effectiveness cannot be met;

3. the integrity requirement on the commodity information is very high, and the actual application scene requirement cannot be met. And on the E-commerce platform, the merchant autonomously puts on the shelf, fills in commodity information and uploads the related pictures of the commodities. In an actual application scene, due to negligence of merchants, the commodity information is often incomplete, or the commodity image information and the text information are missing, or the commodity image information and the text information are wrong, so that the deep learning model cannot give an accurate prediction result, and the similar commodities are inaccurately identified;

4. the machine learning model itself depends on the quality of the training samples. As mentioned in point 3, in the e-commerce platform, a large number of commodities have commodity information missing and commodity information errors, and the quality of the obtained model is inevitably unsatisfactory by using the commodity information as a training set.

In order to solve the technical problems, the invention provides a method and a device for similar product aggregation, which are used for repairing the attributes of a product based on an unsupervised algorithm so as to solve the quality problem of product information; the method comprises the steps of screening key attributes from commodity information, filtering from a similar commodity candidate set in a rule combining mode, and then screening again on the basis of a machine learning model after rule filtering to obtain a final similar commodity set, so that the problem of low detection and identification precision of similar commodities is solved. Meanwhile, the invention continues to use the large-scale nearest neighbor searching technology, thereby ensuring the effectiveness of the algorithm. In addition, the similar commodity sets are integrated into a similar commodity cluster through a clustering algorithm, and the commodity at the center point of the commodity cluster is taken as a representative of the commodity cluster, so that the quantity of commodities is greatly reduced. Under the scenes of searching and recommending the commodity detail page, all commodity candidate sets similar to the commodity can be obtained only by acquiring the commodity cluster where the commodity is located based on the given commodity, so that the similar commodity retrieval efficiency is greatly improved.

In the embodiment of the present invention, the merchandise attributes include, but are not limited to, brand, model, power consumption, size, etc. of the merchandise, and there are different merchandise attributes for different types of merchandise, for example, the clothing-related merchandise attributes include, but are not limited to, sleeve length, collar shape, applicable age, applicable scene, etc. of the merchandise, and the furniture-related merchandise attributes include, but are not limited to, material, classification, color, etc. of the merchandise.

Fig. 1 is a schematic view of the main steps of a method for polymerizing similar articles according to an embodiment of the present invention. As shown in fig. 1, the method for similar article aggregation according to the embodiment of the present invention mainly includes steps S101 to S105 as follows.

Step S101: restoring the article attribute information;

step S102: generating an article characteristic vector according to the repaired article attribute information, and performing nearest neighbor search on the article characteristic vector to obtain a first similar article candidate set;

step S103: processing the articles in the first similar article candidate set based on a preset article key attribute to obtain a second similar article candidate set;

step S104: selecting a third similar object candidate set from the second similar object candidate set based on the similar object discrimination model;

step S105: and clustering the items in the third similar item candidate set through a clustering algorithm to perform similar item aggregation.

The main purpose of step S101 is to repair the article information and ensure that the article information conforms to the characteristics of the article itself. In this step, the attribute information of the repair article is emphasized, and the repair is performed using a method and technique of unsupervised learning. Taking commodity information restoration in the e-commerce field as an example, in one embodiment of the invention, an unsupervised learning method is used for restoring the attribute of the commodity, and a Bayesian model-based restoration of the attribute of the commodity is taken as an example to describe how to modify the information of the commodity.

FIG. 2 is a schematic diagram of the implementation steps for repairing the property of an article according to an embodiment of the present invention. As shown in fig. 2, in an embodiment of the present invention, a specific process of repairing the attribute of the article includes the following steps S1011 to S1014.

Step S1011: and acquiring the attribute of the article to be repaired and the candidate word set of each article attribute from the article attribute information. With the embodiment of the present invention, the attributes of the commodity to be repaired and the corresponding candidate words need to be acquired: obtaining candidate word set { c) of to-be-repaired attributes of commodities through commodity titles₁，c₂，…，c_nGet the skill of the candidate wordTechniques include, but are not limited to, regular expressions, word segmentation tools, and the like. Taking model word candidate word set as an example, all the combined character strings of English, number and special characters can be obtained from the commodity title in a regular expression mode and used as the candidate model word set of the commodity. For example: in the commodity titled 'Sony (SONY) WH-1000XM2 Hi-Res wireless Bluetooth headset intelligent noise reduction headset 1000x second generation black', a model word set is extracted by a regular expression as { WH-1000XM2, Hi-Res, 1000x }.

Step S1012: and for each article attribute, respectively calculating the conditional probability of each candidate word of the article attribute under each existing article attribute information based on the existing article attribute information. In the embodiment of the invention, based on the existing attribute information of the commodity, for each attribute, the conditional probability of each candidate word is respectively calculated: and under the condition that the existing other attribute information of the commodity is determined, calculating the conditional probability of each candidate word under the attribute. The other attribute information may be one or more. For example, the above products, attribute information thereof is "brand: sony, type: noise reduction earphone, wearing mode: head-mounted. Assume that the existing other attribute information set is { a }₁，a₂，…，a_mAnd calculating to obtain a candidate word c_iAt a known attribute of a_jConditional probability of P (c)_i|a_j) The calculation formula is as follows:

wherein, P (a)_j,c_i) Is attribute a_jAnd type word c_iProbability of co-occurrence, P (a)_j) Is attribute a_jThe probability of occurrence.

Step S1013: and for each candidate word, multiplying the conditional probability of the candidate word under each existing article attribute information to obtain a result, and taking the result as the score of the candidate word. Specifically, for each candidate word of each product attribute, multiplying the conditional probability corresponding to the candidate word calculated in the above step S1012, and taking the result as the final score of the candidate word, that is:

step S1014: and taking the candidate word with the highest score as the final value of the article attribute to repair the article attribute. And selecting the candidate word with the highest score as the final value of the commodity attribute, and in the step, obtaining the final value of the attribute to be repaired. Namely:

in an embodiment of the present invention, taking the example of repairing the commodity attribute "model", the commodity title is "SONY (SONY) WH-1000XM2 Hi-Res wireless bluetooth headset intelligent noise reduction headset 1000x second generation black", and its corresponding specification attributes are: "Brand: sony, type: noise reduction earphone, wearing mode: in step S1011, the candidate word set with the attribute "model" can be selected as follows: { WH-1000XM2, Hi-Res, 1000x }, in step S1012, the attribute set corresponding to the commodity is the above-mentioned specification attribute, and with each attribute information in the specification attribute as a condition, by counting the number of occurrences of all commodities and corresponding commodity attributes in the commodity set, the probability of occurrence of each attribute information in the specification attribute and the probability of common occurrence of the specification attribute and the model word are calculated, thereby calculating the conditional probability of each candidate word in the candidate word set of the attribute "model" under each attribute information, that is, the above-mentioned P (c) is calculated_i|a_j) Then, in step S1013, the conditional probabilities of each candidate word under all attributes are added to obtain a final score of the candidate word, so that each candidate word has a score of its own. Finally, in step S1014, the candidate word with the highest score is selected as the highest of the attributes "model" according to the scores of the candidate wordsFinal attribute value, thereby completing the attribute repair of the attribute "model".

After the article attribute information is restored, step S102 may be executed to generate an article feature vector according to the restored article attribute information, and a first similar article candidate set is obtained by performing nearest neighbor search on the article feature vector. The purpose of this step is to generate a candidate set of similar items based on the feature vectors. The traditional calculation method for generating similar articles based on feature vectors is to directly calculate the similarity of any two articles, find out a similar article set by defining a threshold, and assuming that the total amount of articles is n, the calculation amount is (n (n-1))/2. Obviously, in the application scenario of a large number of articles, the calculation method obviously causes a large increase in the calculation amount, which causes a waste of the calculation resources. To reduce the computational load, the present invention employs a large-scale nearest neighbor search technique to generate a candidate set of similar items.

Fig. 3 is a schematic diagram of the implementation steps of the nearest neighbor search on the item vector according to an embodiment of the present invention. In the embodiment of the present invention, taking the nearest neighbor search of the product vector as an example, the process of performing the nearest neighbor search on the product vector to obtain the first candidate set of similar articles mainly includes the following steps S1021 to S1023.

Step S1021: and generating an article feature vector according to the repaired article information. The commodity information is vectorized by using vectorization technology, wherein the vectorization technology comprises but is not limited to bag-of-words vector, Word2Vec and the like. The commodity information including the title information of the commodity and the attribute information of the commodity is vectorized, and a corresponding vector can represent the commodity.

Step S1022: based on the item feature vectors, the vector distance between the item pairs is calculated by a nearest neighbor search technique. In conjunction with embodiments of the present invention, the vector distance of the commodity pair is calculated by a nearest neighbor search technique. This step calculates the vector distance of the commodity pair by using the nearest neighbor search technique, where the nearest neighbor search technique includes but is not limited to the locality sensitive hashing technique, etc., and the vector distance includes but is not limited to the euclidean distance, the hamming distance, etc. The end result is a pair of items and the vector distance of the pair of items.

Step S1023: and obtaining similar item pairs by limiting a threshold value so as to obtain a first similar item candidate set. After the pairs of commodities and the corresponding vector distances are obtained in step S1022, a threshold is determined according to manual observation and experience, and if the distance is smaller than the threshold, the pairs of commodities are considered to be similar, otherwise, the pairs of commodities are not similar. And finally, a candidate set of similar commodities is obtained.

In step S1021, the commodity is vectorized, and since the candidate set of similar commodities is generated based on the similarity of the commodity titles, the method adopted in the present invention is a bag-of-words vector method, and the commodity titles and the results of word segmentation for the commodity titles are shown in table 1, assuming that there are 4 commodities.

TABLE 1

The method comprises the steps of segmenting existing commodity titles through a common segmentation tool, and then constructing a word bag vector for each title, so that a vector x which can represent the commodity title is obtained₁，x₂，…，x_nFor example, the product title is shown in table 1. Next, in step S1022, a vector distance between each pair of commodities is calculated, and in this embodiment, the euclidean distance is taken as an example, and it is assumed that the title vector of any pair of commodities is x ═ x₁，x₂，…，x_n}，y＝{y₁，y₂，…，y_nThe Euclidean distance formula is as follows:

in the step, in order to reduce the calculation amount, the invention proposes to adopt a large-scale neighbor search technology to realize calculation of euclidean distance between any pair of commodity title vectors in a commodity set, taking locality sensitive hashing as an example, recalculating the vectors of any commodity title vectors by a specific hash function, placing all similar vectors in a set by new vectors, and then calculating the vector distance between the sets, so that the calculation amount is greatly reduced compared with a pair-wise comparison calculation method. Next, in step S1023, the embodiment selects a distance threshold according to manual observation and experience, filters the result produced in step S1022 by using the distance threshold, and retains all the commodity sets below the threshold, so as to obtain a final similar commodity candidate set as a first similar item candidate set.

Then, for the obtained first candidate set of similar items, step S103 will be performed to perform a second processing based on the key attributes of the items to obtain a second candidate set of similar items. Specifically, based on a preset item key attribute, a second similar item candidate set is obtained by calculating whether the value of the item key attribute of each similar item pair in the first similar item candidate set matches to screen the first similar item candidate set.

In the embodiment of the invention, when similar articles are screened based on the key attributes of the articles, the screening rule can be preset. The main purpose of this step is to manually set rules to filter the first candidate set of similar items generated in step S102. The main way of making the rule is to find a key attribute, where the key attribute is defined as the information of the product attribute that can determine whether a pair of products are similar, such as related products like clothes, and the product attribute "applicable age" is a key attribute. Different kinds of commodities with different purposes have respective unique attributes which can be used as key attributes. According to the method, a plurality of key attributes are obtained through manual observation and data statistics, and the similar commodity candidate set is filtered by calculating whether the values of the key attributes in the commodity pair are equal or similar, so that the similar commodity candidate set based on the rule is obtained.

In this step, the most critical part is to find the key attribute that can determine whether the product pair is a similar product, where the attribute generally refers to the specification attribute of the product. Taking a clothing-related commodity as an example, such as a dress commodity, the title of the dress is ' Feimayi autumn and winter female skirt a-shaped skirt 2018 new collar long-sleeve spangle embroidery velvet dress 19516 date red XL ', and the commodity attribute is ' waist type: middle waist; style: elegance, europe, america, classical; thickness: the temperature is moderate; a collar type: standing a collar; the applicable age is as follows: 35-39 years old; the type: shaping the body; the length of the sleeve: a long sleeve; skirt type: a-shaped skirt; the method is suitable for people: the light girl, through common sense and experience, can know that the one-piece dress can be distinguished by attributes such as waist type, collar type, applicable age, type, applicable crowd, etc., so this embodiment takes the aforementioned attributes as key attributes, then filters the key attributes in the first similar item candidate set generated in step S102, and if there is an inconsistency in the key attributes corresponding to any pair of goods, for example, the values corresponding to the "applicable crowd" attributes are different, it is considered that the pair of goods is not a similar goods pair, so the pair of goods is removed from the first similar item candidate set of the candidate similar item set, and thus a similar goods candidate set based on key attribute rules is obtained as a second similar goods candidate set.

After the second similar item candidate set is obtained, step S104 is performed to select a third similar item candidate set from the second similar item candidate set based on the similar item discrimination model. In the step, based on step S103, a candidate set of similar commodities filtered based on rules is obtained, and whether a given commodity pair is a similar commodity is predicted by extracting features and combining a machine learning algorithm, so as to obtain a final candidate set of similar commodities.

Fig. 4 is a schematic diagram of steps for screening similar articles based on a similar article discrimination model according to an embodiment of the present invention. As shown in fig. 4, in the embodiment of the present invention, for example, similar commodity screening is performed based on a similar item discrimination model, the method mainly includes the following steps S1041 to S1044.

Step S1041: samples from the second similar sample candidate set are selected and labeled. The main purpose of this step is to obtain a training sample, randomly sample the second similar product candidate set generated in step S103 in a random sampling manner, manually evaluate the obtained sample, and specify a corresponding labeling specification for a specific category of product in the manual evaluation process. In this step, the most important part is to specify the corresponding annotation specification to the participant of the annotation data. For example, in the scientific and technical product category, when the models mentioned in the commodities are inconsistent, the commodities should be labeled as non-similar commodities, and when the attributes of the capacities, colors, styles and the like of the commodities are inconsistent, the labeling personnel should label the commodities according to the actual situation.

Step S1042: and performing feature extraction of similar article pairs on the marked samples. The step aims to extract the characteristics of the commodity pairs for training the similar commodity discrimination model. The extracted features of the part include but are not limited to similarity features of product titles, whether product brands are consistent, whether product pairs belong to a category, similarity distances of product attributes, product image features and the like. The similarity characteristics of the product titles include, but are not limited to, a participle list distance of the product titles, a participle word order characteristic of the product titles, and the like. The commodity attribute similarity distance includes, but is not limited to, a hamming distance of the commodity attribute, an edit distance of the commodity attribute, a contact ratio of the commodity attribute list, and the like. The commodity image features include, but are not limited to, color histogram descriptor CHD, gradient histogram HOG features, GIST features, and the like. The characteristics are extracted and stored in a characteristic library, so that the use of subsequent processes is facilitated.

Step S1043: and training a similar article distinguishing model by performing machine learning on the extracted features. In this step, a discrimination model of similar commodities is trained by using the training sample set obtained in step S1041 and the feature values extracted in step S1042 in combination with a common machine learning algorithm. The machine learning algorithm described herein includes both common supervised learning algorithms such as logistic regression algorithm, decision tree algorithm, and ensemble learning related algorithms that integrate common supervised learning algorithms. In this step, a discrimination model of similar products is obtained, which can be used to determine whether two products in a product pair are similar products.

Step S1044: and selecting a third similar item candidate set from the second similar item candidate set based on the similar item discrimination model. Predicting a total quantity of similar commodity candidate sets by using the trained model, predicting whether each pair of commodities are similar by using the similar item discrimination model obtained in the step S1043 and taking the second similar commodity candidate set produced in the step S103 as an input, thereby obtaining a final similar commodity candidate set as a third similar commodity candidate set.

Finally, since the first similar item candidate set, the second similar item candidate set and the third similar item candidate set obtained before are all similar relations between item pairs, in order to reduce the data volume of similar item search when recommending the item, the present invention obtains similar item clusters through a clustering algorithm by performing step S105. The main purpose of this step is to integrate the third similar article candidate set obtained in step S104, and sort the similar articles in a cluster manner, so as to ensure that any two articles in each cluster are similar articles. Based on the final third similar object candidate set obtained in step S104, all the similar objects are classified into a cluster by using an unsupervised clustering algorithm, so as to finally obtain a similar object cluster set. The unsupervised clustering algorithm includes, but is not limited to, connected subgraph algorithm, gaussian mixture model and other algorithms. The similar article cluster generated by the step can represent the article cluster by taking any one article in the article cluster, so that the calculation amount can be greatly reduced, and the method has great significance for commodity recommendation and commodity search of an electronic commerce platform.

For example, in step S104, part of the samples are extracted to obtain a candidate set of similar products as shown in table 2.

TABLE 2

In table 2, each row has two products corresponding to two different product titles, and the product pairs corresponding to each row are similar product pairs, and there is no relationship between the rows. In order to further mine the relationship between the commodities in each row, in step S105, the present invention clusters the existing candidate set of similar commodities by a clustering algorithm, so as to cluster all the commodities having a similar relationship into a plurality of commodity clusters.

Taking table 2 as an example, in the present embodiment, the connected subgraph algorithm is used to integrate the similar commodity pairs, each commodity is used as a vertex of the graph, the similar commodity pair in each row is used as an edge connecting the vertices of the graph, all the connected subgraphs in the graph are found through the connected subgraph algorithm, and the connected subgraph is a maximum set in which any pair of vertices in the graph can be connected through other vertices. Taking table 2 as an example, table 3 is obtained through a connected subgraph algorithm.

TABLE 3

The first column in table 3 is all the commodities in table 2, and the second column is the cluster center selected by the connected subgraph algorithm. According to the characteristic that any two vertexes in the connected subgraph can pass through other vertexes, any commodity in the connected subgraph has similar relation, so that one connected subgraph can be used as a similar commodity cluster, the center of the connected subgraph selected by the connected subgraph algorithm can be used as a representative commodity of the similar commodity cluster, as shown in table 3, the column of the center of the commodity cluster is the representative commodity of the commodity cluster, and any two commodities in the similar commodity cluster are similar to each other. According to the above steps, the third similar product candidate set generated in step S104 is clustered, so that a plurality of similar product clusters can be obtained, and any two products in the similar product clusters are in a similar relationship with each other, so that all product sets similar to the specific product can be found accurately and quickly.

By introducing the above embodiments, the embodiments of the present invention solve the accuracy problem caused by using a machine learning algorithm model alone to some extent by combining rules and models. In addition, the detection and identification effectiveness of similar commodities is greatly improved from the engineering perspective through the commodity attribute repair, large-scale nearest neighbor search and other processes. The finally produced similar commodity cluster has important significance on subsequent commodity recommendation and commodity search, the calculation amount can be reduced to the maximum degree, and the user experience is improved.

Fig. 5 is a schematic view of main blocks of an apparatus for similarly aggregating articles according to an embodiment of the present invention. As shown in fig. 5, the apparatus 500 for similar article aggregation according to the embodiment of the present invention mainly includes an information retrieval module 501, a first processing module 502, a second processing module 503, a third processing module 504, and a fourth processing module 505.

An information repairing module 501, configured to repair the article attribute information;

the first processing module 502 is configured to generate an article feature vector according to the repaired article attribute information, and perform nearest neighbor search on the article feature vector to obtain a first similar article candidate set;

a second processing module 503, configured to process, based on a preset item key attribute, items in the first similar item candidate set to obtain a second similar item candidate set;

a third processing module 504, configured to select a third similar item candidate set from the second similar item candidate set based on a similar item discrimination model;

a fourth processing module 505, configured to cluster the items in the third similar item candidate set through a clustering algorithm to perform similar item aggregation.

In the embodiment of the invention, the article attribute information can be repaired by using an unsupervised learning method. Specifically, the information repairing module 501 may further be configured to:

acquiring the attribute of the article to be repaired and a candidate word set of each article attribute from the article attribute information;

for each article attribute, respectively calculating the conditional probability of each candidate word of the article attribute under each existing article attribute information based on the existing article attribute information;

for each candidate word, multiplying the conditional probability of the candidate word under each existing article attribute information to obtain a result, and taking the result as the score of the candidate word;

and taking the candidate word with the highest score as the final value of the article attribute so as to repair the article attribute.

According to an embodiment of the present invention, the first processing module 502 may be further configured to:

calculating the vector distance between the article pairs through a nearest neighbor search technology based on the article feature vectors;

and obtaining similar item pairs by limiting a threshold value so as to obtain a first similar item candidate set.

According to another embodiment of the present invention, the second processing module 503 may be further configured to:

and based on preset item key attributes, screening the first similar item candidate set by calculating whether the values of the item key attributes of each similar item pair in the first similar item candidate set are matched to obtain a second similar item candidate set.

According to another embodiment of the present invention, the similar article discrimination model can be obtained by training:

selecting and labeling samples from the second similar sample candidate set;

extracting the characteristics of similar article pairs of the marked samples;

and training a similar article distinguishing model by performing machine learning on the extracted features.

According to the technical scheme of the embodiment of the invention, the article attribute information is repaired, then the article characteristic vector is generated according to the repaired article attribute information, and the nearest neighbor search is carried out on the article characteristic vector to obtain a first similar article candidate set; processing the articles in the first similar article candidate set based on a preset article key attribute to obtain a second similar article candidate set; selecting a third similar object candidate set from the second similar object candidate set based on the similar object discrimination model; clustering the articles in the third similar article candidate set through a clustering algorithm to carry out similar article aggregation, and repairing the article attribute information to solve the quality problem of the article information; the calculation amount can be greatly reduced by determining similar articles by performing nearest neighbor search on the article feature vectors, so that the calculation resources are saved, and the timeliness of the algorithm is ensured; the final similar article set is obtained by searching nearest neighbors of the features, filtering similar articles based on key attributes of the articles and then screening for the first time based on a machine learning model, so that the problem of low detection and identification precision is solved. In addition, similar article sets are integrated into similar article clusters through a clustering algorithm, and articles at the center points of the article clusters are taken as representatives of the article clusters, so that the number of the articles is greatly reduced, and the similar article retrieval efficiency is greatly improved.

Fig. 6 illustrates an exemplary system architecture 600 of a method of similar item aggregation or an apparatus of similar item aggregation to which embodiments of the present invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The

terminal devices

601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

601, 602, 603. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the method for similar goods aggregation provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the apparatus for similar goods aggregation is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, a block diagram of a computer system 700 suitable for use with a terminal device or server implementing an embodiment of the invention is shown. The terminal device or the server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor, and may be described as: a processor includes an information retrieval module, a first processing module, a second processing module, a third processing module, and a fourth processing module. The names of the units or modules do not constitute a limitation to the units or modules themselves in some cases, and for example, the information repair module may also be described as a "module for repairing article attribute information".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: restoring the article attribute information; generating an article characteristic vector according to the repaired article attribute information, and performing nearest neighbor search on the article characteristic vector to obtain a first similar article candidate set; processing the articles in the first similar article candidate set based on a preset article key attribute to obtain a second similar article candidate set; selecting a third similar object candidate set from the second similar object candidate set based on a similar object discrimination model; clustering the items in the third similar item candidate set through a clustering algorithm to perform similar item aggregation.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of polymerizing an analog comprising:

restoring the article attribute information;

generating an article characteristic vector according to the repaired article attribute information, and performing nearest neighbor search on the article characteristic vector to obtain a first similar article candidate set;

processing the articles in the first similar article candidate set based on a preset article key attribute to obtain a second similar article candidate set;

selecting a third similar object candidate set from the second similar object candidate set based on a similar object discrimination model;

clustering the items in the third similar item candidate set through a clustering algorithm to perform similar item aggregation.

2. The method of claim 1, wherein the item attribute information is repaired using an unsupervised learning method.

3. The method of claim 1 or 2, wherein repairing the item attribute information comprises:

4. The method of claim 1, wherein obtaining a first candidate set of similar items by performing a nearest neighbor search on the item feature vector comprises:

5. The method according to claim 1 or 4, wherein processing the items in the first candidate set of similar items to obtain a second candidate set of similar items based on a preset item key attribute comprises:

6. The method of claim 1, wherein the similar item discrimination model is trained by:

selecting and labeling samples from the second similar sample candidate set;

extracting the characteristics of similar article pairs of the marked samples;

7. An apparatus for the polymerization of like products, comprising:

the information restoration module is used for restoring the article attribute information;

the first processing module is used for generating an article characteristic vector according to the repaired article attribute information and obtaining a first similar article candidate set by carrying out nearest neighbor search on the article characteristic vector;

the second processing module is used for processing the articles in the first similar article candidate set to obtain a second similar article candidate set based on preset article key attributes;

the third processing module is used for selecting a third similar object candidate set from the second similar object candidate set based on a similar object discrimination model;

and the fourth processing module is used for clustering the articles in the third similar article candidate set through a clustering algorithm so as to carry out similar article aggregation.

8. The apparatus of claim 7, wherein the information remediation module is further configured to:

9. The apparatus of claim 7, wherein the first processing module is further configured to:

10. The apparatus of claim 7, wherein the second processing module is further configured to:

11. An electronic device for product aggregation, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.