CN112699672B

CN112699672B - Method and device for selecting articles

Info

Publication number: CN112699672B
Application number: CN201911013314.4A
Authority: CN
Inventors: 岳俊杰
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2024-04-05
Anticipated expiration: 2039-10-23
Also published as: CN112699672A

Abstract

The invention discloses a method and a device for selecting an article, and relates to the technical field of computers. One embodiment of the method comprises the following steps: obtaining a first attribute name set and attribute values of the first attribute name set according to the first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to the second text; determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the second attribute name set and the attribute value of the second attribute name set; and if the similarity is larger than a first preset value, selecting a target object from the objects described in the first text and the objects described in the second text according to the user attention attribute. The embodiment improves the user experience.

Description

Method and device for selecting articles

Technical Field

The present invention relates to the field of computer technology, and in particular, to a method and apparatus for selecting an article.

Background

Currently, the process of selecting an item includes: based on a neural network method, determining whether the articles described by different texts are similar, and selecting a target article from the articles after the articles are similar.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:

the neural network-based method is to embed high-dimensional information into a low-dimensional space, and to use vectors with lower dimensions to represent the high-dimensional information, so that no clear theory is currently provided to ensure that the original high-dimensional information can be maintained after the high-dimensional information is embedded into the low-dimensional space, and details of how to distinguish the high-dimensional information, such as attribute values of articles. Therefore, based on the neural network method, the accuracy of determining whether the objects are similar is not high, the selection of the target object is not meaningful, even if the target object is selected, the target object does not conform to the requirements of users, and the user experience is not high.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method and an apparatus for selecting an article, which can improve user experience.

To achieve the above object, according to one aspect of an embodiment of the present invention, there is provided a method of selecting an article.

The method for selecting the object comprises the following steps:

obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to a second text;

Determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the attribute values of the second attribute name set and the attribute value of the second attribute name set;

and if the similarity is larger than a first preset value, selecting a target object from the objects described in the first text and the objects described in the second text according to the user attention attribute.

In one embodiment, the category to which the first text belongs is the same as the category to which the second text belongs;

before obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, the method comprises the following steps:

creating an object attribute library of the category;

obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to a second text, wherein the method comprises the following steps:

and obtaining a first attribute name set and attribute values of the first attribute name set according to a first text by adopting the item attribute library of the category, and obtaining a second attribute name set and attribute values of the second attribute name set according to a second text.

In one embodiment, creating the item property library for the category includes:

acquiring a target attribute name set of the category, and carrying out normalization processing on the target attribute name set to obtain a key attribute name set;

acquiring attribute values of the key attribute name set according to the key attribute name set;

and obtaining the item attribute library of the category according to the key attribute name set and the attribute value of the key attribute name set.

In one embodiment, using the item attribute library of the category, obtaining a first attribute name set and attribute values of the first attribute name set according to a first text includes:

selecting a first attribute name set from the key attribute name sets according to a first text;

acquiring undetermined attribute values of the first attribute name set from the first text according to the attribute values of the first attribute name set and the key attribute name set;

and performing stop word removal, representation mode unification and attribute value splitting treatment on the undetermined attribute values of the first attribute name set to obtain the attribute values of the first attribute name set.

In one embodiment, obtaining the set of target property names for the category includes:

Acquiring a plurality of attribute names of the category;

for each attribute name, determining the number of times that the attribute corresponding to the attribute name appears in the titles of all texts under the category;

and obtaining the target attribute name set of the category according to the attribute names corresponding to the attributes with the occurrence times larger than the second preset value.

In one embodiment, determining the similarity of the first text-described item and the second text-described item based on the first set of attribute names, the attribute values of the first set of attribute names, the second set of attribute names, and the attribute values of the second set of attribute names comprises:

determining an intersection of the first set of attribute names and the second set of attribute names, the intersection comprising at least one identical attribute name;

for each identical attribute name, respectively obtaining attribute values of the identical attribute names from attribute values of the first attribute name set and attribute values of the second attribute name set, and carrying out similarity calculation on the obtained attribute values to obtain the similarity of the identical attribute names;

and fusing the obtained similarity, wherein the fusion result is used as the similarity of the first text-described object and the second text-described object.

In one embodiment, performing similarity calculation on the obtained attribute values to obtain the similarity of the same attribute names includes:

obtaining a plurality of positive examples of the category; the positive examples comprise a plurality of articles, and the attribute value of each article is the same;

recombining the articles included in the positive examples to obtain a plurality of negative examples, deleting the negative examples with mutually exclusive relation of the article attribute values from the plurality of negative examples, and obtaining a plurality of negative examples of the category;

training an edit distance algorithm by adopting a plurality of positive examples of the category and a plurality of negative examples of the category to obtain the edit distance algorithm of the category;

and adopting an edit distance algorithm of the category to calculate the similarity of the acquired attribute values, and obtaining the similarity of the same attribute names.

To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided an apparatus for selecting an article.

The device for selecting the articles comprises:

the acquiring unit is used for acquiring a first attribute name set and attribute values of the first attribute name set according to a first text, and acquiring a second attribute name set and attribute values of the second attribute name set according to a second text;

The processing unit is used for determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the attribute values of the second attribute name set and the attribute value of the second attribute name set;

and the selection unit is used for selecting a target object from the objects described in the first text and the objects described in the second text according to the attention attribute of the user if the similarity is larger than a first preset value.

the acquisition unit is used for:

before a first attribute name set and attribute values of the first attribute name set are obtained according to a first text, creating an object attribute library of the category;

In one embodiment, the acquisition unit is configured to:

acquiring a plurality of attribute names of the category;

In one embodiment, the processing unit is configured to:

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.

An electronic device according to an embodiment of the present invention includes: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the method for selecting the article provided by the embodiment of the invention.

To achieve the above object, according to still another aspect of an embodiment of the present invention, a computer-readable medium is provided.

A computer readable medium of an embodiment of the present invention has stored thereon a computer program which, when executed by a processor, implements a method for selecting an item provided by the embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: obtaining a first attribute name set and attribute values of the first attribute name set according to the first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to the second text; determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the second attribute name set and the attribute value of the second attribute name set; and if the similarity is larger than a first preset value, selecting a target object from the objects described in the first text and the objects described in the second text according to the user attention attribute. By determining whether the attribute values of the articles are similar or not, whether the articles described by different texts are similar or not is determined, accuracy of determining whether the articles are similar or not is improved, the selected target articles are more consistent with the requirements of users, and user experience is improved.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic illustration of the main flow of a method of selecting an item according to an embodiment of the present invention;

FIG. 2 is one application scenario of a method of selecting an item according to an embodiment of the present invention;

FIG. 3 is an example of attributes in a method of selecting an item according to an embodiment of the present invention;

FIG. 4 is an example of processing a first text in a method of selecting an item according to an embodiment of the present invention;

FIG. 5 is an example of one verification of a method of selecting an item according to an embodiment of the present invention;

FIG. 6 is an example of another verification of a method of selecting an item according to an embodiment of the invention;

FIG. 7 is a schematic view of the main units of an apparatus for selecting an item according to an embodiment of the present invention;

FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 9 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It is noted that embodiments of the invention and features of the embodiments may be combined with each other without conflict.

In the information explosion age, people want to acquire content with high matching degree with own needs from massive information. To meet this demand, various applications such as search engines, automatic question-answering systems, document classification and clustering, document searching, document accurate pushing, etc. are presented, and one of the key technologies of these application scenarios is text similarity calculation. Currently, text similarity calculation methods are classified into 4 major categories:

1 String-based method

The method starts from the matching degree of the character strings and takes the co-occurrence degree and the repetition degree of the character strings as the measurement standard of the similarity. Depending on the granularity of computation, methods can be classified into Character-Based methods and Term-Based methods. One type of method simply considers similarity algorithms from the composition of characters or words, such as edit distance, hamming distance, cosine similarity, dice coefficient, and euclidean distance; another class of methods also incorporates endianness, i.e., the same character composition and endianness are similar requirements for strings, such as longest common substring (Longest Common Substring, LCS), jaro-Winkler; yet another class of methods uses the idea of aggregation, which considers a string as an aggregation of words, the intersection computation of the available aggregation of word co-occurrences, such as N-gram, jaccard, overlap Coefficient.

2 Corpus (Corpus-based) based method

The corpus-based method calculates text similarity using information obtained from the corpus. Corpus-based methods can be divided into: a bag of words model-based method, a neural network-based method, and a search engine-based method. The first two types take a document set with similarity to be compared as a corpus, and the later type takes Web as the corpus.

a word bag based method

The bag of words model (Bag of Words Model, BOW) is based on the distribution hypothesis that the contexts in which the words are located are similar, and the semantics thereof are similar. The basic idea is to represent a document as a combination of a series of words regardless of the order in which the words appear in the document. Depending on the degree of semantics considered, bag of words model based approaches mainly include vector space models (Vector Space Model, VSM), latent semantic analysis (Latent Semantic Analysis, LSA), probabilistic latent semantic analysis (Probabilistic Latent Semantic Analysis, PLSA) and latent dirichlet distribution (Latent Dirichlet Allocation, LDA).

b neural network-based method

The generation of Word vectors (Word vectors) by neural network models to compute text similarity is a method that has been studied more in recent years in the field of natural language processing. Models and tools for generating Word vectors are also proposed, such as Word2Vec and GloVe, among others. The essence of the word vector is a low-dimensional real number vector trained from unlabeled unstructured text, and the expression mode enables similar words to be closer in distance, and meanwhile the problems of dimension disasters and semantic insufficiency caused by word bag models due to word independence are well solved.

c is based on search engine method

Since cinibrasi et al proposed normalized google distance (Normalized Google Distance, NGD), search engine based methods of computing semantic similarity began to be popular. The basic principle is that given search keywords x and y, the search engine returns the webpage quantity f (x) and f (y) containing x and y and the webpage quantity f (x and y) containing x and y simultaneously, and the google similarity distance is calculated.

3 method based on world Knowledge (knowledges-based)

The method based on world knowledge refers to calculating text similarity by using a knowledge base with a canonical organization system, and is generally divided into two types: based on ontology knowledge and based on network knowledge. The former generally utilizes the context and co-location between concepts in an ontology architecture, and if the concepts are semantically similar, there is and only one path between the two concepts. And the entries in the network knowledge are structured, and the upper and lower relationships among the entries are displayed in a hyperlink form, so that the information organization mode is closer to the understanding of a computer. Links between paths or terms between concepts become the basis for text similarity calculation.

4 other methods

Other methods include syntactic analysis, which is a syntactic structural analysis of sentences, and also belongs to one of semantic analysis, but does not depend on a certain corpus or world knowledge, and thus is divided into other methods. The mixing method is a combination of several methods.

Determining whether items described in different texts are similar, including SKU level similarity and SPU level similarity, the present invention is primarily directed to SKU level scenarios. The text similarity calculation method is applied to the scene, and has the following problems:

1. the character string method mainly has the following problems:

the character string based method is text comparison at the literal level, and the text representation is the original text. The method is simple in principle and easy to implement, and becomes a calculation basis of other methods. But the disadvantage is that characters or words are used as independent knowledge units, and the meaning of the words and the relation between the words are not considered. Taking synonyms as an example, the terms have the same meaning despite their different expressions, and the similarity of such terms cannot be accurately calculated by means of string-based methods. Thus, the user experience is poor.

2. The existing difficulties of the corpus-based method are as follows:

1) The method based on the VSM has the basic principle that the method is simple, but has two obvious defects that firstly, the method carries out similarity calculation based on characteristic items in a text, and when the characteristic items are more, the calculation efficiency is low due to the generated high-dimensional sparse matrix; the assumption of the vector space model algorithm is that the extracted feature items in the text are not relevant, and the feature items do not accord with the text semantic expression. Thus, the accuracy of determining whether the items are similar is not high and the user experience is poor.

2) The neural network-based approach has been described in the background art and will not be described in detail herein.

3) The greatest disadvantage of the similarity calculation method based on the search engine is that the calculation result is completely dependent on the query effect of the search engine, and the similarity varies from one search engine to another. Currently, generic search engines do not support the same item search to return all merchants' descriptive information about the same item. Thus, this method is not suitable for application to the above-described scenario.

3. Drawbacks of the world knowledge based approach:

the text similarity calculation method based on network knowledge mostly utilizes page links or hierarchical structures, and can better reflect semantic relations of the entries. But it is not enough: the information completeness difference between the vocabulary entry and the vocabulary entry is large, and the calculation accuracy cannot be ensured; the network knowledge is generated in a mode of mass participation, so that the text lacks a certain expertise. Thus, this method is not suitable for application to the above-described scenario.

4. Major drawbacks of the existing other methods

The key point of the syntactic analysis is to find the dependency relationship or semantic relationship of each part in the sentence, and consider the word similarity and the relationship similarity while calculating the similarity, so the method has richer semantics, but the complexity of the sentence is that the difficulty and workload brought by frame analysis are not small, the current research is basically improved from two aspects, and keywords are effectively extracted and proper semantic frames are selected. A significant difference between the text of the item description and the text of the conventional text is that the item description is relatively structured information, and the text does not have a strict grammar structure, for example, the attribute values are almost randomly arranged, the exchange sequence does not influence the semantics, but has a large difference on the matching effect. Thus, this method is not suitable for application to the above-described scenario.

In order to solve the problems existing in the prior art, an embodiment of the present invention provides a method for selecting an article, as shown in fig. 1, the method includes:

step S101, obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to a second text.

Specific embodiments of this step are described in detail below and are not described in detail herein.

Step S102, determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the second attribute name set and the attribute value of the second attribute name set.

Step 103, if the similarity is greater than a first preset value, selecting a target item from the items described in the first text and the items described in the second text according to the user attention attribute.

In this embodiment, the first preset value may be set according to the requirement, for example, 0.8. Note that the attribute of interest of the user may be value, appearance, taste, performance, quality, or the like. The selection of the target article is described below with a specific example: and selecting the low-value article from the first text description article and the second text description article as a target article.

It should be appreciated that the first text and the second text are different, but are both text, which may be item detail pages. The article may be a computer, a mobile phone, clothing or food, etc. If the item is a food, the attribute names may be food type, delivery time, service attitude level, food taste, etc.

In the embodiment of the invention, the category to which the first text belongs is the same as the category to which the second text belongs;

prior to step S101, it includes:

creating an object attribute library of the category;

step S101 may include:

In this embodiment, by creating an item attribute library, the attribute values of the first attribute name set, the second attribute name set, and the second attribute name set are obtained using the item attribute library. Therefore, whether the attribute values of the articles are similar or not is determined, whether the articles described by different texts are similar or not is further determined, accuracy of determining whether the articles are similar or not is improved, the selected target articles are more consistent with the requirements of users, and user experience is further improved.

In an embodiment of the present invention, as shown in fig. 2, creating an item attribute library of the category includes:

In this embodiment, normalizing the target attribute name set includes: if the coincidence proportion of the attribute values of any two target attribute names in the target attribute name set is larger than the preset proportion, the two target attribute names are recognized as the same attribute name, and one target attribute name in the two target attribute names is deleted. The preset ratio may be set according to the need, for example, 0.6.

The number of the target attribute names in the target attribute name set is generally 5, and the target attribute names can be adjusted at any time according to requirements.

As shown in fig. 3, the set of key attribute names includes color, version, and type. The attribute values of the set of key attribute names include attribute values of colors, attribute values of versions, and attribute values of types. Attribute values of colors include blue, black, white, yellow, coral, and red; the attribute values of the version comprise a memory 64GB, a memory 128GB and a memory 256GB; the attribute values of the types include public edition, mobile 4G shared edition and annual payment repair carefree.

It should be understood that the attribute values of the key attribute name set may be obtained from all the text under the category, or may be obtained by searching on a search engine.

In addition, the object attribute library for each category can be created according to the manner provided by the embodiment of the invention.

In the embodiment of the present invention, as shown in fig. 4, using the item attribute library of the category, obtaining a first attribute name set and attribute values of the first attribute name set according to a first text includes:

In this embodiment, when implemented, if the key attribute name exists in the first text, the key attribute name is used as the first attribute name.

As shown in fig. 2 and 4, the item attribute library of the category may include a deactivated word library, a replacement rule library, an upper and lower relationship library, a synonym library, a mutual exclusion library, and the like.

The stop word library is used for removing stop words, and the stop words comprise other and high-grade words and the like.

The upper and lower relation library and the synonym library are used for unifying the expression modes. The processing of the thesaurus is described below with a specific example: the polyester fiber is unified as the attribute value. In order to ensure the effect of the embodiment of the invention, the synonym library should have better attribute value completeness. The upper and lower relational libraries are expression systems in which attribute values are unified to the lowest level. For example, the attribute value of the air-conditioning type, home appliances, is unified as an air conditioner.

The replacement rule base is used for attribute value splitting. For example, the attribute value of the mobile phone network type, namely, the whole network 4G, is split into mobile 4G, connected 4G and telecommunication 4G.

The mutual exclusion library is used for deleting negative examples of mutual exclusion relation of the article attribute values. For example, the mutual exclusion library includes a first item and a second item, where the first item attribute value and the second item attribute value have a mutual exclusion relationship, and if the negative instance includes the first item and the second item, the negative instance is directly deleted.

It should be noted that the complexity of determining whether different articles described herein are similar is reduced by the article attribute library of the category.

In the embodiment, a first attribute name set is selected from key attribute name sets through a first text, undetermined attribute values of the first attribute name set are obtained from the first text, and disabling word removal, representation mode unification and attribute value splitting processing are conducted on the undetermined attribute values to obtain the attribute values of the first attribute name set. Therefore, whether the attribute values of the articles are similar or not is determined, whether the articles described by different texts are similar or not is further determined, accuracy of determining whether the articles are similar or not is improved, the selected target articles are more consistent with the requirements of users, and user experience is further improved. In addition, the representation modes are unified, so that the problem of poor user experience degree in selecting the articles based on the character string method is solved.

In the embodiment of the invention, the article attribute library of the category is adopted, and a second attribute name set and attribute values of the second attribute name set are obtained according to a second text, including:

selecting a second attribute name set from the key attribute name sets according to a second text;

acquiring undetermined attribute values of the second attribute name set from the second text according to the attribute values of the second attribute name set and the key attribute name set;

and performing stop word removal, representation mode unification and attribute value splitting treatment on the undetermined attribute values of the second attribute name set to obtain the attribute values of the second attribute name set.

It should be understood that, similar to the manner of obtaining the attribute values of the second attribute name set and the first attribute name set, the specific implementation of this embodiment may refer to the previous embodiment, and will not be described herein.

In the embodiment of the invention, the obtaining the target attribute name set of the category comprises the following steps:

acquiring a plurality of attribute names of the category;

In this embodiment, the second preset value is set according to the need, for example, 50. And if the requirement exists, manually rechecking the obtained target attribute name set of the category, and obtaining a key attribute name set according to the rechecked target attribute name set of the category.

In the embodiment, the above process is used for obtaining the target attribute name set of the category, and the number of the target attribute names in the target attribute name set is small and the target attribute names have relevance, so that the problem of poor user experience degree in selecting the object based on the vector space model is solved.

In the embodiment of the present invention, step S102 may include:

In this embodiment, the embodiment is described below with a specific example:

the first set of attribute names includes attribute name one, attribute name two, and attribute name three.

The second set of attribute names includes attribute name one, attribute name two, and attribute name four.

Thus, the intersection includes: attribute name one and attribute name two.

For the attribute name I, acquiring an attribute value of the attribute name I, namely the attribute value I, from the attribute values of the first attribute name set; and acquiring the attribute value of the attribute name I, namely an attribute value II, from the attribute values of the second attribute name set.

For the attribute name II, acquiring an attribute value of the attribute name II, namely an attribute value III, from the attribute values of the first attribute name set; and acquiring the attribute value of the attribute name II, namely the attribute value IV, from the attribute values of the second attribute name set.

Performing similarity calculation on the attribute value I and the attribute value II to obtain the similarity of the attribute name I;

performing similarity calculation on the attribute value III and the attribute value IV to obtain the similarity of the attribute name II;

and fusing the similarity of the attribute name I and the similarity of the attribute name II, wherein the fusion result is used as the similarity of the article described by the first text and the article described by the second text.

In specific implementation, a logistic regression classification model (i.e., logistic regression classification model) or a vector space model can be adopted to fuse the obtained similarity.

The logistic regression classification model is adopted for fusion, the accuracy and recall rate of the similarity are shown in fig. 5, the vector space model is adopted for fusion, and the accuracy and recall rate of the similarity are shown in fig. 6. Wherein the abscissa in fig. 5 and 6 represents the similarity, and the ordinate represents the accuracy and recall. As can be seen from FIGS. 5 and 6, the accuracy and recall rate of the embodiment of the invention are good; in addition, the vector space model is adopted for fusion, so that the effect is better.

In this embodiment, similarity of the same attribute name is obtained by performing similarity calculation on the attribute values, the obtained similarity is fused, and the fusion result is used as the similarity of the first text-described article and the second text-described article. Therefore, whether the attribute values of the articles are similar or not is determined, whether the articles described by different texts are similar or not is further determined, accuracy of determining whether the articles are similar or not is improved, the selected target articles are more consistent with the requirements of users, and user experience is further improved.

In the embodiment of the invention, similarity calculation is performed on the acquired attribute values to obtain the similarity of the same attribute names, and the method comprises the following steps:

In this embodiment, a plurality of positive examples of the category are manually marked, and the server applied in the embodiment of the present invention obtains the plurality of positive examples of the category through a manual input manner.

Several positive examples of the category are described below with a specific example:

(A ₁₁ ，A ₁₂ ，…，B ₁₁ ，B ₁₂ ，…)；

(A ₂₁ ，A ₂₂ ，…，B ₂₁ ，B ₂₂ ，…)；

…；

(A _m1 ，A _m2 ，…，B _m1 ，B _m2 ，…)。

a bracket represents a positive example, each article in each positive example is the same, A represents the Beijing east platform, B represents other platforms, the first subscript represents the article number, and the second subscript represents the merchant to which the article belongs. It should be noted that the number of articles may be the same in different examples, and may be different.

The articles included in the alignment example are reorganized in a cross manner, e.g., by combining (A) ₁₁ ，A ₁₂ ，…，B ₁₁ ，B ₁₂ A11 and (a) in …) ₂₁ ，A ₂₂ ，…，B ₂₁ ，B ₂₂ A21 in …) are interchanged, resulting in two negative examples. In addition, the crossover may be random crossover.

The following describes, as a specific example, a negative example of the mutual exclusion relationship of the deletion item attribute values:

example 1: negative examples include articles of different brands. And deleting the negative example by adopting the mutex library of the category.

Example 2: negative examples include 3 items, wherein the population of 1 item is female and the population of 2 items is male. And deleting the negative example by adopting the mutex library of the category.

In addition, in this embodiment, the classification accuracy of the positive example and the negative example can be improved. For example, the accuracy of the positive example is 33.07%, the accuracy of the negative example is 94.35%, and the accuracy of the positive example is 93.85% and the accuracy of the negative example is 87.69%.

It should be noted that, the edit distance algorithm may be replaced by a term frequency-inverse text frequency index (TF-IDF for short). The reduce (an open source framework for entity matching) is two text similarity methods, one is edit distance algorithm, and the other is TF-IDF.

It should be appreciated that determining the similarity of an item of a first textual description to an item of a second textual description is essentially a matter of two text string similarity calculation.

The labeling cost of the positive example is high due to the synonym, the missing or the ambiguity of the attribute name and the attribute value of the article, and even a plurality of attribute values.

The actual number of positive examples, the predicted number of positive examples, the actual number of negative examples, and the predicted number of negative examples are shown in table 1:

TABLE 1 quantitative relationship Table

The accuracy of the positive example was 0.98% (0.98% = 9 900/(9+1 000) and the recall rate of the positive example was 99% (99% = 9 900/10 000), while the accuracy of the negative example was 99%.

As known from the above, in the application scenario of the embodiment of the present invention, the number of negative examples is significantly higher than the number of positive examples, and the actual application of the positive examples is almost ineffective. Thus, to ensure user experience, the number of positive examples and the number of negative examples should be balanced. And by deleting the negative examples with mutual exclusion relation among the article attribute values, the number of the negative examples is reduced, and the quality of the negative examples is ensured.

In this embodiment, a negative example in which the item attribute values have a mutually exclusive relationship is deleted from the plurality of negative examples, and a plurality of negative examples of the category are obtained. Thereby reducing the number of negative examples and improving the quality. And training the editing distance algorithm by adopting a plurality of positive examples of the category and a plurality of negative examples of the category to obtain the editing distance algorithm of the category. And the similarity calculation is performed by adopting the edit distance algorithm of the category, the accuracy of the similarity calculation is higher, the selected target object is more consistent with the user requirement, and the user experience is further improved.

A method of selecting an item is described above in connection with fig. 1-6 and an apparatus for selecting an item is described below in connection with fig. 7.

In order to solve the problems existing in the prior art, an embodiment of the present invention provides an apparatus for selecting an article, as shown in fig. 7, the apparatus includes:

an obtaining unit 701, configured to obtain a first attribute name set and attribute values of the first attribute name set according to a first text, and obtain a second attribute name set and attribute values of the second attribute name set according to a second text.

The processing unit 702 is configured to determine a similarity between the first text-described item and the second text-described item according to the first set of attribute names, the attribute value of the first set of attribute names, the second set of attribute names, and the attribute value of the second set of attribute names.

A selecting unit 703, configured to select, according to the user attention attribute, a target item from the items described in the first text and the items described in the second text if the similarity is greater than a first preset value.

The acquisition unit 701 is configured to:

In the embodiment of the present invention, the obtaining unit 701 is configured to:

acquiring a plurality of attribute names of the category;

In an embodiment of the present invention, the processing unit 702 is configured to:

It should be understood that the functions performed by the components of the apparatus for selecting an article according to the embodiments of the present invention have been described in detail in the method for selecting an article according to the foregoing embodiments, which is not described herein again.

Fig. 8 illustrates an exemplary system architecture 800 of a method of selecting an item or an apparatus of selecting an item to which embodiments of the present invention may be applied.

As shown in fig. 8, a system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves as a medium for providing communication links between the terminal devices 801, 802, 803 and the server 805. The network 804 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 805 through the network 804 using the terminal devices 801, 802, 803 to receive or send messages or the like. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 801, 802, 803.

The terminal devices 801, 802, 803 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 805 may be a server providing various services, such as a background management server (by way of example only) that provides support for shopping-type websites browsed by users using the terminal devices 801, 802, 803. The background management server may analyze and process the received data such as the product information query request, and feedback the processing result (e.g., the target push information, the product information—only an example) to the terminal device.

It should be noted that, the method for selecting an item according to the embodiment of the present invention is generally performed by the server 805, and accordingly, the device for selecting an item is generally disposed in the server 805.

It should be understood that the number of terminal devices, networks and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 9, there is illustrated a schematic diagram of a computer system 900 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU) 901, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 901.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a unit, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present invention may be implemented in software or in hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a processing unit, and a selection unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the selection unit may also be described as "if the similarity is greater than a first preset value, a unit of a target item is selected from the items described in the first text and the items described in the second text according to the user attention attribute".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: obtaining a first attribute name set and attribute values of the first attribute name set according to a first text, and obtaining a second attribute name set and attribute values of the second attribute name set according to a second text; determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the attribute values of the second attribute name set and the attribute value of the second attribute name set; and if the similarity is larger than a first preset value, selecting a target object from the objects described in the first text and the objects described in the second text according to the user attention attribute.

According to the technical scheme of the embodiment of the invention, the first attribute name set and the attribute values of the first attribute name set are obtained according to the first text, and the second attribute name set and the attribute values of the second attribute name set are obtained according to the second text; determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the second attribute name set and the attribute value of the second attribute name set; and if the similarity is larger than a first preset value, selecting a target object from the objects described in the first text and the objects described in the second text according to the user attention attribute. By determining whether the attribute values of the articles are similar or not, whether the articles described by different texts are similar or not is determined, accuracy of determining whether the articles are similar or not is improved, the selected target articles are more consistent with the requirements of users, and user experience is improved.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of selecting an article, comprising:

determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the attribute values of the second attribute name set and the attribute value of the second attribute name set; the method specifically comprises the following steps: determining an intersection of the first set of attribute names and the second set of attribute names, the intersection comprising at least one identical attribute name; for each identical attribute name, respectively obtaining attribute values of the identical attribute names from attribute values of the first attribute name set and attribute values of the second attribute name set, and carrying out similarity calculation on the obtained attribute values to obtain the similarity of the identical attribute names; fusing the obtained similarity, wherein the fusion result is used as the similarity of the first text-described object and the second text-described object;

2. The method of claim 1, wherein the category to which the first text belongs is the same as the category to which the second text belongs;

creating an object attribute library of the category;

3. The method of claim 2, wherein creating the item property library for the category comprises:

4. A method according to claim 3, wherein using the item attribute library of the category to obtain a first set of attribute names and attribute values for the first set of attribute names from a first text comprises:

5. The method of claim 3, wherein obtaining the set of target property names for the category comprises:

acquiring a plurality of attribute names of the category;

6. The method of claim 5, wherein performing similarity calculation on the obtained attribute values to obtain the similarity of the same attribute names comprises:

7. An apparatus for selecting an article, comprising:

The processing unit is used for determining the similarity of the first text-described object and the second text-described object according to the first attribute name set, the attribute value of the first attribute name set, the attribute values of the second attribute name set and the attribute value of the second attribute name set; the method is particularly used for: determining an intersection of the first set of attribute names and the second set of attribute names, the intersection comprising at least one identical attribute name; for each identical attribute name, respectively obtaining attribute values of the identical attribute names from attribute values of the first attribute name set and attribute values of the second attribute name set, and carrying out similarity calculation on the obtained attribute values to obtain the similarity of the identical attribute names; fusing the obtained similarity, wherein the fusion result is used as the similarity of the first text-described object and the second text-described object;

8. An electronic device, comprising:

One or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.

9. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.