CN111353838A - Method and device for automatically checking commodity category - Google Patents

Method and device for automatically checking commodity category Download PDF

Info

Publication number
CN111353838A
CN111353838A CN201811571500.5A CN201811571500A CN111353838A CN 111353838 A CN111353838 A CN 111353838A CN 201811571500 A CN201811571500 A CN 201811571500A CN 111353838 A CN111353838 A CN 111353838A
Authority
CN
China
Prior art keywords
category
level
commodity
prediction
equal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811571500.5A
Other languages
Chinese (zh)
Inventor
余文虎
孙志强
何小锋
刘海锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201811571500.5A priority Critical patent/CN111353838A/en
Publication of CN111353838A publication Critical patent/CN111353838A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0641Shopping interfaces
    • G06Q30/0643Graphical representation of items or shoppers

Abstract

The invention discloses a method and a device for automatically checking commodity categories, and relates to the technical field of computers. One embodiment of the method comprises: acquiring the title and actual category of the commodity; calculating the title of the commodity by using the ith classification model, and determining the probability that the commodity belongs to each i-level category in the ith category set; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1; selecting the primary category with the maximum probability from the first-level category set as a primary prediction category; selecting a kth level category comprising a (k-1) level prediction category as a k level prediction category from a kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i; and checking the actual category according to the prediction category. The implementation mode can comprehensively consider text semantics and upper-layer categories, automatically check the commodity categories, reduce the labor workload, improve the accuracy and efficiency, and has strong expandability.

Description

Method and device for automatically checking commodity category
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for automatically checking commodity categories.
Background
The commodity category system is one of the most important infrastructures in the field of electronic commerce, and the quality of the commodity category system can directly influence the application of subsequent search recommendation, intention identification and related personalized services. However, in the operation process of the actual e-commerce website, the phenomenon of mistakenly hanging categories often occurs, so that the commodities are wrongly classified under other irrelevant categories, and the user experience and the commodity conversion rate are reduced. At present, there are two main ways for detecting whether categories are hung, one is to rely on operators to determine whether the commodity title and the belonged category of a merchant are correct, and the other is to screen out commodities whose titles are not matched with configured categories through rules and then manually remove the commodities.
For example, the working mechanism of the first method is: the operator judges the product to be correct according to the commodity title on shelf of the merchant, such as 'tequ 52-degree strong aromatic white spirit 500 ml' and the class 'wine-white spirit' given by the E-commerce platform by combining the actual operation experience; the working mechanism of the second method is as follows: the given commodity title is 'tequ 52-degree strong aromatic white spirit 500 ml', the category given by the e-commerce platform is 'operator-mobile phone communication', the commodity title keyword contains 'white spirit', and if a rule is set, the category containing the keyword 'white spirit' cannot be classified under the category of 'operator', the computer system can determine that the commodity is hung by mistake according to the rule.
However, for internet e-commerce enterprises, the number of newly added commodities is hundreds of thousands of commodities every day, and manpower and time are needed for examining whether the commodities are matched with categories or not, so that the cost is high; the second method has the biggest defects of limitation and inextensibility of rules, the core of the rules is judged through keywords, the fewer the rules are, the phenomenon that a lot of commodities are hung in a wrong mode is not easy to find out, the fewer the rules are, the larger the algorithm performance is affected, the higher the probability of the occurrence of the contradictory rules is, and the poorer accuracy is. In summary, if only human work or existing rule auditing is relied on, the requirement of auditing class hang incorrectly from hundreds of millions of data is far from being met.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for automatically checking a commodity category, which can comprehensively consider text semantics and upper-layer categories, automatically check the commodity category, reduce workload of manpower, improve accuracy and efficiency, and have strong expandability.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method for automatically checking categories of commodities, including: acquiring the title and actual category of the commodity; calculating the title of the commodity by using an ith classification model, and determining the probability that the commodity belongs to each i-level category in an ith category set; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1; selecting the primary category with the maximum probability from the first-level category set as a primary prediction category; selecting a kth level category comprising a (k-1) level prediction category as a k level prediction category from a kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i; and checking the actual category according to the prediction category.
Optionally, the method further comprises: and if the k-th level category set does not contain the (k-1) level prediction category, selecting the k-level category with the highest probability from the k-th level category set as the k-level prediction category.
Optionally, the ith classification model is obtained according to the following process: acquiring a sample set, wherein the sample set comprises titles and category information of a plurality of commodities; constructing N category trees according to category information in the sample set, wherein the levels of the N category trees are sequentially increased from 1 to N; taking a category tree with a level i as an i-level category tree, and constructing an i-th level sub-sample set according to the sample set and the i-level category tree; and training the ith-level sub-sample set by using a preset classification algorithm to determine an ith-level classification model.
Optionally, before calculating the title of the commodity by using the ith classification model, the method further comprises: segmenting the title of the commodity to obtain a plurality of words;
converting each of the words into a word vector using a word vector conversion algorithm.
Optionally, checking the actual category according to the predicted category includes: calculating the text similarity between the prediction category and the actual category by using a similarity algorithm; and if the text similarity is larger than or equal to a threshold value, determining that the actual category of the commodity passes verification.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided an apparatus for automatically checking categories of goods, including: the information acquisition module is used for acquiring the title and the actual category of the commodity; the classification module is used for calculating the title of the commodity by using the ith classification model and determining the probability that the commodity belongs to each i-level category in the ith category set; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1; the prediction category determining module is used for selecting the primary category with the highest probability from the first-level category set as the primary prediction category; selecting a kth level category comprising a (k-1) level prediction category as a k level prediction category from a kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i; and the checking module is used for checking the actual category according to the prediction category.
Optionally, the prediction category determination module is further configured to: and if the k-th level category set does not contain the (k-1) level prediction category, selecting the k-level category with the highest probability from the k-th level category set as the k-level prediction category.
Optionally, the apparatus further comprises a model training module configured to: acquiring a sample set, wherein the sample set comprises titles and category information of a plurality of commodities; constructing N category trees according to category information in the sample set, wherein the levels of the N category trees are sequentially increased from 1 to N; taking a category tree with a level i as an i-level category tree, and constructing an i-th level sub-sample set according to the sample set and the i-level category tree; and training the ith-level sub-sample set by using a preset classification algorithm to determine an ith-level classification model.
Optionally, the apparatus further comprises a word vector conversion module, configured to: segmenting the title of the commodity to obtain a plurality of words; converting each of the words into a word vector using a word vector conversion algorithm.
Optionally, the verification module is further configured to: calculating the text similarity between the prediction category and the actual category by using a similarity algorithm; and if the text similarity is larger than or equal to a threshold value, determining that the actual category of the commodity passes verification.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method for automatically checking the commodity category of the embodiment of the invention.
To achieve the above object, according to one aspect of the embodiments of the present invention, there is provided a computer readable medium having a computer program stored thereon, wherein the computer program is configured to implement the method for automatically checking categories of commodities according to the embodiments of the present invention when executed by a processor.
One embodiment of the above invention has the following advantages or benefits: determining the probability that the commodity belongs to each i-level category in the i-level category set by adopting the calculation of the title of the commodity by using the i-level classification model; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1; selecting the primary category with the maximum probability from the first-level category set as a primary prediction category; selecting a kth level category comprising a (k-1) level prediction category as a k level prediction category from a kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i; according to the technical means of checking the actual categories according to the prediction categories, text semantics and upper-layer categories can be comprehensively considered, the commodity categories can be automatically checked, the labor workload is reduced, the accuracy and the efficiency are improved, and the expandability is strong.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a method of automatically verifying categories of goods according to an embodiment of the present invention;
FIG. 2 is a schematic illustration of a sub-flow of a method for automatically verifying categories of merchandise according to another embodiment of the present invention;
FIG. 3 is a diagram of a method for automatically verifying commodity classes, a class tree, according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a classification model for a method of automatically verifying categories of merchandise, according to yet another embodiment of the present invention;
FIG. 5 is a schematic diagram of the main modules of an apparatus for automated verification of a commodity category according to an embodiment of the present invention;
FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic flow chart illustrating the main steps of a method for automatically verifying categories of commodities, according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step S101: the title and actual category of the goods are obtained.
Specifically, the title and actual category information of the commodity can be obtained through a web crawler, wherein the web crawler is a program or script for automatically capturing world wide web information according to a certain rule. As a specific example, the obtained commercial title is "Alvine (AVIVI) pillow core pillow, cassia seed pillow core fiber aromatherapy pillow 45 x 70 cm", and the actual category is "home textile-bedding-flower pillow-fiber pillow".
In practical application, in order to differentiate and systematize commodities and facilitate consumers to check specific content information of the commodities on an e-commerce platform, the commodities are classified, and selected proper basic characteristics of the commodities are used as classification marks and are gradually summarized into a plurality of sub-aggregates (categories) with smaller ranges and more consistent characteristics, such as large categories, medium categories, small categories and fine categories, and the categories and the fine categories are obtained. The lower the category level, the more the individual characteristics of the commodities can be reflected, and the higher the category level, the common characteristics of the commodities of one category can be reflected; the upper category refers to all upper categories for the categories of different hierarchies. For example, for "home textile-bedding-flower and grass pillow-fiber pillow", the "home textile" is of the first order, the upper order of the second order, the third order and the fourth order; "bedding" is of the second class, the upper class of the third class and the fourth class; the 'flower and grass pillow' is in a third-class order and is in an upper-layer class order of a fourth-class order; "fiber pillows" may be referred to as class four. For the category of 'computer, office-stationery/consumable-toner/toner', the 'toner/toner' is a third-level category, and the 'computer, office' and 'computer, office-stationery/consumable' are upper-level categories of a third layer.
Step S102: calculating the title of the commodity by using an ith classification model, and determining the probability that the commodity belongs to each i-level category in an ith category set; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1.
In the present embodiment, N classification models are shared, the hierarchy levels of the N classification models sequentially increase from 1 to N, and the classification model with the hierarchy level i is referred to as the ith classification model. For the ith classification model of the N classification models: and calculating the title of the commodity by using the classification model, and determining the probability that the commodity belongs to each i-level category in the i-th category set.
In an alternative embodiment, before calculating the title of the good using the ith classification model, the method further comprises:
segmenting the title of the commodity to obtain a plurality of words;
converting each of the words into a word vector using a word vector conversion algorithm.
The word segmentation is a process of recombining continuous word sequences into word sequences according to a certain standard. In the embodiment, the existing word segmentation technology, such as a word segmentation method based on character string matching, a word segmentation method based on understanding, a word segmentation method based on statistics, and the like, may be used to segment the title of the commodity.
A word vector is a vector that transforms words into a dense vector and for similar words, their corresponding word vectors are similar. The word vector conversion algorithm is used for converting words in natural language into dense vectors which can be understood by a computer. Preferably, the word vector conversion algorithm is a word2vec algorithm, the basic idea of which is to map each word to a real vector of predetermined dimensions by training. Compared with the existing bag-of-words model or vector space model, the word2vec algorithm can better express semantic information of a text.
As a specific example, a commercial title is "spring and autumn 1-3-6 year old man and woman baby shoes sports shoes man and woman leisure white shoes baby shoes golden 28 yards 17cm inside length", and this title is participled to obtain words "spring and autumn, 1-3-6 years old, man and woman, baby shoes, sports shoes, man and woman, leisure white shoes, children, shoes, golden, 28 yards, 17cm inside length", and then each word is converted into a word vector, such as [ [0.2123, 0.1232, 0.7238, … ], [0.5628, 0.2813, 0.9272, … ], … ].
As shown in fig. 2, the above-mentioned i-th classification model is obtained according to the following procedure:
step S202: a sample set is obtained, wherein the sample set comprises titles and category information of a plurality of commodities.
In an alternative embodiment, after the sample set is obtained, the sample set is divided into a training data set for training the classification model and a validation data set for testing the effectiveness of the classification model, wherein the amount of samples in the training data set is greater than the amount of samples in the validation data set, e.g., the sample set is divided into the training data set and the validation data set in a 9:1 ratio.
As a specific example, part of the sample data in the sample set is shown in table 1.
Table 1:
Figure BDA0001915650290000071
Figure BDA0001915650290000081
as can be seen from table 1, the category levels corresponding to different product titles are different, for example, "gold 28 yards of 17cm inner length of a casual white shoe, a young child shoe, a male shoe, a female shoe, a young child shoe, a sports shoe for men and women in spring and autumn 1-3-6 years old, the category corresponding to the sports shoe is" mother and infant- - -child shoe- - -sports shoe "(" mother and infant- - -child shoe- - -sports shoe "may also be referred to as category path), the related category is" mother and infant, child shoe, sports shoe ", and the level is 3; for example, the general license plate tray of the license plate frame of the thickened stainless steel silver automobile article license plate frame vehicle management station of the gathering new traffic rule license plate frame corresponds to the category of automobile articles, automobile decoration, automobile body decoration, other functional small pieces and anti-collision rubber strips, and the related category of the general license plate tray is automobile articles, automobile decoration, automobile body decoration, other functional small pieces and anti-collision rubber strips, and the level is 5.
Step S202: and constructing N category trees according to the category information in the sample set, wherein the hierarchy of the N category trees is sequentially increased from 1 to N.
As known from practical application, the commodity category system is subdivided step by step, so that the commodity category can be represented by a tree structure, namely a category tree is constructed. The tree is a non-linear data structure and is a finite set consisting of a plurality of nodes.
Further, existing categories may be added step-by-step to the tree as tree nodes by traversing all categories to form a category tree. For example, it is known that there are the categories "home textile-bedding article-flower and grass pillow-fiber pillow" and "home textile-bedding article-flower and grass pillow-latex pillow", and then the first layer node of the tree is the root node, the second layer node is the home textile, the third layer node bedding article, the fourth layer node is the flower and grass pillow, and the fifth layer node has both fiber pillow and latex pillow. The above expression was transformed into a graph as shown in FIG. 3.
For convenience of expression, a category tree with a hierarchy of i in the N category trees can be referred to as an i-level category tree, i is more than or equal to 1 and less than or equal to N, and i is an integer. As a specific example, if N is 8, the maximum level of the category tree is 8.
Step S203: and taking the category tree with the level i as an i-level category tree, and constructing an i-th level sub-sample set according to the sample set and the i-level category tree.
According to different category levels, sub-sample sets of different category levels, such as a primary sub-sample set, a secondary sub-sample set and an i-level sub-sample set, can be generated according to the sample set.
As a specific example, the trade title "golden 28 yards of the boy and baby shoes, casual little white shoes, boy and baby shoes, in spring and autumn, sports shoes, boy and baby shoes, in 17cm inner length" may expand the trade category: the first-level commodity category is mother-infant, the second-level commodity category is mother-infant-children shoes, and the third-level commodity category is mother-infant-children shoes-sports shoes. Because the category 'mother and infant, child shoe, sports shoe' only has three layers and does not support the generation of the four-level sub-sample set, the sample data cannot be present in the four-level sub-sample set, and can only be present in the first-level sub-sample set, the second-level sub-sample set and the third-level sub-sample set.
Illustratively, the primary, secondary, and tertiary subsample sets generated from the sample set shown in table 1 are shown in tables 2, 3, and 4, respectively, and the subsample sets of other levels are not shown due to space limitations.
Table 2:
Figure BDA0001915650290000091
table 3:
Figure BDA0001915650290000092
Figure BDA0001915650290000101
table 4:
Figure BDA0001915650290000102
step S204: and training the ith-level sub-sample set by using a preset classification algorithm to determine an ith-level classification model.
Specifically, the predetermined classification algorithm may be Fasttext, AbLSTM, or the like. The Fasttext algorithm was proposed by Mikolov in 2016 and consists of three parts, namely an input layer, a hidden layer and an output layer. When classifying, inputting a word sequence generated after a commodity title is participled into an input layer, forming a plurality of characteristic vectors by words and words in the sequence through an N-gram method, mapping the characteristic vectors to a hidden layer through linear transformation, and outputting commodity classification through a nonlinear activation function. The network structure of the AbsTM algorithm mainly comprises an input layer, a word vector conversion layer, a bidirectional LSTM layer (LSTM is an abbreviation of Long Short-term memory and is a Long Short-term memory network) and an output layer. The input layer inputs a word sequence generated after the commodity title is segmented, the word vector layer converts the word sequence into a corresponding word vector sequence, the bidirectional LSTM layer performs hidden conversion mapping, and the output layer calculates the prediction importance of each part to the overall category by utilizing an Attention mechanism so as to obtain the probability of each classification.
In an alternative embodiment, before performing step S204, the method further comprises: segmenting the commodity titles in each subsample set to obtain a plurality of words; converting each of the words into a word vector using a word vector conversion algorithm. The specific process may refer to the word segmentation method and the word vector conversion algorithm in step S102.
Taking N as an example, as shown in fig. 4, 8 category trees are constructed through step S202, and the maximum hierarchy level of the category trees is 8; 8 subsample sets (i.e. primary category training set data in the graph, etc.) are generated through step S203; 8 classification models (i.e. the first-level category classification models in the figure) are obtained through the training of step S204.
Step S103: selecting the primary category with the maximum probability from the first-level category set as a primary prediction category;
step S104: and selecting the kth level category comprising the (k-1) level prediction category as the k level prediction category from the kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i.
In step S104, if the (k-1) -th class set does not include the k-th prediction class, the k-th class with the highest probability is selected from the k-th class set as the k-th prediction class.
The following description will be given by taking N as 6 as an example. A total of 8 classification models, namely, a first-level classification model and a second-level classification model … … eight-level classification model, are generated through steps S201-S204. The 8 classification models were used to calculate the title "Alwei (AVIVI) pillow core, cassia seed pillow core fiber aromatherapy pillow 45 × 70 cm" of the commodity, and for convenience of explanation, the top 3 values with the highest probability in the result set of each classification model were taken, as shown in table 4 below.
Table 4:
Figure BDA0001915650290000111
Figure BDA0001915650290000121
according to step S103: the maximum probability of 'home textile' is 0.823, and 'home textile' is selected as a primary prediction category;
according to step S103:
for the second-level category set, including the upper-layer result of 'home textile', selecting 'home textile-bedding article' and 'home textile-home fabric' as the second-level prediction categories respectively;
for the third-level category centralization, the upper-layer results including 'home textile-bedding article' and 'home textile-household fabric' are respectively selected as the third-level prediction category, namely 'home textile-bedding article-flower and grass pillow' and 'home textile-household fabric-pillow cushion' are respectively selected as the third-level prediction category;
for the fourth-level category centralization, the upper-layer results of 'home textile-bedding-flower and grass pillow', 'home textile-household cloth art-pillow back cushion' are included, and 'home textile-bedding-flower and grass pillow-fiber pillow', 'home textile-household cloth art-pillow back cushion-headrest waist rest' are respectively selected as the fourth-level prediction category;
for the fifth category set, if the category of the upper layer result "home textile-bedding article-flower and grass pillow-fiber pillow", "home textile-home fabric art-throw pillow back cushion-headrest waist rest" is not included, the fourth category set is returned, and the probabilities of "home textile-bedding article-flower and grass pillow-fiber pillow", "home textile-home fabric art-throw pillow back cushion-headrest waist rest" are compared, and because the probability of the former is higher, the "home textile-bedding article-flower and grass pillow-fiber pillow" is selected as the prediction category of the commodity.
Step S105: and checking the actual category according to the prediction category.
Specifically, the following processes may be included:
calculating the text similarity between the prediction category and the actual category by using a similarity algorithm;
and if the text similarity is larger than or equal to a threshold value, determining that the actual category of the commodity passes verification.
The similarity of the text can be represented by cosine similarity, edit distance, hamming distance, manhattan distance or J-W distance, and the similarity calculation method can be selected from cosine similarity calculation method, edit distance method, hamming distance algorithm, manhattan distance algorithm or J-W distance algorithm. As a specific example, in the present embodiment, a J-W distance algorithm is used to measure the similarity of texts. The J-W distance algorithm is actually a variant of the jaro distance. The Jaro distance belongs to the category of edit distance, and is used in the field of record linking to link records in heterogeneous data sources to synonymous entities, and also for spell correction. The Jaro distance is defined as the following formula (1), and with the Jaro distance, we define the J-W distance as the following formula (2):
Figure BDA0001915650290000131
Figure BDA0001915650290000132
wherein d isjRepresenting the distance of jaro, m representing the number of characters in the corresponding position between the prediction category and the actual category, s1And s2String lengths representing prediction and actual categories, respectively, t representing one-half of the number of string transitions between prediction and actual categories, dwRepresenting the text similarity between the final predicted category and the actual category information, l representing the starting maximum common prefix and l ≦ 4, p representing the scaling factor, btRepresents the firing threshold, and the firing jaro distance is J-W distance when the jaro distance exceeds the firing threshold. p is used to adjust l to avoid dwExceeding 1. In this embodiment, p is 0.1, btThe value is 0.7.
For example, if the character string s1 is marcha, the character string s2 is marcha, and the characters "M, A, R, H, A" in s1 and s2 match each other, then m is 5, s1 character string length | s1| -6, s2 character string length | s2| -6, T/H belongs to a pair of transposed character pairs,so that t is 0.5 when t is 1/2, substituting the above formula can obtain djIs 0.847; "MAR" is the initial maximum common prefix, then k is 3, and d can be obtained by substituting the above formulawIs 0.908, so the text similarity between s1 and s2 is 0.908.
In the above example, if the prediction category is "home textile-bedding-flower and grass pillow-fiber pillow", and the actual category is "home textile-bedding-flower and grass pillow-latex pillow", then m is 10, | s1| -12, | s2| -12, t is 0, and k is 4, the above formula is substituted to obtain dj=0.889,dw0.933. If the threshold is set to 0.9, because dw=0.933>0.9, the actual category of the item passes the verification, indicating that the item has not been mistakenly hung.
The method for automatically checking the commodity category can comprehensively consider text semantics and upper-layer categories, automatically check the commodity category, reduce the labor workload, improve the accuracy and efficiency, and has strong expandability.
Fig. 5 is a schematic diagram of main modules of an apparatus 500 for automatically verifying categories of commodities, according to an embodiment of the present invention, as shown in fig. 5, the apparatus 500 includes:
an information obtaining module 501, configured to obtain a title and an actual category of a commodity;
a classification module 502, configured to calculate, by using an ith classification model, a title of the commodity, and determine a probability that the commodity belongs to each i-level category in an ith category set; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1;
a prediction category determining module 503, configured to select, from the first level category set, a level category with the highest probability as a level prediction category; selecting a kth level category comprising a (k-1) level prediction category as a k level prediction category from a kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i;
a checking module 504, configured to check the actual category according to the predicted category.
Optionally, the prediction category determining module 503 is further configured to: and if the k-th level category set does not contain the (k-1) level prediction category, selecting the k-level category with the highest probability from the k-th level category set as the k-level prediction category.
Optionally, the apparatus further comprises a model training module configured to: acquiring a sample set, wherein the sample set comprises titles and category information of a plurality of commodities; constructing N category trees according to category information in the sample set, wherein the levels of the N category trees are sequentially increased from 1 to N; taking a category tree with a level i as an i-level category tree, and constructing an i-th level sub-sample set according to the sample set and the i-level category tree; and training the ith-level sub-sample set by using a preset classification algorithm to determine an ith-level classification model.
Optionally, the apparatus further comprises a word vector conversion module, configured to: segmenting the title of the commodity to obtain a plurality of words; converting each of the words into a word vector using a word vector conversion algorithm.
Optionally, the check module 504 is further configured to: calculating the text similarity between the prediction category and the actual category by using a similarity algorithm; and if the text similarity is larger than or equal to a threshold value, determining that the actual category of the commodity passes verification.
The device can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
Fig. 6 illustrates an exemplary system architecture 600 of a method for automatically verifying categories of goods or an apparatus for automatically verifying categories of goods to which embodiments of the present invention may be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the terminal devices 601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. Various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like, may be installed on the terminal devices 601, 602, and 603.
The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 605 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the terminal devices 601, 602, and 603. The background management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (e.g., target push information and product information) to the terminal device.
It should be noted that the method for automatically verifying the commodity category provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the apparatus for automatically verifying the commodity category is generally disposed in the server 605.
It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a sending module, an obtaining module, a determining module, and a first processing module. The names of these modules do not in some cases constitute a limitation on the unit itself, and for example, the sending module may also be described as a "module that sends a picture acquisition request to a connected server".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
acquiring the title and actual category of the commodity;
calculating the title of the commodity by using an ith classification model, and determining the probability that the commodity belongs to each i-level category in an ith category set; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1;
selecting the primary category with the maximum probability from the first-level category set as a primary prediction category; selecting a kth level category comprising a (k-1) level prediction category as a k level prediction category from a kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i;
and checking the actual category according to the prediction category.
The technical scheme of the embodiment of the invention can comprehensively consider text semantics and upper-layer categories, automatically check the commodity categories, reduce the labor workload, improve the accuracy and efficiency and have strong expandability.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for automatically checking categories of commodities is characterized by comprising the following steps:
acquiring the title and actual category of the commodity;
calculating the title of the commodity by using an ith classification model, and determining the probability that the commodity belongs to each i-level category in an ith category set; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1;
selecting the primary category with the maximum probability from the first-level category set as a primary prediction category;
selecting a kth level category comprising a (k-1) level prediction category as a k level prediction category from a kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i;
and checking the actual category according to the prediction category.
2. The method of claim 1, further comprising:
and if the k-th level category set does not contain the (k-1) level prediction category, selecting the k-level category with the highest probability from the k-th level category set as the k-level prediction category.
3. The method of claim 1, wherein the ith classification model is obtained according to the following procedure:
acquiring a sample set, wherein the sample set comprises titles and category information of a plurality of commodities;
constructing N category trees according to category information in the sample set, wherein the levels of the N category trees are sequentially increased from 1 to N;
taking a category tree with a level i as an i-level category tree, and constructing an i-th level sub-sample set according to the sample set and the i-level category tree;
and training the ith-level sub-sample set by using a preset classification algorithm to determine an ith-level classification model.
4. The method of any of claims 1-3, wherein prior to calculating the title of the good using the level i taxonomy model, the method further comprises:
segmenting the title of the commodity to obtain a plurality of words;
converting each of the words into a word vector using a word vector conversion algorithm.
5. The method of claim 4, wherein checking the actual category against the predicted category comprises:
calculating the text similarity between the prediction category and the actual category by using a similarity algorithm;
and if the text similarity is larger than or equal to a threshold value, determining that the actual category of the commodity passes verification.
6. An apparatus for automatically verifying categories of goods, comprising:
the information acquisition module is used for acquiring the title and the actual category of the commodity;
the classification module is used for calculating the title of the commodity by using the ith classification model and determining the probability that the commodity belongs to each i-level category in the ith category set; i is an integer, i is more than or equal to 1 and less than or equal to N, and N is an integer more than or equal to 1;
the prediction category determining module is used for selecting the primary category with the highest probability from the first-level category set as the primary prediction category; selecting a kth level category comprising a (k-1) level prediction category as a k level prediction category from a kth level category set, and taking the k level prediction category as the prediction category of the commodity, wherein k is an integer and is more than 1 and less than or equal to i;
and the checking module is used for checking the actual category according to the prediction category.
7. The apparatus of claim 6, wherein the prediction category determination module is further configured to:
and if the k-th level category set does not contain the (k-1) level prediction category, selecting the k-level category with the highest probability from the k-th level category set as the k-level prediction category.
8. The apparatus of claim 6, further comprising a model training module to:
acquiring a sample set, wherein the sample set comprises titles and category information of a plurality of commodities;
constructing N category trees according to category information in the sample set, wherein the levels of the N category trees are sequentially increased from 1 to N;
taking a category tree with a level i as an i-level category tree, and constructing an i-th level sub-sample set according to the sample set and the i-level category tree;
and training the ith-level sub-sample set by using a preset classification algorithm to determine an ith-level classification model.
9. The apparatus according to any of claims 6-8, wherein the apparatus further comprises a word vector conversion module configured to:
segmenting the title of the commodity to obtain a plurality of words;
converting each of the words into a word vector using a word vector conversion algorithm.
10. The apparatus of claim 9, wherein the verification module is further configured to:
calculating the text similarity between the prediction category and the actual category by using a similarity algorithm;
and if the text similarity is larger than or equal to a threshold value, determining that the actual category of the commodity passes verification.
11. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN201811571500.5A 2018-12-21 2018-12-21 Method and device for automatically checking commodity category Pending CN111353838A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811571500.5A CN111353838A (en) 2018-12-21 2018-12-21 Method and device for automatically checking commodity category

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811571500.5A CN111353838A (en) 2018-12-21 2018-12-21 Method and device for automatically checking commodity category

Publications (1)

Publication Number Publication Date
CN111353838A true CN111353838A (en) 2020-06-30

Family

ID=71196932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811571500.5A Pending CN111353838A (en) 2018-12-21 2018-12-21 Method and device for automatically checking commodity category

Country Status (1)

Country Link
CN (1) CN111353838A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241493A (en) * 2020-10-28 2021-01-19 浙江集享电子商务有限公司 Commodity retrieval method and device, computer equipment and storage medium
CN112801720A (en) * 2021-04-12 2021-05-14 连连(杭州)信息技术有限公司 Method and device for generating shop category identification model and identifying shop category
CN113010812A (en) * 2021-03-10 2021-06-22 北京百度网讯科技有限公司 Information acquisition method and device, electronic equipment and storage medium
CN113793191A (en) * 2021-02-09 2021-12-14 京东科技控股股份有限公司 Commodity matching method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609422A (en) * 2011-01-25 2012-07-25 阿里巴巴集团控股有限公司 Class misplacing identification method and device
CN103646343A (en) * 2013-12-18 2014-03-19 世纪禾光科技发展(北京)有限责任公司 Text based commodity classification treatment method and system
CN103902545A (en) * 2012-12-25 2014-07-02 北京京东尚科信息技术有限公司 Category path recognition method and system
WO2017162074A1 (en) * 2016-03-25 2017-09-28 阿里巴巴集团控股有限公司 Method, apparatus and device for mapping products
CN107704892A (en) * 2017-11-07 2018-02-16 宁波爱信诺航天信息有限公司 A kind of commodity code sorting technique and system based on Bayesian model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609422A (en) * 2011-01-25 2012-07-25 阿里巴巴集团控股有限公司 Class misplacing identification method and device
CN107122980A (en) * 2011-01-25 2017-09-01 阿里巴巴集团控股有限公司 The method and apparatus for recognizing the affiliated classification of commodity
CN103902545A (en) * 2012-12-25 2014-07-02 北京京东尚科信息技术有限公司 Category path recognition method and system
CN103646343A (en) * 2013-12-18 2014-03-19 世纪禾光科技发展(北京)有限责任公司 Text based commodity classification treatment method and system
WO2017162074A1 (en) * 2016-03-25 2017-09-28 阿里巴巴集团控股有限公司 Method, apparatus and device for mapping products
CN107704892A (en) * 2017-11-07 2018-02-16 宁波爱信诺航天信息有限公司 A kind of commodity code sorting technique and system based on Bayesian model

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241493A (en) * 2020-10-28 2021-01-19 浙江集享电子商务有限公司 Commodity retrieval method and device, computer equipment and storage medium
CN113793191A (en) * 2021-02-09 2021-12-14 京东科技控股股份有限公司 Commodity matching method and device and electronic equipment
CN113010812A (en) * 2021-03-10 2021-06-22 北京百度网讯科技有限公司 Information acquisition method and device, electronic equipment and storage medium
CN113010812B (en) * 2021-03-10 2023-07-25 北京百度网讯科技有限公司 Information acquisition method, device, electronic equipment and storage medium
CN112801720A (en) * 2021-04-12 2021-05-14 连连(杭州)信息技术有限公司 Method and device for generating shop category identification model and identifying shop category

Similar Documents

Publication Publication Date Title
CN107330752B (en) Method and device for identifying brand words
US10095782B2 (en) Summarization of short comments
CN107220386A (en) Information-pushing method and device
US9436919B2 (en) System and method of tuning item classification
CN111353838A (en) Method and device for automatically checking commodity category
CN104765729B (en) A kind of cross-platform microblogging community account matching process
CN107145485B (en) Method and apparatus for compressing topic models
US20170300564A1 (en) Clustering for social media data
CN103116588A (en) Method and system for personalized recommendation
CN112100396B (en) Data processing method and device
WO2019019348A1 (en) Product information pushing method and apparatus, storage medium, and computer device
CN115080742B (en) Text information extraction method, apparatus, device, storage medium, and program product
WO2019072098A1 (en) Method and system for identifying core product terms
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN110795613B (en) Commodity searching method, device and system and electronic equipment
CN111046170A (en) Method and apparatus for outputting information
CN112116426A (en) Method and device for pushing article information
KR102307517B1 (en) Method and apparatus of recommending goods based on network
CN112347147A (en) Information pushing method and device based on user association relationship and electronic equipment
CN111209351A (en) Object relation prediction method and device, object recommendation method and device, electronic equipment and medium
CN108959289B (en) Website category acquisition method and device
Li et al. Automated extraction of personal knowledge from smartphone push notifications
CN112685452A (en) Enterprise case retrieval method, device, equipment and storage medium
CN113744002A (en) Method, device, equipment and computer readable medium for pushing information
CN113139558A (en) Method and apparatus for determining a multi-level classification label for an article

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination