CN103699523A - Product classification method and device - Google Patents

Product classification method and device Download PDF

Info

Publication number
CN103699523A
CN103699523A CN201310692950.0A CN201310692950A CN103699523A CN 103699523 A CN103699523 A CN 103699523A CN 201310692950 A CN201310692950 A CN 201310692950A CN 103699523 A CN103699523 A CN 103699523A
Authority
CN
China
Prior art keywords
product
sample
word
feature
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310692950.0A
Other languages
Chinese (zh)
Other versions
CN103699523B (en
Inventor
樊春玲
邓亮
冯良炳
张冠军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201310692950.0A priority Critical patent/CN103699523B/en
Publication of CN103699523A publication Critical patent/CN103699523A/en
Application granted granted Critical
Publication of CN103699523B publication Critical patent/CN103699523B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a product classification method which comprises the steps of extracting product textual features according to a product text for describing products to be classified; extracting product image features according to product images of the products to be classified; generating product features of the products to be classified according to the product textual features and the product image features; inputting the product features of the products to be classified into a pre-trained product classification model to obtain a classification result. According to the product classification method provided by the invention, the product textual features and the product image features of the products to be classified are extracted, and then the product features are generated according to the product textual features and the product image features, so that the products can be classified according to the product features to obtain the classification results. By virtue of comprehensively considering the textual features and the image features of the products to be classified, compared with the conventional method for classifying the products only according to textual information of the products, the product classification method disclosed by the invention has the advantage that the classification accuracy is improved. The invention also provides a product classification device.

Description

Product classification method and apparatus
Technical field
The present invention relates to area of pattern recognition, particularly relate to a kind of product classification method and apparatus.
Background technology
Along with the fast development of ecommerce, shopping online becomes netizen's daily behavior gradually.According to 2012 Chinese online-shopping market analysis reports demonstrations of CNNIC in March, 2013 issue, 2012, China's online-shopping market dealing money reached 12,594 hundred million yuan.Networking products kind is numerous and diverse, and quantity is huge, and the electric business website energy that need to cost a lot of money aspect the management of product could be experienced for user provide good shopping.
Product classification problem is the matter of utmost importance of the management of product, yet product classification is mainly by the artificial product category of demarcating at present, although also there is the method for classifying according to the text message of product, but because the text message of product not can be described all the elements of product completely, if Word message is described, there is deviation, will cause product by mis-classification, need expensive human cost to revise product category, therefore existing product classification classification accuracy is poor
Summary of the invention
Based on this, being necessary, for carry out the problem of product classification poor accuracy according to the text message of product, provides a kind of product classification method and apparatus.
A product classification method, described method comprises:
According to extracting product text feature for describing the product text of product to be sorted;
According to the product image of described product to be sorted, extract product characteristics of image;
According to described product text feature and described product characteristics of image, generate the product feature of product to be sorted;
The product classification model that the product feature input training in advance of described product to be sorted is obtained, obtains classification results.
A product classification device, described device comprises:
Product text feature extraction module, extracts product text feature for basis for describing the product text of product to be sorted;
Product image characteristics extraction module, for extracting product characteristics of image according to the product image of described product to be sorted;
Product feature generation module, for generating the product feature of product to be sorted according to described product text feature and described product characteristics of image;
Sort module, the product classification model for the product feature input training in advance of described product to be sorted is obtained, obtains classification results.
The said goods sorting technique and device, by extracting product text feature and the product characteristics of image of product to be sorted, then generate product feature according to product text feature and product characteristics of image, thereby utilize this product feature to classify to obtain classification results.Due to comprehensive consideration text feature and the characteristics of image of product to be sorted, and according to the text message of product, classify and compare separately, improved classification accuracy.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of product classification method in an embodiment;
Fig. 2 is the schematic flow sheet that in an embodiment, training obtains the step of product classification model;
Fig. 3 is according to extract the schematic flow sheet of the step of product text feature for describing the product text of product to be sorted in an embodiment;
Fig. 4 extracts the schematic flow sheet of the step of product characteristics of image according to the product image of product to be sorted in an embodiment;
Fig. 5 is partitioned into image fritter from product image in an embodiment, or from sample image, is partitioned into the schematic diagram of little image block;
Fig. 6 is divided into a plurality of elementary areas by image fritter in an embodiment, or little image block is divided into the schematic diagram of a plurality of subelements;
Fig. 7 concentrates the sample text of product sample to extract the schematic flow sheet of the step of sample text feature according to training sample in an embodiment;
Fig. 8 concentrates the sample image of product sample to extract the schematic flow sheet of the step of sample image feature according to training sample in an embodiment;
Fig. 9 generates the schematic diagram of the process of product feature in a concrete application scenarios;
Figure 10 is that the product classification model that uses training to obtain in a concrete application scenarios is treated sort product and classified, and obtains the schematic diagram of the process of classification results;
Figure 11 is the structured flowchart of product classification device in an embodiment;
Figure 12 is the structured flowchart of product classification device in another embodiment;
Figure 13 is the structured flowchart of product text feature extraction module in an embodiment;
Figure 14 is the structured flowchart of product feature word screening module in an embodiment;
Figure 15 is the structured flowchart of product image characteristics extraction module in an embodiment;
Figure 16 is the structured flowchart of sample text characteristic extracting module in an embodiment;
Figure 17 is the structured flowchart of sample characteristics word screening module in an embodiment;
Figure 18 is the structured flowchart of sample image characteristic extracting module in an embodiment;
Figure 19 is the structured flowchart of product classification device in another embodiment.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
Unless context separately has the description of specific distinct, the element in the present invention and assembly, the form that quantity both can be single exists, and form that also can be a plurality of exists, and the present invention does not limit this.Although the step in the present invention is arranged with label, and be not used in the precedence that limits step, unless expressly stated the order of step or the execution of certain step need other steps as basis, otherwise the relative order of step is adjustable.Be appreciated that term "and/or" used herein relates to and contain the one or more any and all possible combination in the Listed Items being associated.
As shown in Figure 1, in one embodiment, provide a kind of product classification method, having comprised:
Step 102, according to extracting product text feature for describing the product text of product to be sorted.
Product text refers to for describing the text of product to be sorted, comprises word, symbol, numeral etc.Product text can corresponding stored in product text document, the corresponding product text document of such product text.
Particularly, extract the process of product text feature, be that the Feature Words that extracts from product text is quantized to represent product text, thereby structureless original product text is converted into the information that structurized computing machine can identifying processing, for the process of classification processing.Can use existing text feature from product text, to extract product text feature, such as principal component analysis (PCA) (Principal Component Analysis, PCA), simulated annealing (Simulating Anneal, SA) etc.
Step 104, extracts product characteristics of image according to the product image of product to be sorted.
Product image refers to the image of the image that comprises product to be sorted.Can extract color characteristic (such as color histogram), textural characteristics or the shape facility etc. of product image as product characteristics of image.
Step 106, generates the product feature of product to be sorted according to product text feature and product characteristics of image.
Obtain after product text feature and product characteristics of image, product text feature and product characteristics of image can be stitched together and obtain the product feature of product to be sorted.Particularly, the vector sum that represents product text feature can be represented to the vector of product characteristics of image couples together and form the vector that represents product feature, thus the splicing of realization character.
Step 108, the product classification model that the product feature input training in advance of product to be sorted is obtained, obtains classification results.
Before treating sort product and classifying, according to training sample set, training obtains product classification model in advance.During classification, the product classification model that the product feature input training of product to be sorted is obtained, just can obtain classification results.
In one embodiment, training sample set comprises a plurality of product samples of corresponding pre-set categories, and product sample is to being applied to describe sample text and the sample image of product sample.Like this, training sample is concentrated the product sample that comprises pre-set categories, and the corresponding a plurality of product samples of each pre-set categories difference.Each product sample is a corresponding sample text and at least one sample image respectively.
As shown in Figure 2, this product classification method also comprises that training obtains the step of product classification model, comprises step 202~step 208:
Step 202, concentrates the sample text of product sample to extract sample text feature according to training sample.
Particularly, can use and according to training sample, concentrate the sample text that each product sample is corresponding to extract sample text feature according to the identical means of the product text extraction product text feature of product to be sorted.Can use existing text feature from sample text, to extract sample text feature, such as principal component analysis (PCA) (Principal Component Analysis, PCA), simulated annealing (Simulating Anneal, SA) etc.
Step 204, concentrates the sample image of product sample to extract sample image feature according to training sample.
Particularly, use and according to training sample, concentrate the sample image that each product sample is corresponding to extract sample image feature according to the identical means of the product image extraction product characteristics of image of product to be sorted.Can extract color characteristic (such as color histogram), textural characteristics or the shape facility etc. of each sample image as sample image feature.
Step 206, generates sample characteristics according to sample text feature and sample image feature.
Obtain after sample text feature and sample image feature, sample text feature and sample image merging features can be got up obtain the sample characteristics of each product sample.Particularly, the vector sum that represents sample text feature can be represented to the vector of sample image feature couples together and form the vector that represents sample characteristics, thus the splicing of realization character.
Step 208, according to sample characteristics, training obtains the product classification model based on support vector machine.
The present embodiment adopts the training of support vector machine (Support vector machine, SVM) method to obtain product classification model.The basic thought of support vector machine method is the lineoid of setting up one or a series of higher dimensional spaces, and the distance that lineoid is arrived between the most adjacent training sample is maximum.Use existing training method of support vector machine also can obtain the product classification model based on support vector machine.A selection that important work is exactly kernel function in SVM method.When sample characteristics also has Heterogeneous Information, sample size is very large, and the unevenness that the irregular or data of multidimensional data distribute at high-order feature space adopts the mode that monokaryon shines upon all samples to be processed also unreasonable, need a plurality of kernel functions to combine, i.e. Multiple Kernel Learning method.
The method of synthetic kernel has a lot, the present embodiment adopts Multiple Kernel Learning method (the Ultra-fast optimization algorithm for sparse multi kernel learning based on sparse coding, UFO-MKL), the raising of sparse property can reduce redundancy in some cases, improves operation efficiency.
Particularly, establishing the sample characteristics obtaining is x ∈ X, and pre-set categories is y ∈ Y={1,2 ..., F}, the sum that F is pre-set categories.Definition
Figure BDA0000439142850000054
j=1 .., F, wherein,
Figure BDA0000439142850000051
for with j the function that pre-set categories is corresponding.
Definition φ ‾ ( x , y ) = [ φ 1 ( x , y ) , · · · , φ F ( x , y ) ] , w ‾ = [ w 1 , · · · , w F ] , W wherein jfor
Figure BDA0000439142850000053
corresponding lineoid coefficient;
Figure BDA00004391428500000615
definition mould value is for as shown in Equation (1), wherein || || pthe p norm of vector.
| | w ‾ | | 2 , p = | | | | w 1 | | 2 , | | w 2 | | 2 , | | w 3 | | 2 , · · · , | | w 2 | | 2 | | p Formula (1)
The training of the product classification model of multinuclear can be defined as the optimization problem of formula (2),
Figure BDA0000439142850000062
formula (2)
Wherein,
Figure BDA0000439142850000063
for coefficient specification item, for classification error loss cost item, the product sample number that N is training set.
Figure BDA0000439142850000065
wherein :=represent assignment, λ, α is factor coefficient, p=2logF (2logF-1) is the norm factor.
Figure BDA0000439142850000066
Figure BDA0000439142850000067
for conventional simple cost function,
Figure BDA0000439142850000068
for the partial derivative of cost function, the coefficient derivation algorithm based on UFO-MKL is as 11 so)~18):
11), initialization factor coefficient lambda, α and iterative loop number of times T;
12), initialization coefficient
Figure BDA0000439142850000069
variable
Figure BDA00004391428500000610
variable q=2logF;
13)、for?t?=1,2,…,T?do;
14), obtain at random training sample (x t, y t);
15), new variables more
16), calculate v j = max ( 0 , | | θ j | | 1 - at ) , ∀ j = 1,2 , · · · , F ;
17), upgrade coefficient w j = v j θ j tλ | | θ j | | 1 ( v j | | v | | q ) q - 2 , ∀ j = 1,2 , · · · , F ;
18)、end?for。
Wherein, step 13)~step 18) expression t difference value 1,2 ..., T, repeated execution of steps 13) and~step 17), until during t=T, upgraded coefficient w jrear circulation stops, and algorithm finishes.Obtain
Figure BDA00004391428500000614
corresponding lineoid coefficient w jafter, just can set up the lineoid of a series of higher dimensional spaces, thereby obtain product classification model.
In the present embodiment, by step 202~step 208, provide training to obtain the training method of the product classification model based on support vector machine, used this kind of training method can improve operation efficiency.
The said goods sorting technique, by step 102~step 108, first extract product text feature and the product characteristics of image of product to be sorted, then generate product feature according to product text feature and product characteristics of image, thereby utilize this product feature to classify to obtain classification results.Due to comprehensive consideration text feature and the characteristics of image of product to be sorted, and according to the text message of product, classify and compare separately, improved classification accuracy; And providing of classification accuracy also make product automatic classification become possibility, can save the human cost in product classification process.
In one embodiment, sample text corresponding stored is in sample files.Sample text is corresponding one by one with sample files.As shown in Figure 3, step 102 specifically comprises step 302~step 308.
Step 302, carries out participle by product text, obtains candidate word.
Participle is the process of independent one by one word or word that word sequence is divided into.Particularly, in one embodiment, can adopt Chinese lexical analysis system ICTCLAS (the Institute of Computing Technology of Inst. of Computing Techn. Academia Sinica based on multilayer Hidden Markov Model (HMM), Chinese Lexical Analysis System) product text is carried out to Chinese word segmentation, thereby obtain candidate word.The precision of word segmentation reaches 98.45%.
Step 304 filters out product feature word according to default valuation functions from candidate word.
Particularly, by structure valuation functions, each feature in characteristic set is assessed, and to each feature marking, each candidate word obtains an assessed value like this, is called again weights.Then all features are pressed to the sequence of weights size, extract the optimal characteristics of predetermined number as the character subset that extracts result.
In one embodiment, step 304 specifically comprises step 21)~step 25) at least one step, be preferably and comprise step 21)~step 25) Overall Steps:
21), the number of times that calculated candidate word occurs in sample files, the candidate word that occurrence number is more than or equal to frequency threshold value is as product feature word.
Step 21), in, default valuation functions is word frequency rate (Term Frequency, TF) function.Particularly, travel through all candidate word, obtain the number of times that each candidate word occurs in sample files, set point number threshold value (such as 10), delete occurrence number be less than frequency threshold value to the very little candidate word of classification contribution, choose be more than or equal to frequency threshold value candidate word as product feature word.
22), calculate the proportion that the sample files that comprises candidate word accounts for sample files sum, the candidate word using corresponding proportion in preset range is as product feature word.
Particularly, first according to formula (3), calculate the document frequencies P of each candidate word Γ Γ, the sample files that comprises this candidate word accounts for the proportion of sample files sum.Formula (3) is default valuation functions, is called document frequency (Document Frequency, DF) function.
P Γ = n Γ n Formula (3)
Wherein, n Γfor comprising the sample files of candidate word Γ, n is sample files sum.
Set preset range, as (0.005,0.08), filter out candidate word Γ in preset range as product feature word.
23), the information gain weights of calculated candidate word, the candidate word that corresponding information gain weights are greater than to information gain weight threshold is as product feature word.
Particularly, first according to formula (4), calculate each candidate word Γ kinformation gain weights IG (Γ k).Formula (4), for default valuation functions, is called information gain (Information Gain, IG) function.
IG ( Γ k ) = - Σ i = 1 m P ( y i ) log P ( y i ) + P ( Γ k ) Σ i = 1 m P ( y i | Γ k ) log P ( y i | Γ k ) + P ( Γ ‾ k ) Σ i = 1 m P ( y i | Γ ‾ k ) log P ( y i | Γ ‾ k ) Formula (4)
Wherein, Γ krepresent k candidate word, y irepresent pre-set categories, F represents pre-set categories number, P (y i) expression y ithe probability that sample files corresponding to classification occurs in sample files set (set that all sample files form), P (Γ k) probability that occurs in sample files set of the sample files that represents to comprise candidate word Γ, P (y i| Γ k) represent that sample files comprises candidate word Γ ktime belong to y ithe conditional probability of class,
Figure BDA0000439142850000085
represent the concentrated candidate word Γ that do not comprise of training sample ktime belong to y ithe conditional probability of class.
Set information gain weight threshold value, as 0.006.After obtaining the information gain weights of each candidate word, choose candidate word that information gain weights are greater than this information gain weight threshold as product feature word.
24), the mutual information value of calculated candidate word, the candidate word that corresponding mutual information value is greater than to mutual information value threshold value is as product feature word.
Particularly, first according to formula (5), calculate each candidate word Γ kwith each classification y imutual information value MI (Γ k, y i).
MI ( Γ k , y i ) = log P ( Γ k , y i ) P ( Γ k ) P ( y i ) Formula (5)
Formula (5) also can be expressed as formula (6)
MI (Γ k, y i)=log (Γ k| y i)-logP (Γ k) formula (6)
Wherein, P (Γ k, y i) in sample files set, there is candidate word Γ kand belong to pre-set categories y ithe probability that occurs of sample files, P (Γ k) be candidate word Γ kat whole training sample, concentrate the probability occurring, P (y i) be corresponding y ithe probability that the sample files of classification occurs in whole sample files set, P (Γ k| y i) be Feature Words Γ kat y ithe conditional probability occurring in the sample files of classification.Formula (5) or (6), for default valuation functions, are called mutual information (Mutual Information, MI) function.
Set mutual information value threshold value, as 1.54, choose be greater than mutual information threshold value candidate word as product feature word.
25), according to training sample, concentrate whether occur whether candidate word and candidate word belong to the probability of pre-set categories, the degree of correlation of calculated candidate word and pre-set categories, the candidate word that the corresponding degree of correlation is greater than to degree of correlation threshold value is as product feature word.
Particularly, first according to formula (7), calculate each candidate word Γ kwith pre-set categories y ibetween degree of correlation CHI (Γ k, y i).Formula (7) is default valuation functions, is called evolution matching check (Chi-square, CHI) function.
CHI ( Γ k , y i ) = n [ P ( Γ k , y i ) × ( Γ ‾ k , y ‾ i ) - ( Γ ‾ k , y i ) × P ( Γ k , y ‾ i ) ] 2 P ( Γ k ) × P ( y i ) × P ( Γ ‾ k ) × P ( y ‾ i ) Formula (7)
Wherein, n is the concentrated sample files sum of training sample, P (Γ k, y i) in sample files set, there is candidate word Γ kand belong to pre-set categories y ithe probability that occurs of sample files,
Figure BDA0000439142850000092
for there is not candidate word Γ in sample files set kand do not belong to pre-set categories y ithe probability that occurs of sample files,
Figure BDA0000439142850000093
for there is candidate word Γ in sample files set kand do not belong to pre-set categories y ithe probability that occurs of sample files, for there is not candidate word Γ in sample files set kand belong to pre-set categories y ithe probability that occurs of sample files.
Set degree of correlation threshold value, as 10, filter out be greater than this threshold value candidate word as product feature word.
By above-mentioned steps 21)~step 25) Overall Steps, can generate five set product Feature Words, corresponding generate five kinds of product text features, can significantly improve the ability that product text feature is described product to be sorted, thereby improve the accuracy of classification.
In one embodiment, before step 304, also comprise: filter out and be included in default candidate word of stopping using in vocabulary.Word or the word that in candidate word, may exist some can cause classification to disturb, such as modal particle, auxiliary word etc.Therefore set in advance inactive vocabulary, can cause word or the word that classification is disturbed to add in inactive vocabulary these, thereby filter out, be included in default candidate word of stopping using in vocabulary, can avoid unnecessary calculating, save product classification required time.
Step 306, the number counting yield Feature Words weights of the frequency occurring in sample files according to product feature word, sample files sum and the sample files that comprises product feature word.
Particularly, by above-mentioned steps 21)~step 25) filter out after product feature word, according to formula (8), calculate respectively every set product Feature Words (step 21)~step 25) each step corresponding set product Feature Words that generates respectively) product feature word weights W i:
W i=TF i(γ, d) * n/DF (γ) formula (8)
Wherein, W ibe the product feature word weights of i product feature word, TF ithe frequency that (Υ, d) occurs in sample files d for product feature word Υ, n represents sample files sum, the number of files of DF (Υ) for comprising product feature word Υ.
Step 308, generates the product text feature of product to be sorted according to product feature word weights.
Particularly, according to formula (8), calculate respectively step 21)~step 25) after the product feature word weights of each product feature word of obtaining in each step, can be by product text-converted one and take the vector that product feature word is dimension, the property value of each dimension is the weights of product feature word.By step 21)~step 25) each step can draw a vector, i.e. a product text feature.For a product text, by step 21)~step 25) Overall Steps can obtain five vectors, i.e. and five kinds of product text features, so just obtain the product text feature of product to be sorted.Adopt five kinds of product text features, can improve the accuracy rate of product classification.
In the present embodiment, by above-mentioned steps 302~step 308, from the product text of product to be sorted, extracted the product text feature that can accurately represent product text, be conducive to treat sort product and correctly classify.
As shown in Figure 4, in one embodiment, step 104 comprises step 402~step 408:
Step 402 is partitioned into the image fritter of a plurality of formed objects, and has lap between adjacent image fritter from the product image of product to be sorted.
Particularly, corresponding at least one the product image of product to be sorted, cutting image fritter thick and fast on every width image, its size be wide 16 with grow 16.As shown in Figure 5, during cutting, in the transverse and longitudinal direction of the product image of product to be sorted, the step-length according to 8,8 moves cutting starting point respectively, between the adjacent image fritter in position, has lap, and every like this width picture will be cut into many image fritters.
Step 404, the histogram of gradients feature of extraction image fritter.
Particularly, step 404 comprises step 31)~step 32):
Step 31), each image fritter is divided into formed objects and nonoverlapping a plurality of elementary area.
Particularly, as shown in Figure 6, by image fritter difference 4 deciles on horizontal stroke, longitudinal direction, obtain 16 elementary area C i, i=1,2 ..., 16.
Step 32), on each elementary area, add up the histogram of gradients feature of 8 directions, the histogram of gradients merging features of the corresponding elementary area of each image fritter is got up to obtain to the histogram of gradients feature of each image fritter.
Particularly, first according to formula (9) and formula (10), calculate each elementary area C iin Grad M (a, b) and the direction β (a, b) of each pixel:
M ( a , b ) = ( C i ( a + 1 , b ) - C i ( a - 1 , b ) ) 2 + ( C i ( a , b + 1 ) - C i ( a , b - 1 ) ) 2 Formula (9)
β ( a , b ) = arctan ( C i ( a , b + 1 ) - C i ( a , b - 1 ) C i ( a + 1 , b ) - C i ( a - 1 , b ) ) Formula (10)
Wherein, M (a, b) is each elementary area C iin the Grad of each pixel, β (a, b) is each elementary area C iin the direction of each pixel, a, b are respectively each elementary area C iin horizontal ordinate and the ordinate of each pixel.
Then according to each elementary area C iin the direction β (a, b) of each pixel, by each elementary area C iin the Grad M (a, b) of each pixel be added to vectorial h i, i=1,2 ..., in 16 in corresponding position, thereby obtain the histogram of gradients feature h of elementary area i.Again by the corresponding elementary area C of each image fritter ihistogram of gradients feature h ibe stitched together and obtain the gradient orientation histogram feature feat=(h of image fritter 1, h 2..., h 16).Wherein, feat is the proper vector of one 128 dimension.
In the present embodiment, by step 31)~step 32), extracted the histogram of gradients feature of image fritter, be convenient to generate product characteristics of image according to the histogram of gradients feature of image fritter.
Step 406, calculate the Euclidean distance of each cluster centre in the histogram of gradients feature of each image fritter and study obtains in advance cluster centre set, cluster centre the counting nearest with the Euclidean distance of the histogram of gradients feature of each image fritter in Statistical Clustering Analysis centralization.
Step 408, the cluster centre of adding up according to the histogram of gradients feature of corresponding each image fritter and count results generate product characteristics of image.
First, need to pass through in advance step 41)~step 44), study obtains cluster centre set:
Step 41), from training sample, concentrate choose corresponding each pre-set categories respectively default to choose several product samples.
Particularly, from training sample, concentrate the product sample of corresponding pre-set categories, corresponding each pre-set categories is chosen respectively default several product sample of choosing.Such as training sample is concentrated the product sample that has F classification, each classification is chosen respectively M product sample, obtains altogether M*F product sample.
Step 42), product sample image corresponding to the product sample of choosing is divided into the image subblock of a plurality of formed objects, and there is lap in adjacent image subblock.
Step 43), extract the histogram of gradients feature of image subblock.
Step 42)~step 43) from product sample image corresponding to the product sample chosen, be partitioned into image subblock, thereby extract the histogram of gradients feature of image subblock, with above-mentioned steps 402)~step 404) in from the product image of product to be sorted, be partitioned into image fritter, thereby the step of the histogram of gradients feature of extraction image fritter is basic identical, difference is that the object of processing is different, does not repeat them here.
Step 44), the cluster centre by the histogram of gradients feature clustering of image subblock for default cluster centre number, obtains cluster centre set.
Particularly, by step 41)~step 43), obtain the histogram of gradients feature set FEAT={feat about image subblock 1, feat 2..., feat m, total number that m is image subblock.Presetting cluster centre number is 1024, uses k-means(k-mean cluster) clustering algorithm carries out cluster to feature set FEAT, and cluster obtains 1024 cluster centre points, is designated as Dict={d 1, d 2..., d 1024, Dict is the cluster centre set that study obtains.
By step 404, obtain the set Feat={feat of histogram of gradients feature of a plurality of image fritters of product to be sorted 1, feat 2..., feat s, s is total number of image fritter.
In step 406~step 408, particularly, the first initialization product characteristics of image full null vector R=[r identical with element number in cluster centre set that be length 1, r 2..., r 1024], for each the histogram of gradients feature feat in the set Feat of the histogram of gradients feature of a plurality of image fritters of product to be sorted i, according to formula (11), calculate the Euclidean distance of each cluster centre point in itself and cluster centre set Dict, the histogram of gradients feature feat of each image fritter of statistical distance ithe cluster centre point of Euclidean distance minimum.
mi n d x = ar g j C min | | fea t i - d j | | 2 Formula (11)
Min wherein dxexpression is apart from each histogram of gradients feature feat ithe position of cluster centre point of Euclidean distance minimum, d j∈ Dict.
Then, according to formula (12), the cluster centre point of statistics is counted.
r [ mi n d x ] = r [ mi n d x ] + 1 Formula (12)
Operation in above step 406~step 408, is equivalent to the histogram of gradients feature feat to each image fritter ion cluster centre set Dict, vote.The final vectorial R=[r obtaining 1, r 2..., r 1024] be the product characteristics of image of generation.
As shown in Figure 7, in one embodiment, step 202 comprises step 702~step 708:
Step 702, carries out participle by sample text, obtains word to be selected.
Step 704 filters out sample characteristics word according to default valuation functions from word to be selected.
In one embodiment, step 704 specifically comprises step 51)~step 55) at least one step, be preferably and comprise step 51)~step 55) Overall Steps:
Step 51), calculate the number of times that word to be selected occurs in sample files, the word to be selected that occurrence number is greater than to frequency threshold value is as sample characteristics word.
Particularly, travel through all words to be selected, obtain the number of times that each word to be selected occurs in sample files, set point number threshold value (as 10), delete occurrence number be less than frequency threshold value to the very little word to be selected of classification contribution, choose be more than or equal to frequency threshold value word to be selected as sample characteristics word.
Step 52), calculate the proportion that the sample files that comprises word to be selected accounts for sample files sum, the word to be selected using corresponding proportion in preset range is as sample characteristics word.
Particularly, first according to formula (3), calculate the document frequencies of each word to be selected, the sample files that comprises word to be selected accounts for the proportion of sample files sum.Set preset range, as (0.005,0.08), filter out word to be selected in preset range as sample characteristics word
Step 53), calculate the information gain weights of word to be selected, the word to be selected that corresponding information gain weights are greater than to information gain weight threshold is as sample characteristics word.
Particularly, set information gain weight threshold value, as 0.006.According to formula (4), calculate the information gain weights of each word to be selected, the word to be selected that corresponding information gain weights are greater than to this information gain weight threshold is as sample characteristics word
Step 54), calculate the mutual information value of word to be selected, the word to be selected that corresponding mutual information value is greater than to mutual information value threshold value is as sample characteristics word.
Particularly, set mutual information value threshold value, as 1.54.The mutual information value of calculating each word to be selected according to formula (5) or (6), the word to be selected that corresponding mutual information value is greater than to mutual information value threshold value is as sample characteristics word
Step 55), according to training sample, concentrate whether occur whether word to be selected and word to be selected belong to the probability of pre-set categories, calculate the degree of correlation of word to be selected and pre-set categories, and the word to be selected that the corresponding degree of correlation is greater than to degree of correlation threshold value is as sample characteristics word.
Particularly, set degree of correlation threshold value, as 10.According to concentrating whether occur whether word to be selected and word to be selected belong to the probability of pre-set categories, calculate the degree of correlation of each word to be selected by formula 7, and the word to be selected that the corresponding degree of correlation is greater than to degree of correlation threshold value is as sample characteristics word according to training sample.
By above-mentioned steps 51)~step 55) Overall Steps, can generate five groups of sample characteristics words, corresponding generate five kinds of sample text features, can significantly improve the ability that sample text feature is described product sample, thereby improve the accuracy of classification.
In one embodiment, before step 704, also comprise and filter out the step that is included in the default word to be selected in vocabulary of stopping using.Word or the word that in word to be selected, may exist some can cause classification to disturb, such as modal particle, auxiliary word etc.Therefore set in advance inactive vocabulary, can cause word or the word that classification is disturbed to add in inactive vocabulary these, thereby filter out, be included in default word to be selected of stopping using in vocabulary, can avoid unnecessary calculating, save product classification required time.
Step 706, the number of the frequency occurring in sample files according to sample characteristics word, sample files sum and the sample files that comprises sample characteristics word is calculated sample characteristics word weights.
Particularly, by above-mentioned steps 51)~step 55) filter out after sample characteristics word, according to above-mentioned formula (8), calculate respectively every group of sample characteristics word (step 51)~step 55) each step is corresponding respectively generates one group of sample characteristics word) sample characteristics word weights.
Step 708, according to the sample text feature of sample characteristics word weights generation product sample.
Particularly, calculate after sample characteristics word weights, each sample text can be converted to and take the vector that sample characteristics word is dimension, the property value of each dimension is the weights of sample characteristics word.By above-mentioned steps 51)~step 55) Overall Steps can obtain five kinds of sample text features.Adopt five kinds of sample text features, can make the classification accuracy of the disaggregated model of training acquisition be improved.
As shown in Figure 8, in one embodiment, step 204 comprises:
Step 802 is partitioned into the little image block of a plurality of formed objects, and has lap between adjacent little image block from the sample image of the concentrated product sample of training sample.
Particularly, corresponding at least one sample image of product sample cuts thick and fast little image block on every width image, and its size is wide 16 and long 16.As shown in Figure 5, during cutting, in the transverse and longitudinal direction of the sample image of product sample, the step-length according to 8,8 moves cutting starting point respectively, between adjacent little image block, has lap, and every like this width picture will be cut into many little image blocks.Cut apart and obtain the process of little image block and above-mentionedly cut apart that to obtain the process of image fritter, image subblock basic identical, difference is that the source images of cutting apart is different.
Step 804, extracts the histogram of gradients feature of little image block.
Specifically he, step 804 comprises step 61)~step 62):
Step 61), each little image block is divided into formed objects and nonoverlapping a plurality of subelement.
Particularly, as shown in Figure 6, by little image block difference 4 deciles on horizontal stroke, longitudinal direction, obtain 16 subelements.Divide the process of subelement and the process of above-mentioned partitioned image unit basic identical, difference is to process the difference of object, then this repeats no more.
Step 62), on each subelement, add up the histogram of gradients feature of 8 directions, the histogram of gradients merging features of the corresponding subelement of each little image block is got up to obtain to the histogram of gradients feature of each little image block.
Particularly, first according to above-mentioned formula (9) and formula (10), calculate Grad and the direction of each pixel in each subelement.Then according to the direction of each pixel in each subelement, the Grad of each pixel in each subelement is added in position corresponding in a vector, thus the histogram of gradients feature of acquisition subelement.Again the histogram of gradients merging features of the subelement of each little image block is got up to obtain the histogram of gradients feature of little image block.
Step 806, calculate the Euclidean distance of each cluster centre in the histogram of gradients feature of each little image block and study obtains in advance cluster centre set, cluster centre the counting nearest with the Euclidean distance of the histogram of gradients feature of each little image block in Statistical Clustering Analysis centralization.
Step 808, the cluster centre of adding up according to the histogram of gradients feature of corresponding each little image block and count results generate sample image feature.
In step 806~step 808, particularly, first each sample image of initialization is characterized as the full null vector that length is identical with element number in cluster centre set, for each the histogram of gradients feature in the set of the histogram of gradients feature of a plurality of little image blocks of product sample, calculate the Euclidean distance of each cluster centre point in itself and cluster centre set, the cluster centre point of the Euclidean distance minimum of the histogram of gradients feature of each little image block of statistical distance and in initialized full null vector correspondence position counting, the final vector obtaining is the sample image feature of generation.
The principle of the said goods sorting technique is described with a concrete application scenarios below.Suppose the concentrated five classes electricity business product samples that comprise of training sample, as the sweater in men's clothing, T-shirt, overcoat, trousers, shirt, wherein every class has 300 products.Each product sample corresponding one for describing sample files and at least one sample image of this product sample, training sample concentrates all sample files to form sample files set.
As shown in Figure 9, each sample files is carried out obtaining word to be selected after participle, filter out and be included in the word to be selected of stopping using in vocabulary, according to word frequency rate, document frequency, information gain, mutual information, five kinds of valuation functions of evolution matching check, from word to be selected, filter out sample characteristics word.Then calculate the sample characteristics word weights of each sample characteristics word in every group of sample characteristics word, thereby according to sample characteristics word weights, obtain five kinds of one-dimensional vector, i.e. five kinds of sample text features.
The sample image of product sample is divided into little image block, and there is lap in the adjacent little image block in position.Again little image block is divided into 16 subelements, on each subelement, adds up the histogram of gradients feature of 8 directions, the histogram of gradients merging features of subelement corresponding to each little image block is got up to obtain to the histogram of gradients feature of each little image block.The Euclidean distance of each cluster centre in the cluster centre set of then calculating the histogram of gradients feature of each little image block and learning in advance to obtain, cluster centre the counting nearest with the Euclidean distance of the histogram of gradients feature of each little image block in Statistical Clustering Analysis centralization, thereby generate sample image feature, i.e. an one-dimensional vector according to the cluster centre of statistics and count results.
The sample text feature one-dimensional vector of each product sample and sample image feature one-dimensional vector are stitched together and obtain the sample characteristics of this product sample.According to the sample characteristics of product sample, obtain the product classification model based on support vector machine.
The corresponding product documentation of product to be sorted and at least one product image, product documentation is carried out to participle and obtain candidate word, filter out and be included in the candidate word of stopping using in vocabulary, according to word frequency rate, document frequency, information gain, mutual information, five kinds of valuation functions of evolution matching check, from candidate word, filter out product feature word.Then calculate the product feature word weights of each product feature word in every set product Feature Words, thereby according to product feature word weights, obtain five kinds of one-dimensional vector, i.e. five kinds of product text features.
The product image of product to be sorted is divided into image fritter, and there is lap in the adjacent image fritter in position.Again image fritter is divided into 16 elementary areas, on each elementary area, add up the histogram of gradients feature of 8 directions, the histogram of gradients merging features of elementary area corresponding to each image fritter is got up to obtain to the histogram of gradients feature of each image fritter.The Euclidean distance of each cluster centre in the cluster centre set of then calculating the histogram of gradients feature of each image fritter and learning in advance to obtain, cluster centre the counting nearest with the Euclidean distance of the histogram of gradients feature of each image fritter in Statistical Clustering Analysis centralization, thus the cluster centre of adding up according to the histogram of gradients feature of corresponding each image fritter and count results generate the product characteristics of image of one dimension.
By product text feature one-dimensional vector and the product characteristics of image one-dimensional vector acquisition product feature that is stitched together.As shown in figure 10, the product classification model that product feature input training is obtained, product classification model output class distinguishing label, obtains classification results.
As shown in figure 11, in one embodiment, provide a kind of product classification device, this product classification device comprises product text feature extraction module 1120, product image characteristics extraction module 1140, product feature generation module 1160 and sort module 1180.
Product text feature extraction module 1120 extracts product text feature for basis for describing the product text of product to be sorted.
Product image characteristics extraction module 1140 is for extracting product characteristics of image according to the product image of product to be sorted.
Product feature generation module 1160 is for generating the product feature of product to be sorted according to product text feature and product characteristics of image.
The product classification model of sort module 1180 for the product feature input training in advance of product to be sorted is obtained, obtains classification results.
As shown in figure 12, in one embodiment, training sample set comprises a plurality of product samples of corresponding pre-set categories, and product sample is to being applied to describe sample text and the sample image of product sample; Product classification device also comprises training module 1110, and training module 1110 comprises sample text characteristic extracting module 1112, sample image characteristic extracting module 1114, sample characteristics generation module 1116 and training execution module 1118.
Sample text characteristic extracting module 1112 is for concentrating the sample text of product sample to extract sample text feature according to training sample.
Sample image characteristic extracting module 1114 is for concentrating the sample image of product sample to extract sample image feature according to training sample.
Sample characteristics generation module 1116 is for generating sample characteristics according to sample text feature and sample image feature.
Training execution module 1118 is for according to sample characteristics, training obtains the product classification model based on support vector machine.
In one embodiment, sample text corresponding stored is in sample files; As shown in figure 13, product text feature extraction module 1120 comprises first participle module 1122, the screening of product feature word module 1124, product feature word weights computing module 1126 and product text feature generation module 1128.
First participle module 1122, for product text is carried out to participle, obtains candidate word.
Product feature word screening module 1124 is for filtering out product feature word according to default valuation functions from candidate word.
Product feature word weights computing module 1126 is for the number counting yield Feature Words weights of the frequency that occurs in sample files according to product feature word, sample files sum and the sample files that comprises product feature word.
Product text feature generation module 1128 is for generating the product text feature of product to be sorted according to product feature word weights.
In one embodiment, product text feature extraction module 1120 also comprises candidate word filtering module 1123, for filtering out the candidate word that is included in the default vocabulary of stopping using.
As shown in figure 14, in one embodiment, product feature word screening module 1124 comprises at least one module in the first screening module 1124a, the second screening module 1124b, three screening module 1124c, the 4th screening module 1124d and the 5th screening module 1124e.
The number of times that the first screening module 1124a occurs in sample files for calculated candidate word, the candidate word that occurrence number is more than or equal to frequency threshold value is as product feature word.
The second screening module 1124b is for calculating the proportion that the sample files that comprises candidate word accounts for sample files sum, and the candidate word using corresponding proportion in preset range is as product feature word.
Three screening module 1124c is for the information gain weights of calculated candidate word, and the candidate word that corresponding information gain weights are greater than to information gain weight threshold is as product feature word.
The 4th screening module 1124d, for the mutual information value of calculated candidate word, the candidate word that corresponding mutual information value is greater than to mutual information value threshold value is as product feature word.
The 5th screening module 1124e is for concentrating whether occur whether candidate word and candidate word belong to the probability of pre-set categories according to training sample, the degree of correlation of calculated candidate word and pre-set categories, the candidate word that the corresponding degree of correlation is greater than to degree of correlation threshold value is as product feature word.
As shown in figure 15, in one embodiment, product image characteristics extraction module 1140 comprises that image fritter is cut apart module 1142, image fritter characteristic extracting module 1144, first is added up and counting module 1146 and product characteristics of image generation module 1148.
Image fritter is cut apart module 1142 and for the product image from product to be sorted, is partitioned into the image fritter of a plurality of formed objects, and has lap between adjacent image fritter.
Image fritter characteristic extracting module 1144 is for extracting the histogram of gradients feature of image fritter.
The first statistics and counting module 1146 be for calculating the Euclidean distance of the histogram of gradients feature of each image fritter and each cluster centre of the cluster centre set that study obtains in advance, in Statistical Clustering Analysis centralization with the nearest cluster centre of the Euclidean distance of the histogram of gradients feature of each image fritter and count.
Product characteristics of image generation module 1148 generates product characteristics of image for cluster centre and the count results of adding up according to the histogram of gradients feature of corresponding each image fritter.
In one embodiment, image fritter characteristic extracting module 1144 comprises elementary area division module 1144a and First Characteristic concatenation module 1144b.
Elementary area is divided module 1144a for each image fritter is divided into formed objects and nonoverlapping a plurality of elementary area.
First Characteristic concatenation module 1144b, for add up the histogram of gradients feature of 8 directions on each elementary area, gets up to obtain the histogram of gradients feature of each image fritter by the histogram of gradients merging features of the corresponding elementary area of each image fritter.
As shown in figure 16, in one embodiment, sample text characteristic extracting module 1112 comprises the second word-dividing mode 1112a, sample characteristics word screening module 1112c, sample characteristics word weights computing module 1112d and sample text feature generation module 1112e.
The second word-dividing mode 1112a, for sample text is carried out to participle, obtains word to be selected.
Sample characteristics word screening module 1112c is for filtering out sample characteristics word according to default valuation functions from word to be selected.
Sample characteristics word weights computing module 1112d is used for frequency, the sample files sum occurring in sample files according to sample characteristics word and the number of the sample files that comprises sample characteristics word is calculated sample characteristics word weights.
Sample text feature generation module 1112e is for generating the sample text feature of product sample according to sample characteristics word weights.
In one embodiment, sample text characteristic extracting module 1112 also comprises word filtering module 1112b to be selected, for filtering out the word to be selected that is included in the default vocabulary of stopping using.
As shown in figure 17, in one embodiment, sample characteristics word screening module 1112c comprises according to number of times screening module 1112c1, according to document proportion screening module 1112c2, according to information gain weights screening module 1112c3, according to mutual information value screening module 1112c4 with according at least one module in degree of correlation screening module 1112c5.
According to number of times screening module 1112c1, for calculating the number of times that word to be selected occurs in sample files, the word to be selected that occurrence number is greater than to frequency threshold value is as sample characteristics word.
According to document proportion screening module 1112c2, for calculating the proportion that the sample files that comprises word to be selected accounts for sample files sum, the word to be selected using corresponding proportion in preset range is as sample characteristics word.
According to information gain weights screening module 1112c3, for calculating the information gain weights of word to be selected, the word to be selected that corresponding information gain weights are greater than to information gain weight threshold is as sample characteristics word.
According to mutual information value screening module 1112c4, for calculating the mutual information value of word to be selected, the word to be selected that corresponding mutual information value is greater than to mutual information value threshold value is as sample characteristics word.
According to degree of correlation screening module, 1112c5 is used for concentrating whether occur whether word to be selected and word to be selected belong to the probability of pre-set categories according to training sample, calculate the degree of correlation of word to be selected and pre-set categories, the word to be selected that the corresponding degree of correlation is greater than to degree of correlation threshold value is as sample characteristics word.
As shown in figure 18, in one embodiment, sample image characteristic extracting module 1114 comprise little image block cut apart module 1114a, little image block characteristics extraction module 1114b, second statistics and counting module 1114c and sample image feature generation module 1114d.
Little image block is cut apart module 1114a for concentrating the sample image of product sample to be partitioned into the little image block of a plurality of formed objects from training sample, and has lap between adjacent little image block.
Little image block characteristics extraction module 1114b is for extracting the histogram of gradients feature of little image block.
The second statistics and counting module 1114c be for calculating the Euclidean distance of the histogram of gradients feature of each little image block and each cluster centre of the cluster centre set that study obtains in advance, in Statistical Clustering Analysis centralization with the nearest cluster centre of the Euclidean distance of the histogram of gradients feature of each little image block and count.
Sample image feature generation module 1114d generates sample image feature for cluster centre and the count results of adding up according to the histogram of gradients feature of corresponding each little image block.
In one embodiment, little image block characteristics extraction module 1114b comprises subelement division module 1114b1 and Second Characteristic concatenation module 1114b2.
Subelement is divided module 1114b1 for each little image block is divided into formed objects and nonoverlapping a plurality of subelement.
Second Characteristic concatenation module 1114b2, for add up the histogram of gradients feature of 8 directions on each subelement, gets up to obtain the histogram of gradients feature of each little image block by the histogram of gradients merging features of the corresponding subelement of each little image block.
As shown in figure 19, in one embodiment, product classification device also comprises cluster centre set acquisition module 1130, and cluster centre set acquisition module 1130 comprises that product sample is chosen module 1132, image subblock is cut apart module 1134, sub-image feature extraction module 1136 and cluster module 1138.
Product sample is chosen module 1132 for concentrate choose corresponding each pre-set categories respectively default to choose several product samples from training sample.
Image subblock is cut apart module 1134 for product sample image corresponding to the product sample of choosing being divided into the image subblock of a plurality of formed objects, and adjacent image subblock exists lap.
Sub-image feature extraction module 1136 is for extracting the histogram of gradients feature of image subblock.
Cluster module 1138, for the cluster centre for default cluster centre number by the histogram of gradients feature clustering of image subblock, obtains cluster centre set.
The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (26)

1. a product classification method, described method comprises:
According to extracting product text feature for describing the product text of product to be sorted;
According to the product image of described product to be sorted, extract product characteristics of image;
According to described product text feature and described product characteristics of image, generate the product feature of product to be sorted;
The product classification model that the product feature input training in advance of described product to be sorted is obtained, obtains classification results.
2. method according to claim 1, is characterized in that, training sample set comprises a plurality of product samples of corresponding pre-set categories, and described product sample is to being applied to describe sample text and the sample image of product sample; Described method also comprises that training obtains the step of product classification model, comprising:
According to the sample text of product sample described in described training sample set, extract sample text feature;
According to the sample image of product sample described in described training sample set, extract sample image feature;
According to described sample text feature and described sample image feature, generate sample characteristics;
According to described sample characteristics training, obtain the product classification model based on support vector machine.
3. method according to claim 2, is characterized in that, described sample text corresponding stored is in sample files; Described basis is extracted product text feature for describing the product text of product to be sorted, comprising:
Described product text is carried out to participle, obtain candidate word;
According to default valuation functions, from described candidate word, filter out product feature word;
The number counting yield Feature Words weights of the frequency occurring in described sample files according to described product feature word, sample files sum and the sample files that comprises described product feature word;
According to described product feature word weights, generate the product text feature of product to be sorted.
4. method according to claim 3, is characterized in that, the default valuation functions of described basis also comprised filter out product feature word from described candidate word before:
Filter out and be included in default described candidate word of stopping using in vocabulary.
5. method according to claim 3, is characterized in that, the default valuation functions of described basis filters out product feature word from described candidate word, comprising:
Calculate the number of times that described candidate word occurs in described sample files, the candidate word that occurrence number is more than or equal to frequency threshold value is as product feature word; And/or,
The sample files that calculating comprises described candidate word accounts for the proportion of sample files sum, and the candidate word using corresponding proportion in preset range is as product feature word; And/or,
Calculate the information gain weights of described candidate word, the candidate word that corresponding information gain weights are greater than to information gain weight threshold is as product feature word; And/or,
Calculate the mutual information value of described candidate word, the candidate word that corresponding mutual information value is greater than to mutual information value threshold value is as product feature word; And/or,
According to described training sample, concentrate whether occur whether described candidate word and described candidate word belong to the probability of described pre-set categories, calculate the degree of correlation of described candidate word and described pre-set categories, the candidate word that the corresponding degree of correlation is greater than to degree of correlation threshold value is as product feature word.
6. method according to claim 1, is characterized in that, the described product image according to described product to be sorted extracts product characteristics of image, comprising:
From the product image of described product to be sorted, be partitioned into the image fritter of a plurality of formed objects, and have lap between adjacent image fritter;
Extract the histogram of gradients feature of described image fritter;
Calculate the Euclidean distance of each cluster centre in the histogram of gradients feature of image fritter described in each and study obtains in advance cluster centre set, add up in described cluster centre set cluster centre the counting nearest with the Euclidean distance of the histogram of gradients feature of described each image fritter;
The cluster centre of adding up according to the histogram of gradients feature of corresponding described each image fritter and count results generate product characteristics of image.
7. method according to claim 6, is characterized in that, the histogram of gradients feature of the described image fritter of described extraction, comprising:
Image fritter described in each is divided into formed objects and nonoverlapping a plurality of elementary area;
On described each elementary area, add up the histogram of gradients feature of 8 directions, the histogram of gradients merging features of the corresponding elementary area of described each image fritter is got up to obtain to the histogram of gradients feature of described each image fritter.
8. method according to claim 2, is characterized in that, the described sample text according to product sample described in described training sample set extracts sample text feature, comprising:
Described sample text is carried out to participle, obtain word to be selected;
According to default valuation functions, from described word to be selected, filter out sample characteristics word;
The number of the frequency occurring in described sample files according to described sample characteristics word, sample files sum and the sample files that comprises described sample characteristics word is calculated sample characteristics word weights;
According to described sample characteristics word weights, generate the sample text feature of described product sample.
9. method according to claim 8, is characterized in that, the default valuation functions of described basis also comprised filter out sample characteristics word from described word to be selected before:
Filter out and be included in default word described to be selected of stopping using in vocabulary.
10. method according to claim 8, is characterized in that, the default valuation functions of described basis filters out sample characteristics word from described word to be selected, comprising:
Calculate the number of times that described word to be selected occurs in described sample files, the word to be selected that occurrence number is greater than to frequency threshold value is as sample characteristics word;
The sample files that calculating comprises described word to be selected accounts for the proportion of sample files sum, and the word to be selected using corresponding proportion in preset range is as sample characteristics word;
Calculate the information gain weights of described word to be selected, the word to be selected that corresponding information gain weights are greater than to information gain weight threshold is as sample characteristics word;
Calculate the mutual information value of described word to be selected, the word to be selected that corresponding mutual information value is greater than to mutual information value threshold value is as sample characteristics word; And/or
According to described training sample, concentrate whether occur whether described word to be selected and described word to be selected belong to the probability of described pre-set categories, calculate the degree of correlation of described word to be selected and described pre-set categories, the word to be selected that the corresponding degree of correlation is greater than to degree of correlation threshold value is as sample characteristics word.
11. methods according to claim 2, is characterized in that, the described sample image according to product sample described in described training sample set extracts sample image feature, comprising:
From the sample image of product sample described in described training sample set, be partitioned into the little image block of a plurality of formed objects, and have lap between adjacent little image block;
Extract the histogram of gradients feature of described little image block;
Calculate the Euclidean distance of each cluster centre in the histogram of gradients feature of little image block described in each and study obtains in advance cluster centre set, add up in described cluster centre set cluster centre the counting nearest with the Euclidean distance of the histogram of gradients feature of described each little image block;
The cluster centre of adding up according to the histogram of gradients feature of corresponding described each little image block and count results generate sample image feature.
12. methods according to claim 11, is characterized in that, the histogram of gradients feature of the described little image block of described extraction, comprising:
Little image block described in each is divided into formed objects and nonoverlapping a plurality of subelement;
On described each subelement, add up the histogram of gradients feature of 8 directions, the histogram of gradients merging features of the corresponding subelement of described each little image block is got up to obtain to the histogram of gradients feature of described each little image block.
13. according to the method described in claim 6 or 11, it is characterized in that, described method also comprises that study obtains the step of cluster centre set, comprising:
From training sample, concentrate choose corresponding each pre-set categories respectively default to choose several product samples;
Product sample image corresponding to the described product sample of choosing is divided into the image subblock of a plurality of formed objects, and there is lap in adjacent image subblock;
Extract the histogram of gradients feature of described image subblock;
Cluster centre by the histogram of gradients feature clustering of described image subblock for default cluster centre number, obtains cluster centre set.
14. 1 kinds of product classification devices, is characterized in that, described device comprises:
Product text feature extraction module, extracts product text feature for basis for describing the product text of product to be sorted;
Product image characteristics extraction module, for extracting product characteristics of image according to the product image of described product to be sorted;
Product feature generation module, for generating the product feature of product to be sorted according to described product text feature and described product characteristics of image;
Sort module, the product classification model for the product feature input training in advance of described product to be sorted is obtained, obtains classification results.
15. devices according to claim 14, is characterized in that, training sample set comprises a plurality of product samples of corresponding pre-set categories, and described product sample is to being applied to describe sample text and the sample image of product sample; Described device also comprises training module, comprising:
Sample text characteristic extracting module, for extracting sample text feature according to the sample text of product sample described in described training sample set;
Sample image characteristic extracting module, for extracting sample image feature according to the sample image of product sample described in described training sample set;
Sample characteristics generation module, for generating sample characteristics according to described sample text feature and described sample image feature;
Training execution module, for obtaining the product classification model based on support vector machine according to described sample characteristics training.
16. devices according to claim 15, is characterized in that, described sample text corresponding stored is in sample files; Described product text feature extraction module comprises:
First participle module, for described product text is carried out to participle, obtains candidate word;
Product feature word screening module, for filtering out product feature word according to default valuation functions from described candidate word;
Product feature word weights computing module, for the number counting yield Feature Words weights of the frequency that occurs in described sample files according to described product feature word, sample files sum and the sample files that comprises described product feature word;
Product text feature generation module, for generating the product text feature of product to be sorted according to described product feature word weights.
17. devices according to claim 16, is characterized in that, described product text feature extraction module also comprises candidate word filtering module, for filtering out the described candidate word that is included in the default vocabulary of stopping using.
18. devices according to claim 16, is characterized in that, described product feature word screening module comprises at least one module in the first screening module, the second screening module, three screening module, the 4th screening module and the 5th screening module:
The first screening module is for calculating the number of times that described candidate word occurs in described sample files, and the candidate word that occurrence number is more than or equal to frequency threshold value is as product feature word;
The second screening module is for calculating the proportion that the sample files that comprises described candidate word accounts for sample files sum, and the candidate word using corresponding proportion in preset range is as product feature word;
Three screening module is for calculating the information gain weights of described candidate word, and the candidate word that corresponding information gain weights are greater than to information gain weight threshold is as product feature word;
The 4th screening module is for calculating the mutual information value of described candidate word, and the candidate word that corresponding mutual information value is greater than to mutual information value threshold value is as product feature word;
The 5th screening module is for concentrating whether occur whether described candidate word and described candidate word belong to the probability of described pre-set categories according to described training sample, calculate the degree of correlation of described candidate word and described pre-set categories, the candidate word that the corresponding degree of correlation is greater than to degree of correlation threshold value is as product feature word.
19. devices according to claim 14, is characterized in that, described product image characteristics extraction module comprises:
Image fritter is cut apart module, is partitioned into the image fritter of a plurality of formed objects for the product image from described product to be sorted, and has lap between adjacent image fritter;
Image fritter characteristic extracting module, for extracting the histogram of gradients feature of described image fritter;
The first statistics and counting module, for calculating the Euclidean distance of the histogram of gradients feature of image fritter described in each and each cluster centre of the cluster centre set that study obtains in advance, add up in described cluster centre set cluster centre the counting nearest with the Euclidean distance of the histogram of gradients feature of described each image fritter;
Product characteristics of image generation module, generates product characteristics of image for cluster centre and the count results of adding up according to the histogram of gradients feature of corresponding described each image fritter.
20. devices according to claim 19, is characterized in that, described image fritter characteristic extracting module comprises:
Elementary area is divided module, for image fritter described in each being divided into formed objects and nonoverlapping a plurality of elementary area;
First Characteristic concatenation module, for add up the histogram of gradients feature of 8 directions on described each elementary area, the histogram of gradients merging features of the corresponding elementary area of described each image fritter is got up to obtain to the histogram of gradients feature of described each image fritter.
21. devices according to claim 15, is characterized in that, described sample text characteristic extracting module comprises:
The second word-dividing mode, for described sample text is carried out to participle, obtains word to be selected;
Sample characteristics word screening module, for filtering out sample characteristics word according to default valuation functions from described word to be selected;
Sample characteristics word weights computing module, calculates sample characteristics word weights for the number of the frequency that occurs in described sample files according to described sample characteristics word, sample files sum and the sample files that comprises described sample characteristics word;
Sample text feature generation module, for generating the sample text feature of described product sample according to described sample characteristics word weights.
22. devices according to claim 21, is characterized in that, described sample text characteristic extracting module also comprises word filtering module to be selected, for filtering out the word described to be selected that is included in the default vocabulary of stopping using.
23. devices according to claim 21, it is characterized in that, described sample characteristics word screens module and comprises according to number of times screening module, according to document proportion screening module, according to information gain weights screening module, according to mutual information value, screens module and screen at least one module in module according to the degree of correlation:
According to number of times screening module, for calculating the number of times that described word to be selected occurs in described sample files, the word to be selected that occurrence number is greater than to frequency threshold value is as sample characteristics word;
According to document proportion screening module, for calculating the proportion that the sample files that comprises described word to be selected accounts for sample files sum, the word to be selected using corresponding proportion in preset range is as sample characteristics word;
According to information gain weights screening module, for calculating the information gain weights of described word to be selected, the word to be selected that corresponding information gain weights are greater than to information gain weight threshold is as sample characteristics word;
According to mutual information value screening module, for calculating the mutual information value of described word to be selected, the word to be selected that corresponding mutual information value is greater than to mutual information value threshold value is as sample characteristics word;
According to degree of correlation screening module, be used for concentrating whether occur whether described word to be selected and described word to be selected belong to the probability of described pre-set categories according to described training sample, calculate the degree of correlation of described word to be selected and described pre-set categories, the word to be selected that the corresponding degree of correlation is greater than to degree of correlation threshold value is as sample characteristics word.
24. devices according to claim 15, is characterized in that, described sample image characteristic extracting module comprises:
Little image block is cut apart module, is partitioned into the little image block of a plurality of formed objects for the sample image from product sample described in described training sample set, and has lap between adjacent little image block;
Little image block characteristics extraction module, for extracting the histogram of gradients feature of described little image block;
The second statistics and counting module, for calculating the Euclidean distance of the histogram of gradients feature of little image block described in each and each cluster centre of the cluster centre set that study obtains in advance, add up in described cluster centre set cluster centre the counting nearest with the Euclidean distance of the histogram of gradients feature of described each little image block;
Sample image feature generation module, generates sample image feature for cluster centre and the count results of adding up according to the histogram of gradients feature of corresponding described each little image block.
25. devices according to claim 24, is characterized in that, described little image block characteristics extraction module comprises:
Subelement is divided module, for little image block described in each being divided into formed objects and nonoverlapping a plurality of subelement;
Second Characteristic concatenation module, for add up the histogram of gradients feature of 8 directions on described each subelement, the histogram of gradients merging features of the corresponding subelement of described each little image block is got up to obtain to the histogram of gradients feature of described each little image block.
26. according to the device described in claim 19 or 24, it is characterized in that, described device also comprises cluster centre set acquisition module, comprising:
Product sample is chosen module, for concentrate choose corresponding each pre-set categories respectively default to choose several product samples from training sample;
Image subblock is cut apart module, and for product sample image corresponding to the described product sample of choosing being divided into the image subblock of a plurality of formed objects, and adjacent image subblock exists lap;
Sub-image feature extraction module, for extracting the histogram of gradients feature of described image subblock;
Cluster module, for the cluster centre for default cluster centre number by the histogram of gradients feature clustering of described image subblock, obtains cluster centre set.
CN201310692950.0A 2013-12-16 2013-12-16 Product classification method and apparatus Active CN103699523B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310692950.0A CN103699523B (en) 2013-12-16 2013-12-16 Product classification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310692950.0A CN103699523B (en) 2013-12-16 2013-12-16 Product classification method and apparatus

Publications (2)

Publication Number Publication Date
CN103699523A true CN103699523A (en) 2014-04-02
CN103699523B CN103699523B (en) 2016-06-29

Family

ID=50361054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310692950.0A Active CN103699523B (en) 2013-12-16 2013-12-16 Product classification method and apparatus

Country Status (1)

Country Link
CN (1) CN103699523B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095396A (en) * 2015-07-03 2015-11-25 北京京东尚科信息技术有限公司 Model establishment method, quality assessment method and device
CN105824512A (en) * 2016-03-11 2016-08-03 杨晟志 Directory interaction system based on virtual map
CN105824889A (en) * 2016-03-11 2016-08-03 杨晟志 Classification method based on virtual map
CN106021350A (en) * 2016-05-10 2016-10-12 湖北工程学院 An artwork collection and management method and an artwork collection and management system
CN106250398A (en) * 2016-07-19 2016-12-21 北京京东尚科信息技术有限公司 A kind of complaint classifying content decision method complaining event and device
CN106919954A (en) * 2017-03-02 2017-07-04 深圳明创自控技术有限公司 A kind of cloud computing system for commodity classification
WO2017113232A1 (en) * 2015-12-30 2017-07-06 中国科学院深圳先进技术研究院 Product classification method and apparatus based on deep learning
CN107133208A (en) * 2017-03-24 2017-09-05 南京缘长信息科技有限公司 The method and device that a kind of entity is extracted
CN107194739A (en) * 2017-05-25 2017-09-22 上海耐相智能科技有限公司 A kind of intelligent recommendation system based on big data
CN107220875A (en) * 2017-05-25 2017-09-29 深圳众厉电力科技有限公司 It is a kind of to service good e-commerce platform
CN107346433A (en) * 2016-05-06 2017-11-14 华为技术有限公司 A kind of text data sorting technique and server
CN107784372A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Forecasting Methodology, the device and system of destination object attribute
CN107977794A (en) * 2017-12-14 2018-05-01 方物语(深圳)科技文化有限公司 Data processing method, device, computer equipment and the storage medium of industrial products
CN108256549A (en) * 2017-12-13 2018-07-06 北京达佳互联信息技术有限公司 Image classification method, device and terminal
CN109241379A (en) * 2017-07-11 2019-01-18 北京交通大学 A method of across Modal detection network navy
WO2020037762A1 (en) * 2018-08-21 2020-02-27 深圳码隆科技有限公司 Product information identification method and system
CN110852329A (en) * 2019-10-21 2020-02-28 南京航空航天大学 Method for defining product appearance attribute
CN111368926A (en) * 2020-03-06 2020-07-03 腾讯科技(深圳)有限公司 Image screening method, device and computer readable storage medium
CN112101018A (en) * 2020-08-05 2020-12-18 中国工业互联网研究院 Method and system for calculating new words in text based on word frequency matrix eigenvector
WO2021087770A1 (en) * 2019-11-05 2021-05-14 深圳市欢太科技有限公司 Picture classification method and apparatus, and storage medium and electronic device
CN113570427A (en) * 2021-07-22 2021-10-29 上海普洛斯普新数字科技有限公司 System for extracting and identifying on-line or system commodity characteristic information
CN113837214A (en) * 2020-06-23 2021-12-24 财团法人亚洲大学 Image verification method and product real-time authentication system
CN113962773A (en) * 2021-10-22 2022-01-21 广州华多网络科技有限公司 Same-style commodity polymerization method and device, equipment, medium and product thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315663A (en) * 2008-06-25 2008-12-03 中国人民解放军国防科学技术大学 Nature scene image classification method based on area dormant semantic characteristic
US20120314941A1 (en) * 2011-06-13 2012-12-13 Microsoft Corporation Accurate text classification through selective use of image data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315663A (en) * 2008-06-25 2008-12-03 中国人民解放军国防科学技术大学 Nature scene image classification method based on area dormant semantic characteristic
US20120314941A1 (en) * 2011-06-13 2012-12-13 Microsoft Corporation Accurate text classification through selective use of image data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KOEN E.A. VAN DE SANDE等: "Segmentation as Selective Search for Object Recognition", 《2011 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION》 *
宋丽平: "文本分类中特征选择方法的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
郑伟: "文本分类特征选取技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
雷庆等: "动作识别中局部时空特征的运动表示方法研究", 《计算机工程与应用》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095396A (en) * 2015-07-03 2015-11-25 北京京东尚科信息技术有限公司 Model establishment method, quality assessment method and device
CN107683469A (en) * 2015-12-30 2018-02-09 中国科学院深圳先进技术研究院 A kind of product classification method and device based on deep learning
WO2017113232A1 (en) * 2015-12-30 2017-07-06 中国科学院深圳先进技术研究院 Product classification method and apparatus based on deep learning
CN105824512A (en) * 2016-03-11 2016-08-03 杨晟志 Directory interaction system based on virtual map
CN105824889A (en) * 2016-03-11 2016-08-03 杨晟志 Classification method based on virtual map
CN107346433A (en) * 2016-05-06 2017-11-14 华为技术有限公司 A kind of text data sorting technique and server
CN107346433B (en) * 2016-05-06 2020-09-18 华为技术有限公司 Text data classification method and server
CN106021350A (en) * 2016-05-10 2016-10-12 湖北工程学院 An artwork collection and management method and an artwork collection and management system
CN106250398A (en) * 2016-07-19 2016-12-21 北京京东尚科信息技术有限公司 A kind of complaint classifying content decision method complaining event and device
CN106250398B (en) * 2016-07-19 2020-03-27 北京京东尚科信息技术有限公司 Method and device for classifying and judging complaint content of complaint event
CN107784372B (en) * 2016-08-24 2022-02-22 阿里巴巴集团控股有限公司 Target object attribute prediction method, device and system
CN107784372A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Forecasting Methodology, the device and system of destination object attribute
CN106919954A (en) * 2017-03-02 2017-07-04 深圳明创自控技术有限公司 A kind of cloud computing system for commodity classification
CN107133208B (en) * 2017-03-24 2021-08-24 南京柯基数据科技有限公司 Entity extraction method and device
CN107133208A (en) * 2017-03-24 2017-09-05 南京缘长信息科技有限公司 The method and device that a kind of entity is extracted
CN107220875B (en) * 2017-05-25 2020-09-22 黄华 Electronic commerce platform with good service
CN107220875A (en) * 2017-05-25 2017-09-29 深圳众厉电力科技有限公司 It is a kind of to service good e-commerce platform
CN107194739A (en) * 2017-05-25 2017-09-22 上海耐相智能科技有限公司 A kind of intelligent recommendation system based on big data
CN109241379A (en) * 2017-07-11 2019-01-18 北京交通大学 A method of across Modal detection network navy
CN108256549A (en) * 2017-12-13 2018-07-06 北京达佳互联信息技术有限公司 Image classification method, device and terminal
CN107977794A (en) * 2017-12-14 2018-05-01 方物语(深圳)科技文化有限公司 Data processing method, device, computer equipment and the storage medium of industrial products
WO2020037762A1 (en) * 2018-08-21 2020-02-27 深圳码隆科技有限公司 Product information identification method and system
CN110852329A (en) * 2019-10-21 2020-02-28 南京航空航天大学 Method for defining product appearance attribute
WO2021087770A1 (en) * 2019-11-05 2021-05-14 深圳市欢太科技有限公司 Picture classification method and apparatus, and storage medium and electronic device
CN111368926A (en) * 2020-03-06 2020-07-03 腾讯科技(深圳)有限公司 Image screening method, device and computer readable storage medium
CN111368926B (en) * 2020-03-06 2021-07-06 腾讯科技(深圳)有限公司 Image screening method, device and computer readable storage medium
CN113837214A (en) * 2020-06-23 2021-12-24 财团法人亚洲大学 Image verification method and product real-time authentication system
CN112101018A (en) * 2020-08-05 2020-12-18 中国工业互联网研究院 Method and system for calculating new words in text based on word frequency matrix eigenvector
CN112101018B (en) * 2020-08-05 2024-03-12 北京工联科技有限公司 Method and system for calculating new words in text based on word frequency matrix feature vector
CN113570427A (en) * 2021-07-22 2021-10-29 上海普洛斯普新数字科技有限公司 System for extracting and identifying on-line or system commodity characteristic information
CN113962773A (en) * 2021-10-22 2022-01-21 广州华多网络科技有限公司 Same-style commodity polymerization method and device, equipment, medium and product thereof

Also Published As

Publication number Publication date
CN103699523B (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN103699523A (en) Product classification method and device
US11715313B2 (en) Apparatus and methods for extracting data from lineless table using delaunay triangulation and excess edge removal
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN102156871B (en) Image classification method based on category correlated codebook and classifier voting strategy
CN109063649B (en) Pedestrian re-identification method based on twin pedestrian alignment residual error network
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN106909946A (en) A kind of picking system of multi-modal fusion
CN101329731A (en) Automatic recognition method pf mathematical formula in image
CN105139041A (en) Method and device for recognizing languages based on image
CN101329734A (en) License plate character recognition method based on K-L transform and LS-SVM
CN111652332A (en) Deep learning handwritten Chinese character recognition method and system based on two classifications
CN102289522A (en) Method of intelligently classifying texts
CN103886077B (en) Short text clustering method and system
CN105045913B (en) File classification method based on WordNet and latent semantic analysis
CN105574540A (en) Method for learning and automatically classifying pest image features based on unsupervised learning technology
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN104142960A (en) Internet data analysis system
CN102004796B (en) Non-retardant hierarchical classification method and device of webpage texts
CN102768732A (en) Face recognition method integrating sparse preserving mapping and multi-class property Bagging
CN101655911A (en) Mode identification method based on immune antibody network
CN103258186A (en) Integrated face recognition method based on image segmentation
CN109472020A (en) A kind of feature alignment Chinese word cutting method
CN103324942B (en) A kind of image classification method, Apparatus and system
CN104331717A (en) Feature dictionary structure and visual feature coding integrating image classifying method
CN106503706B (en) The method of discrimination of Chinese character pattern cutting result correctness

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant