CN116166805B - Commodity coding prediction method and device - Google Patents
Commodity coding prediction method and device Download PDFInfo
- Publication number
- CN116166805B CN116166805B CN202310174800.4A CN202310174800A CN116166805B CN 116166805 B CN116166805 B CN 116166805B CN 202310174800 A CN202310174800 A CN 202310174800A CN 116166805 B CN116166805 B CN 116166805B
- Authority
- CN
- China
- Prior art keywords
- cosmetic
- feature vector
- frequency
- commodity
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000002537 cosmetic Substances 0.000 claims abstract description 206
- 239000013598 vector Substances 0.000 claims abstract description 87
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000013145 classification model Methods 0.000 description 21
- 238000000605 extraction Methods 0.000 description 4
- 230000003020 moisturizing effect Effects 0.000 description 3
- 238000004806 packaging method and process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000002087 whitening effect Effects 0.000 description 2
- 241000192710 Microcystis aeruginosa Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a method and a device for predicting commodity codes, wherein the method comprises the following steps: taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first layer word bank; predicting the cosmetic category of the cosmetic to be declared according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock; calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word bank; and predicting commodity codes of the cosmetics to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock. The embodiment of the application can improve the accuracy of classification and the matching degree of commodity codes, thereby saving the time for inquiring commodity codes.
Description
Technical Field
The application belongs to the technical field of big data, and particularly relates to a method and a device for predicting commodity codes.
Background
When the cosmetics enterprises declare import and export goods, HS (Harmonized System, coordination system) codes of the goods are required to be filled in a customs declaration attached to the goods. The HS code is a set of international trade commodity classification system codes, and is mainly used for customs personnel to confirm commodity category, carry out commodity classification management, audit tariff standard and check commodity quality index. The currently used HS coding system in China consists of ten digits, and usually one commodity only corresponds to one HS code, and one HS code not only corresponds to one commodity. Correctly filling the HS code can accelerate the customs process, ensure smooth clearance of the goods, and avoid extra cost or delay. If the HS codes are wrongly classified, normal order of customs is disturbed, and the situation is seriously penalized by administration of customs.
In order to accurately fill in the coding of the commodity of the cosmetics, the declaration personnel of the enterprises need to know the basic knowledge of the classification of the HS codes, as well as the properties, characteristics, purposes, etc. of the commodity itself, which requires the accumulation of knowledge over the years and months, not every bit of the HS codes of the commodity can be classified and distinguished quickly and skillfully. Currently, there are many websites on the network that can query HS codes by obtaining a keyword entered by a user and then returning all relevant HS codes that contain the keyword. However, the results obtained by the query are numerous, different in category and lack of hierarchical relationship, and the matching degree is low, so that the time cost of querying commodity codes of enterprises is increased.
Content of the application
The embodiment of the application aims to provide a method and a device for predicting commodity codes, which are used for solving the defect of low matching degree of commodity code query in the prior art.
In order to solve the technical problems, the application is realized as follows:
in a first aspect, a method of predicting commodity coding is provided, comprising the steps of:
taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic class;
predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, wherein each second-layer word stock corresponds to a commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
predicting the commodity code of the cosmetic to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
In a second aspect, there is provided an apparatus for predicting commodity codes, comprising:
the first calculation module is used for taking commodity information of cosmetics to be declared as a prediction sample, calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic category;
the first prediction module is used for predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the second calculation module is used for calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, and each second-layer word stock corresponds to one commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
and the second prediction module is used for predicting the commodity code to which the cosmetics to be declared belong according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
According to the embodiment of the application, the classification of the cosmetics to be declared and the commodity code to be declared are predicted according to the word frequency-inverse text frequency of each feature vector in the prediction sample and the correlation between the cosmetic attribute and the cosmetic class, so that the classification accuracy and the matching degree of the commodity code can be improved, and the commodity code inquiring time is further saved.
Drawings
FIG. 1 is a flow chart of a method for predicting commodity codes according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an apparatus for predicting commodity codes according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order to solve the problems in the prior art, the embodiment of the application provides a method for searching cosmetic HS codes based on an improved naive Bayes classifier, which is improved from the following three aspects:
1. because the number of commodity codes contained under the cosmetics category is large, and the establishment difficulty of the multi-classification model is large, the embodiment of the application constructs a two-layer classification model, wherein the first layer predicts the cosmetic category to which the commodity belongs and the second layer predicts the specific commodity code to which the commodity belongs.
2. By calculating TF-IDF values instead of conditional probabilities in a naive bayes model, the evaluation of the importance of the terms in the classification process is increased.
3. By calculating the correlation between the attributes and the categories, different weights are given to different attributes, and the higher the degree of correlation between the attributes and the categories, the greater the importance of the attributes to the categories, and therefore the higher the weight given to the attributes.
In order to achieve the above object, the embodiment of the present application provides the following technical solutions:
1. and acquiring historical declaration data of cosmetics, and establishing a label extraction model based on commodity information filled in by enterprises.
2. Based on the improved naive Bayes classifier, a first-layer classification model is established, and the cosmetic category to which the commodity belongs is predicted.
2.1 definition of first layer classification model categories
2.2, acquiring historical declaration data of cosmetics as training samples, and establishing five word banks according to the types of the cosmetics.
And 2.3, acquiring commodity information of cosmetics to be declared, which is input by enterprises, as a prediction sample. The prediction samples are subjected to label extraction in step 1.
2.4 defines an improved naive bayes classifier:
2.5 calculating the prior probability for each cosmetic class
2.6 calculate prediction samples x= { X 1 ,x 2 ,...,x n Each feature vector x in } i The TF-IDF value of (i) is the word frequency-inverse text frequency, used to evaluate the importance of words in the word stock.
2.7 calculate the correlation of cosmetic properties to cosmetic categories. Giving each attribute a weight w according to the correlation i The higher the correlation, the greater the weight.
2.8 based on the calculation results of steps 2.5, 2.6, 2.7, a prediction sample x= { X is calculated 1 ,x 2 ,...,x n Probability of belonging to each categoryThen, the cosmetic class in which the probability is the greatest is selected as the prediction result of the first-layer classification model.
3. And based on the improved naive Bayes classifier, establishing a second-layer classification model, and predicting the specific commodity code to which the commodity belongs.
3.1, based on the cosmetic category of the commodity predicted by the first-layer classification model, establishing a second-layer classification model under the category. Defining second-level classification model classes
3.2 based on training samples and class L of the second-layer classification model 2 And respectively establishing word libraries.
3.3 repeating the steps 2.3 to 2.8 to calculate the maximum posterior probabilityThe category with the highest probability is the predicted commodity code of the commodity.
The method for predicting commodity codes provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
As shown in fig. 1, a flowchart of a method for predicting commodity coding according to an embodiment of the present application is provided, where the method includes the following steps:
step 101, commodity information of cosmetics to be declared is used as a prediction sample, word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock is calculated, each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic category.
Step 102, predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category, and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data.
Specifically, the weight coefficient of each feature vector in the prediction sample can be calculated according to the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
In this embodiment, before calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each first layer word stock, historical declaration data of cosmetics may be further obtained as a training sample, and each declaration data in the training sample is classified according to a cosmetic class corresponding to a commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
Step 103, calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, wherein each second-layer word stock corresponds to a commodity code contained in the cosmetic category to which the cosmetic to be declared belongs.
And 104, predicting the commodity code to which the cosmetics to be declared belong according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
Specifically, the weight coefficient of each feature vector in the prediction sample can be calculated according to the correlation between the cosmetic attribute and commodity code in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
In this embodiment, before calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, the plurality of pieces of declaration data may be further classified according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
According to the embodiment of the application, the classification of the cosmetics to be declared and the commodity code to be declared are predicted according to the word frequency-inverse text frequency of each feature vector in the prediction sample and the correlation between the cosmetic attribute and the cosmetic class, so that the classification accuracy and the matching degree of the commodity code can be improved, and the commodity code inquiring time is further saved.
Further, the technical solution of the embodiment of the present application may be described in detail as follows:
1. the method comprises the steps of acquiring historical declaration data of cosmetics, and establishing a label extraction model based on commodity information filled by enterprises, and specifically comprises the following steps:
1.1 definition of cosmetic Properties Z= { Z 1 ,z 2 ,...,z 7 The goods types, the objects of use, the efficacy, the packaging, the specifications, the brands and the components are respectively.
1.2, dividing commodity information filled in by an enterprise into a plurality of piece information through word segmentation and attribute labeling. Specifically, the bert+crf model may be used to implement chinese named-body recognition. The examples of the present application are not limited herein.
1.3, duplicate word segmentation results are de-duplicated, and commodity types, usage objects, efficacy, packaging, specifications, brands and component attributes are extracted.
Use in "OLAY rinse water bloom|: facial moisturizing and whitening|packaging specification: 50G/bottle |brand: OLAY "is exemplified by the label extraction model, and the results are" OLAY-brand, cream-commodity type, face-subject of use, moisturizing and whitening-efficacy, G-specification, bottle-pack ".
2. Based on an improved naive Bayes classifier, a first-layer classification model is established, and the cosmetic category to which the commodity belongs is predicted, specifically comprising the following steps:
2.1 definition of first layer classification model categoriesRespectively, lip cosmetics, eye cosmetics, nail cosmetics, powdery cosmetics, and other cosmetics or cosmeticsSkin care products.
2.2, acquiring historical declaration data of cosmetics as training samples, and respectively establishing word libraries according to the types of the cosmetics, wherein the method specifically comprises the following steps of:
and classifying the data according to HS codes of each claim data to obtain data of 5 cosmetic categories. For example, declaration data of "33041000" 8 bits before HS encoding is classified into cosmetics for lips. And (3) extracting the label of the step (1) for the declaration information of declaration data in each cosmetic category to obtain 5 word libraries.
And 2.3, acquiring commodity information of cosmetics to be declared, which is input by enterprises, as a prediction sample. The commodity information of the cosmetics to be declared, which is input by enterprises, is extracted through the label in the step 1, and a feature vector X= { X is obtained 1 ,x 2 ,...,x n Each feature x corresponds to an attribute z.
2.4 defines an improved naive bayes classifier, calculating the probability of each cosmetic class.
Specifically, based on the principle of naive Bayes classification, the probability that the prediction sample belongs to j classes of cosmetics is judged to be
Since P (X) is constant for all cosmetic categories, the posterior probability is maximizedCan be converted into maximizing the prior probability +.>Assuming that the feature vectors are mutually independent to obtain
And further obtain a naive Bayes classification model as
Wherein,,a priori probabilities for each cosmetic class; />Is a conditional probability, i.e. x i Probability of the word occurring in class j word stock. From the above formula, x i The higher the frequency of occurrence of words in the class j lexicon, the greater the probability that the sample belongs to class j cosmetics. In reality, however, some frequently occurring common words may not contribute much to classifying categories, such as "moisturizing", and may occur very frequently in each category, and simply using word frequencies may reduce the accuracy of the classifier. Therefore, we use the TF-IDF value instead of +.>
In addition, the weights of the attributes in the naive Bayes model are equal, but in the actual classification process of cosmetics, the importance of each attribute to classification is different, for example, in the first-layer classification model, the attribute of "using an object" plays the most important role in distinguishing the cosmetics categories, so different attributes can be given different weights to improve the accuracy of classification of the Bayes model. Let w be i Is the characteristic x i The improved naive bayes model is that
2.5 calculating the prior probability for each cosmetic classNamely, the cosmetic class j in the training sampleThe reporting data accounts for the proportion of the total reporting data, and the specific calculation formula is as follows:
wherein, |D| is the total amount of training samples, |D j And I is the number of training samples with the cosmetic class of j.
2.6 calculate prediction samples x= { X 1 ,x 2 ,...,x n Each feature vector x in } i The TF-IDF value of (i) is the word frequency-inverse text frequency, used to evaluate the importance of words in the word stock. If a word occurs frequently in one word stock and less frequently in other word stocks, it is considered that the word has good class distinction capability. The specific calculation formula is as follows:
TF-IDF ij =tf ij ×idf i
wherein tf is ij Representing word frequency, namely the frequency of occurrence of i words in a j-class word stock; idf (idf) i The inverse text frequency of the i word, i.e. the frequency of occurrence in other word stores, is represented.
tf ij The specific formula of (2) is:
wherein n is ij For the number of training samples of the occurrence of the i word in the j word stock, sigma k n kj The total training sample number in the j word stock. If the i word does not appear in the training set, the whole probability becomes 0, and in order to solve the problem of zero probability, laplacian smoothing can be used for correcting the probability, and tf after correction ij The method comprises the following steps:
wherein m is the number of word stock.
idf i The specific formula of (2) is:
wherein, D is the total training sample amount, |{ j: t i ∈d j The number of training samples containing i words is increased by 1 to avoid 0.
2.7 calculating the correlation of the cosmetic properties in the training sample with the cosmetic class. Each attribute is given a weight according to the correlation, the higher the correlation, the greater the weight.
Specifically, the cosmetic property z is calculated 1 、z 2 、...、z 7 Class with cosmeticsTo measure the correlation between two event sets. The calculation formula is as follows:
wherein,,is cosmetic property z i And cosmetic class L 1 Is a joint probability distribution function of P (z) i,k ) Andcosmetic properties z, respectively i And cosmetic class L 1 Is a function of the edge probability distribution of (a).
Further, I (z i ;L 1 ) As attribute z i Weight coefficient of (a), i.e. w i =I(z i ;L 1 )。
2.8 based on the results of the calculations of steps 2.5, 2.6, 2.7tf ij 、idf i 、w i And an improved naive Bayes model formula
Calculate the prediction sample x= { X 1 ,x 2 ,...,x n Probability of belonging to each categoryThen, the cosmetic class in which the probability is the greatest is selected as the prediction result of the first-layer classification model.
3. Based on the improved naive Bayes classifier, a second-layer classification model is established, and specific commodity codes of the commodity are predicted, wherein the specific commodity codes comprise the following steps:
3.1, based on the cosmetic category of the commodity predicted by the first-layer classification model, establishing a second-layer classification model under the category. Defining second-level classification model classesWherein the lip cosmetic contains 7 codes, the eye cosmetic contains 7 codes, the nail cosmetic contains 4 codes, the powder cosmetic contains 2 codes, other cosmetics or cosmetics contains 9 codes.
3.2 acquiring historical declaration data of the cosmetics according to class L of the second-layer classification model 2 And respectively establishing word libraries. Namely, 7 word banks are built under the lip cosmetics, 7 word banks are built under the eye cosmetics, and so on.
3.3 repeating the steps 2.3 to 2.8 to calculate the maximum posterior probabilityThe category with the highest probability is the predicted commodity code of the commodity.
According to the embodiment of the application, by establishing the method for searching the HS codes of the cosmetics based on the improved naive Bayesian classifier, enterprises can return the HS codes with highest matching degree only by inputting commodity information of the cosmetics to be declared, and the time for inquiring the commodity codes by the enterprises is greatly saved; by constructing a two-layer classification model, the first layer predicts the cosmetic category to which the commodity belongs, and the second layer predicts the specific commodity code to which the commodity belongs, so that the classification accuracy is improved; the calculation of the correlation between the attribute and the category is increased, the weight of the attribute with large contribution to the category discrimination is increased, and the weight of the attribute with small contribution to the category discrimination is reduced, so that the influence of low-weight attribute words on the classification result is weakened, and the stability of the classification effect is ensured.
As shown in fig. 2, a schematic structural diagram of an apparatus for predicting commodity codes according to an embodiment of the present application includes:
the first calculation module 210 is configured to calculate, using commodity information of the cosmetics to be declared as a prediction sample, word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, where each feature vector corresponds to an attribute of the cosmetics to be declared, and each first-layer word stock corresponds to a cosmetic category.
The first prediction module 220 is configured to predict, according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category, and the correlation between the cosmetic attribute and the cosmetic category in the historical reporting data, the cosmetic category to which the cosmetic to be reported belongs.
Specifically, the first prediction module 220 is specifically configured to calculate a weight coefficient of each feature vector in the prediction sample according to a correlation between the cosmetic attribute and the cosmetic category in the historical declaration data; the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
The second calculating module 230 is configured to calculate a word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, where each second-layer word stock corresponds to a commodity code included in the cosmetic category to which the cosmetic to be declared belongs.
The second prediction module 240 is configured to predict the commodity code to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code, and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
Specifically, the second prediction module 240 is specifically configured to calculate a weight coefficient of each feature vector in the prediction sample according to a correlation between the cosmetic attribute and the commodity code in the historical declaration data; the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Wherein,,{x 1 ,x 2 ,...,x n is a plurality of feature vectors in the prediction samples,the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
Further, the device further comprises:
the first classification module is used for acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to the commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
The second classification module is used for classifying the plurality of pieces of declaration data according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
According to the embodiment of the application, the classification of the cosmetics to be declared and the commodity code to be declared are predicted according to the word frequency-inverse text frequency of each feature vector in the prediction sample and the correlation between the cosmetic attribute and the cosmetic class, so that the classification accuracy and the matching degree of the commodity code can be improved, and the commodity code inquiring time is further saved.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the method embodiment of predicting commodity coding, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.
Claims (8)
1. A method of predicting commodity codes, comprising the steps of:
taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic class;
predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, wherein each second-layer word stock corresponds to a commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
predicting commodity codes of the cosmetics to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data;
the predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data, specifically includes:
according to the correlation between the cosmetic attributes and the cosmetic categories in the historical declaration data, calculating the weight coefficient of each feature vector in the prediction sample;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Each cosmetic class
Probability of category j
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
2. The method of claim 1, wherein said calculating each feature vector in the prediction samples precedes a word frequency-inverse text frequency of the respective first level thesaurus further comprises:
acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to commodity codes of each declaration data in the training sample;
and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
3. The method of claim 1, wherein said calculating each feature vector in the prediction samples precedes a word frequency-inverse text frequency of the respective second level word stock further comprises:
classifying the multiple pieces of declaration data according to commodity codes of the multiple pieces of declaration data corresponding to each cosmetic class;
and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
4. The method according to claim 1, wherein predicting the commodity code to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code, and the correlation between the cosmetic attribute and the commodity code in the historical declaration data specifically includes:
according to the correlation between the cosmetic attributes and commodity codes in the historical declaration data, calculating the weight coefficient of each feature vector in the prediction sample;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,the prior probability of k for commodity code tf ik The frequency of the feature vector xi appearing in the second-layer word stock corresponding to the commodity code k is used; idf (idf) i The inverse text frequency of the feature vector xi in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
5. An apparatus for predicting commodity codes, comprising:
the first calculation module is used for taking commodity information of cosmetics to be declared as a prediction sample, calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic category;
the first prediction module is used for predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the second calculation module is used for calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, and each second-layer word stock corresponds to one commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
the second prediction module is used for predicting commodity codes of the cosmetics to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data;
the first prediction module is specifically configured to calculate a weight coefficient of each feature vector in the prediction sample according to a correlation between a cosmetic attribute and a cosmetic category in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
6. The apparatus as recited in claim 5, further comprising:
the first classification module is used for acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to the commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
7. The apparatus as recited in claim 5, further comprising:
the second classification module is used for classifying the plurality of pieces of declaration data according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
8. The apparatus of claim 5, wherein the device comprises a plurality of sensors,
the second prediction module is specifically configured to calculate a weight coefficient of each feature vector in the prediction sample according to the correlation between the cosmetic attribute and the commodity code in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,the prior probability of k for commodity code tf ik The frequency of the feature vector xi appearing in the second-layer word stock corresponding to the commodity code k is used; idf (idf) i The inverse text frequency of the feature vector xi in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310174800.4A CN116166805B (en) | 2023-02-24 | 2023-02-24 | Commodity coding prediction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310174800.4A CN116166805B (en) | 2023-02-24 | 2023-02-24 | Commodity coding prediction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116166805A CN116166805A (en) | 2023-05-26 |
CN116166805B true CN116166805B (en) | 2023-09-22 |
Family
ID=86413034
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310174800.4A Active CN116166805B (en) | 2023-02-24 | 2023-02-24 | Commodity coding prediction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116166805B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140066486A (en) * | 2012-11-23 | 2014-06-02 | 현대중공업 주식회사 | An method for assigning hs code to materials |
CN107704892A (en) * | 2017-11-07 | 2018-02-16 | 宁波爱信诺航天信息有限公司 | A kind of commodity code sorting technique and system based on Bayesian model |
CN109598517A (en) * | 2017-09-29 | 2019-04-09 | 阿里巴巴集团控股有限公司 | Commodity clearance processing, the processing of object and its class prediction method and apparatus |
CN110858219A (en) * | 2018-08-17 | 2020-03-03 | 菜鸟智能物流控股有限公司 | Logistics object information processing method and device and computer system |
CN112529420A (en) * | 2020-12-14 | 2021-03-19 | 深圳市钛师傅云有限公司 | Intelligent classification method and system for customs commodity codes |
CN113378167A (en) * | 2021-06-30 | 2021-09-10 | 哈尔滨理工大学 | Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120052636A (en) * | 2010-11-16 | 2012-05-24 | 한국전자통신연구원 | A hscode recommendation service system and method using ontology |
-
2023
- 2023-02-24 CN CN202310174800.4A patent/CN116166805B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140066486A (en) * | 2012-11-23 | 2014-06-02 | 현대중공업 주식회사 | An method for assigning hs code to materials |
CN109598517A (en) * | 2017-09-29 | 2019-04-09 | 阿里巴巴集团控股有限公司 | Commodity clearance processing, the processing of object and its class prediction method and apparatus |
CN107704892A (en) * | 2017-11-07 | 2018-02-16 | 宁波爱信诺航天信息有限公司 | A kind of commodity code sorting technique and system based on Bayesian model |
CN110858219A (en) * | 2018-08-17 | 2020-03-03 | 菜鸟智能物流控股有限公司 | Logistics object information processing method and device and computer system |
CN112529420A (en) * | 2020-12-14 | 2021-03-19 | 深圳市钛师傅云有限公司 | Intelligent classification method and system for customs commodity codes |
CN113378167A (en) * | 2021-06-30 | 2021-09-10 | 哈尔滨理工大学 | Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing |
Non-Patent Citations (2)
Title |
---|
基于DOA自动注册和改进朴素贝叶斯算法的商机数据分类研究;蒋登丽;《中国优秀硕士学位论文全文数据库》(第2期);第36-50页 * |
改进的多项朴素贝叶斯分类算法和Python实现;陈翠娟;《景德镇学院学报》;第36卷(第3期);第92-95页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116166805A (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11669750B2 (en) | System and/or method for generating clean records from imperfect data using model stack(s) including classification model(s) and confidence model(s) | |
CN108391446B (en) | Automatic extraction of training corpus for data classifier based on machine learning algorithm | |
CN106919619B (en) | Commodity clustering method and device and electronic equipment | |
CN104778186B (en) | Merchandise items are mounted to the method and system of standardized product unit | |
US20180158078A1 (en) | Computer device and method for predicting market demand of commodities | |
CN106447066A (en) | Big data feature extraction method and device | |
CN112200601B (en) | Item recommendation method, device and readable storage medium | |
CN106445988A (en) | Intelligent big data processing method and system | |
CN112487199B (en) | User characteristic prediction method based on user purchasing behavior | |
CN110019790B (en) | Text recognition, text monitoring, data object recognition and data processing method | |
CN109766911A (en) | A kind of behavior prediction method | |
CN107247728B (en) | Text processing method and device and computer storage medium | |
CN113656699B (en) | User feature vector determining method, related equipment and medium | |
CN110135769A (en) | Kinds of goods attribute fill method and device, storage medium and electric terminal | |
CN111651981A (en) | Data auditing method, device and equipment | |
CN115062732A (en) | Resource sharing cooperation recommendation method and system based on big data user tag information | |
CN116166805B (en) | Commodity coding prediction method and device | |
CN117436446A (en) | Weak supervision-based agricultural social sales service user evaluation data analysis method | |
Spichakova et al. | Using machine learning for automated assessment of misclassification of goods for fraud detection | |
CN115080741A (en) | Questionnaire survey analysis method, device, storage medium and equipment | |
US20110208738A1 (en) | Method for Determining an Enhanced Value to Keywords Having Sparse Data | |
CN111325419A (en) | Method and device for identifying blacklist user | |
CN113486948B (en) | Clothing commodity gender classification method and device based on text data | |
CN112685635B (en) | Item recommendation method, device, server and storage medium based on classification label | |
CN117853249A (en) | Commodity classification recommending method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |