CN116166805A - Commodity coding prediction method and device - Google Patents

Commodity coding prediction method and device Download PDF

Info

Publication number
CN116166805A
CN116166805A CN202310174800.4A CN202310174800A CN116166805A CN 116166805 A CN116166805 A CN 116166805A CN 202310174800 A CN202310174800 A CN 202310174800A CN 116166805 A CN116166805 A CN 116166805A
Authority
CN
China
Prior art keywords
cosmetic
feature vector
frequency
commodity
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310174800.4A
Other languages
Chinese (zh)
Other versions
CN116166805B (en
Inventor
徐梦璇
张丹
熊晓菁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingmeng Shuhai Technology Co ltd
Original Assignee
Beijing Qingmeng Shuhai Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingmeng Shuhai Technology Co ltd filed Critical Beijing Qingmeng Shuhai Technology Co ltd
Priority to CN202310174800.4A priority Critical patent/CN116166805B/en
Publication of CN116166805A publication Critical patent/CN116166805A/en
Application granted granted Critical
Publication of CN116166805B publication Critical patent/CN116166805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for predicting commodity codes, wherein the method comprises the following steps: taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first layer word bank; predicting the cosmetic category of the cosmetic to be declared according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock; calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word bank; and predicting commodity codes of the cosmetics to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock. The method and the device can improve the accuracy of classification and the matching degree of commodity codes, and further save the time for inquiring commodity codes.

Description

Commodity coding prediction method and device
Technical Field
The application belongs to the technical field of big data, and particularly relates to a method and a device for predicting commodity codes.
Background
When the cosmetics enterprises declare import and export goods, HS (Harmonized System, coordination system) codes of the goods are required to be filled in a customs declaration attached to the goods. The HS code is a set of international trade commodity classification system codes, and is mainly used for customs personnel to confirm commodity category, carry out commodity classification management, audit tariff standard and check commodity quality index. The currently used HS coding system in China consists of ten digits, and usually one commodity only corresponds to one HS code, and one HS code not only corresponds to one commodity. Correctly filling the HS code can accelerate the customs process, ensure smooth clearance of the goods, and avoid extra cost or delay. If the HS codes are wrongly classified, normal order of customs is disturbed, and the situation is seriously penalized by administration of customs.
In order to accurately fill in the coding of the commodity of the cosmetics, the declaration personnel of the enterprises need to know the basic knowledge of the classification of the HS codes, as well as the properties, characteristics, purposes, etc. of the commodity itself, which requires the accumulation of knowledge over the years and months, not every bit of the HS codes of the commodity can be classified and distinguished quickly and skillfully. Currently, there are many websites on the network that can query HS codes by obtaining a keyword entered by a user and then returning all relevant HS codes that contain the keyword. However, the results obtained by the query are numerous, different in category and lack of hierarchical relationship, and the matching degree is low, so that the time cost of querying commodity codes of enterprises is increased.
Content of the application
The embodiment of the application aims to provide a method and a device for predicting commodity codes, which are used for solving the defect of low matching degree of commodity code query in the prior art.
In order to solve the technical problems, the application is realized as follows:
in a first aspect, a method of predicting commodity coding is provided, comprising the steps of:
taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic class;
predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, wherein each second-layer word stock corresponds to a commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
predicting the commodity code of the cosmetic to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
In a second aspect, there is provided an apparatus for predicting commodity codes, comprising:
the first calculation module is used for taking commodity information of cosmetics to be declared as a prediction sample, calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic category;
the first prediction module is used for predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the second calculation module is used for calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, and each second-layer word stock corresponds to one commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
and the second prediction module is used for predicting the commodity code to which the cosmetics to be declared belong according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
According to the method and the device for predicting the commodity codes, the classification accuracy and the matching degree of the commodity codes can be improved, and the time for inquiring the commodity codes is further saved.
Drawings
FIG. 1 is a flow chart of a method for predicting commodity codes according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an apparatus for predicting commodity codes according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
In order to solve the problems in the prior art, the embodiment of the application provides a method for searching cosmetic HS codes based on an improved naive Bayes classifier, which is improved from the following three aspects:
1. because the number of commodity codes contained under the cosmetics category is large, and the establishment difficulty of the multi-classification model is large, the embodiment of the application establishes a two-layer classification model, wherein the first layer predicts the cosmetic category to which the commodity belongs and the second layer predicts the specific commodity code to which the commodity belongs.
2. By calculating TF-IDF values instead of conditional probabilities in a naive bayes model, the evaluation of the importance of the terms in the classification process is increased.
3. By calculating the correlation between the attributes and the categories, different weights are given to different attributes, and the higher the degree of correlation between the attributes and the categories, the greater the importance of the attributes to the categories, and therefore the higher the weight given to the attributes.
In order to achieve the above purpose, the embodiment of the present application provides the following technical solutions:
1. and acquiring historical declaration data of cosmetics, and establishing a label extraction model based on commodity information filled in by enterprises.
2. Based on the improved naive Bayes classifier, a first-layer classification model is established, and the cosmetic category to which the commodity belongs is predicted.
2.1 definition of first layer classification model categories
Figure BDA0004100538250000041
2.2, acquiring historical declaration data of cosmetics as training samples, and establishing five word banks according to the types of the cosmetics.
And 2.3, acquiring commodity information of cosmetics to be declared, which is input by enterprises, as a prediction sample. The prediction samples are subjected to label extraction in step 1.
2.4 defines an improved naive bayes classifier:
Figure BDA0004100538250000042
2.5 calculating the prior probability for each cosmetic class
Figure BDA0004100538250000043
2.6 calculate prediction samples x= { X 1 ,x 2 ,...,x n Each feature vector x in } i The TF-IDF value of (i) is the word frequency-inverse text frequency, used to evaluate the importance of words in the word stock.
2.7 calculation of cosmetic Properties and cosmeticsCorrelation of categories. Giving each attribute a weight w according to the correlation i The higher the correlation, the greater the weight.
2.8 based on the calculation results of steps 2.5, 2.6, 2.7, a prediction sample x= { X is calculated 1 ,x 2 ,...,x n Probability of belonging to each category
Figure BDA0004100538250000044
Then, the cosmetic class in which the probability is the greatest is selected as the prediction result of the first-layer classification model.
3. And based on the improved naive Bayes classifier, establishing a second-layer classification model, and predicting the specific commodity code to which the commodity belongs.
3.1, based on the cosmetic category of the commodity predicted by the first-layer classification model, establishing a second-layer classification model under the category. Defining second-level classification model classes
Figure BDA0004100538250000045
3.2 based on training samples and class L of the second-layer classification model 2 And respectively establishing word libraries.
3.3 repeating the steps 2.3 to 2.8 to calculate the maximum posterior probability
Figure BDA0004100538250000046
The category with the highest probability is the predicted commodity code of the commodity.
The method for predicting commodity codes provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
As shown in fig. 1, a flowchart of a method for predicting commodity coding according to an embodiment of the present application is provided, where the method includes the following steps:
step 101, commodity information of cosmetics to be declared is used as a prediction sample, word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock is calculated, each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic category.
Step 102, predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category, and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data.
Specifically, the weight coefficient of each feature vector in the prediction sample can be calculated according to the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Figure BDA0004100538250000051
/>
Figure BDA0004100538250000052
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,
Figure BDA0004100538250000053
for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
In this embodiment, before calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each first layer word stock, historical declaration data of cosmetics may be further obtained as a training sample, and each declaration data in the training sample is classified according to a cosmetic class corresponding to a commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
Step 103, calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, wherein each second-layer word stock corresponds to a commodity code contained in the cosmetic category to which the cosmetic to be declared belongs.
And 104, predicting the commodity code to which the cosmetics to be declared belong according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
Specifically, the weight coefficient of each feature vector in the prediction sample can be calculated according to the correlation between the cosmetic attribute and commodity code in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Figure BDA0004100538250000061
Figure BDA0004100538250000062
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,
Figure BDA0004100538250000063
the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
In this embodiment, before calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, the plurality of pieces of declaration data may be further classified according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
According to the method and the device for predicting the commodity codes, the classification accuracy and the matching degree of the commodity codes can be improved, and the time for inquiring the commodity codes is further saved.
Further, the technical solutions of the embodiments of the present application may be described in detail as follows:
1. the method comprises the steps of acquiring historical declaration data of cosmetics, and establishing a label extraction model based on commodity information filled by enterprises, and specifically comprises the following steps:
1.1 definition of cosmetic Properties Z= { Z 1 ,z 2 ,...,z 7 The goods types, the objects of use, the efficacy, the packaging, the specifications, the brands and the components are respectively.
1.2, dividing commodity information filled in by an enterprise into a plurality of piece information through word segmentation and attribute labeling. Specifically, the bert+crf model may be used to implement chinese named-body recognition. The examples of the present application are not limited herein.
1.3, duplicate word segmentation results are de-duplicated, and commodity types, usage objects, efficacy, packaging, specifications, brands and component attributes are extracted.
Use in "OLAY rinse water bloom|: facial moisturizing and whitening|packaging specification: 50G/bottle |brand: OLAY "is exemplified by the label extraction model, and the results are" OLAY-brand, cream-commodity type, face-subject of use, moisturizing and whitening-efficacy, G-specification, bottle-pack ".
2. Based on an improved naive Bayes classifier, a first-layer classification model is established, and the cosmetic category to which the commodity belongs is predicted, specifically comprising the following steps:
2.1 definition of first layer classification model categories
Figure BDA0004100538250000071
Respectively, cosmetics for lips, cosmetics for eyes, cosmetics for nails, powdery cosmetics, and other cosmetics or cosmetics for skin care.
2.2, acquiring historical declaration data of cosmetics as training samples, and respectively establishing word libraries according to the types of the cosmetics, wherein the method specifically comprises the following steps of:
and classifying the data according to HS codes of each claim data to obtain data of 5 cosmetic categories. For example, declaration data of "33041000" 8 bits before HS encoding is classified into cosmetics for lips. And (3) extracting the label of the step (1) for the declaration information of declaration data in each cosmetic category to obtain 5 word libraries.
And 2.3, acquiring commodity information of cosmetics to be declared, which is input by enterprises, as a prediction sample. The commodity information of the cosmetics to be declared, which is input by enterprises, is extracted through the label in the step 1, and a feature vector X= { X is obtained 1 ,x 2 ,...,x n Each feature x corresponds to an attribute z.
2.4 defines an improved naive bayes classifier, calculating the probability of each cosmetic class.
Specifically, based on the principle of naive Bayes classification, the probability that the prediction sample belongs to j classes of cosmetics is judged to be
Figure BDA0004100538250000081
Since P (X) is constant for all cosmetic categories, the posterior probability is maximized
Figure BDA0004100538250000082
Can be converted into maximizing the prior probability +.>
Figure BDA0004100538250000083
Assuming that the feature vectors are mutually independent to obtain
Figure BDA0004100538250000084
And further obtain a naive Bayes classification model as
Figure BDA0004100538250000085
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004100538250000086
a priori probabilities for each cosmetic class; />
Figure BDA0004100538250000087
Is a conditional probability, i.e. x i Probability of the word occurring in class j word stock. From the above formula, x i The higher the frequency of occurrence of words in the class j lexicon, the greater the probability that the sample belongs to class j cosmetics. In reality, however, some frequently occurring common words may not contribute much to classifying categories, such as "moisturizing", and may occur very frequently in each category, and simply using word frequencies may reduce the accuracy of the classifier. Therefore, we use the TF-IDF value instead of +.>
Figure BDA0004100538250000088
In addition, the weights of the attributes in the naive Bayes model are equal, but in the actual classification process of cosmetics, the importance of each attribute to classification is different, for example, in the first-layer classification model, the attribute of using an object plays the most important role in distinguishing the cosmetics categories, so different attributes can be given different weights to improve the Bayes modelAccuracy of type classification. Let w be i Is the characteristic x i The improved naive bayes model is that
Figure BDA0004100538250000089
2.5 calculating the prior probability for each cosmetic class
Figure BDA00041005382500000810
Namely, the declaration data with the cosmetic class of j in the training sample accounts for the proportion of the total declaration data, and the specific calculation formula is as follows:
Figure BDA0004100538250000091
wherein, |D| is the total amount of training samples, |D j And I is the number of training samples with the cosmetic class of j.
2.6 calculate prediction samples x= { X 1 ,x 2 ,...,x n Each feature vector x in } i The TF-IDF value of (i) is the word frequency-inverse text frequency, used to evaluate the importance of words in the word stock. If a word occurs frequently in one word stock and less frequently in other word stocks, it is considered that the word has good class distinction capability. The specific calculation formula is as follows:
TF-IDF ij =tf ij ×idf i
wherein tf is ij Representing word frequency, namely the frequency of occurrence of i words in a j-class word stock; idf (idf) i The inverse text frequency of the i word, i.e. the frequency of occurrence in other word stores, is represented.
tf ij The specific formula of (2) is:
Figure BDA0004100538250000092
wherein n is ij For the number of training samples of the occurrence of the i word in the j word stock, sigma k n kj The total training sample number in the j word stock. If the i word does not appear in the training set, the whole probability becomes 0, and in order to solve the problem of zero probability, laplacian smoothing can be used for correcting the probability, and tf after correction ij The method comprises the following steps:
Figure BDA0004100538250000093
wherein m is the number of word stock.
idf i The specific formula of (2) is:
Figure BDA0004100538250000094
wherein, D is the total training sample amount, |{ j: t i ∈d j The number of training samples containing i words is increased by 1 to avoid 0.
2.7 calculating the correlation of the cosmetic properties in the training sample with the cosmetic class. Each attribute is given a weight according to the correlation, the higher the correlation, the greater the weight.
Specifically, the cosmetic property z is calculated 1 、z 2 、...、z 7 Class with cosmetics
Figure BDA0004100538250000095
To measure the correlation between two event sets. The calculation formula is as follows:
Figure BDA0004100538250000101
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004100538250000102
is cosmetic property z i And cosmetic class L 1 Is a joint probability distribution function of P (z) i,k ) And
Figure BDA0004100538250000103
cosmetic properties z, respectively i And cosmetic class L 1 Is a function of the edge probability distribution of (a).
Further, I (z i ;L 1 ) As attribute z i Weight coefficient of (a), i.e. w i =I(z i ;L 1 )。
2.8 based on the results of the calculations of steps 2.5, 2.6, 2.7
Figure BDA0004100538250000104
tf ij 、idf i 、w i And an improved naive Bayes model formula
Figure BDA0004100538250000105
Calculate the prediction sample x= { X 1 ,x 2 ,...,x n Probability of belonging to each category
Figure BDA0004100538250000106
Then, the cosmetic class in which the probability is the greatest is selected as the prediction result of the first-layer classification model.
3. Based on the improved naive Bayes classifier, a second-layer classification model is established, and specific commodity codes of the commodity are predicted, wherein the specific commodity codes comprise the following steps:
3.1, based on the cosmetic category of the commodity predicted by the first-layer classification model, establishing a second-layer classification model under the category. Defining second-level classification model classes
Figure BDA0004100538250000107
Wherein the lip cosmetic contains 7 codes, the eye cosmetic contains 7 codes, the nail cosmetic contains 4 codes, the powder cosmetic contains 2 codes, other cosmetics or cosmetics contains 9 codes.
3.2 acquiring historical declaration data of the cosmetic according to the second-layer classification modelCategory L 2 And respectively establishing word libraries. Namely, 7 word banks are built under the lip cosmetics, 7 word banks are built under the eye cosmetics, and so on.
3.3 repeating the steps 2.3 to 2.8 to calculate the maximum posterior probability
Figure BDA0004100538250000108
The category with the highest probability is the predicted commodity code of the commodity.
According to the embodiment of the application, the method for searching the HS codes of the cosmetics based on the improved naive Bayesian classifier is established, so that an enterprise can return the HS code with the highest matching degree only by inputting commodity information of the cosmetics to be declared, and the time for querying the commodity codes by the enterprise is greatly saved; by constructing a two-layer classification model, the first layer predicts the cosmetic category to which the commodity belongs, and the second layer predicts the specific commodity code to which the commodity belongs, so that the classification accuracy is improved; the calculation of the correlation between the attribute and the category is increased, the weight of the attribute with large contribution to the category discrimination is increased, and the weight of the attribute with small contribution to the category discrimination is reduced, so that the influence of low-weight attribute words on the classification result is weakened, and the stability of the classification effect is ensured.
As shown in fig. 2, a schematic structural diagram of an apparatus for predicting commodity codes according to an embodiment of the present application includes:
the first calculation module 210 is configured to calculate, using commodity information of the cosmetics to be declared as a prediction sample, word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, where each feature vector corresponds to an attribute of the cosmetics to be declared, and each first-layer word stock corresponds to a cosmetic category.
The first prediction module 220 is configured to predict, according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category, and the correlation between the cosmetic attribute and the cosmetic category in the historical reporting data, the cosmetic category to which the cosmetic to be reported belongs.
Specifically, the first prediction module 220, specifically usesCalculating weight coefficients of each feature vector in the prediction sample according to the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data; the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Figure BDA0004100538250000111
Figure BDA0004100538250000112
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,
Figure BDA0004100538250000113
for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
The second calculating module 230 is configured to calculate a word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, where each second-layer word stock corresponds to a commodity code included in the cosmetic category to which the cosmetic to be declared belongs.
The second prediction module 240 is configured to predict the commodity code to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code, and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
Specifically, the second prediction module 240 is specifically configured to report the cosmetic attributes and the merchandise in the data according to the historyCalculating the weight coefficient of each feature vector in the prediction sample according to the encoded correlation; the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Figure BDA0004100538250000121
Figure BDA0004100538250000122
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,
Figure BDA0004100538250000123
the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
Further, the device further comprises:
the first classification module is used for acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to the commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
The second classification module is used for classifying the plurality of pieces of declaration data according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
According to the method and the device for predicting the commodity codes, the classification accuracy and the matching degree of the commodity codes can be improved, and the time for inquiring the commodity codes is further saved.
The embodiment of the present application further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above-mentioned method embodiment for predicting commodity coding, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims (10)

1. A method of predicting commodity codes, comprising the steps of:
taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic class;
predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, wherein each second-layer word stock corresponds to a commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
predicting the commodity code of the cosmetic to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
2. The method of claim 1, wherein said calculating each feature vector in the prediction samples precedes a word frequency-inverse text frequency of the respective first level thesaurus further comprises:
acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to commodity codes of each declaration data in the training sample;
and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
3. The method according to claim 1, wherein predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word bank, the prior probability of each cosmetic category, and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data, specifically includes:
according to the correlation between the cosmetic attributes and the cosmetic categories in the historical declaration data, calculating the weight coefficient of each feature vector in the prediction sample;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Figure FDA0004100538240000027
Figure FDA0004100538240000022
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,
Figure FDA0004100538240000023
for the prior probability of cosmetic class j, tf ij For the feature vector x i First-layer word stock corresponding to cosmetic class jIs a frequency of occurrence in the first and second embodiments; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
4. The method of claim 1, wherein said calculating each feature vector in the prediction samples precedes a word frequency-inverse text frequency of the respective second level word stock further comprises:
classifying the multiple pieces of declaration data according to commodity codes of the multiple pieces of declaration data corresponding to each cosmetic class;
and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
5. The method according to claim 1, wherein predicting the commodity code to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code, and the correlation between the cosmetic attribute and the commodity code in the historical declaration data specifically includes:
according to the correlation between the cosmetic attributes and commodity codes in the historical declaration data, calculating the weight coefficient of each feature vector in the prediction sample;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Figure FDA0004100538240000028
Figure FDA0004100538240000025
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,
Figure FDA0004100538240000026
the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
6. An apparatus for predicting commodity codes, comprising:
the first calculation module is used for taking commodity information of cosmetics to be declared as a prediction sample, calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic category;
the first prediction module is used for predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the second calculation module is used for calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, and each second-layer word stock corresponds to one commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
and the second prediction module is used for predicting the commodity code to which the cosmetics to be declared belong according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
7. The apparatus as recited in claim 6, further comprising:
the first classification module is used for acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to the commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
8. The apparatus of claim 6, wherein the device comprises a plurality of sensors,
the first prediction module is specifically configured to calculate a weight coefficient of each feature vector in the prediction sample according to a correlation between a cosmetic attribute and a cosmetic category in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Figure FDA0004100538240000047
Figure FDA0004100538240000042
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,
Figure FDA0004100538240000043
for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
9. The apparatus as recited in claim 6, further comprising:
the second classification module is used for classifying the plurality of pieces of declaration data according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
10. The apparatus of claim 6, wherein the device comprises a plurality of sensors,
the second prediction module is specifically configured to calculate a weight coefficient of each feature vector in the prediction sample according to the correlation between the cosmetic attribute and the commodity code in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Figure FDA0004100538240000048
Figure FDA0004100538240000045
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,
Figure FDA0004100538240000046
the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i Is of special interestSign vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
CN202310174800.4A 2023-02-24 2023-02-24 Commodity coding prediction method and device Active CN116166805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310174800.4A CN116166805B (en) 2023-02-24 2023-02-24 Commodity coding prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310174800.4A CN116166805B (en) 2023-02-24 2023-02-24 Commodity coding prediction method and device

Publications (2)

Publication Number Publication Date
CN116166805A true CN116166805A (en) 2023-05-26
CN116166805B CN116166805B (en) 2023-09-22

Family

ID=86413034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310174800.4A Active CN116166805B (en) 2023-02-24 2023-02-24 Commodity coding prediction method and device

Country Status (1)

Country Link
CN (1) CN116166805B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120124050A1 (en) * 2010-11-16 2012-05-17 Electronics And Telecommunications Research Institute System and method for hs code recommendation
KR20140066486A (en) * 2012-11-23 2014-06-02 현대중공업 주식회사 An method for assigning hs code to materials
CN107704892A (en) * 2017-11-07 2018-02-16 宁波爱信诺航天信息有限公司 A kind of commodity code sorting technique and system based on Bayesian model
CN109598517A (en) * 2017-09-29 2019-04-09 阿里巴巴集团控股有限公司 Commodity clearance processing, the processing of object and its class prediction method and apparatus
CN110858219A (en) * 2018-08-17 2020-03-03 菜鸟智能物流控股有限公司 Logistics object information processing method and device and computer system
CN112529420A (en) * 2020-12-14 2021-03-19 深圳市钛师傅云有限公司 Intelligent classification method and system for customs commodity codes
CN113378167A (en) * 2021-06-30 2021-09-10 哈尔滨理工大学 Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120124050A1 (en) * 2010-11-16 2012-05-17 Electronics And Telecommunications Research Institute System and method for hs code recommendation
KR20140066486A (en) * 2012-11-23 2014-06-02 현대중공업 주식회사 An method for assigning hs code to materials
CN109598517A (en) * 2017-09-29 2019-04-09 阿里巴巴集团控股有限公司 Commodity clearance processing, the processing of object and its class prediction method and apparatus
CN107704892A (en) * 2017-11-07 2018-02-16 宁波爱信诺航天信息有限公司 A kind of commodity code sorting technique and system based on Bayesian model
CN110858219A (en) * 2018-08-17 2020-03-03 菜鸟智能物流控股有限公司 Logistics object information processing method and device and computer system
CN112529420A (en) * 2020-12-14 2021-03-19 深圳市钛师傅云有限公司 Intelligent classification method and system for customs commodity codes
CN113378167A (en) * 2021-06-30 2021-09-10 哈尔滨理工大学 Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
蒋登丽: "基于DOA自动注册和改进朴素贝叶斯算法的商机数据分类研究", 《中国优秀硕士学位论文全文数据库》, no. 2, pages 36 - 50 *
陈翠娟: "改进的多项朴素贝叶斯分类算法和Python实现", 《景德镇学院学报》, vol. 36, no. 3, pages 92 - 95 *

Also Published As

Publication number Publication date
CN116166805B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
US12020172B2 (en) System and/or method for generating clean records from imperfect data using model stack(s) including classification model(s) and confidence model(s)
CN108391446B (en) Automatic extraction of training corpus for data classifier based on machine learning algorithm
US20180158078A1 (en) Computer device and method for predicting market demand of commodities
CN112487199B (en) User characteristic prediction method based on user purchasing behavior
CN108256968B (en) E-commerce platform commodity expert comment generation method
CN112200601B (en) Item recommendation method, device and readable storage medium
CN110019790B (en) Text recognition, text monitoring, data object recognition and data processing method
CN107247728B (en) Text processing method and device and computer storage medium
CN110046251B (en) Community content risk assessment method and device
CN115062732B (en) Resource sharing cooperation recommendation method and system based on big data user tag information
CN115114994A (en) Method and device for determining commodity category information
CN113656699B (en) User feature vector determining method, related equipment and medium
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN111651981A (en) Data auditing method, device and equipment
CN113591476A (en) Data label recommendation method based on machine learning
CN116166805B (en) Commodity coding prediction method and device
CN114969498A (en) Method and device for recommending industrial faucet information
CN108304568B (en) Real estate public expectation big data processing method and system
Spichakova et al. Using machine learning for automated assessment of misclassification of goods for fraud detection
CN115080741A (en) Questionnaire survey analysis method, device, storage medium and equipment
US20110208738A1 (en) Method for Determining an Enhanced Value to Keywords Having Sparse Data
CN114138976A (en) Data processing and model training method and device, electronic equipment and storage medium
CN114169418A (en) Label recommendation model training method and device, and label obtaining method and device
CN111325419A (en) Method and device for identifying blacklist user
CN114219084B (en) Sales visit display counterfeiting identification method and device in fast moving industry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant