CN116166805A - Commodity coding prediction method and device - Google Patents
Commodity coding prediction method and device Download PDFInfo
- Publication number
- CN116166805A CN116166805A CN202310174800.4A CN202310174800A CN116166805A CN 116166805 A CN116166805 A CN 116166805A CN 202310174800 A CN202310174800 A CN 202310174800A CN 116166805 A CN116166805 A CN 116166805A
- Authority
- CN
- China
- Prior art keywords
- cosmetic
- feature vector
- frequency
- commodity
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a method and a device for predicting commodity codes, wherein the method comprises the following steps: taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first layer word bank; predicting the cosmetic category of the cosmetic to be declared according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock; calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word bank; and predicting commodity codes of the cosmetics to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock. The method and the device can improve the accuracy of classification and the matching degree of commodity codes, and further save the time for inquiring commodity codes.
Description
Technical Field
The application belongs to the technical field of big data, and particularly relates to a method and a device for predicting commodity codes.
Background
When the cosmetics enterprises declare import and export goods, HS (Harmonized System, coordination system) codes of the goods are required to be filled in a customs declaration attached to the goods. The HS code is a set of international trade commodity classification system codes, and is mainly used for customs personnel to confirm commodity category, carry out commodity classification management, audit tariff standard and check commodity quality index. The currently used HS coding system in China consists of ten digits, and usually one commodity only corresponds to one HS code, and one HS code not only corresponds to one commodity. Correctly filling the HS code can accelerate the customs process, ensure smooth clearance of the goods, and avoid extra cost or delay. If the HS codes are wrongly classified, normal order of customs is disturbed, and the situation is seriously penalized by administration of customs.
In order to accurately fill in the coding of the commodity of the cosmetics, the declaration personnel of the enterprises need to know the basic knowledge of the classification of the HS codes, as well as the properties, characteristics, purposes, etc. of the commodity itself, which requires the accumulation of knowledge over the years and months, not every bit of the HS codes of the commodity can be classified and distinguished quickly and skillfully. Currently, there are many websites on the network that can query HS codes by obtaining a keyword entered by a user and then returning all relevant HS codes that contain the keyword. However, the results obtained by the query are numerous, different in category and lack of hierarchical relationship, and the matching degree is low, so that the time cost of querying commodity codes of enterprises is increased.
Content of the application
The embodiment of the application aims to provide a method and a device for predicting commodity codes, which are used for solving the defect of low matching degree of commodity code query in the prior art.
In order to solve the technical problems, the application is realized as follows:
in a first aspect, a method of predicting commodity coding is provided, comprising the steps of:
taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic class;
predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, wherein each second-layer word stock corresponds to a commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
predicting the commodity code of the cosmetic to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
In a second aspect, there is provided an apparatus for predicting commodity codes, comprising:
the first calculation module is used for taking commodity information of cosmetics to be declared as a prediction sample, calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic category;
the first prediction module is used for predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the second calculation module is used for calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, and each second-layer word stock corresponds to one commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
and the second prediction module is used for predicting the commodity code to which the cosmetics to be declared belong according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
According to the method and the device for predicting the commodity codes, the classification accuracy and the matching degree of the commodity codes can be improved, and the time for inquiring the commodity codes is further saved.
Drawings
FIG. 1 is a flow chart of a method for predicting commodity codes according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an apparatus for predicting commodity codes according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
In order to solve the problems in the prior art, the embodiment of the application provides a method for searching cosmetic HS codes based on an improved naive Bayes classifier, which is improved from the following three aspects:
1. because the number of commodity codes contained under the cosmetics category is large, and the establishment difficulty of the multi-classification model is large, the embodiment of the application establishes a two-layer classification model, wherein the first layer predicts the cosmetic category to which the commodity belongs and the second layer predicts the specific commodity code to which the commodity belongs.
2. By calculating TF-IDF values instead of conditional probabilities in a naive bayes model, the evaluation of the importance of the terms in the classification process is increased.
3. By calculating the correlation between the attributes and the categories, different weights are given to different attributes, and the higher the degree of correlation between the attributes and the categories, the greater the importance of the attributes to the categories, and therefore the higher the weight given to the attributes.
In order to achieve the above purpose, the embodiment of the present application provides the following technical solutions:
1. and acquiring historical declaration data of cosmetics, and establishing a label extraction model based on commodity information filled in by enterprises.
2. Based on the improved naive Bayes classifier, a first-layer classification model is established, and the cosmetic category to which the commodity belongs is predicted.
2.2, acquiring historical declaration data of cosmetics as training samples, and establishing five word banks according to the types of the cosmetics.
And 2.3, acquiring commodity information of cosmetics to be declared, which is input by enterprises, as a prediction sample. The prediction samples are subjected to label extraction in step 1.
2.4 defines an improved naive bayes classifier:
2.6 calculate prediction samples x= { X 1 ,x 2 ,...,x n Each feature vector x in } i The TF-IDF value of (i) is the word frequency-inverse text frequency, used to evaluate the importance of words in the word stock.
2.7 calculation of cosmetic Properties and cosmeticsCorrelation of categories. Giving each attribute a weight w according to the correlation i The higher the correlation, the greater the weight.
2.8 based on the calculation results of steps 2.5, 2.6, 2.7, a prediction sample x= { X is calculated 1 ,x 2 ,...,x n Probability of belonging to each categoryThen, the cosmetic class in which the probability is the greatest is selected as the prediction result of the first-layer classification model.
3. And based on the improved naive Bayes classifier, establishing a second-layer classification model, and predicting the specific commodity code to which the commodity belongs.
3.1, based on the cosmetic category of the commodity predicted by the first-layer classification model, establishing a second-layer classification model under the category. Defining second-level classification model classes
3.2 based on training samples and class L of the second-layer classification model 2 And respectively establishing word libraries.
3.3 repeating the steps 2.3 to 2.8 to calculate the maximum posterior probabilityThe category with the highest probability is the predicted commodity code of the commodity.
The method for predicting commodity codes provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
As shown in fig. 1, a flowchart of a method for predicting commodity coding according to an embodiment of the present application is provided, where the method includes the following steps:
Specifically, the weight coefficient of each feature vector in the prediction sample can be calculated according to the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j/>
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
In this embodiment, before calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each first layer word stock, historical declaration data of cosmetics may be further obtained as a training sample, and each declaration data in the training sample is classified according to a cosmetic class corresponding to a commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
And 104, predicting the commodity code to which the cosmetics to be declared belong according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
Specifically, the weight coefficient of each feature vector in the prediction sample can be calculated according to the correlation between the cosmetic attribute and commodity code in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
In this embodiment, before calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, the plurality of pieces of declaration data may be further classified according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
According to the method and the device for predicting the commodity codes, the classification accuracy and the matching degree of the commodity codes can be improved, and the time for inquiring the commodity codes is further saved.
Further, the technical solutions of the embodiments of the present application may be described in detail as follows:
1. the method comprises the steps of acquiring historical declaration data of cosmetics, and establishing a label extraction model based on commodity information filled by enterprises, and specifically comprises the following steps:
1.1 definition of cosmetic Properties Z= { Z 1 ,z 2 ,...,z 7 The goods types, the objects of use, the efficacy, the packaging, the specifications, the brands and the components are respectively.
1.2, dividing commodity information filled in by an enterprise into a plurality of piece information through word segmentation and attribute labeling. Specifically, the bert+crf model may be used to implement chinese named-body recognition. The examples of the present application are not limited herein.
1.3, duplicate word segmentation results are de-duplicated, and commodity types, usage objects, efficacy, packaging, specifications, brands and component attributes are extracted.
Use in "OLAY rinse water bloom|: facial moisturizing and whitening|packaging specification: 50G/bottle |brand: OLAY "is exemplified by the label extraction model, and the results are" OLAY-brand, cream-commodity type, face-subject of use, moisturizing and whitening-efficacy, G-specification, bottle-pack ".
2. Based on an improved naive Bayes classifier, a first-layer classification model is established, and the cosmetic category to which the commodity belongs is predicted, specifically comprising the following steps:
2.1 definition of first layer classification model categoriesRespectively, cosmetics for lips, cosmetics for eyes, cosmetics for nails, powdery cosmetics, and other cosmetics or cosmetics for skin care.
2.2, acquiring historical declaration data of cosmetics as training samples, and respectively establishing word libraries according to the types of the cosmetics, wherein the method specifically comprises the following steps of:
and classifying the data according to HS codes of each claim data to obtain data of 5 cosmetic categories. For example, declaration data of "33041000" 8 bits before HS encoding is classified into cosmetics for lips. And (3) extracting the label of the step (1) for the declaration information of declaration data in each cosmetic category to obtain 5 word libraries.
And 2.3, acquiring commodity information of cosmetics to be declared, which is input by enterprises, as a prediction sample. The commodity information of the cosmetics to be declared, which is input by enterprises, is extracted through the label in the step 1, and a feature vector X= { X is obtained 1 ,x 2 ,...,x n Each feature x corresponds to an attribute z.
2.4 defines an improved naive bayes classifier, calculating the probability of each cosmetic class.
Specifically, based on the principle of naive Bayes classification, the probability that the prediction sample belongs to j classes of cosmetics is judged to be
Since P (X) is constant for all cosmetic categories, the posterior probability is maximizedCan be converted into maximizing the prior probability +.>Assuming that the feature vectors are mutually independent to obtain
And further obtain a naive Bayes classification model as
Wherein, the liquid crystal display device comprises a liquid crystal display device,a priori probabilities for each cosmetic class; />Is a conditional probability, i.e. x i Probability of the word occurring in class j word stock. From the above formula, x i The higher the frequency of occurrence of words in the class j lexicon, the greater the probability that the sample belongs to class j cosmetics. In reality, however, some frequently occurring common words may not contribute much to classifying categories, such as "moisturizing", and may occur very frequently in each category, and simply using word frequencies may reduce the accuracy of the classifier. Therefore, we use the TF-IDF value instead of +.>
In addition, the weights of the attributes in the naive Bayes model are equal, but in the actual classification process of cosmetics, the importance of each attribute to classification is different, for example, in the first-layer classification model, the attribute of using an object plays the most important role in distinguishing the cosmetics categories, so different attributes can be given different weights to improve the Bayes modelAccuracy of type classification. Let w be i Is the characteristic x i The improved naive bayes model is that
2.5 calculating the prior probability for each cosmetic classNamely, the declaration data with the cosmetic class of j in the training sample accounts for the proportion of the total declaration data, and the specific calculation formula is as follows:
wherein, |D| is the total amount of training samples, |D j And I is the number of training samples with the cosmetic class of j.
2.6 calculate prediction samples x= { X 1 ,x 2 ,...,x n Each feature vector x in } i The TF-IDF value of (i) is the word frequency-inverse text frequency, used to evaluate the importance of words in the word stock. If a word occurs frequently in one word stock and less frequently in other word stocks, it is considered that the word has good class distinction capability. The specific calculation formula is as follows:
TF-IDF ij =tf ij ×idf i
wherein tf is ij Representing word frequency, namely the frequency of occurrence of i words in a j-class word stock; idf (idf) i The inverse text frequency of the i word, i.e. the frequency of occurrence in other word stores, is represented.
tf ij The specific formula of (2) is:
wherein n is ij For the number of training samples of the occurrence of the i word in the j word stock, sigma k n kj The total training sample number in the j word stock. If the i word does not appear in the training set, the whole probability becomes 0, and in order to solve the problem of zero probability, laplacian smoothing can be used for correcting the probability, and tf after correction ij The method comprises the following steps:
wherein m is the number of word stock.
idf i The specific formula of (2) is:
wherein, D is the total training sample amount, |{ j: t i ∈d j The number of training samples containing i words is increased by 1 to avoid 0.
2.7 calculating the correlation of the cosmetic properties in the training sample with the cosmetic class. Each attribute is given a weight according to the correlation, the higher the correlation, the greater the weight.
Specifically, the cosmetic property z is calculated 1 、z 2 、...、z 7 Class with cosmeticsTo measure the correlation between two event sets. The calculation formula is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,is cosmetic property z i And cosmetic class L 1 Is a joint probability distribution function of P (z) i,k ) Andcosmetic properties z, respectively i And cosmetic class L 1 Is a function of the edge probability distribution of (a).
Further, I (z i ;L 1 ) As attribute z i Weight coefficient of (a), i.e. w i =I(z i ;L 1 )。
2.8 based on the results of the calculations of steps 2.5, 2.6, 2.7tf ij 、idf i 、w i And an improved naive Bayes model formula
Calculate the prediction sample x= { X 1 ,x 2 ,...,x n Probability of belonging to each categoryThen, the cosmetic class in which the probability is the greatest is selected as the prediction result of the first-layer classification model.
3. Based on the improved naive Bayes classifier, a second-layer classification model is established, and specific commodity codes of the commodity are predicted, wherein the specific commodity codes comprise the following steps:
3.1, based on the cosmetic category of the commodity predicted by the first-layer classification model, establishing a second-layer classification model under the category. Defining second-level classification model classesWherein the lip cosmetic contains 7 codes, the eye cosmetic contains 7 codes, the nail cosmetic contains 4 codes, the powder cosmetic contains 2 codes, other cosmetics or cosmetics contains 9 codes.
3.2 acquiring historical declaration data of the cosmetic according to the second-layer classification modelCategory L 2 And respectively establishing word libraries. Namely, 7 word banks are built under the lip cosmetics, 7 word banks are built under the eye cosmetics, and so on.
3.3 repeating the steps 2.3 to 2.8 to calculate the maximum posterior probabilityThe category with the highest probability is the predicted commodity code of the commodity.
According to the embodiment of the application, the method for searching the HS codes of the cosmetics based on the improved naive Bayesian classifier is established, so that an enterprise can return the HS code with the highest matching degree only by inputting commodity information of the cosmetics to be declared, and the time for querying the commodity codes by the enterprise is greatly saved; by constructing a two-layer classification model, the first layer predicts the cosmetic category to which the commodity belongs, and the second layer predicts the specific commodity code to which the commodity belongs, so that the classification accuracy is improved; the calculation of the correlation between the attribute and the category is increased, the weight of the attribute with large contribution to the category discrimination is increased, and the weight of the attribute with small contribution to the category discrimination is reduced, so that the influence of low-weight attribute words on the classification result is weakened, and the stability of the classification effect is ensured.
As shown in fig. 2, a schematic structural diagram of an apparatus for predicting commodity codes according to an embodiment of the present application includes:
the first calculation module 210 is configured to calculate, using commodity information of the cosmetics to be declared as a prediction sample, word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, where each feature vector corresponds to an attribute of the cosmetics to be declared, and each first-layer word stock corresponds to a cosmetic category.
The first prediction module 220 is configured to predict, according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category, and the correlation between the cosmetic attribute and the cosmetic category in the historical reporting data, the cosmetic category to which the cosmetic to be reported belongs.
Specifically, the first prediction module 220, specifically usesCalculating weight coefficients of each feature vector in the prediction sample according to the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data; the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
The second calculating module 230 is configured to calculate a word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, where each second-layer word stock corresponds to a commodity code included in the cosmetic category to which the cosmetic to be declared belongs.
The second prediction module 240 is configured to predict the commodity code to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code, and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
Specifically, the second prediction module 240 is specifically configured to report the cosmetic attributes and the merchandise in the data according to the historyCalculating the weight coefficient of each feature vector in the prediction sample according to the encoded correlation; the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
Further, the device further comprises:
the first classification module is used for acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to the commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
The second classification module is used for classifying the plurality of pieces of declaration data according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
According to the method and the device for predicting the commodity codes, the classification accuracy and the matching degree of the commodity codes can be improved, and the time for inquiring the commodity codes is further saved.
The embodiment of the present application further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above-mentioned method embodiment for predicting commodity coding, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.
Claims (10)
1. A method of predicting commodity codes, comprising the steps of:
taking commodity information of cosmetics to be declared as a prediction sample, and calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic class;
predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, wherein each second-layer word stock corresponds to a commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
predicting the commodity code of the cosmetic to be declared according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
2. The method of claim 1, wherein said calculating each feature vector in the prediction samples precedes a word frequency-inverse text frequency of the respective first level thesaurus further comprises:
acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to commodity codes of each declaration data in the training sample;
and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
3. The method according to claim 1, wherein predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word bank, the prior probability of each cosmetic category, and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data, specifically includes:
according to the correlation between the cosmetic attributes and the cosmetic categories in the historical declaration data, calculating the weight coefficient of each feature vector in the prediction sample;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,for the prior probability of cosmetic class j, tf ij For the feature vector x i First-layer word stock corresponding to cosmetic class jIs a frequency of occurrence in the first and second embodiments; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
4. The method of claim 1, wherein said calculating each feature vector in the prediction samples precedes a word frequency-inverse text frequency of the respective second level word stock further comprises:
classifying the multiple pieces of declaration data according to commodity codes of the multiple pieces of declaration data corresponding to each cosmetic class;
and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
5. The method according to claim 1, wherein predicting the commodity code to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code, and the correlation between the cosmetic attribute and the commodity code in the historical declaration data specifically includes:
according to the correlation between the cosmetic attributes and commodity codes in the historical declaration data, calculating the weight coefficient of each feature vector in the prediction sample;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
6. An apparatus for predicting commodity codes, comprising:
the first calculation module is used for taking commodity information of cosmetics to be declared as a prediction sample, calculating word frequency-inverse text frequency of each feature vector in the prediction sample in each first-layer word stock, wherein each feature vector corresponds to one attribute of the cosmetics to be declared, and each first-layer word stock corresponds to one cosmetic category;
the first prediction module is used for predicting the cosmetic category to which the cosmetic to be declared belongs according to the word frequency-inverse text frequency of each feature vector in each first-layer word stock, the prior probability of each cosmetic category and the correlation between the cosmetic attribute and the cosmetic category in the historical declaration data;
the second calculation module is used for calculating the word frequency-inverse text frequency of each feature vector in the prediction sample in each second-layer word stock, and each second-layer word stock corresponds to one commodity code contained in the cosmetic category to which the cosmetic to be declared belongs;
and the second prediction module is used for predicting the commodity code to which the cosmetics to be declared belong according to the word frequency-inverse text frequency of each feature vector in each second-layer word stock, the prior probability of each commodity code and the correlation between the cosmetic attribute and the commodity code in the historical declaration data.
7. The apparatus as recited in claim 6, further comprising:
the first classification module is used for acquiring historical declaration data of cosmetics as a training sample, and classifying each declaration data in the training sample according to the cosmetic category corresponding to the commodity code of each declaration data in the training sample; and respectively extracting labels from the declaration information of the multiple declaration data corresponding to each cosmetic class to obtain a first-layer word stock corresponding to each cosmetic class.
8. The apparatus of claim 6, wherein the device comprises a plurality of sensors,
the first prediction module is specifically configured to calculate a weight coefficient of each feature vector in the prediction sample according to a correlation between a cosmetic attribute and a cosmetic category in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each cosmetic class j
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,for the prior probability of cosmetic class j, tf ij For the feature vector x i The frequency of occurrence in the first layer word stock corresponding to cosmetic class j; idf (idf) i For the feature vector x i The frequency of the reverse text in the first-layer word stock; w (w) i For the feature vector x i Weight coefficient of (2);
and comparing the probabilities that the prediction samples belong to the cosmetic categories, and taking the cosmetic category with the highest probability as a prediction result.
9. The apparatus as recited in claim 6, further comprising:
the second classification module is used for classifying the plurality of pieces of declaration data according to commodity codes of the plurality of pieces of declaration data corresponding to each cosmetic class; and respectively extracting labels from the declaration information of each declaration data corresponding to each commodity code to obtain a second-layer word stock corresponding to each commodity code.
10. The apparatus of claim 6, wherein the device comprises a plurality of sensors,
the second prediction module is specifically configured to calculate a weight coefficient of each feature vector in the prediction sample according to the correlation between the cosmetic attribute and the commodity code in the historical declaration data;
the prediction samples x= { X are calculated by the following formulas, respectively 1 ,x 2 ,...,x n Probability of belonging to each commodity code k
Wherein { x 1 ,x 2 ,...,x n Is a plurality of feature vectors in the prediction samples,the prior probability of k for commodity code tf ik For the feature vector x i The frequency of occurrence in a second-layer word stock corresponding to the commodity code k; idf (idf) i For the feature vector x i The frequency of the reverse text in the second-layer word stock; w (w) i Is of special interestSign vector x i Weight coefficient of (2);
and comparing the probability that the prediction sample belongs to each commodity code, and taking the commodity code with the highest probability as a prediction result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310174800.4A CN116166805B (en) | 2023-02-24 | 2023-02-24 | Commodity coding prediction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310174800.4A CN116166805B (en) | 2023-02-24 | 2023-02-24 | Commodity coding prediction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116166805A true CN116166805A (en) | 2023-05-26 |
CN116166805B CN116166805B (en) | 2023-09-22 |
Family
ID=86413034
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310174800.4A Active CN116166805B (en) | 2023-02-24 | 2023-02-24 | Commodity coding prediction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116166805B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120124050A1 (en) * | 2010-11-16 | 2012-05-17 | Electronics And Telecommunications Research Institute | System and method for hs code recommendation |
KR20140066486A (en) * | 2012-11-23 | 2014-06-02 | 현대중공업 주식회사 | An method for assigning hs code to materials |
CN107704892A (en) * | 2017-11-07 | 2018-02-16 | 宁波爱信诺航天信息有限公司 | A kind of commodity code sorting technique and system based on Bayesian model |
CN109598517A (en) * | 2017-09-29 | 2019-04-09 | 阿里巴巴集团控股有限公司 | Commodity clearance processing, the processing of object and its class prediction method and apparatus |
CN110858219A (en) * | 2018-08-17 | 2020-03-03 | 菜鸟智能物流控股有限公司 | Logistics object information processing method and device and computer system |
CN112529420A (en) * | 2020-12-14 | 2021-03-19 | 深圳市钛师傅云有限公司 | Intelligent classification method and system for customs commodity codes |
CN113378167A (en) * | 2021-06-30 | 2021-09-10 | 哈尔滨理工大学 | Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing |
-
2023
- 2023-02-24 CN CN202310174800.4A patent/CN116166805B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120124050A1 (en) * | 2010-11-16 | 2012-05-17 | Electronics And Telecommunications Research Institute | System and method for hs code recommendation |
KR20140066486A (en) * | 2012-11-23 | 2014-06-02 | 현대중공업 주식회사 | An method for assigning hs code to materials |
CN109598517A (en) * | 2017-09-29 | 2019-04-09 | 阿里巴巴集团控股有限公司 | Commodity clearance processing, the processing of object and its class prediction method and apparatus |
CN107704892A (en) * | 2017-11-07 | 2018-02-16 | 宁波爱信诺航天信息有限公司 | A kind of commodity code sorting technique and system based on Bayesian model |
CN110858219A (en) * | 2018-08-17 | 2020-03-03 | 菜鸟智能物流控股有限公司 | Logistics object information processing method and device and computer system |
CN112529420A (en) * | 2020-12-14 | 2021-03-19 | 深圳市钛师傅云有限公司 | Intelligent classification method and system for customs commodity codes |
CN113378167A (en) * | 2021-06-30 | 2021-09-10 | 哈尔滨理工大学 | Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing |
Non-Patent Citations (2)
Title |
---|
蒋登丽: "基于DOA自动注册和改进朴素贝叶斯算法的商机数据分类研究", 《中国优秀硕士学位论文全文数据库》, no. 2, pages 36 - 50 * |
陈翠娟: "改进的多项朴素贝叶斯分类算法和Python实现", 《景德镇学院学报》, vol. 36, no. 3, pages 92 - 95 * |
Also Published As
Publication number | Publication date |
---|---|
CN116166805B (en) | 2023-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12020172B2 (en) | System and/or method for generating clean records from imperfect data using model stack(s) including classification model(s) and confidence model(s) | |
CN108391446B (en) | Automatic extraction of training corpus for data classifier based on machine learning algorithm | |
US20180158078A1 (en) | Computer device and method for predicting market demand of commodities | |
CN112487199B (en) | User characteristic prediction method based on user purchasing behavior | |
CN108256968B (en) | E-commerce platform commodity expert comment generation method | |
CN112200601B (en) | Item recommendation method, device and readable storage medium | |
CN110019790B (en) | Text recognition, text monitoring, data object recognition and data processing method | |
CN107247728B (en) | Text processing method and device and computer storage medium | |
CN110046251B (en) | Community content risk assessment method and device | |
CN115062732B (en) | Resource sharing cooperation recommendation method and system based on big data user tag information | |
CN115114994A (en) | Method and device for determining commodity category information | |
CN113656699B (en) | User feature vector determining method, related equipment and medium | |
CN113761875B (en) | Event extraction method and device, electronic equipment and storage medium | |
CN111651981A (en) | Data auditing method, device and equipment | |
CN113591476A (en) | Data label recommendation method based on machine learning | |
CN116166805B (en) | Commodity coding prediction method and device | |
CN114969498A (en) | Method and device for recommending industrial faucet information | |
CN108304568B (en) | Real estate public expectation big data processing method and system | |
Spichakova et al. | Using machine learning for automated assessment of misclassification of goods for fraud detection | |
CN115080741A (en) | Questionnaire survey analysis method, device, storage medium and equipment | |
US20110208738A1 (en) | Method for Determining an Enhanced Value to Keywords Having Sparse Data | |
CN114138976A (en) | Data processing and model training method and device, electronic equipment and storage medium | |
CN114169418A (en) | Label recommendation model training method and device, and label obtaining method and device | |
CN111325419A (en) | Method and device for identifying blacklist user | |
CN114219084B (en) | Sales visit display counterfeiting identification method and device in fast moving industry |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |