CN110858219A

CN110858219A - Logistics object information processing method and device and computer system

Info

Publication number: CN110858219A
Application number: CN201810943287.XA
Authority: CN
Inventors: 郑恒; 张振华; 李驰
Original assignee: Cainiao Smart Logistics Holding Ltd
Current assignee: Cainiao Smart Logistics Holding Ltd
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2020-03-03
Also published as: TW202009748A; WO2020034880A1

Abstract

The embodiment of the application discloses a method, a device and a computer system for processing logistics object information, wherein the method comprises the following steps: determining text description information of a target object to be classified, processing the text description information, and determining contained target feature words; generating a characteristic word vector corresponding to the target logistics object according to the inclusion condition of the text description information on each target characteristic word; and inputting the feature word vector into a coding classification model to obtain corresponding classification feature information. Through the embodiment of the application, the automatic classification of the logistics object codes can be realized, and the labor cost is reduced while the error probability is reduced.

Description

Logistics object information processing method and device and computer system

Technical Field

The present application relates to the field of logistics object information processing technologies, and in particular, to a method, an apparatus, and a computer system for processing logistics object information.

Background

Hscode (The standardization System Code, Commodity name and Code coordination System Code, customs Code for short) is core data which must be provided to customs in The customs clearance link, and relates to The tax rate of export and tax refund. HS adopts six-digit coding, and divides all international trade commodities into 22 types and 98 chapters. The chapter is subdivided into categories and subdivisions. The first and second digit of the commodity code represent "chapter", the third and fourth digit represent "order", and the fifth and sixth digit represent "sub-order". The first 6 digits are the HS international standard code, HS has 1241 four-digit tax and 5113 six-digit sub-categories.

In the import and export trade process of cross-border e-commerce and the like, HScode classification needs to be carried out on specific commodities so as to be convenient for customs clearance. The HScode classification is to give HScode to which a commodity belongs according to specific information (character description, picture, etc.) of the commodity and classification of the HScode. Different from the classification of commodities by systems such as common E-commerce and the like, the degree of subdividing the commodities in the HScode classification process is deeper. For example, the same garment type, different materials, different styles and even different weaving methods can correspond to different Hscodes. Therefore, Hscode categorization of commodities is a cumbersome process.

At present, most enterprises in the industry use a manual pre-classification method, but even a customs expert with abundant working experience needs to take about 2-15 minutes for classifying a sku (minimum stock quantity unit of a commodity), and for some extremely complex commodities, the time of several hours or even longer is needed, and the day is at most 200 skus. Moreover, due to the fact that corresponding pre-classification professionals are rare, the pre-classification learning threshold is high and the like, the manual pre-classification mainly has the problems of high classification cost, low classification timeliness, long response time and the like. According to statistics, the cost of classifying one sku is 200-500 RMB at present, and the pre-classification cost of some special products such as electromechanical products is up to more than 1500 RMB. However, in some large cross-border e-commerce trading platforms, the number of skus involved is large, even billions, and this cost is obviously unacceptable. In addition, in the face of the ambitious goal of achieving the time efficiency by the single amount of ten million of the cross-border E-commerce days of B2C and the '72-hour achievement' or even shorter delivery time efficiency, the processing capacity of 200 skus of people per day in the above mode is obviously incapable of responding and meeting the requirement in time. Moreover, manual classification is too dependent on the experience of experts, errors are inevitable in the face of huge daily classification workload, and along with the increase of workload, the error rate is gradually increased, so that customs challenges are caused to enterprise commodity customs declaration commodities, and enterprise qualification is influenced.

Therefore, how to more efficiently classify commodities, reduce the cost and the error probability is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application provides a logistics object information processing method, a logistics object information processing device and a computer system, which can realize automatic classification of logistics objects, reduce labor cost and reduce error probability.

The application provides the following scheme:

a logistics object information processing method comprises the following steps:

determining text description information of a target object to be classified, processing the text description information, and determining contained target feature words;

generating a characteristic word vector corresponding to the target logistics object according to the inclusion condition of the text description information on each target characteristic word;

and inputting the feature word vector into a coding classification model to obtain corresponding classification feature information.

A method of generating a code classification model, comprising:

collecting training samples, wherein each training sample comprises a corresponding relation between known logistics object text description information and codes;

performing word segmentation processing on the text description information in the training sample, and filtering out invalid words to obtain feature words;

summarizing and de-duplicating the feature words obtained from each training sample to obtain a feature word set, and respectively allocating corresponding serial numbers to each feature word;

generating a feature word vector corresponding to each training sample according to the inclusion condition of the feature words on each sequence number in each training sample;

respectively inputting the feature word vectors corresponding to a plurality of training samples associated with the same code into a preset machine learning model for training to obtain a code classification model corresponding to each code; the feature word weight vector corresponding to each code is stored in the code classification model; the feature word weight vector records the discrimination weight value of each feature word pair association code.

A logistics object information processing apparatus comprising:

the target logistics object information determining unit is used for determining text description information of a target logistics object to be classified, processing the text description information and determining contained target feature words;

the characteristic vector generating unit is used for generating a characteristic word vector corresponding to the target logistics object according to the inclusion condition of the text description information on each target characteristic word;

and the classification characteristic information acquisition unit is used for inputting the characteristic word vectors into a coding classification model and acquiring corresponding classification characteristic information.

An apparatus for generating a code classification model, comprising:

the system comprises a sample collection unit, a storage unit and a display unit, wherein the sample collection unit is used for collecting training samples, and each training sample comprises the corresponding relation between the known logistics object text description information and the code;

the characteristic word determining unit is used for performing word segmentation processing on the text description information in the training sample and filtering out invalid words to obtain characteristic words;

the characteristic vocabulary total unit is used for summarizing and de-duplicating the characteristic words obtained from each training sample to obtain a characteristic word set and respectively allocating corresponding serial numbers to each characteristic word;

the characteristic word vector generating unit is used for generating a characteristic word vector corresponding to each training sample according to the inclusion condition of the characteristic words on each sequence number in each training sample;

the training unit is used for inputting the feature word vectors corresponding to a plurality of training samples associated with the same code into a preset machine learning model for training to obtain a code classification model corresponding to each code; the feature word weight vector corresponding to each code is stored in the code classification model; the feature word weight vector records the discrimination weight value of each feature word pair association code.

A computer system, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

According to the specific embodiments provided herein, the present application discloses the following technical effects:

according to the embodiment of the application, the coding classification model can be predetermined, so that for the target logistics object to be classified, the text description information can be obtained and processed to determine the contained target feature words, then the feature word vector corresponding to the target logistics object is generated according to the inclusion condition of the text description information on each target feature word, and then the feature word vector can be input into the coding classification model to obtain the corresponding classification feature information. In this way, automatic classification of logistics objects can be achieved without relying on manual classification, and therefore, efficiency and accuracy can be improved.

In an optional implementation scheme, a classification model for each HScode can be obtained through collection and processing of training samples and machine learning training, and specifically can be represented by a feature word weight vector, where a discrimination weight value of each feature word for an associated HScode is recorded. Therefore, when a certain target data object is predicted, the text description information of the target data object can be subjected to word segmentation and other processing, the characteristic words contained in the text description information can be determined, and the characteristic word vector can be generated. Thus, the feature word vector can be input into a classification model which is trained before, so that the probability that the target commodity object is classified into each HScode can be calculated, and suggestion information can be given according to the probability, for example, one or more suggested HScodes can be given, and the like. By the method, the process of HScode classification of the target data object does not depend on experts completely any more, labor cost can be reduced, classification efficiency is improved, the method is not limited by experience and personal ability of the experts, and error rate is reduced.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic overall framework diagram provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a system provided by an embodiment of the present application;

FIG. 3 is a flow chart of a prediction method provided by an embodiment of the present application;

FIG. 4 is an interface schematic diagram of a classification tool provided by an embodiment of the present application;

FIG. 5 is a flow chart of a model training method provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a first apparatus provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a second apparatus provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a computer system provided by an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In the embodiment of the application, in order to improve the logistics object classification efficiency and reduce the labor cost, a coding classification model may be established in advance, and for example, the coding classification model may specifically include a logistic regression model, a decision tree model, a neural network model, and the like. If the code classification model is a logistic regression model, the code classification model can be established through machine learning, and then the code classification model automatically classifies the logistics objects (particularly commodity objects and the like) so as to determine the implementation mode of the corresponding codes. Specifically, as shown in fig. 1, some training samples, specifically, the corresponding relationship between the text description information of the known commodity object and the codes such as the HScode, may be collected, and then, the data is processed according to the characteristics of the training samples and the training target, and is input into a specific machine learning model for training, and finally, a specific code classification model is established. Then, the codes of the specific logistics objects can be predicted by using the code classification model. The specific prediction result may be directly used as the encoding classification result, or may also be used as a reference for the encoding classification result, and so on.

In specific implementation, after the code classification model is established, the code classification model can be provided for merchant users, customs Clearance Partners (CP) of a cross-border network sales system, customs departments and the like to be used in the logistics object clearance link, so that manual classification in a traditional mode is replaced or partially replaced by machine classification, the logistics object clearance efficiency is improved, and the enterprise classification cost is reduced. In addition, the technical threshold of using the model is lowered for the convenience of the user. As shown in fig. 2, an interfacing classification tool (which may be an online tool or an application that may be installed locally) may be further developed on the basis of the coding classification model, so that a user may input text description information of a target logistics object to be classified by using a control such as an input box provided in the interface, and the classification tool may automatically process the text description information and call a pre-configured coding classification model to give a final classification suggestion.

Specific implementations are described in detail below.

Example one

First, from the perspective of the foregoing classification tool, an embodiment of the present application provides a method for processing logistics object information, in which a code classification model may be obtained first, and for example, the method may specifically include a logistic regression model, a decision tree model, a neural network model, and the like. For the logistic regression model, a feature word weight vector corresponding to each code may be stored in a specific code classification model. For example, for a customs code HScode, what is specifically stored in the code classification model may be a feature word weight vector corresponding to each HScode; and the feature word weight vector records the discrimination weight value of each feature word pair associated with the HScode.

In the case of the above logistic regression model, for HScode, a specific code classification model may be established in advance, and in one specific implementation, the step of establishing the model may first include:

the method comprises the following steps: collecting training samples, wherein each training sample comprises a corresponding relation between known logistics object text description information and HScode;

in specific implementation, labeled data in the logistics object historical classification records can be collected, for example, the labeled data can include import and export tax regulations of the people's republic of China, historical customs data, expert labeled data and the like.

For example, the information recorded in the import and export tax Law of the people's republic of China may be as shown in Table 1 (only one shown):

TABLE 1

Of course, since the description of the goods recorded in the specific tax law does not usually refer to a certain goods, the historical clearance data in the cross-border network sales system can be used as a supplement in order to supplement the information in the tax law better. For example, a piece of historical clearance data may be as shown in table 2:

TABLE 2

That is to say, the historical clearance data records the corresponding relationship between the name and the like of the specific logistics object and the HScode, so that the data is included in the specific training sample, and the more accurate HScode prediction for the specific logistics object can be facilitated.

In addition, besides the above tax regulations and historical customs data, the method may also be supplemented with expert annotation data, for example, one piece of expert annotation data may be as shown in table 3:

TABLE 3

In summary, the acquisition of training samples can be performed in a variety of ways. Of course, since the collected data is usually some historical data, in practical applications, some HScode changes may be involved, or the hscodes are disabled, split, and the like, so that some historical data may be invalid information for subsequent classification. For this reason, in a preferred embodiment, the training samples may also be subjected to data washing, so as to perform training of the classification model using the remaining valid training samples.

The specific data cleaning process may include modifying the changed HScode, deleting the training samples corresponding to the deactivated HScode, redetermining the HScode in the training samples corresponding to the split HScode, and the like. During specific implementation, new and old HScode mapping relation information can be stored in advance; in this way, after the specific training samples are collected, each training sample may be traversed first to determine whether an old HScode appears in the hscodes in each sample. And replacing the training sample with the old HScode according to the mapping relation, and adding the training sample into the training sample set as an effective training sample. For example, for a certain type of logistics object, the HScode defined in the previous tax rule is 6110110000, and after the tax rule is modified later, the HScode of the type of logistics object is modified to 6110110011, so that the mapping relationship can be saved. After the training samples are collected, if the HScode contained in a certain training sample is found to be 6110110000, the HScode can be modified to 6110110011 according to the stored mapping relation, and then the certain training sample can be made to be valid data.

In addition, a deactivated HScode list can be preserved in advance; in this way, after the training samples are acquired, the hscodes in the training samples can be traversed, and the training samples with deactivated hscodes are deleted. For example, a certain HScode is 6110110027, and after the modification of the tax rule, the category corresponding to the HScode is deleted, and accordingly, the HScode is deactivated. Therefore, such HScode can be recorded, and after a specific training sample is acquired, if the HScode in a certain training sample is found to be 6110110027, the training sample corresponding to the HScode can be deleted.

Moreover, there may be a case of splitting the HScode, for example, in the tax rule of the old version, the HScode corresponding to a certain category is 6110110000, and after the tax rule is modified, the category is refined and split into two sub-categories, which correspond to HScode6110110001 and 6110110002, respectively, and the original HScode6110110000 is no longer used. Therefore, a list of split HScode information can be also pre-stored, wherein each piece of split HScode information comprises an HScode before splitting and a plurality of corresponding hscodes after splitting; therefore, after specific training data are collected, traversing can be carried out on the HScodes in all the data, and the training samples of the HScodes before splitting are extracted, so that the training samples are added into the training sample set as effective training samples after the split HScodes are determined again. For example, where the HScode in a training sample is found to be 6110110000, it can be extracted. And then, determining the disassembled HScode for the training sample again by means of expert confirmation and the like, and then replacing the HScode before the disassembly so that the training sample becomes valid data, and the like.

In addition, in addition to the above data cleaning, manual verification can be performed by performing random sampling on the training samples, so that the quality of the training samples is improved as much as possible, and the accuracy of the finally trained model is improved.

Step two: performing word segmentation processing on the text description information in the training sample, and filtering out invalid words to obtain feature words;

after the training sample is subjected to data washing or the like, the next processing may be performed. Specifically, because the training sample has text description information of the data object, the text description information may be a title of the specific logistics object, or may also be a text description given in tax rules, a declaration element when a merchant declares a customs, and the like. In a word, the corresponding relation between the text description information and the HScode is recorded in each training sample. The purpose of machine learning is to find out regularity information from a plurality of text description information corresponding to the same HScode so as to predict the HScode. In particular, when processing the text description information, word segmentation processing may be firstly included, that is, the text description information is usually a sentence or a paragraph, and the purpose of word segmentation is to divide the text description information into a plurality of words.

For example, the text description information in a certain training sample is: new style wool sweater women shawl overcoat thin knitted sweater short style V neck little sweater loose big sign indicating number sweater in spring and autumn. The word segmentation result obtained after the word segmentation processing can be as follows: spring and autumn/new style/wool/sweater/woman/shawl/coat/thin/sweater/short/V-neck/small sweater/loose/big/sweater. The specific word segmentation processing method may refer to a scheme in the prior art, and is not described herein again.

After the word segmentation is completed, the words which are irrelevant to the classification can be filtered out, and only the effective characteristic words are left. In order to achieve the purpose, named entity recognition can be carried out on the vocabulary obtained by the word segmentation result of the text description information, and the vocabulary irrelevant to the logistics object classification is filtered according to the named entity recognition result. For example, also assume that the textual description information of a certain training sample is: in the new spring and autumn wool sweater female shawl jacket thin sweater short V-neck small sweater loose large-size sweater, after word segmentation and named entity recognition, the obtained result can be as follows:

spring and autumn [ season ]/new style [ shopping guide words ]/wool [ materials ]/sweater [ types ]/woman [ population ]/shawl [ style ]/coat [ types ]/thin [ style ]/knitting [ weaving method ]/short style ]/V collar [ style ]/small sweater [ types ]/loose [ style ]/big-size [ style ]/sweater [ types ]

After the named entity is identified, words which are irrelevant to classification, such as seasons, shopping guide, styles and the like, can be removed, words which are relevant to classification, such as categories, materials, weaving methods and the like are left, and subsequent characteristic processing is facilitated. The words left can be called feature words because the words can better reflect the features of specific logistics objects when HScodes are classified.

Step three: summarizing and de-duplicating the feature words obtained from each training sample to obtain a feature word set, and respectively allocating corresponding serial numbers to each feature word;

after the feature words are obtained, the feature words in each training sample can be summarized and deduplicated to obtain a feature word set, and in addition, in order to express the text description information in each training sample in a vector manner, probability calculation can be performed in a vector calculation manner in the following process, and corresponding serial numbers can be respectively allocated to each feature word. For example, assuming that feature words in each training sample are grouped together and there are ten thousand feature words, the feature words may be numbered from 1 to 10000 respectively. In this way, for each training sample, a corresponding feature word vector may be generated according to the feature word inclusion condition on each sequence number.

Step four: generating a feature word vector corresponding to each training sample according to the inclusion condition of the feature words on each sequence number in each training sample;

as described in step three, after the feature word sets are generated and the respective sequence numbers are obtained, when the text description information in each training sample is expressed, the feature word vector corresponding to each training sample can be generated according to the inclusion condition of each training sample on the feature words on each sequence number. That is, assuming that there are ten thousand feature words, each training sample may correspond to a feature word vector of ten thousand dimensions. The feature words are obtained by performing word segmentation and filtering on each training sample and then summarizing, so that the feature words contained in each training sample are always in the feature word set. That is, the feature words included in each training sample are a subset of the feature word set. For a certain training sample, the value of each element in the feature word vector can be determined according to whether a feature word exists on the corresponding sequence number. For example, the feature words included in a training sample are respectively No. 1, No. 12, No. 23, No. 25, No. 68, No. 1279, and so on, and the element value corresponding to the above sequence number in the feature word vector of the training sample is 1, and the element values on other sequence numbers are 0, so as to express which feature words are included in the training sample. Or, in another implementation, an initial weight may be given to the element value on each sequence number according to information such as an attribute of a specific feature word, and if a feature word on a certain sequence number exists in the training sample, the element value corresponding to the sequence number may be set as the initial weight of the feature word, which represents the importance degree of classifying the commodity category of the HScode corresponding to the feature word. For example, the generated feature word vector may be {1:0.2, 4:0.5, 12:0.6, 1009:0.3, 3801:0.2 … … }, that is, the training sample includes a number 1 feature word, a number 4 feature word, a number 12 feature word, a number 1009 feature word, a number 3801 feature word, and the like, and the initial weights of the respective initial weights are 0.2, 0.5, 0.6, 0.3, 0.2, and the like. In the above example, since there is no feature of another sequence number (for example, 2, 3, 5, 6 … …) in the training sample, the corresponding element value is 0, and the vector is not shown. In particular, for example, the element value of 0 is also present in a particular vector for the purpose of multiplication between vectors.

Of course, in a specific implementation, each training sample corresponds to a ten-thousand-dimensional or even larger-dimensional vector, and when performing calculation, a relatively large amount of calculation resources may be occupied, and since the number of feature words included in each training sample is usually very small relative to the total number of dimensions of the vector, most of the element values in the vector are 0, and thus, the calculation resources may be wasted. For this reason, in an alternative embodiment, hscodes may also be grouped in advance, for example, some commodity categories corresponding to hscodes have stronger similarity, and therefore, may be grouped into a group, and form a large class, and so on. The grouping basis for grouping the HScodes can also be information such as a category system defined in the network sales system, so that the category system defined in the network sales system can be associated with the customs HScodes, and more efficient classification prediction can be conveniently carried out in the subsequent prediction.

For example, the category system defined in the network sales system includes first-level categories such as clothing, daily necessities, household appliances, computer consumables, and the like, each first-level category further includes a plurality of second-level categories, each second-level category further includes third-level categories, and the like, and finally reaches the leaf categories. When grouping the hscodes, the grouping can be performed according to a certain class in a specific class system, and according to different class levels, the number of groups of the divided hscodes is different, and the number of the hscodes included in each group is also different. The selection can be specifically carried out according to actual requirements.

After the grouping is carried out in the mode, the training of the classification model can be carried out by taking each group as a unit, so that the number of training samples in each group is reduced, the total amount of corresponding feature words is reduced, and finally the dimensionality of the feature word vector corresponding to each training sample is reduced, thereby reducing the calculated amount and improving the training efficiency.

Step five: and respectively inputting the feature word vectors corresponding to a plurality of training samples associated with the same HScode into a preset machine learning model for training to obtain a classification model corresponding to each HScode.

After the feature word vectors of the training samples are obtained, the feature word vectors corresponding to a plurality of training samples associated with the same HScode can be input into a preset machine learning model for training, that is, assuming that 1000 training samples corresponding to a certain HScode in the training samples have a total number, the feature vectors corresponding to the 1000 training samples can be input into the machine learning model for training. The specific machine learning model may be various, and for example, may include but is not limited to classification models such as SVM, LR, naive bayes, maximum entropy, and deep learning methods such as lstm + softmax, etc. And after multiple rounds of iteration until the algorithm is converged, obtaining the classification model corresponding to the HScode. The classification model can also be represented by a vector, for example, { f1: w1, f2: w2, f3: w3, f4: w4, f5: w5, f6: w6 … }, where fn represents the sequence number of a specific feature word and wn represents the corresponding weight. That is, for a certain HScode, the training result is used for expression, and the importance degree of the feature word corresponding to each sequence number to the HScode is high.

In a word, after machine learning training, each HScode may correspond to a feature word weight vector, and in the respective feature word weight vectors, the weights corresponding to feature words with the same sequence number may be different. The trained classification model may be stored persistently in a storage medium such as a disk, or, as described above, an interfacing classification tool may be generated from the model and provided to various users.

Of course, other types of classification models, such as decision tree models, neural network models, etc., may be used in addition to the logistic regression model described above. For example, for the decision tree model, a decision process may be made in a multi-tree model based on word features, and based on the splitting threshold value stored in each tree and the features of the feature word vector, which leaf node of each tree the logistics object belongs to is decided, so as to decide the probability that the logistics object is classified into each potential HScode and other codes corresponding to the category. For the neural network model, a specific code classification model can have multiple layers of nonlinear change units, the nonlinear change judging unit of each layer is connected with the nonlinear change unit of the next layer in series, the nonlinear change unit of each layer stores characteristic weights based on characteristic word vectors or characteristic vectors derived from the characteristic word vectors, and the probability that the logistics objects are classified into the corresponding classes of each code is obtained through the interaction of the multiple layers of nonlinear change units. More specific details are not described in detail herein.

In addition, for other codes than the HScode, the code classification model can also be obtained in a similar manner.

The process of establishing the code classification model can be completed in advance, and after the process is completed, the specific model can be used for classifying the target logistics objects to be classified. Specifically, referring to fig. 3, the following steps may be included:

s301: determining text description information of a target object to be classified, processing the text description information, and determining contained target feature words;

from this step, the process of predicting the code of the specific target logistics object by using the code classification model is mainly used. Specifically, text description information of the target logistics object to be classified may be determined first, where the specific text description information may be obtained from information such as a title of the logistics object. In a specific implementation, if the interface tool is provided, as shown in fig. 4, an entry for inputting text description information of the target logistics object may also be provided in the interface, for example, the entry may be an input box or the like. Or, an entry for importing the text description information of a plurality of target logistics objects in batch may be provided, so that a user may sort the text description information of the logistics objects to be classified in advance through an Excel table or the like, and may name the data columns in the table according to predefined field names or the like. Then, the text description information of each logistics object recorded in the table can be imported into the tool through the batch operation entry, and the like. In addition, whether the text description information of the target logistics object is input singly or imported in batch, the specific target logistics object may be a logistics object waiting for customs clearance, for example, text object information such as a title of the target logistics object extracted from a specific cross-border order, and the like.

S302: generating a characteristic word vector corresponding to the target logistics object according to the inclusion condition of the text description information on each target characteristic word;

in the text description information of the target data object, the text description information may be processed, and the specific processing mode may be the same as the processing mode of the text description information in the training sample. For example, word segmentation processing can also be performed to filter out invalid words therein and determine the remaining valid words as target feature words. Then, similarly, a feature word vector corresponding to the target logistics object can be generated according to the inclusion condition of the text description information on each target feature word. Specifically, the feature word vector corresponding to the target logistics object may be generated according to the inclusion condition of the feature words on each sequence number in the text description information of the target logistics object. For example, the text description information of the target logistics object includes a feature word No. 1, a feature word No. 5, a feature word No. 8, a feature word No. 27, and so on, and the element value corresponding to each serial number may be 1, or a preset initial weight value, and the element values corresponding to other serial numbers are 0. Of course, in practical applications, the text description information of the target logistics object to be predicted may include words that are not included in the training process, and such words may be filtered out and do not need to be input into a specific classification model. After the prediction is completed, whether the vocabulary is related to the HScode classification or not can be determined according to named entity information of the vocabulary and the like, if so, the vocabulary can be added into a corresponding characteristic word set as a characteristic word, and the model can be trained again, and the like.

It should be noted here that the dimension of the feature word vector generated for the target logistics object is consistent with the number of feature words in the feature word set during training. For example, if feature words in all training samples are collected together to form a feature word set during training, where the number of the feature words included is N, the feature word vector corresponding to the target logistics object to be predicted may also be an N-dimensional vector. In addition, if the hscodes are grouped during training, and the feature words in the training samples corresponding to the hscodes in each group are summarized, the number of the feature words in each group is also reduced. In this case, specifically, before generating the feature word vector for the target logistics object, a group to which the target logistics object belongs may be determined first, for example, if the hscodes are grouped according to a category system in a certain network sales system, a corresponding HScode group may be determined according to a category to which the target logistics object belongs in the category system in the network sales system. Furthermore, the feature word vector of the current target logistics object can be determined by using the feature word set contained in the HScode group.

S303: and inputting the feature word vector into a coding classification model to obtain corresponding classification feature information.

After the feature word vector corresponding to the target logistics object is determined, the feature word vector can be input into the coding classification model to obtain specific classification feature information. For example, in a specific implementation, for the case of performing HScode classification by using a logistic regression model, the feature word vector may be input into the code classification model, the probability that the target logistics object belongs to the category corresponding to each HScode is determined, and in addition, classification suggestion information may be provided according to the probability. Specifically, the feature word vector of the target logistics object may be multiplied by the feature word weight vector corresponding to each HScode (or may be adjusted by a certain offset value, etc.), so as to obtain a probability value that the target logistics object belongs to the category corresponding to each HScode. If the training is performed in a grouping way, the feature word vectors are input into the classification model, and meanwhile, the group information corresponding to the target logistics object can also be input into the classification model. In this way, only the feature word vector of the target logistics object needs to be operated with the feature word weight vector corresponding to each HScode in the group, and probability calculation does not need to be carried out on all HScodes, so that the calculation amount can be saved.

After the probability that the target logistics object belongs to the corresponding category of each HScode is obtained through calculation, corresponding classification suggestion information can be returned. For example, one or several hscodes with a probability higher than a preset threshold may be returned, so that the user may determine a specific HScode for the target logistics object according to the recommendation result.

In summary, according to the embodiment of the present application, a coding classification model may be predetermined, so that, for a target logistics object to be classified, text description information of the target logistics object may be obtained and processed to determine target feature words included, then, a feature word vector corresponding to the target logistics object is generated according to the inclusion condition of the text description information on each target feature word, and then, the feature word vector may be input into the coding classification model to obtain corresponding classification feature information. In this way, automatic classification of logistics objects can be achieved without relying on manual classification, and therefore, efficiency and accuracy can be improved.

In an optional embodiment, through the collection and processing of training samples and machine learning training, a classification model for each HScode can be obtained, and specifically can be represented by a feature word weight vector, where a discrimination weight value of each feature word for an associated HScode is recorded. Therefore, when a certain target data object is predicted, the text description information of the target data object can be subjected to word segmentation and other processing, the characteristic words contained in the text description information can be determined, and the characteristic word vector can be generated. Thus, the feature word vector can be input into the classification model of the previous training number, so that the probability that the target logistics object is classified into each HScode can be calculated, and suggestion information can be given according to the probability, for example, one or more suggested HScodes can be given, and the like. By the method, the process of classifying the target data object does not depend on experts completely any more, the labor cost can be reduced, the efficiency of classification is improved, and the method is not limited by the experience and personal ability of the experts.

Example two

The second embodiment provides a method for generating a code classification model, and referring to fig. 5, the method may specifically include:

s501: collecting training samples, wherein each training sample comprises a corresponding relation between known logistics object text description information and codes;

the code may specifically refer to the customs code HScode and the like described above.

S502: performing word segmentation processing on the text description information in the training sample, and filtering out invalid words to obtain feature words;

s503: summarizing and de-duplicating the feature words obtained from each training sample to obtain a feature word set, and respectively allocating corresponding serial numbers to each feature word;

s504: generating a feature word vector corresponding to each training sample according to the inclusion condition of the feature words on each sequence number in each training sample;

s505: and respectively inputting the feature word vectors corresponding to a plurality of training samples associated with the same code into a preset machine learning model for training to obtain a classification model corresponding to each code.

For the parts of the second embodiment that are not described in detail, reference may be made to the descriptions of the first embodiment, and details are not repeated here.

Corresponding to the first embodiment, an embodiment of the present application further provides a logistics object information processing apparatus, and referring to fig. 6, the apparatus may specifically include:

the target logistics object information determining unit 601 is configured to determine text description information of a target logistics object to be classified, process the text description information, and determine a target feature word included in the text description information;

the feature vector generating unit 602 is configured to determine text description information of a target object to be classified, process the text description information, and determine a target feature word included in the text description information;

a classification feature information obtaining unit 603, configured to input the feature word vector into a coding classification model, and obtain corresponding classification feature information.

The code classification model comprises a logistic regression model, a decision tree model and a neural network model.

And if the code classification model is a logistic regression model, storing a feature word weight vector corresponding to each code in the code classification model.

Specifically, the encoding package customs encoding hscodes stores feature word weight vectors corresponding to each customs encoding HScode in the encoding classification model; and the feature word weight vector records the discrimination weight value of each feature word pair associated with the HScode.

If the coding classification model is a decision tree model, the coding classification model stores a plurality of tree models, and based on the fact that each tree stores a split threshold and the characteristics of the feature word vectors, the probability that the target logistics object is classified into the corresponding category of each potential code is determined.

If the coding classification model is a neural network model, the coding classification model is provided with a plurality of layers of nonlinear change units, the nonlinear change judging unit of each layer is connected with the nonlinear change unit of the next layer in series, and the nonlinear change unit of each layer stores characteristic weights based on characteristic word vectors or characteristic vectors derived from the characteristic word vectors, so that the probability that the logistics objects are classified into the corresponding classes of each potential code is obtained through the interaction of the plurality of layers of nonlinear change units.

In a specific implementation, the classification feature information obtaining unit may be specifically configured to input the feature word vector into a code classification model, and determine a probability that the target logistics object is classified into a category corresponding to each potential code. And may be further configured to provide classification recommendation information based on the probability.

The code classification model is established in the following way:

the system comprises a sample collection unit, a data acquisition unit and a data processing unit, wherein the sample collection unit is used for collecting training samples, and each training sample comprises the corresponding relation between the known logistics object text description information and the HScode;

and the training unit is used for inputting the feature word vectors corresponding to a plurality of training samples associated with the same HScode into a preset machine learning model for training to obtain a classification model corresponding to each HScode.

In a specific implementation, the feature vector generating unit may be specifically configured to generate a feature word vector corresponding to the target logistics object according to a situation of inclusion of feature words in each sequence number in the text description information of the target logistics object.

When performing model training, the apparatus may further include:

and the data cleaning unit is used for cleaning data of the training samples after the training samples are collected so as to train the classification model by using the residual effective training samples.

Specifically, the data cleansing unit may be specifically configured to:

pre-storing new and old HScode mapping relation information; and replacing the training sample with the old HScode according to the mapping relation, and adding the training sample into the training sample set as an effective training sample.

Alternatively, the data cleansing unit may also be configured to:

pre-storing a deactivated HScode list; deleting training samples in which the deactivated HScode occurs.

Still alternatively, the data cleansing unit may also be configured to:

pre-storing a list of split HScode information, wherein each piece of split HScode information comprises an HScode before splitting and a plurality of corresponding split HScodes;

extracting the training sample of the HScode before splitting so as to determine the split HScode again and then adding the training sample into a training sample set as an effective training sample.

In a specific implementation, the apparatus may further include:

and the vocabulary filtering unit is used for carrying out named entity recognition on the vocabulary obtained by the word segmentation result of the text description information and filtering the vocabulary irrelevant to the logistics object classification according to the named entity recognition result.

In addition, specifically when performing model training, the apparatus may further include:

and the grouping unit is used for grouping the HScodes according to the category information of one level in a category system in a related network sales system to obtain a plurality of groups before summarizing and removing the characteristic words obtained from each training sample, wherein each group comprises a plurality of HScodes so as to summarize and remove the characteristic words by taking each HScode group as a unit, generate a characteristic vector and perform model training.

Specifically, the classification model may further store a correspondence between each group and the HScode;

in the prediction, the apparatus may further include:

the group determination unit is used for determining a corresponding HScode group for the target logistics object according to the category to which the target logistics object belongs under the network sales system;

the prediction unit may specifically be configured to:

and inputting the HScode group corresponding to the target logistics object and the characteristic word vector into the classification model so as to determine the probability that the target logistics object belongs to the category corresponding to each HScode under the group.

Corresponding to the second embodiment, an embodiment of the present application further provides an apparatus for generating a classification model of a customs code, referring to fig. 7, where the apparatus may specifically include:

a sample collection unit 701, configured to collect training samples, where each training sample includes a correspondence between known logistics object text description information and a code;

a feature word determining unit 702, configured to perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;

the feature vocabulary general unit 703 is configured to perform summarization and deduplication processing on the feature words obtained in each training sample to obtain a feature word set, and assign corresponding sequence numbers to each feature word respectively;

a feature word vector generating unit 704, configured to generate a feature word vector corresponding to each training sample according to the inclusion condition of the feature word in each training sample on each sequence number;

the training unit 705 is configured to input feature word vectors corresponding to multiple training samples associated with the same code into a preset machine learning model for training, so as to obtain a code classification model corresponding to each code; the feature word weight vector corresponding to each code is stored in the code classification model; the feature word weight vector records the discrimination weight value of each feature word pair association code.

In addition, corresponding to the first embodiment of the present application, an embodiment of the present application further provides a computer system, including:

one or more processors; and

FIG. 8 illustrates an architecture of a computer system that may include, in particular, a processor 810, a video display adapter 811, a disk drive 812, an input/output interface 813, a network interface 814, and a memory 820. The processor 810, the video display adapter 811, the disk drive 812, the input/output interface 813, the network interface 814, and the memory 820 may be communicatively connected by a communication bus 830.

The processor 810 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 820 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 820 may store an operating system 821 for controlling the operation of the computer system 800, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 800. In addition, a web browser 823, a data storage management system 824, and a classification processing system 825, among others, may also be stored. The classification processing system 825 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program codes are stored in the memory 820 and called for execution by the processor 810.

The input/output interface 813 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 814 is used for connecting a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 830 includes a pathway for communicating information between various components of the device, such as processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, and memory 820.

In addition, the computer system 800 may also obtain information of specific pickup conditions from the virtual resource object pickup condition information database 841 for performing condition judgment, and the like.

It should be noted that although the above-mentioned devices only show the processor 810, the video display adapter 811, the disk drive 812, the input/output interface 813, the network interface 814, the memory 820, the bus 830, etc., in a specific implementation, the devices may also include other components necessary for normal operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The logistics object information processing method, device and computer system provided by the present application are introduced in detail, and a specific example is applied in the description to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A logistics object information processing method is characterized by comprising the following steps:

2. The method of claim 1, wherein the code classification model comprises a logistic regression model, a decision tree model, or a neural network model.

3. The method of claim 2, wherein if the code classification model is a logistic regression model, the code classification model holds a feature word weight vector corresponding to each code.

4. The method according to claim 3, wherein the code packet comprises customs codes HScodes, and the code classification model stores a feature word weight vector corresponding to each customs code HScode; and the feature word weight vector records the discrimination weight value of each feature word pair associated with the HScode.

5. The method according to claim 2, wherein if the coding classification model is a decision tree model, the coding classification model holds a plurality of tree models, and based on the fact that each tree holds a split threshold and the features of the feature word vectors, determines the probability that the target logistics object is classified into each potential coding corresponding category.

6. The method according to claim 2, wherein if the code classification model is a neural network model, the code classification model has multiple layers of non-linear variation units, the non-linear variation units of each layer are connected in series with the non-linear variation units of the next layer, and each layer of non-linear variation units holds feature weights based on or derived from feature word vectors, so as to obtain the probability that the logistics object is classified into each potential code corresponding category through the interaction of the multiple layers of non-linear variation units.

7. The method of claim 1,

the step of inputting the feature word vector into a coding classification model to obtain corresponding classification feature information comprises:

and inputting the characteristic word vector into a code classification model, and determining the probability that the target logistics object is classified into the corresponding class of each potential code.

8. The method of claim 7, further comprising:

and providing classification suggestion information according to the probability.

9. The method according to claim 3 or 7,

the code classification model is established in the following way:

collecting training samples, wherein each training sample comprises a corresponding relation between known logistics object text description information and a customs code HScode;

and respectively inputting the feature word vectors corresponding to a plurality of training samples associated with the same HScode into a preset machine learning model for training to obtain a code classification model corresponding to each HScode.

10. The method of claim 9,

the generating of the feature word vector corresponding to the target logistics object includes:

and generating a characteristic word vector corresponding to the target logistics object according to the inclusion condition of the characteristic words on each sequence number in the text description information of the target logistics object.

11. The method of claim 9,

after the training sample is collected, the method further comprises the following steps:

and performing data cleaning on the training samples so as to train the classification model by using the residual effective training samples.

12. The method of claim 11,

the performing data washing on the training sample comprises:

pre-storing new and old HScode mapping relation information;

and replacing the training sample with the old HScode according to the mapping relation, and adding the training sample into the training sample set as an effective training sample.

13. The method of claim 11,

the performing data washing on the training sample comprises:

pre-storing a deactivated HScode list;

deleting training samples in which the deactivated HScode occurs.

14. The method of claim 11,

the performing data washing on the training sample comprises:

15. The method of claim 9,

filtering out invalid words, including:

and carrying out named entity recognition on the vocabulary obtained from the word segmentation result of the text description information, and filtering the vocabulary which is irrelevant to the logistics object classification according to the named entity recognition result.

16. The method of claim 9,

before the feature words obtained in each training sample are summarized and deduplicated, the method further comprises the following steps:

and grouping the HScodes according to the category information of one level in a category system in a related network sales system to obtain a plurality of groups, wherein each group comprises a plurality of HScodes, so that the feature words are summarized and deduplicated, feature vectors are generated, and the model is trained by taking each HScode group as a unit.

17. The method of claim 16,

the code classification model also stores the corresponding relation between each group and HScode;

the method further comprises the following steps:

determining a corresponding HScode group for the target logistics object according to the category to which the target logistics object belongs under the network sales system category system;

the step of inputting the feature word vectors into the code classification model and determining the probability that the target commodity object belongs to the corresponding category of each HScode comprises the following steps:

and inputting the HScode group corresponding to the target commodity object and the characteristic word vector into the code classification model so as to determine the probability that the target logistics object belongs to the category corresponding to each HScode under the group.

18. A method of generating a code classification model, comprising:

19. A physical distribution object information processing apparatus, comprising:

20. An apparatus for generating a code classification model, comprising:

21. A computer system, comprising:

one or more processors; and