WO2020034880A1

WO2020034880A1 - Logistics object information processing method, device and computer system

Info

Publication number: WO2020034880A1
Application number: PCT/CN2019/099552
Authority: WO
Inventors: 郑恒; 张振华; 李驰
Original assignee: 菜鸟智能物流控股有限公司
Priority date: 2018-08-17
Filing date: 2019-08-07
Publication date: 2020-02-20
Also published as: TW202009748A; CN110858219A

Abstract

Disclosed by the embodiments of the present application are a logistics object information processing method, a device and a computer system. The method comprises: determining text description information of a target logistics object to be categorized, processing the text description information, and determining a target characteristic word contained therein; generating a characteristic word vector corresponding to the target logistics object according to the inclusion of each target characteristic word in the text description information; and inputting the characteristic word vector into a code classification model, and acquiring corresponding classification characteristic information. By means of the embodiments of the present application, the automatic classification of a logistics object code may be achieved so as to reduce the probability of an error occurring while reducing labor costs.

Description

Information processing method, device and computer system for logistics objects

This application claims priority from a Chinese patent application filed on August 17, 2018 with an application number of 201810943287.X and an invention name of "Logistics Object Information Processing Method, Device, and Computer System", the entire contents of which are incorporated herein by reference. in.

Technical field

The present application relates to the technical field of logistics object information processing, and in particular, to a method, an apparatus, and a computer system for processing logistics object information.

Background technique

Hscode (The Harmonization System Code) is the core data that must be provided to the customs during the customs clearance process. It involves the export tax rate and the tax refund rate. HS uses a six-digit code to classify all international trade goods into 22 categories, 98 chapters. Chapters are subdivided into headings and subheadings. The first and second digits of the product code represent "Chapter", the third and fourth digits represent "Heading", and the fifth and sixth digits represent "Subheading". The first 6 digits are HS international standard codes. HS has 1241 four-digit tax items and 5,113 six-digit subheadings.

In the process of cross-border e-commerce and other import and export trades, HScode classification is required for specific commodities in order to facilitate customs clearance. The so-called HScode classification is to give the HScode to which the product belongs according to the specific information of the product (text description, picture, etc.) and the classification basis of the HScode. Different from general e-commerce systems and other categories of goods, HScode classifies products to a greater degree during the classification process. For example, the same clothing category, different materials, different styles, and even different weaving methods will correspond to different Hscode. Therefore, Hscode classification of products is a very tedious process.

At present, most companies in the industry use the manual pre-classification method. However, even a customs expert with rich experience in the industry, it takes about 2 to 15 minutes to classify a sku (the smallest unit of inventory of goods). For some extremely complicated products, it takes several hours or even longer, and the average daily processing is up to 200 skus. In addition, due to the scarcity of corresponding pre-categorized professionals and high thresholds for pre-categorization, artificial pre-categorization mainly has problems such as high classification costs, low classification time, and long response time. According to statistics, the cost of categorizing a sku currently ranges from 200 to 500 RMB. For some special categories such as electromechanical, the pre-categorization cost can be as high as 1,500 RMB or more. However, in some large cross-border e-commerce trading platforms, the number of skus involved is huge, even up to billions, and this cost is obviously unacceptable. In addition, in the face of the average daily volume of B2C cross-border e-commerce of tens of millions and the grand goal of "72 hours to reach" or even shorter delivery time, the per capita daily throughput of 200 skus under the above method is obviously not timely. Respond and meet needs. Furthermore, manual classification is too dependent on the experience of experts. In the face of the huge daily classification workload, mistakes are inevitable, and as the workload increases, this error rate will gradually increase, causing corporate commodity declaration products. Under the challenge of the customs, the qualification of the company was affected.

Therefore, how to implement the classification of commodities more efficiently, reduce the cost and reduce the probability of errors, has become a technical problem that needs to be solved by those skilled in the art.

Summary of the Invention

The application provides a logistics object information processing method, device and computer system, which can realize automatic classification of logistics objects, reduce labor costs, and reduce the probability of errors.

This application provides the following solutions:

A logistics object information processing method includes:

Determining the text description information of the target logistics object to be classified and processing the text description information to determine the target feature words included;

Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;

The feature word vector is input into a coding classification model to obtain corresponding classification feature information.

A method for generating a coding classification model includes:

Collect training samples, where each training sample includes a known correspondence between the text description information of the logistics object and the coding;

Perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;

The feature words obtained from each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;

Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;

The feature word vectors corresponding to multiple training samples associated with the same code are respectively input to a preset machine learning model for training, and a coding classification model corresponding to each coding is obtained; the coding classification model stores a corresponding coding classification model. Feature word weight vector; the feature word weight vector records the discrimination weight value of each feature word pair encoding.

A logistics object information processing device includes:

A target logistics object information determining unit, configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;

A feature vector generating unit, configured to generate a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word in the text description information;

A classification feature information acquisition unit is configured to input the feature word vector into a coding classification model to obtain corresponding classification feature information.

A device for generating a coding classification model includes:

A sample collection unit, configured to collect training samples, where each training sample includes a correspondence relationship between known text object description information and a coding of a logistics object;

A feature word determining unit, configured to perform word segmentation on the text description information in the training sample, and filter out invalid words to obtain a feature word;

The feature vocabulary unit is used to summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;

A feature word vector generating unit, configured to generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each sequence number in each training sample;

A training unit is configured to input feature word vectors corresponding to multiple training samples associated with the same code into a preset machine learning model for training, and obtain a coding classification model corresponding to each coding; the coding classification model stores therein Feature word weight vector corresponding to each code; the feature word weight vector records the discrimination weight value of each feature word pair associated code.

A computer system includes:

One or more processors; and

A memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:

According to the specific embodiments provided by the present application, the present application discloses the following technical effects:

Through the embodiments of the present application, a coding classification model can be determined in advance. In this way, for the target logistics object to be classified, text description information can be obtained and processed, and the target feature words included are determined. When the text description information includes each target feature word, a feature word vector corresponding to the target logistics object is generated. After that, the feature word vector can be input into a coding classification model to obtain corresponding classification feature information. In this way, automatic classification of logistics objects can be achieved without relying on manual classification, so efficiency and accuracy can be improved.

In an optional implementation solution, a classification model for each HScode can also be obtained through the collection, processing, and machine learning training of training samples. Specifically, it can be represented by a feature word weight vector, in which each feature word pair is recorded. The discriminative weight value of HScode. In this way, when a certain target data object is predicted, the text description information of the target data object may be processed by word segmentation, etc., the feature words contained therein are determined, and a feature word vector is generated. In this way, this feature word vector can be input into the previously trained classification model, so that the probability of the target commodity object being classified into each HScode can be calculated, and recommendation information can be given accordingly, for example, it can be given Suggested HScode or codes, etc. In this way, the process of HScode classification of target data objects is no longer completely dependent on experts, which can reduce labor costs, and the efficiency of classification is also improved, not limited by the experience of experts and personal capabilities, reducing Error rate.

Of course, the implementation of any product of this application does not necessarily need to achieve all the advantages described above at the same time.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present application. For those of ordinary skill in the art, other embodiments may be obtained based on these drawings without paying creative labor.

FIG. 1 is a schematic diagram of an overall framework provided by an embodiment of the present application; FIG.

2 is a schematic diagram of a system provided by an embodiment of the present application;

3 is a flowchart of a prediction method provided by an embodiment of the present application;

4 is a schematic interface diagram of a classification tool provided by an embodiment of the present application;

5 is a flowchart of a model training method according to an embodiment of the present application;

6 is a schematic diagram of a first device according to an embodiment of the present application;

7 is a schematic diagram of a second device according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a computer system according to an embodiment of the present application.

detailed description

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art fall within the protection scope of this application.

In the embodiments of the present application, in order to improve the classification efficiency of logistics objects and reduce labor costs, a coding classification model may be established in advance, for example, it may specifically include a logistic regression model, a decision tree model, a neural network model, and so on. Among them, if the coding classification model is a logistic regression model, a coding classification model can be established through machine learning, and then the logistics object (specifically, it can refer to a product object, etc.) is automatically classified by the coding classification model to determine the corresponding coding implementation method. . Specifically, as shown in FIG. 1, some training samples can be collected, which can be the correspondence between the text description information of known commodity objects and codes such as HScode. Then, according to the characteristics of the training samples and training targets, The data is processed and input into specific machine learning models for training, and finally a specific coding classification model is established. After that, this code classification model can be used to predict the codes to which specific logistics objects belong. The specific prediction result can be directly used as the encoding classification result, or it can also be used as a reference for the encoding classification result, and so on.

In specific implementation, after the coding classification model is established, it can be provided to merchant users, customs clearance partners (CPs) of the cross-border online sales system, and customs departments to use in the customs clearance of logistics objects, thereby replacing or partially replacing them with machine classification. Manual classification in the traditional way improves the efficiency of customs clearance of logistics objects and reduces the cost of enterprise classification. In addition, in order to facilitate the use of the user, the technical threshold for using the above model is lowered. As shown in Figure 2, based on the coding classification model, you can further develop an interface-based classification tool (either an online tool or a local application that can be installed). In this way, users only need to pass the interface The input box and other controls provided in the input the text description information of the target logistics object to be classified, and the classification tool can automatically process and call the pre-configured coding classification model to give the final classification suggestion.

The specific implementation scheme is described in detail below.

Example one

First, Embodiment 1 of the present application provides a method for processing logistics object information from the perspective of the aforementioned classification tool. In this method, a coding classification model can be obtained first. For example, it can include a logistic regression model, a decision tree model, Neural network models, etc. For a logistic regression model, a specific coding classification model may store feature word weight vectors corresponding to each coding. For example, for the customs code HScode, a feature word weight vector corresponding to each HScode can be stored specifically in the code classification model; the feature word weight vector records the discrimination weight value of each feature word to the associated HScode.

In the case of the above logistic regression model, for HScode, a specific coding classification model may be established in advance. In one of the specific implementations, the steps of establishing the model may first include:

Step 1: Collect training samples, where each training sample includes a corresponding relationship between known text description information of logistics objects and HScode;

In specific implementation, labeled data in the historical classification records of logistics objects can be collected, for example, it can include the "Import and Export Tariff of the People's Republic of China", historical customs clearance data, and expert labeled data.

For example, the information recorded in the "Import and Export Tariff of the People's Republic of China" can be shown in Table 1 (only one is shown):

Table 1

Of course, because the product descriptions recorded in specific tariffs usually do not specifically refer to a certain product, in order to better supplement the information in the above tariffs, historical customs data in the cross-border online sales system can also be used as a supplement. For example, a certain historical customs clearance data can be shown in Table 2:

Table 2

In other words, the historical customs data records the correspondence between the name of the specific logistics object and the HScode. Therefore, incorporating this data into the specific training sample can be more conducive to predicting more specific logistics objects. Accurate HScode.

In addition, in addition to the above tariffs and historical customs clearance data, it can also be supplemented by expert labeling data. For example, one of the expert labeling data can be shown in Table 3:

table 3

In short, there are many ways to collect training samples. Of course, because the collected data is usually historical data, in actual applications, some HScode changes, or deactivation, splitting, etc. may be involved. Therefore, some historical data are used for subsequent classification. Words may already be invalid information. For this reason, in a preferred embodiment, the training samples may also be subjected to data cleaning in order to use the remaining valid training samples for training of the classification model.

The specific data cleaning process may include modifying the changed HScode, deleting the training samples corresponding to the disabled HScode, re-determining the HScode in the training sample corresponding to the split HScode, and so on. Among them, in specific implementation, first, the old and new HScode mapping relationship information can be saved in advance; in this way, after collecting specific training samples, each training sample can be traversed first to determine whether the old HScode appears in each sample. HScode. For the training samples where the old HScode appears, the new HScode may be replaced according to the mapping relationship, and then added to the training sample set as valid training samples. For example, for a certain type of logistics object, the HScode defined in the previous tariff is 6110110000. After the tax code is modified, the HScode of this type of logistics object is modified to 6110110011. Therefore, this mapping relationship can be saved. After collecting training samples, if it is found that the HScode contained in a training sample is 6110110000, it can be modified to 6110110011 according to the saved mapping relationship, and then the training sample can be made into valid data.

In addition, the disabled HScode list can also be saved in advance; in this way, after the training samples are collected, the HScode in each training sample can also be traversed to delete the disabled HScode training samples. For example, an HScode is 6110110027. After the HS code is modified, the category corresponding to the HScode is deleted. Accordingly, the HScode is disabled. Therefore, this HScode can be recorded. After collecting specific training samples, if the HScode in a training sample is found to be 6110110027, the training sample corresponding to the HScode can be deleted.

Furthermore, there may be cases where the HScode is split. For example, in the old version of the tax code, the HScode corresponding to a category was 6110110000. After the tax code was revised, the category was subdivided into two subclasses, corresponding to HScode6110110001, and 6110110002, the original HScode6110110000 is no longer used. Therefore, a list of split HScode information can also be saved in advance, where each split HScode information includes the HScode before split, and the corresponding multiple HScode after split; in this way, after collecting specific training data, Similarly, the HScode in each piece of data can be traversed, and the training samples of the HScode before the splitting are extracted to re-determine the HScode after the splitting, and then be added to the training sample set as valid training samples. For example, if you find that the HScode in one of the training samples is 6110110000, you can extract it. After that, the HScode after the split can be determined again for the training sample by expert confirmation and other methods, and then the HScode before the split can be replaced to make the training sample valid data, and so on.

In addition, in addition to performing the above data cleaning on the training samples, manual verification can be performed by randomly sampling the training samples to improve the quality of the training samples as much as possible to improve the accuracy of the finally trained model.

Step 2: word segmentation processing is performed on the text description information in the training sample, and invalid words are filtered to obtain feature words;

After performing processing such as data cleaning on the training samples, the next processing can be performed. Specifically, because the text description information of the data object exists in the training sample, this text description information can be the title of the specific logistics object, or it can also be the text description given in the tax code, the declaration elements at the time of customs declaration, etc. . In short, the correspondence between text description information and HScode is recorded in each training sample. The purpose of machine learning is to find regular information from multiple text descriptions corresponding to the same HScode, and use it for HScode prediction. Specifically, when processing text description information, it may first include word segmentation processing, that is, text description information is usually a sentence or a paragraph, and the purpose of the word segmentation is to divide it into multiple words.

For example, the text description information in a training sample is: spring and autumn new wool cardigan women shawl coat thin sweater short V-neck small cardigan loose plus size sweater. The word segmentation results obtained after word segmentation processing can be: spring / autumn / new style / wool / cardigan / female / shawl / coat / thin / knit sweater / short section / V-neck / small cardigan / loose / large size / sweater. For the specific word segmentation processing method, refer to the solution in the prior art, which will not be repeated here.

After the word segmentation is completed, words that are not related to classification can be filtered out, leaving only valid feature words. Among them, in order to achieve this purpose, named entities may also be identified on the words obtained from the segmentation results of the text description information, and words that are not related to the classification of logistics objects may be filtered out based on the identified results of the named entities. For example, assuming that the text description information of a training sample is: spring and autumn new wool cardigan women shawl coat thin knit sweater short V-neck small cardigan loose plus size sweater, after word segmentation and named entity recognition, the result obtained can be:

Spring and Autumn [Season] / New [Shopping Guide] / Wool [Material] / Cardigan [Category] / Female [Crowd] / Shawl [Style] / Coat [Category] / Thin [Style] / Knitting [Weaving Method] / Short Section [ Style] / V-neck [style] / small cardigan [category] / loose [style] / large size [style] / sweater [category]

After the named entities are identified, words that are not related to the classification, such as season, shopping guide, style, etc., can be removed, and words such as category, material, and weaving method related to the classification can be left to facilitate subsequent feature processing. Because this left word can better reflect the characteristics of the specific logistics object in HScode classification, it can be called a feature word.

Step 3: The feature words obtained in each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;

After the feature words are obtained, the feature words in each training sample can be summarized and deduplicated to obtain a feature word set. In addition, in order to facilitate the expression of the text description information in each training sample in a vector manner, the following can be passed The vector calculation method performs probability calculation, and each feature word can also be assigned a corresponding sequence number. For example, if the feature words in each training sample are grouped together and there are 10,000 feature words in total, these feature words can be numbered from 1 to 10,000, respectively. In this way, for each training sample, it is only necessary to generate a corresponding feature word vector according to the feature word contained in each sequence number.

Step 4: Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;

As described in step three, after generating the feature word set and obtaining the respective sequence numbers, when expressing the text description information in each training sample, the feature words on each sequence number can be included according to each training sample. In this case, a feature word vector corresponding to each training sample is generated. That is, assuming a total of 10,000 feature words, each training sample can correspond to a feature word vector of 10,000 dimensions. And because the above feature words are segmented and filtered according to each training sample, the feature words contained in each training sample must exist in the feature word set. That is, the feature words contained in each training sample are a subset of the above feature word set. For a certain training sample, the value of each element in the feature word vector can be determined according to whether a feature word exists on the corresponding sequence number. For example, the feature words contained in a training sample are No. 1, 12, 23, 25, 68, 1279, etc., the feature word vector of the training sample can be corresponding to the above sequence number The element value is 1 and the element numbers on other serial numbers are 0 to express which feature words are included in the training sample. Or, in another implementation manner, an initial weight may be given to element values on each sequence number according to information such as attributes of specific feature words. If a feature word on a sequence number exists in the training sample, the sequence number may be assigned on the sequence number. The corresponding element value is set as the initial weight of the feature word, which represents the importance of the feature word corresponding to the HScode product category classification. For example, the generated feature word vector can be {1: 0.2, 4: 0.5, 12: 0.6, 1009: 0.3, 3801: 0.2 ...}, that is, the training sample contains the number 1 feature word and the number 4 feature word , No. 12 feature words, No. 1009 feature words, No. 3801 feature words, etc., and their respective initial weights are 0.2, 0.5, 0.6, 0.3, 0.2, and so on. It should be noted that, in the above example, because there are no features with other sequence numbers (for example, 2, 3, 5, 6, ...) in the training sample, the corresponding element value is 0, which is not shown in the above vector. In specific implementation, in order to facilitate multiplication between vectors and the like, the element value at 0 is also present in the specific vector.

Of course, in specific implementation, each training sample corresponds to a vector of 10,000 dimensions or more. When performing calculations, there may be a situation that consumes a large amount of computing resources. Because of the characteristics contained in each training sample, The number of words is usually very small relative to the total number of dimensions of the vector. Therefore, the value of most elements in the vector is 0, which may cause a waste of computing resources. For this reason, in an optional implementation manner, HScodes can also be grouped in advance. For example, the commodity categories corresponding to certain HScodes have strong similarities, so they can be divided into a group to form a large category, and so on. Among them, the grouping basis for grouping HScodes can also be information such as the category system defined in the online sales system. In this way, the category system defined in the online sales system can be associated with this customs HScode, which is also convenient. Make more efficient classification predictions in subsequent predictions.

For example, the category system defined in the online sales system includes first-level categories such as clothing, daily necessities, home appliances, and computer consumables. Each first-level category also includes multiple second-level categories, and the second-level category can also include Tertiary categories, and so on, and finally to leaf categories. When grouping HScodes, you can group them according to a certain category in the specific category system. According to different category levels, the number of HScode groups will be different, and the number of HScodes included in each group It will be different. Specific choices can be made according to actual needs.

After grouping in the above manner, the classification model can be trained with each group as a unit. In this way, the number of training samples in each group will be reduced, so the total number of corresponding feature words will also be reduced. In the end, the dimension of the feature word vector corresponding to each training sample will also be reduced, thereby reducing the calculation amount and improving the training efficiency.

Step 5: The feature word vectors corresponding to multiple training samples associated with the same HScode are respectively input into a preset machine learning model for training, and a classification model corresponding to each HScode is obtained.

After obtaining the feature word vectors of each training sample, the feature word vectors corresponding to multiple training samples associated with the same HScode can be input into a preset machine learning model for training, that is, if the training sample corresponds to a certain A total of 1000 training samples of an HScode can be input into the machine learning model for feature vectors corresponding to the 1000 training samples. Among them, there can be multiple specific machine learning models. For example, they can include but are not limited to classification models such as SVM, LR, Naive Bayes, maximum entropy, and deep learning methods such as lstm + softmax, and so on. After multiple iterations until the algorithm converges, the classification model corresponding to the HScode can be obtained. The classification model can also be represented by a vector, for example, {f1: w1, f2: w2, f3: w3, f4: w4, f5: w5, f6: w6 ...}, where fn represents the sequence number of a specific feature word, wn represents the corresponding weight. That is, for a certain HScode, the training result is used to express the importance of the feature words corresponding to each serial number to the HScode.

In short, after machine learning training, each HScode can respectively correspond to a feature word weight vector. In the respective feature word weight vectors, the weights corresponding to the feature words on the same sequence number may be different. The trained classification model can be persistently stored in a storage medium such as a disk, or, as described above, an interfaced classification tool can be generated based on the model and provided to various users for use.

Of course, in addition to the above logistic regression models, other types of classification models such as decision tree models and neural network models can also be used. For example, for a decision tree model, it is possible to make a decision process in a multiple tree model based on word features, and based on the threshold of splitting stored in each tree and the characteristics of the feature word vector, determine which leaf of each tree the logistics object belongs to Node, thereby determining the probability that the logistics object is classified into each potential HScode and other coding corresponding categories. For the neural network model, the specific coding classification model can have multiple layers of non-linear change units, and the non-linear change unit of each layer is connected in series with the non-linear change unit of the next layer. The feature weight of a word vector or a feature vector derived from a feature word vector is obtained through the interaction of multiple layers of non-linear change units to obtain the probability that a logistics object is classified into each coding corresponding category. For more specific details, I won't go into details here.

In addition, for other encodings other than HScode, a coding classification model can also be obtained in a similar manner.

The above-mentioned process of establishing a coding classification model may be completed in advance. After the completion, a specific model may be used to classify the target logistics object to be classified. Specifically, referring to FIG. 3, the following steps may be included:

S301: Determine the text description information of the target logistics object to be classified and process the text description information to determine the target feature words included;

From this step, it is mainly a process of predicting the coding of a specific target logistics object by using the above coding classification model. Specifically, first, the text description information of the target logistics object to be classified can be determined, where the specific text description information can be obtained from information such as the title of the logistics object. In specific implementation, if an interface-based tool is provided, as shown in FIG. 4, an interface for inputting text description information of a target logistics object may also be provided in the interface, for example, it may be an input box. Alternatively, an entry for importing text description information of multiple target logistics objects in batches can also be provided. In this way, users can organize the text description information of logistics objects that need to be classified in advance through Excel forms and other methods. Name the data columns in the table. After that, you can import the textual description information of each logistics object recorded in this form into the tool through the batch operation entry mentioned above, and so on. Among them, whether it is a single entry of the text description information of the target logistics object or a batch import, the specific target logistics object can be a logistics object waiting for customs declaration, for example, it can be a target logistics object extracted from a specific cross-border order. Title, text object information, etc.

S302: Generate a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;

After obtaining the text description information of the target data object, the text description information can be processed. The specific processing method can be consistent with the processing method of the text description information in the training sample. For example, word segmentation processing can also be performed to filter out invalid words and determine the remaining valid words as target feature words. Then, a feature word vector corresponding to the target logistics object may also be generated according to the inclusion condition of each target feature word in the text description information. Specifically, the feature word vector corresponding to the target logistics object may be generated according to the inclusion of the feature words on each serial number in the text description information of the target logistics object. For example, the text description information of the target logistics object includes No. 1 feature word, No. 5 feature word, No. 8 feature word, No. 27 feature word, etc., then the element values corresponding to the above serial numbers can be 1, or preset The initial weight value, and the element values corresponding to other sequence numbers are 0. Of course, in practical applications, the text description information of the target logistics object to be predicted may include vocabulary that has not been included in the training process. For this vocabulary, it can be filtered out without entering into specific categories In the model. However, after completing this prediction, you can also determine whether the vocabulary is related to the HScode classification based on the named entity information of the vocabulary. If relevant, you can also add it as a feature word to the corresponding feature word set, and you can re-model the model Training, etc.

It should be noted here that the dimension of the feature word vector generated for the target logistics object here is consistent with the number of feature words in the feature word set during training. For example, if the feature words in all the training samples are combined to form a feature word set during training, the number of feature words included is N, and the feature word vector corresponding to the target logistics object to be predicted may also be N-dimensional vector. In addition, if the HScode is grouped during training, and the feature words in the training samples corresponding to the HScode in each group are summarized, the number of feature words in each group will also be reduced. In this case, before generating a feature word vector for the target logistics object, the group to which the target logistics object belongs can be determined first. For example, if HScode is performed according to the category system in an online sales system Grouping can determine the corresponding HScode group according to the category to which the target logistics object belongs under the category system in the online sales system. Furthermore, the feature word set included in the HScode group may be used to determine the feature word vector of the current target logistics object.

S303: The feature word vector is input into a coding classification model, and corresponding classification feature information is obtained.

After the feature word vector corresponding to the target logistics object is determined, it can be input into the coding classification model to obtain specific classification feature information. For example, in specific implementation, for a case where HScode is classified by using a logistic regression model, the feature word vector may be input into the coding classification model to determine a probability that the target logistics object belongs to a corresponding category of each HScode. In addition, Classification recommendation information may also be provided according to the probability. Specifically, the feature word vector of the target logistics object may be multiplied with the feature word weight vector corresponding to each HScode (it may also be adjusted by an offset value, etc.) to obtain that the target logistics object belongs to the corresponding category of each HScode. Probability value. Wherein, if group training is performed during training, while the feature word vector is input into the classification model, the group information corresponding to the target logistics object may also be input into the classification model. In this way, only the feature word vector of the target logistics object needs to be calculated with the feature word weight vector corresponding to each HScode in the group, instead of calculating the probability of all HScodes separately, which can save calculations. the amount.

After calculating the probability that the target logistics object belongs to the corresponding category of each HScode, the corresponding classification suggestion information can also be returned. For example, one or more HScodes with a probability higher than a preset threshold can be returned, so that the user can determine the specific HScode for the target logistics object based on the result of this recommendation.

In short, according to the embodiment of the present application, a coding classification model can be determined in advance, so that, for a target logistics object to be classified, text description information can be obtained and processed, and the target feature words included are determined, and Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information, and then, inputting the feature word vector into a coding classification model to obtain corresponding classification feature information . In this way, automatic classification of logistics objects can be achieved without relying on manual classification, so efficiency and accuracy can be improved.

In an optional embodiment, a classification model for each HScode can be obtained through the collection, processing, and machine learning training of training samples. Specifically, it can be represented by a feature word weight vector, where each feature word pair is associated with the HScode. Determine the weight value. In this way, when a certain target data object is predicted, the text description information of the target data object may be processed by word segmentation, etc., the feature words contained therein are determined, and a feature word vector is generated. In this way, this feature word vector can be input into the classification model of the previous training number, so that the probability that the target logistics object is classified as each HScode can be calculated, and the recommended information can be given accordingly, for example, it can be given Suggested HScode or codes, etc. In this way, the process of classifying target data objects is no longer completely dependent on experts, which can reduce labor costs, and the efficiency of classification is also improved, without being limited by the experience of experts and personal capabilities.

Example two

This second embodiment provides a method for generating a coding classification model. Referring to FIG. 5, the method may specifically include:

S501: Collect training samples, where each training sample includes a corresponding relationship between known text object description information and a coding of a logistics object;

The code may specifically refer to the customs code HScode and the like described above.

S502: Perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;

S503: Summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;

S504: Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;

S505: Each feature word vector corresponding to multiple training samples associated with the same code is input into a preset machine learning model for training, and a classification model corresponding to each code is obtained.

Regarding the undetailed parts in the second embodiment, reference may be made to the records in the foregoing first embodiment, and details are not described herein again.

Corresponding to the first embodiment, the embodiment of the present application further provides a logistics object information processing device. Referring to FIG. 6, the device may specifically include:

A target logistics object information determining unit 601, configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;

A feature vector generating unit 602, configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;

A classification feature information obtaining unit 603 is configured to input the feature word vector into a coding classification model, and obtain corresponding classification feature information.

The coding classification model includes a logistic regression model, a decision tree model, and a neural network model.

If the coding classification model is a logistic regression model, the coding classification model stores a feature word weight vector corresponding to each coding.

Specifically, the encoding package includes a HS code, and the encoding classification model stores a feature word weight vector corresponding to each customs code HScode. The feature word weight vector records the distinguishing weight value of each feature word on the associated HScode. .

If the coding classification model is a decision tree model, the coding classification model stores multiple tree models, and based on each tree, a threshold of division and features of a feature word vector are stored in order to determine that the target logistics object is classified. Probability of class corresponding to each potential code.

If the coding classification model is a neural network model, the coding classification model has multiple layers of non-linear change units, and each layer's non-linear change unit is connected in series with the next layer of non-linear change units, and each layer is non-linear The change unit stores feature weights based on the feature word vector or the feature vector derived from the feature word vector, so as to obtain the probability that the logistics object is classified into each potential coding corresponding category through the interaction of the multilayer non-linear change unit.

In specific implementation, the classification feature information obtaining unit may be specifically configured to input the feature word vector into a coding classification model, and determine a probability that the target logistics object is classified into each potential coding corresponding category. It can also be used to provide classification recommendation information according to the probability.

The coding classification model is established in the following manner:

A sample collection unit, configured to collect training samples, where each training sample includes a correspondence relationship between known text description information of logistics objects and HScode;

A training unit is configured to input feature word vectors corresponding to multiple training samples associated with the same HScode into a preset machine learning model for training, and obtain a classification model corresponding to each HScode.

In specific implementation, the feature vector generating unit may be specifically configured to generate a feature word vector corresponding to the target logistics object according to the inclusion of the feature words on each serial number in the text description information of the target logistics object.

When performing model training, the device may further include:

A data cleaning unit is configured to perform data cleaning on the training samples after the training samples are collected, so as to use the remaining valid training samples to train a classification model.

Specifically, the data cleaning unit may be specifically configured to:

Information on the mapping relationship between the old and new HScodes is saved in advance; for training samples where the old HScode appears, the new HScode is replaced according to the mapping relationship, and then added to the training sample set as valid training samples.

Alternatively, the data cleaning unit can also be used:

Save the list of disabled HScodes in advance; delete the training samples where the disabled HScodes appear.

Alternatively, the data cleaning unit can also be used:

Pre-save a list of split HScode information, where each split HScode information includes the HScode before split, and the corresponding multiple HScode after split;

The training samples before the HScode before splitting are extracted, so that the HScode after splitting is re-determined, and then added to the training sample set as valid training samples.

In specific implementation, the device may further include:

A vocabulary filtering unit is configured to perform named entity recognition on a vocabulary obtained from the segmentation result of the text description information, and filter vocabularies that are not related to the classification of the logistics object according to the recognition result of the named entity.

In addition, when specifically performing model training, the device may further include:

A grouping unit, configured to group the HScodes according to one level of category information in the category system in the relevant online sales system before the feature words obtained in each training sample are summarized and deduplicated To obtain multiple groups, each group includes multiple HScodes, so that the feature words are summarized and deduplicated, and feature vectors are generated and model training processing is performed on each HScode group as a unit.

Specifically, the classification model may also save the correspondence between each group and the HScode;

When performing prediction, the apparatus may further include:

A group determining unit, configured to determine a corresponding HScode group according to the category to which the target logistics object belongs under the category system of the online sales system;

The prediction unit may be specifically configured to:

The HScode group corresponding to the target logistics object and the feature word vector are input into the classification model, so as to determine a probability that the target logistics object belongs to a corresponding category of each HScode in the group.

Corresponding to the second embodiment, this embodiment of the present application further provides a device for generating a customs code classification model. Referring to FIG. 7, the device may specifically include:

A sample collection unit 701 is configured to collect training samples, where each training sample includes a corresponding relationship between a text description information of a known logistics object and a coding;

A feature word determining unit 702, configured to perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain a feature word;

A feature vocabulary unit 703 is configured to summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;

A feature word vector generating unit 704, configured to generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each sequence number in each training sample;

A training unit 705 is configured to input feature word vectors corresponding to multiple training samples associated with the same code into a preset machine learning model for training, and obtain a coding classification model corresponding to each coding; the coding classification model is stored in the coding classification model. There is a feature word weight vector corresponding to each coding; the feature word weight vector records the discrimination weight value of each feature word pair associated coding.

In addition, corresponding to the first embodiment of the present application, the embodiment of the present application further provides a computer system including:

One or more processors; and

Among them, FIG. 8 exemplarily shows the architecture of the computer system, which may specifically include a processor 810, a video display adapter 811, a disk drive 812, an input / output interface 813, a network interface 814, and a memory 820. The processor 810, the video display adapter 811, the disk drive 812, the input / output interface 813, and the network interface 814 can communicate with the memory 820 through a communication bus 830.

The processor 810 may be implemented by using a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits. Relevant procedures are executed to implement the technical solution provided in this application.

The memory 820 may be implemented in the form of ROM (Read Only Memory, Read Only Memory), RAM (Random Access Memory, Random Access Memory), static storage devices, dynamic storage devices, and the like. The memory 820 may store an operating system 821 for controlling the operation of the computer system 800, and a basic input output system (BIOS) 822 for controlling low-level operations of the computer system 800. In addition, a web browser 823, a data storage management system 824, and a classification processing system 825 can also be stored. The classification processing system 825 may be an application program that specifically implements the foregoing steps in the embodiments of the present application. In short, when the technical solution provided in the present application is implemented by software or firmware, the relevant program code is stored in the memory 820, and is called and executed by the processor 810.

The input / output interface 813 is used to connect an input / output module to implement information input and output. The input / output / module can be configured in the device as a component (not shown in the figure), or it can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, and an indicator light.

The network interface 814 is used to connect a communication module (not shown in the figure) to implement communication interaction between the device and other devices. The communication module can implement communication through a wired method (such as USB, network cable, etc.), and can also implement communication through a wireless method (such as mobile network, WIFI, Bluetooth, etc.).

The bus 830 includes a path for transmitting information between various components of the device (for example, the processor 810, the video display adapter 811, the disk drive 812, the input / output interface 813, the network interface 814, and the memory 820).

In addition, the computer system 800 may also obtain information of specific receiving conditions from the virtual resource object receiving condition information database 841 for use in performing condition judgment, and so on.

It should be noted that although the above device only shows the processor 810, the video display adapter 811, the disk drive 812, the input / output interface 813, the network interface 814, the memory 820, the bus 830, etc., in the specific implementation process, the The device may also include other components necessary for proper operation. In addition, a person skilled in the art can understand that the foregoing device may also include only components necessary for implementing the solution of the present application, and does not necessarily include all the components shown in the figure.

It can be known from the description of the foregoing embodiments that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary universal hardware platform. Based on such an understanding, the technical solution of this application that is essentially or contributes to the existing technology can be embodied in the form of a software product, which can be stored in a storage medium, such as ROM / RAM, magnetic disk , Optical discs, and the like, including a number of instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in each embodiment or some parts of the application.

Each embodiment in this specification is described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. In particular, for a system or a system embodiment, since it is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment. The system and system embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, It can be located in one place or distributed across multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the objective of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without creative efforts.

The logistics object information processing method, device, and computer system provided in the present application have been described in detail above. Specific examples have been used in this document to explain the principle and implementation of the present application. The descriptions of the above embodiments are only for understanding The method of the present application and its core idea; meanwhile, for a person of ordinary skill in the art, according to the idea of the present application, there will be changes in the specific implementation and application scope. In summary, the content of this specification should not be construed as a limitation on this application.

Claims

A logistics object information processing method, comprising:

Determining the text description information of the target logistics object to be classified and processing the text description information to determine the target feature words included;

Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;

The feature word vector is input into a coding classification model to obtain corresponding classification feature information.
The method according to claim 1, wherein the coding classification model comprises a logistic regression model, a decision tree model, and a neural network model.
The method according to claim 2, wherein if the coding classification model is a logistic regression model, the coding classification model stores a feature word weight vector corresponding to each coding.
The method according to claim 3, wherein the code includes a customs code HScode, and the code classification model stores a feature word weight vector corresponding to each customs code HScode; and the feature word weight vector is recorded in the code classification model. The discrimination weight value of each feature word to the associated HScode.
The method according to claim 2, characterized in that, if the coding classification model is a decision tree model, the coding classification model stores multiple tree models, and based on each tree, a split threshold and feature words are stored. The characteristics of the vectors in order to determine the probability that the target logistics object is classified into the corresponding category of each potential code.
The method according to claim 2, characterized in that, if the coding classification model is a neural network model, the coding classification model has multiple layers of non-linear change units, and each layer of the non-linear change unit is the same as the next Layers of non-linear change units are connected in series, and each layer of non-linear change units holds feature weights based on or derived from feature word vectors, so that logistics objects obtained through the interaction of multi-layered non-linear change units are classified as The probability of each potential code corresponding to a category.
The method according to claim 1, wherein:

The inputting the feature word vector into a coding classification model to obtain corresponding classification feature information includes:

The feature word vector is input into a coding classification model, and a probability that the target logistics object is classified into each potential coding corresponding category is determined.
The method according to claim 7, further comprising:

Provide classification recommendation information according to the probability.
The method according to claim 3 or 7, characterized in that:

The coding classification model is established in the following manner:

Collect training samples, where each training sample includes the corresponding relationship between the text description information of the known logistics object and the customs code HScode;

Perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;

The feature words obtained from each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;

Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;

The feature word vectors corresponding to multiple training samples associated with the same HScode are input into a preset machine learning model for training, and a coding classification model corresponding to each HScode is obtained.
The method according to claim 9, characterized in that:

The generating a feature word vector corresponding to the target logistics object includes:

According to the inclusion of the feature words on each serial number in the text description information of the target logistics object, a feature word vector corresponding to the target logistics object is generated.
The method according to claim 9, characterized in that:

After collecting the training samples, the method further includes:

Data washing is performed on the training samples, so as to use the remaining valid training samples to train the classification model.
The method according to claim 11, wherein:

The data cleaning on the training samples includes:

Save the old and new HScode mapping relationship information in advance;

For the training samples in which the old HScode appears, the new HScode is replaced according to the mapping relationship, and then added to the training sample set as valid training samples.
The method according to claim 11, wherein:

The data cleaning on the training samples includes:

Save a list of disabled HScodes in advance;

Deletion of the training samples with the disabled HScode will appear.
The method according to claim 11, wherein:

The data cleaning on the training samples includes:

Pre-save a list of split HScode information, where each split HScode information includes the HScode before split, and the corresponding multiple HScode after split;

The training samples before the HScode before splitting are extracted, so that the HScode after splitting is re-determined, and then added to the training sample set as valid training samples.
The method according to claim 9, characterized in that:

The filtering out invalid words includes:

Named entity recognition is performed on the vocabulary obtained from the word segmentation result of the text description information, and words that are not related to the classification of the logistics object are filtered according to the recognition result of the named entity.
The method according to claim 9, characterized in that:

Before the summary and deduplication of the feature words obtained in each training sample, the method further includes:

According to one level of category information in the relevant online sales system, the HScode is grouped to obtain multiple groups, and each group includes multiple HScodes so that each HScode group is a unit. , Performing summary deduplication of the feature words, generating feature vectors, and model training processing.
The method according to claim 16, wherein:

The coding classification model also stores the corresponding relationship between each group and HScode;

The method further includes:

Determine the corresponding HScode group according to the category to which the target logistics object belongs under the category system of the online sales system;

The inputting the feature word vector into the coding classification model and determining a probability that the target commodity object belongs to a corresponding category of each HScode includes:

The HScode group corresponding to the target commodity object and the feature word vector are input into the coding classification model, so as to determine a probability that the target logistics object belongs to a corresponding category of each HScode in the group.
A method for generating a coding classification model, comprising:

Collect training samples, where each training sample includes a known correspondence between the text description information of the logistics object and the coding;

Perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;

The feature words obtained from each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;

Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;

The feature word vectors corresponding to multiple training samples associated with the same code are respectively input to a preset machine learning model for training, and a coding classification model corresponding to each coding is obtained; the coding classification model stores a corresponding coding classification model. Feature word weight vector; the feature word weight vector records the discrimination weight value of each feature word pair encoding.
A logistics object information processing device, comprising:

A target logistics object information determining unit, configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;

A feature vector generating unit, configured to generate a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word in the text description information;

A classification feature information acquisition unit is configured to input the feature word vector into a coding classification model to obtain corresponding classification feature information.
An apparatus for generating a coding classification model, comprising:

A sample collection unit, configured to collect training samples, where each training sample includes a correspondence between a text description information of a known logistics object and a code;

A feature word determining unit, configured to perform word segmentation on the text description information in the training sample, and filter out invalid words to obtain a feature word;

The feature vocabulary unit is used to summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;

A feature word vector generating unit, configured to generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each sequence number in each training sample;

A training unit is configured to input feature word vectors corresponding to multiple training samples associated with the same code into a preset machine learning model for training, and obtain a coding classification model corresponding to each coding; the coding classification model stores therein Feature word weight vector corresponding to each code; the feature word weight vector records the discrimination weight value of each feature word pair associated code.
A computer system, comprising:

One or more processors; and

A memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:

Determining the text description information of the target logistics object to be classified and processing the text description information to determine the target feature words included;

Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;

The feature word vector is input into a coding classification model to obtain corresponding classification feature information.