WO2020034880A1 - Logistics object information processing method, device and computer system - Google Patents

Logistics object information processing method, device and computer system Download PDF

Info

Publication number
WO2020034880A1
WO2020034880A1 PCT/CN2019/099552 CN2019099552W WO2020034880A1 WO 2020034880 A1 WO2020034880 A1 WO 2020034880A1 CN 2019099552 W CN2019099552 W CN 2019099552W WO 2020034880 A1 WO2020034880 A1 WO 2020034880A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
feature word
hscode
classification model
logistics object
Prior art date
Application number
PCT/CN2019/099552
Other languages
French (fr)
Chinese (zh)
Inventor
郑恒
张振华
李驰
Original Assignee
菜鸟智能物流控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 菜鸟智能物流控股有限公司 filed Critical 菜鸟智能物流控股有限公司
Publication of WO2020034880A1 publication Critical patent/WO2020034880A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • G06Q10/083Shipping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • G06Q10/083Shipping
    • G06Q10/0838Historical data

Definitions

  • Hscode The Harmonization System Code
  • HS uses a six-digit code to classify all international trade goods into 22 categories, 98 chapters. Chapters are subdivided into headings and subheadings. The first and second digits of the product code represent "Chapter”, the third and fourth digits represent "Heading", and the fifth and sixth digits represent "Subheading”. The first 6 digits are HS international standard codes. HS has 1241 four-digit tax items and 5,113 six-digit subheadings.
  • HScode classification is required for specific commodities in order to facilitate customs clearance.
  • the so-called HScode classification is to give the HScode to which the product belongs according to the specific information of the product (text description, picture, etc.) and the classification basis of the HScode.
  • HScode classifies products to a greater degree during the classification process. For example, the same clothing category, different materials, different styles, and even different weaving methods will correspond to different Hscode. Therefore, Hscode classification of products is a very tedious process.
  • the application provides a logistics object information processing method, device and computer system, which can realize automatic classification of logistics objects, reduce labor costs, and reduce the probability of errors.
  • a logistics object information processing method includes:
  • the feature word vector is input into a coding classification model to obtain corresponding classification feature information.
  • a method for generating a coding classification model includes:
  • each training sample includes a known correspondence between the text description information of the logistics object and the coding
  • the feature words obtained from each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;
  • the feature word vectors corresponding to multiple training samples associated with the same code are respectively input to a preset machine learning model for training, and a coding classification model corresponding to each coding is obtained; the coding classification model stores a corresponding coding classification model.
  • Feature word weight vector the feature word weight vector records the discrimination weight value of each feature word pair encoding.
  • a target logistics object information determining unit configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;
  • a feature vector generating unit configured to generate a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word in the text description information
  • a classification feature information acquisition unit is configured to input the feature word vector into a coding classification model to obtain corresponding classification feature information.
  • a device for generating a coding classification model includes:
  • a sample collection unit configured to collect training samples, where each training sample includes a correspondence relationship between known text object description information and a coding of a logistics object;
  • the feature vocabulary unit is used to summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;
  • a training unit is configured to input feature word vectors corresponding to multiple training samples associated with the same code into a preset machine learning model for training, and obtain a coding classification model corresponding to each coding; the coding classification model stores therein Feature word weight vector corresponding to each code; the feature word weight vector records the discrimination weight value of each feature word pair associated code.
  • a computer system includes:
  • One or more processors are One or more processors.
  • a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
  • the feature word vector is input into a coding classification model to obtain corresponding classification feature information.
  • a coding classification model can be determined in advance.
  • text description information can be obtained and processed, and the target feature words included are determined.
  • the text description information includes each target feature word
  • a feature word vector corresponding to the target logistics object is generated.
  • the feature word vector can be input into a coding classification model to obtain corresponding classification feature information.
  • FIG. 1 is a schematic diagram of an overall framework provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a system provided by an embodiment of the present application.
  • FIG. 4 is a schematic interface diagram of a classification tool provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a model training method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a second device according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer system according to an embodiment of the present application.
  • the data is processed and input into specific machine learning models for training, and finally a specific coding classification model is established.
  • this code classification model can be used to predict the codes to which specific logistics objects belong.
  • the specific prediction result can be directly used as the encoding classification result, or it can also be used as a reference for the encoding classification result, and so on.
  • the coding classification model after the coding classification model is established, it can be provided to merchant users, customs clearance partners (CPs) of the cross-border online sales system, and customs departments to use in the customs clearance of logistics objects, thereby replacing or partially replacing them with machine classification.
  • Manual classification in the traditional way improves the efficiency of customs clearance of logistics objects and reduces the cost of enterprise classification.
  • the technical threshold for using the above model is lowered.
  • an interface-based classification tool either an online tool or a local application that can be installed. In this way, users only need to pass the interface
  • the input box and other controls provided in the input the text description information of the target logistics object to be classified, and the classification tool can automatically process and call the pre-configured coding classification model to give the final classification suggestion.
  • a specific coding classification model may be established in advance.
  • the steps of establishing the model may first include:
  • the HScode of this type of logistics object is modified to 6110110011. Therefore, this mapping relationship can be saved. After collecting training samples, if it is found that the HScode contained in a training sample is 6110110000, it can be modified to 6110110011 according to the saved mapping relationship, and then the training sample can be made into valid data.
  • Step 4 Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;
  • each training sample corresponds to a vector of 10,000 dimensions or more.
  • the number of words is usually very small relative to the total number of dimensions of the vector. Therefore, the value of most elements in the vector is 0, which may cause a waste of computing resources.
  • HScodes can also be grouped in advance.
  • the commodity categories corresponding to certain HScodes have strong similarities, so they can be divided into a group to form a large category, and so on.
  • the grouping basis for grouping HScodes can also be information such as the category system defined in the online sales system. In this way, the category system defined in the online sales system can be associated with this customs HScode, which is also convenient. Make more efficient classification predictions in subsequent predictions.
  • the classification model can be trained with each group as a unit. In this way, the number of training samples in each group will be reduced, so the total number of corresponding feature words will also be reduced. In the end, the dimension of the feature word vector corresponding to each training sample will also be reduced, thereby reducing the calculation amount and improving the training efficiency.
  • each HScode can respectively correspond to a feature word weight vector.
  • the weights corresponding to the feature words on the same sequence number may be different.
  • the trained classification model can be persistently stored in a storage medium such as a disk, or, as described above, an interfaced classification tool can be generated based on the model and provided to various users for use.
  • a coding classification model can also be obtained in a similar manner.
  • the above-mentioned process of establishing a coding classification model may be completed in advance. After the completion, a specific model may be used to classify the target logistics object to be classified. Specifically, referring to FIG. 3, the following steps may be included:
  • the text description information of the target logistics object to be classified can be determined, where the specific text description information can be obtained from information such as the title of the logistics object.
  • an interface-based tool is provided, as shown in FIG. 4, an interface for inputting text description information of a target logistics object may also be provided in the interface, for example, it may be an input box.
  • an entry for importing text description information of multiple target logistics objects in batches can also be provided. In this way, users can organize the text description information of logistics objects that need to be classified in advance through Excel forms and other methods. Name the data columns in the table.
  • a coding classification model can be determined in advance, so that, for a target logistics object to be classified, text description information can be obtained and processed, and the target feature words included are determined, and Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information, and then, inputting the feature word vector into a coding classification model to obtain corresponding classification feature information .
  • automatic classification of logistics objects can be achieved without relying on manual classification, so efficiency and accuracy can be improved.
  • S501 Collect training samples, where each training sample includes a corresponding relationship between known text object description information and a coding of a logistics object;
  • a feature vector generating unit 602 configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;
  • the encoding package includes a HS code
  • the encoding classification model stores a feature word weight vector corresponding to each customs code HScode.
  • the feature word weight vector records the distinguishing weight value of each feature word on the associated HScode. .
  • the coding classification model is a decision tree model
  • the coding classification model stores multiple tree models, and based on each tree, a threshold of division and features of a feature word vector are stored in order to determine that the target logistics object is classified. Probability of class corresponding to each potential code.
  • the coding classification model is a neural network model
  • the coding classification model has multiple layers of non-linear change units, and each layer's non-linear change unit is connected in series with the next layer of non-linear change units, and each layer is non-linear
  • the change unit stores feature weights based on the feature word vector or the feature vector derived from the feature word vector, so as to obtain the probability that the logistics object is classified into each potential coding corresponding category through the interaction of the multilayer non-linear change unit.
  • the classification feature information obtaining unit may be specifically configured to input the feature word vector into a coding classification model, and determine a probability that the target logistics object is classified into each potential coding corresponding category. It can also be used to provide classification recommendation information according to the probability.
  • the coding classification model is established in the following manner:
  • a sample collection unit configured to collect training samples, where each training sample includes a correspondence relationship between known text description information of logistics objects and HScode;
  • a feature word vector generating unit configured to generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each sequence number in each training sample;
  • a training unit is configured to input feature word vectors corresponding to multiple training samples associated with the same HScode into a preset machine learning model for training, and obtain a classification model corresponding to each HScode.
  • the device may further include:
  • the data cleaning unit may be specifically configured to:
  • the device may further include:
  • a sample collection unit 701 is configured to collect training samples, where each training sample includes a corresponding relationship between a text description information of a known logistics object and a coding;

Abstract

Disclosed by the embodiments of the present application are a logistics object information processing method, a device and a computer system. The method comprises: determining text description information of a target logistics object to be categorized, processing the text description information, and determining a target characteristic word contained therein; generating a characteristic word vector corresponding to the target logistics object according to the inclusion of each target characteristic word in the text description information; and inputting the characteristic word vector into a code classification model, and acquiring corresponding classification characteristic information. By means of the embodiments of the present application, the automatic classification of a logistics object code may be achieved so as to reduce the probability of an error occurring while reducing labor costs.

Description

物流对象信息处理方法、装置及计算机系统Information processing method, device and computer system for logistics objects
本申请要求2018年08月17日递交的申请号为201810943287.X、发明名称为“物流对象信息处理方法、装置及计算机系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed on August 17, 2018 with an application number of 201810943287.X and an invention name of "Logistics Object Information Processing Method, Device, and Computer System", the entire contents of which are incorporated herein by reference. in.
技术领域Technical field
本申请涉及物流对象信息处理技术领域,特别是涉及物流对象信息处理方法、装置及计算机系统。The present application relates to the technical field of logistics object information processing, and in particular, to a method, an apparatus, and a computer system for processing logistics object information.
背景技术Background technique
Hscode(The Harmonization System Code,商品名称及编码协调系统码,简称海关编码)是海关清关环节必须要向海关提供的核心数据,涉及到出口的税率以及退税的税率。HS采用六位数编码,把全部国际贸易商品分为22类,98章。章以下再分为目和子目。商品编码第一、二位数码代表“章”,第三、四位数码代表“目”(Heading),第五、六位数码代表“子目”。前6位数是HS国际标准编码,HS有1241个四位数的税目,5113个六位数子目。Hscode (The Harmonization System Code) is the core data that must be provided to the customs during the customs clearance process. It involves the export tax rate and the tax refund rate. HS uses a six-digit code to classify all international trade goods into 22 categories, 98 chapters. Chapters are subdivided into headings and subheadings. The first and second digits of the product code represent "Chapter", the third and fourth digits represent "Heading", and the fifth and sixth digits represent "Subheading". The first 6 digits are HS international standard codes. HS has 1241 four-digit tax items and 5,113 six-digit subheadings.
在跨境电商等进出口贸易过程中,需要对具体的商品进行HScode归类,以便于到海关进行清关。所谓HScode归类即为,根据商品的具体信息(文字描述、图片等)以及HScode的归类依据给出商品所属的HScode。与普通电商等系统对商品分类不同的是,HScode归类过程中对商品细分的程度更深。例如,同样是服装类,不同的材质、不同的款式、甚至不同的织造方法,都会对应不同的Hscode。因此,对商品进行Hscode归类是个很繁琐的过程。In the process of cross-border e-commerce and other import and export trades, HScode classification is required for specific commodities in order to facilitate customs clearance. The so-called HScode classification is to give the HScode to which the product belongs according to the specific information of the product (text description, picture, etc.) and the classification basis of the HScode. Different from general e-commerce systems and other categories of goods, HScode classifies products to a greater degree during the classification process. For example, the same clothing category, different materials, different styles, and even different weaving methods will correspond to different Hscode. Therefore, Hscode classification of products is a very tedious process.
目前行业内绝大部分企业使用的是人工预归类的方法,但是,即使是一个具有丰富从业经验的报关专家,归类一个sku(商品的最小库存量单位)需要花费大约2~15分钟,对于一些极其复杂的商品需要花费几小时甚至更长的时间,日均最多处理200个sku。并且,由于相对应的预归类专业人才稀少,预归类学习门槛高等原因,人工预归类主要存在归类成本高、归类时效低、响应时间长等问题。据统计,目前归类一个sku的成本在200~500RMB,一些特殊品类如机电类,预归类成本高达1500RMB以上。但是,一些大型的跨境电商交易平台中,涉及到的sku数量庞大,甚至可以高达几十亿,这种成本 显然无法接受。另外,面对B2C跨境电商日均千万级的单量以及“72小时达”甚至更短送达时效的宏伟目标,上述方式下人均每天200个sku的处理量,也显然是无法及时响应和满足需求的。再者,人工归类太依赖专家的经验,面对每天巨大的归类工作量,错误在所难免,并且会随着工作量的增大,这种错误率会逐渐增高,造成企业商品报关商品受到海关挑战,企业资质受到影响。At present, most companies in the industry use the manual pre-classification method. However, even a customs expert with rich experience in the industry, it takes about 2 to 15 minutes to classify a sku (the smallest unit of inventory of goods). For some extremely complicated products, it takes several hours or even longer, and the average daily processing is up to 200 skus. In addition, due to the scarcity of corresponding pre-categorized professionals and high thresholds for pre-categorization, artificial pre-categorization mainly has problems such as high classification costs, low classification time, and long response time. According to statistics, the cost of categorizing a sku currently ranges from 200 to 500 RMB. For some special categories such as electromechanical, the pre-categorization cost can be as high as 1,500 RMB or more. However, in some large cross-border e-commerce trading platforms, the number of skus involved is huge, even up to billions, and this cost is obviously unacceptable. In addition, in the face of the average daily volume of B2C cross-border e-commerce of tens of millions and the grand goal of "72 hours to reach" or even shorter delivery time, the per capita daily throughput of 200 skus under the above method is obviously not timely. Respond and meet needs. Furthermore, manual classification is too dependent on the experience of experts. In the face of the huge daily classification workload, mistakes are inevitable, and as the workload increases, this error rate will gradually increase, causing corporate commodity declaration products. Under the challenge of the customs, the qualification of the company was affected.
因此,如何更高效地实现商品的归类,降低成本的同时降低出错概率,成为需要本领域技术人员解决的技术问题。Therefore, how to implement the classification of commodities more efficiently, reduce the cost and reduce the probability of errors, has become a technical problem that needs to be solved by those skilled in the art.
发明内容Summary of the Invention
本申请提供了物流对象信息处理方法、装置及计算机系统,可以实现对物流对象的自动分类,降低人力成本的同时降低出错概率。The application provides a logistics object information processing method, device and computer system, which can realize automatic classification of logistics objects, reduce labor costs, and reduce the probability of errors.
本申请提供了如下方案:This application provides the following solutions:
一种物流对象信息处理方法,包括:A logistics object information processing method includes:
确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;Determining the text description information of the target logistics object to be classified and processing the text description information to determine the target feature words included;
根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量;Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;
将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。The feature word vector is input into a coding classification model to obtain corresponding classification feature information.
一种生成编码分类模型的方法,包括:A method for generating a coding classification model includes:
收集训练样本,其中,每条训练样本中包括已知的物流对象文本描述信息与编码之间的对应关系;Collect training samples, where each training sample includes a known correspondence between the text description information of the logistics object and the coding;
对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;Perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;
将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;The feature words obtained from each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;
根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量;Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;
分别将同一编码关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各编码分别对应的编码分类模型;所述编码分类模型中保存有每个编码对应的特征词权重向量;所述特征词权重向量中记录有各个特征词对关联编码的判别权重值。The feature word vectors corresponding to multiple training samples associated with the same code are respectively input to a preset machine learning model for training, and a coding classification model corresponding to each coding is obtained; the coding classification model stores a corresponding coding classification model. Feature word weight vector; the feature word weight vector records the discrimination weight value of each feature word pair encoding.
一种物流对象信息处理装置,包括:A logistics object information processing device includes:
目标物流对象信息确定单元,用于确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;A target logistics object information determining unit, configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;
特征向量生成单元,用于根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量;A feature vector generating unit, configured to generate a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word in the text description information;
分类特征信息获取单元,用于将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。A classification feature information acquisition unit is configured to input the feature word vector into a coding classification model to obtain corresponding classification feature information.
一种生成编码分类模型的装置,包括:A device for generating a coding classification model includes:
样本收集单元,用于收集训练样本,其中,每条训练样本中包括已知的物流对象文本描述信息与编码之间的对应关系;A sample collection unit, configured to collect training samples, where each training sample includes a correspondence relationship between known text object description information and a coding of a logistics object;
特征词确定单元,用于对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;A feature word determining unit, configured to perform word segmentation on the text description information in the training sample, and filter out invalid words to obtain a feature word;
特征词汇总单元,用于将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;The feature vocabulary unit is used to summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;
特征词向量生成单元,用于根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量;A feature word vector generating unit, configured to generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each sequence number in each training sample;
训练单元,用于分别将同一编码关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各编码分别对应的编码分类模型;所述编码分类模型中保存有每个编码对应的特征词权重向量;所述特征词权重向量中记录有各个特征词对关联编码的判别权重值。A training unit is configured to input feature word vectors corresponding to multiple training samples associated with the same code into a preset machine learning model for training, and obtain a coding classification model corresponding to each coding; the coding classification model stores therein Feature word weight vector corresponding to each code; the feature word weight vector records the discrimination weight value of each feature word pair associated code.
一种计算机系统,包括:A computer system includes:
一个或多个处理器;以及One or more processors; and
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:A memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;Determining the text description information of the target logistics object to be classified and processing the text description information to determine the target feature words included;
根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量;Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;
将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。The feature word vector is input into a coding classification model to obtain corresponding classification feature information.
根据本申请提供的具体实施例,本申请公开了以下技术效果:According to the specific embodiments provided by the present application, the present application discloses the following technical effects:
通过本申请实施例,可以预先确定编码分类模型,这样,针对待归类的目标物流对象,可以获取其文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词,然后,根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量,之后,便可以将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。通过这种方式,可以实现对物流对象的自动分类,而不再需要依赖于人工分类,因此,可以提高效率以及准确度。Through the embodiments of the present application, a coding classification model can be determined in advance. In this way, for the target logistics object to be classified, text description information can be obtained and processed, and the target feature words included are determined. When the text description information includes each target feature word, a feature word vector corresponding to the target logistics object is generated. After that, the feature word vector can be input into a coding classification model to obtain corresponding classification feature information. In this way, automatic classification of logistics objects can be achieved without relying on manual classification, so efficiency and accuracy can be improved.
在可选的实现方案中,还可以通过对训练样本的收集、处理以及机器学习训练,可以得到针对各个HScode的分类模型,具体可以通过特征词权重向量来表示,其中记录有各个特征词对关联HScode的判别权重值。这样,具体在对某个目标数据对象进行预测时,就可以将该目标数据对象的文本描述信息进行分词等处理,确定出其中包含的特征词,并生成特征词向量。这样就可以将这种特征词向量输入到之前训练好的分类模型中,从而可以计算出该目标商品对象被归类为各个HScode的概率,并可以据此给出建议信息,例如,可以给出建议的一个或者几个HScode,等等。通过这种方式,使得对目标数据对象进行HScode归类的过程不再完全依赖于专家,可以降低人力成本,并且,分类的效率也得到提升,不会受限于专家的经验以及个人能力,降低出错率。In an optional implementation solution, a classification model for each HScode can also be obtained through the collection, processing, and machine learning training of training samples. Specifically, it can be represented by a feature word weight vector, in which each feature word pair is recorded. The discriminative weight value of HScode. In this way, when a certain target data object is predicted, the text description information of the target data object may be processed by word segmentation, etc., the feature words contained therein are determined, and a feature word vector is generated. In this way, this feature word vector can be input into the previously trained classification model, so that the probability of the target commodity object being classified into each HScode can be calculated, and recommendation information can be given accordingly, for example, it can be given Suggested HScode or codes, etc. In this way, the process of HScode classification of target data objects is no longer completely dependent on experts, which can reduce labor costs, and the efficiency of classification is also improved, not limited by the experience of experts and personal capabilities, reducing Error rate.
当然,实施本申请的任一产品并不一定需要同时达到以上所述的所有优点。Of course, the implementation of any product of this application does not necessarily need to achieve all the advantages described above at the same time.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present application. For those of ordinary skill in the art, other embodiments may be obtained based on these drawings without paying creative labor.
图1是本申请实施例提供的整体框架示意图;FIG. 1 is a schematic diagram of an overall framework provided by an embodiment of the present application; FIG.
图2是本申请实施例提供的系统示意图;2 is a schematic diagram of a system provided by an embodiment of the present application;
图3是本申请实施例提供的预测方法的流程图;3 is a flowchart of a prediction method provided by an embodiment of the present application;
图4是本申请实施例提供的分类工具的界面示意图;4 is a schematic interface diagram of a classification tool provided by an embodiment of the present application;
图5是本申请实施例提供的模型训练方法的流程图;5 is a flowchart of a model training method according to an embodiment of the present application;
图6是本申请实施例提供的第一装置的示意图;6 is a schematic diagram of a first device according to an embodiment of the present application;
图7是本申请实施例提供的第二装置的示意图;7 is a schematic diagram of a second device according to an embodiment of the present application;
图8是本申请实施例提供的计算机系统的示意图。FIG. 8 is a schematic diagram of a computer system according to an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art fall within the protection scope of this application.
在本申请实施例中,为了能够提高物流对象分类效率,降低人力成本,可以预先建立编码分类模型,例如,具体可以包括逻辑回归模型、决策树模型、神经网络模型,等等。其中,如果编码分类模型为逻辑回归模型,可以通过机器学习建立编码分类模型,然后通过编码分类模型自动为物流对象(具体可以是指商品对象等)进行分类,以确定其对应的编码的实现方式。具体的,如图1所示,可以采集一些训练样本,具体可以是已知的商品对象的文本描述信息与HScode等编码之间的对应关系,然后,利用根据训练样本以及训练目标的特点,对数据进行处理,并输入到具体的机器学习模型中进行训练,并最终建立起具体的编码分类模型。之后,便可以利用这种编码分类模型对具体物流对象所属的编码进行预测。具体的预测结果可以直接作为编码分类结果,或者,也可以作为编码分类结果的参考,等等。In the embodiments of the present application, in order to improve the classification efficiency of logistics objects and reduce labor costs, a coding classification model may be established in advance, for example, it may specifically include a logistic regression model, a decision tree model, a neural network model, and so on. Among them, if the coding classification model is a logistic regression model, a coding classification model can be established through machine learning, and then the logistics object (specifically, it can refer to a product object, etc.) is automatically classified by the coding classification model to determine the corresponding coding implementation method. . Specifically, as shown in FIG. 1, some training samples can be collected, which can be the correspondence between the text description information of known commodity objects and codes such as HScode. Then, according to the characteristics of the training samples and training targets, The data is processed and input into specific machine learning models for training, and finally a specific coding classification model is established. After that, this code classification model can be used to predict the codes to which specific logistics objects belong. The specific prediction result can be directly used as the encoding classification result, or it can also be used as a reference for the encoding classification result, and so on.
具体实现时,在建立好编码分类模型后,可以提供给商家用户、跨境网络销售系统的报关合作方(CP)、海关部门等在物流对象通关环节使用,从而用机器归类代替或部分代替传统方式的人工归类,提升物流对象通关的效率,降低企业归类成本。另外,为了方便用户使用,降低使用上述模型的技术门槛。如图2所示,还可以在编码分类模型的基础上,进一步开发成界面化的分类工具(可以是在线工具,或者也可以是可以安装到本地的应用程序等),这样,用户只要通过界面中提供的输入框等控件,输入待分类的目标物流对象的文本描述信息,该分类工具便可以自动进行处理,并调用预先配置好的编码分类模型,给出最终的分类建议。In specific implementation, after the coding classification model is established, it can be provided to merchant users, customs clearance partners (CPs) of the cross-border online sales system, and customs departments to use in the customs clearance of logistics objects, thereby replacing or partially replacing them with machine classification. Manual classification in the traditional way improves the efficiency of customs clearance of logistics objects and reduces the cost of enterprise classification. In addition, in order to facilitate the use of the user, the technical threshold for using the above model is lowered. As shown in Figure 2, based on the coding classification model, you can further develop an interface-based classification tool (either an online tool or a local application that can be installed). In this way, users only need to pass the interface The input box and other controls provided in the input the text description information of the target logistics object to be classified, and the classification tool can automatically process and call the pre-configured coding classification model to give the final classification suggestion.
下面对具体的实现方案进行详细介绍。The specific implementation scheme is described in detail below.
实施例一Example one
首先,本申请实施例一从前述分类工具的角度,提供了一种对物流对象信息处理方法,在该方法中,首先可以获得编码分类模型,例如,具体可以包括逻辑回归模型、决策树模型、神经网络模型,等等。其中,对于逻辑回归模型,具体的编码分类模型中可以保存有每个编码对应的特征词权重向量。例如,对于海关编码HScode而言,编码分 类模型中具体保存的可以是每个HScode对应的特征词权重向量;所述特征词权重向量中记录有各个特征词对关联HScode的判别权重值。First, Embodiment 1 of the present application provides a method for processing logistics object information from the perspective of the aforementioned classification tool. In this method, a coding classification model can be obtained first. For example, it can include a logistic regression model, a decision tree model, Neural network models, etc. For a logistic regression model, a specific coding classification model may store feature word weight vectors corresponding to each coding. For example, for the customs code HScode, a feature word weight vector corresponding to each HScode can be stored specifically in the code classification model; the feature word weight vector records the discrimination weight value of each feature word to the associated HScode.
在上述逻辑回归模型的情况下,针对HScode,具体的编码分类模型可以是预先建立好的,在其中一种具体实现时,建立模型的步骤可以首先包括:In the case of the above logistic regression model, for HScode, a specific coding classification model may be established in advance. In one of the specific implementations, the steps of establishing the model may first include:
步骤一:收集训练样本,其中,每条训练样本中包括已知的物流对象文本描述信息与HScode之间的对应关系;Step 1: Collect training samples, where each training sample includes a corresponding relationship between known text description information of logistics objects and HScode;
具体实现时,可以收集物流对象历史归类记录中的有标注数据,例如,可以包括《中华人民共和国进出口税则》、历史通关数据、专家标注数据等。In specific implementation, labeled data in the historical classification records of logistics objects can be collected, for example, it can include the "Import and Export Tariff of the People's Republic of China", historical customs clearance data, and expert labeled data.
例如,《中华人民共和国进出口税则》中记录的信息可以如表1所示(仅示出一条):For example, the information recorded in the "Import and Export Tariff of the People's Republic of China" can be shown in Table 1 (only one is shown):
表1Table 1
Figure PCTCN2019099552-appb-000001
Figure PCTCN2019099552-appb-000001
当然,由于具体税则中记录的商品描述通常并不特指某一件商品,因此,为了更好的补充上述税则中的信息,还可以以跨境网络销售系统中的历史通关数据作为补充。例如,某条历史通关数据可以如表2所示:Of course, because the product descriptions recorded in specific tariffs usually do not specifically refer to a certain product, in order to better supplement the information in the above tariffs, historical customs data in the cross-border online sales system can also be used as a supplement. For example, a certain historical customs clearance data can be shown in Table 2:
表2Table 2
Figure PCTCN2019099552-appb-000002
Figure PCTCN2019099552-appb-000002
也就是说,历史通关数据中记录的是具体物流对象的名称等与HScode之间的对应关系,因此,将这种数据纳入到具体的训练样本中,可以更有利于为具体物流对象预测出更准确的HScode。In other words, the historical customs data records the correspondence between the name of the specific logistics object and the HScode. Therefore, incorporating this data into the specific training sample can be more conducive to predicting more specific logistics objects. Accurate HScode.
另外,除了上述税则以及历史通关数据之外,还可以以专家标注数据作为补充,例如,其中一条专家标注数据可以如表3所示:In addition, in addition to the above tariffs and historical customs clearance data, it can also be supplemented by expert labeling data. For example, one of the expert labeling data can be shown in Table 3:
表3table 3
Figure PCTCN2019099552-appb-000003
Figure PCTCN2019099552-appb-000003
总之可以通过多种途径进行训练样本的采集。当然,由于这些采集到的数据通常是 一些历史数据,而在实际应用中,可能会涉及到一些HScode的变更,或者停用,分拆等情况,因此,使得一些历史数据中对于后续的分类而言可能已经是的无效信息。为此,在优选的实施方式中,还可以对所述训练样本进行数据清洗,以便利用剩余的有效训练样本进行分类模型的训练。In short, there are many ways to collect training samples. Of course, because the collected data is usually historical data, in actual applications, some HScode changes, or deactivation, splitting, etc. may be involved. Therefore, some historical data are used for subsequent classification. Words may already be invalid information. For this reason, in a preferred embodiment, the training samples may also be subjected to data cleaning in order to use the remaining valid training samples for training of the classification model.
具体的数据清洗过程就是可以包括对发生变更的HScode进行修改,停用的HScode对应的训练样本进行删除,对拆分的HScode对应的训练样本中的HScode进行重新确定,等等。其中,具体实现时,首先可以预先保存新旧HScode映射关系信息;这样,在收集到具体的训练样本之后,可以首先对各条训练样本进行遍历,判断各条样本中的HScode中是否出现了旧的HScode。对于出现旧HScode的训练样本,可以根据所述映射关系替换为新HScode后,再作为有效训练样本加入到训练样本集合中。例如,对于某类物流对象,以前的税则中定义的HScode是6110110000,后来经过税则修改之后,将该类物流对象的HScode修改为6110110011,因此,就可以对这种映射关系进行保存。在采集到训练样本之后,如果发现某条训练样本中包含的HScode是6110110000,则可以根据保存的映射关系,将其修改为6110110011,然后就可以使得该条训练样本成为有效的数据。The specific data cleaning process may include modifying the changed HScode, deleting the training samples corresponding to the disabled HScode, re-determining the HScode in the training sample corresponding to the split HScode, and so on. Among them, in specific implementation, first, the old and new HScode mapping relationship information can be saved in advance; in this way, after collecting specific training samples, each training sample can be traversed first to determine whether the old HScode appears in each sample. HScode. For the training samples where the old HScode appears, the new HScode may be replaced according to the mapping relationship, and then added to the training sample set as valid training samples. For example, for a certain type of logistics object, the HScode defined in the previous tariff is 6110110000. After the tax code is modified, the HScode of this type of logistics object is modified to 6110110011. Therefore, this mapping relationship can be saved. After collecting training samples, if it is found that the HScode contained in a training sample is 6110110000, it can be modified to 6110110011 according to the saved mapping relationship, and then the training sample can be made into valid data.
另外,还可以预先保存已停用的HScode名单;这样,在采集到训练样本之后,同样可以对各条训练样本中的HScode进行遍历,将出现所述已停用HScode的训练样本删除。例如,某HScode为6110110027,后来经过税则修改之后,将该HScode对应的类别删除,相应的,该HScode停用。因此,可以对这种HScode进行记录,采集到具体的训练样本之后,如果发现某条训练样本中的HScode为6110110027,则可以将该HScode对应的训练样本删除。In addition, the disabled HScode list can also be saved in advance; in this way, after the training samples are collected, the HScode in each training sample can also be traversed to delete the disabled HScode training samples. For example, an HScode is 6110110027. After the HS code is modified, the category corresponding to the HScode is deleted. Accordingly, the HScode is disabled. Therefore, this HScode can be recorded. After collecting specific training samples, if the HScode in a training sample is found to be 6110110027, the training sample corresponding to the HScode can be deleted.
再者,还可能存在HScode拆分的情况,例如,在旧版本的税则中,某类别对应的HScode是6110110000,经过税则改版之后,将该类别进行细化拆分成了两个子类,分别对应HScode6110110001,以及6110110002,原来的HScode6110110000不再使用。因此,还可以预先保存分拆HScode信息名单,其中,每条分拆HScode信息中包括拆分前的HScode,以及对应的拆分后的多个HScode;这样,在采集到具体的训练数据之后,同样可以对各条数据中的HScode进行遍历,将出现所述拆分前的HScode的训练样本提取出来,以便重新确定分拆后的HScode后,再作为有效训练样本加入到训练样本集合中。例如,发现其中某条训练样本中的HScode是6110110000,则可以将其提取出来。之后,可以通过专家确认等方式,重新为该条训练样本确定分拆后的HScode,然后对分 拆前的HScode进行替换,使得该条训练样本成为有效数据,等等。Furthermore, there may be cases where the HScode is split. For example, in the old version of the tax code, the HScode corresponding to a category was 6110110000. After the tax code was revised, the category was subdivided into two subclasses, corresponding to HScode6110110001, and 6110110002, the original HScode6110110000 is no longer used. Therefore, a list of split HScode information can also be saved in advance, where each split HScode information includes the HScode before split, and the corresponding multiple HScode after split; in this way, after collecting specific training data, Similarly, the HScode in each piece of data can be traversed, and the training samples of the HScode before the splitting are extracted to re-determine the HScode after the splitting, and then be added to the training sample set as valid training samples. For example, if you find that the HScode in one of the training samples is 6110110000, you can extract it. After that, the HScode after the split can be determined again for the training sample by expert confirmation and other methods, and then the HScode before the split can be replaced to make the training sample valid data, and so on.
另外,除了对训练样本进行上述数据清洗之外,还可以通过对训练样本进行随机采样等方式进行人工校验,尽量提升训练样本的质量,以提升最终训练出的模型的准确度。In addition, in addition to performing the above data cleaning on the training samples, manual verification can be performed by randomly sampling the training samples to improve the quality of the training samples as much as possible to improve the accuracy of the finally trained model.
步骤二:对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;Step 2: word segmentation processing is performed on the text description information in the training sample, and invalid words are filtered to obtain feature words;
在对训练样本进行了数据清洗等处理之后,可以进行下一步的处理。具体的,由于训练样本中存在数据对象的文本描述信息,这种文本描述信息可以是具体物流对象的标题,或者,也可以是税则中给出的文本描述,商家报关时的申报要素,等等。总之,每条训练样本中都记录有文本描述信息与HScode之间的对应关系。机器学习的目的是从同一个HScode对应的多条文本描述信息中,找出规律性的信息,以用于进行HScode的预测。具体在对文本描述信息进行处理时,首先可以包括分词处理,也就是,文本描述信息通常是一句话,或者一段话,分词的目的是将其划分成多个词汇。After performing processing such as data cleaning on the training samples, the next processing can be performed. Specifically, because the text description information of the data object exists in the training sample, this text description information can be the title of the specific logistics object, or it can also be the text description given in the tax code, the declaration elements at the time of customs declaration, etc. . In short, the correspondence between text description information and HScode is recorded in each training sample. The purpose of machine learning is to find regular information from multiple text descriptions corresponding to the same HScode, and use it for HScode prediction. Specifically, when processing text description information, it may first include word segmentation processing, that is, text description information is usually a sentence or a paragraph, and the purpose of the word segmentation is to divide it into multiple words.
例如,某条训练样本中的文本描述信息是:春秋新款羊毛开衫女披肩外套薄针织衫短款V领小开衫宽松大码毛衣。进行分词处理后得到的分词结果可以是:春秋/新款/羊毛/开衫/女/披肩/外套/薄/针织衫/短款/V领/小开衫/宽松/大码/毛衣。其中,具体的分词处理方式可以参见现有技术中的方案,这里不再赘述。For example, the text description information in a training sample is: spring and autumn new wool cardigan women shawl coat thin sweater short V-neck small cardigan loose plus size sweater. The word segmentation results obtained after word segmentation processing can be: spring / autumn / new style / wool / cardigan / female / shawl / coat / thin / knit sweater / short section / V-neck / small cardigan / loose / large size / sweater. For the specific word segmentation processing method, refer to the solution in the prior art, which will not be repeated here.
在完成分词之后,还可以将其中与分类无关的词汇过滤掉,仅留下有效的特征词。其中,为了达到该目的,还可以对所述文本描述信息的分词结果得到的词汇进行命名实体识别,根据命名实体识别结果,过滤掉与物流对象分类无关的词汇。例如,同样假设某条训练样本的文本描述信息是:春秋新款羊毛开衫女披肩外套薄针织衫短款V领小开衫宽松大码毛衣,则分词并进行命名实体识别后,得到的结果可以是:After the word segmentation is completed, words that are not related to classification can be filtered out, leaving only valid feature words. Among them, in order to achieve this purpose, named entities may also be identified on the words obtained from the segmentation results of the text description information, and words that are not related to the classification of logistics objects may be filtered out based on the identified results of the named entities. For example, assuming that the text description information of a training sample is: spring and autumn new wool cardigan women shawl coat thin knit sweater short V-neck small cardigan loose plus size sweater, after word segmentation and named entity recognition, the result obtained can be:
春秋[季节]/新款[导购词]/羊毛[材质]/开衫[品类]/女[人群]/披肩[款式]/外套[品类]/薄[款式]/针织[织造方法]/短款[款式]/V领[款式]/小开衫[品类]/宽松[款式]/大码[风格]/毛衣[品类]Spring and Autumn [Season] / New [Shopping Guide] / Wool [Material] / Cardigan [Category] / Female [Crowd] / Shawl [Style] / Coat [Category] / Thin [Style] / Knitting [Weaving Method] / Short Section [ Style] / V-neck [style] / small cardigan [category] / loose [style] / large size [style] / sweater [category]
经过上述命名实体识别后,可以去除掉其中与归类无关的词,比如季节、导购、风格等词,留下与归类有关的品类、材质、织造方法等词,方便后续的特征处理。由于这种被留下的词更能体现出具体的物流对象在HScode归类时的特征,因此,可以称为特征词。After the named entities are identified, words that are not related to the classification, such as season, shopping guide, style, etc., can be removed, and words such as category, material, and weaving method related to the classification can be left to facilitate subsequent feature processing. Because this left word can better reflect the characteristics of the specific logistics object in HScode classification, it can be called a feature word.
步骤三:将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;Step 3: The feature words obtained in each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;
得到特征词后,可以对各条训练样本中的特征词进行汇总以及去重处理,得到特征词集合,并且,为了方便以向量的方式表达各条训练样本中的文本描述信息,使得后续可以通过向量计算的方式进行概率计算,还可以分别为各个特征词分配对应的序号。例如,假设各条训练样本中的特征词汇集在一起,一共有一万个特征词,则可以分别将这些特征词从1到10000进行编号。这样,对于每条训练样本,只要根据在各序号上的特征词包含情况,生成对应的特征词向量即可。After the feature words are obtained, the feature words in each training sample can be summarized and deduplicated to obtain a feature word set. In addition, in order to facilitate the expression of the text description information in each training sample in a vector manner, the following can be passed The vector calculation method performs probability calculation, and each feature word can also be assigned a corresponding sequence number. For example, if the feature words in each training sample are grouped together and there are 10,000 feature words in total, these feature words can be numbered from 1 to 10,000, respectively. In this way, for each training sample, it is only necessary to generate a corresponding feature word vector according to the feature word contained in each sequence number.
步骤四:根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量;Step 4: Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;
如步骤三所述,在生成了特征词集合,并获得各自的序号之后,在对每条训练样本中的文本描述信息进行表达时,都可以根据各训练样本对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量。也就是说,假设共有一万个特征词,则每条训练样本都可以对应一万维度的特征词向量。而由于上述特征词是根据各条训练样本进行分词、过滤后汇总得到的,因此,每条训练样本中包含的特征词,一定会存在于该特征词集合中。也即,每条训练样本中包含的特征词,是上述特征词集合的一个子集。其中,对于某条训练样本而言,其特征词向量中各个元素的取值,可以根据在对应序号上是否存在特征词来确定。例如,某条训练样本中包含的特征词分别为1号,12号,23号,25号,68号,1279号,等等,则该训练样本的特征词向量中,可以在上述序号对应的元素值为1,其他序号上的元素值为0,以此来表达该训练样本中包含哪些特征词。或者,在另一种实现方式下,还可以根据具体特征词的属性等信息,为各个序号上的元素值赋予初始权重,如果训练样本中存在某序号上的特征词,则可以将该序号上对应的元素值设为该特征词的初始权重,代表该特征词对应HScode的商品类别进行归类的重要程度。例如,生成的特征词向量可以为{1:0.2,4:0.5,12:0.6,1009:0.3,3801:0.2……},也即,该训练样本中包含1号特征词、4号特征词、12号特征词、1009号特征词、3801号特征词,等等,各自对应的初始权重分别为0.2,0.5,0.6,0.3,0.2,等等。需要说明的是,上述例子中,由于训练样本中不存在其他序号(例如,2,3,5,6……)的特征,因此,对应的元素值为0,上述向量中未示出。而在具体实现时,为了便于进行向量之间的乘法运算等,位于为0的元素值也是存在于具体的向量中的。As described in step three, after generating the feature word set and obtaining the respective sequence numbers, when expressing the text description information in each training sample, the feature words on each sequence number can be included according to each training sample. In this case, a feature word vector corresponding to each training sample is generated. That is, assuming a total of 10,000 feature words, each training sample can correspond to a feature word vector of 10,000 dimensions. And because the above feature words are segmented and filtered according to each training sample, the feature words contained in each training sample must exist in the feature word set. That is, the feature words contained in each training sample are a subset of the above feature word set. For a certain training sample, the value of each element in the feature word vector can be determined according to whether a feature word exists on the corresponding sequence number. For example, the feature words contained in a training sample are No. 1, 12, 23, 25, 68, 1279, etc., the feature word vector of the training sample can be corresponding to the above sequence number The element value is 1 and the element numbers on other serial numbers are 0 to express which feature words are included in the training sample. Or, in another implementation manner, an initial weight may be given to element values on each sequence number according to information such as attributes of specific feature words. If a feature word on a sequence number exists in the training sample, the sequence number may be assigned on the sequence number. The corresponding element value is set as the initial weight of the feature word, which represents the importance of the feature word corresponding to the HScode product category classification. For example, the generated feature word vector can be {1: 0.2, 4: 0.5, 12: 0.6, 1009: 0.3, 3801: 0.2 ...}, that is, the training sample contains the number 1 feature word and the number 4 feature word , No. 12 feature words, No. 1009 feature words, No. 3801 feature words, etc., and their respective initial weights are 0.2, 0.5, 0.6, 0.3, 0.2, and so on. It should be noted that, in the above example, because there are no features with other sequence numbers (for example, 2, 3, 5, 6, ...) in the training sample, the corresponding element value is 0, which is not shown in the above vector. In specific implementation, in order to facilitate multiplication between vectors and the like, the element value at 0 is also present in the specific vector.
当然,在具体实现时,每条训练样本都对应一万维度甚至更大维度的向量,进行计算时,可能会存在占用比较大的计算资源的情况,而由于每条训练样本中所包含的特征词数量,相对于向量的总维度数而言,通常是非常少的,因此,使得向量中大部分的元 素值为0,因此,可能会造成对计算资源的浪费。为此,在可选的实施方式中,还可以预先对HScode进行分组,例如,某些HScode对应的商品类别具有比较强的相似性,因此,可以分成一组,组成一个大类,等等。其中,对HScode进行分组的分组依据还可以是网络销售系统中定义的类目体系等信息,这样,可以使得网络销售系统中定义的类目体系,与这种海关HScode之间产生关联,也便于后续进行预测时进行更高效的分类预测。Of course, in specific implementation, each training sample corresponds to a vector of 10,000 dimensions or more. When performing calculations, there may be a situation that consumes a large amount of computing resources. Because of the characteristics contained in each training sample, The number of words is usually very small relative to the total number of dimensions of the vector. Therefore, the value of most elements in the vector is 0, which may cause a waste of computing resources. For this reason, in an optional implementation manner, HScodes can also be grouped in advance. For example, the commodity categories corresponding to certain HScodes have strong similarities, so they can be divided into a group to form a large category, and so on. Among them, the grouping basis for grouping HScodes can also be information such as the category system defined in the online sales system. In this way, the category system defined in the online sales system can be associated with this customs HScode, which is also convenient. Make more efficient classification predictions in subsequent predictions.
例如,网络销售系统中定义的类目体系中包括服装、日用品、家电、电脑耗材等一级类目,各一级类目下还包括多个二级类目,二级类目下还可以包括三级类目,等等,最后到叶子类目。在对HScode进行分组时,可以根据具体类目体系中的某一级类目为依据进行分组,根据不同的类目级别,分成的HScode的组别数量会不同,每一组中包含的HScode数量也会不同。具体可以根据实际需求来进行选择。For example, the category system defined in the online sales system includes first-level categories such as clothing, daily necessities, home appliances, and computer consumables. Each first-level category also includes multiple second-level categories, and the second-level category can also include Tertiary categories, and so on, and finally to leaf categories. When grouping HScodes, you can group them according to a certain category in the specific category system. According to different category levels, the number of HScode groups will be different, and the number of HScodes included in each group It will be different. Specific choices can be made according to actual needs.
通过上述方式进行分组之后,可以分别以各组别为单位,进行分类模型的训练,这样,每个组别内部的训练样本数量会有所减少,因此,对应的特征词总量也会减少,最终每条训练样本对应的特征词向量的维度也会降低,从而降低计算量,提高训练效率。After grouping in the above manner, the classification model can be trained with each group as a unit. In this way, the number of training samples in each group will be reduced, so the total number of corresponding feature words will also be reduced. In the end, the dimension of the feature word vector corresponding to each training sample will also be reduced, thereby reducing the calculation amount and improving the training efficiency.
步骤五:分别将同一HScode关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各HScode分别对应的分类模型。Step 5: The feature word vectors corresponding to multiple training samples associated with the same HScode are respectively input into a preset machine learning model for training, and a classification model corresponding to each HScode is obtained.
在得到各条训练样本的特征词向量后,便可以将同一HScode关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,也就是,假设训练样本中,对应某一个HScode的训练样本一共有1000条,则可以将该1000条训练样本分别对应的特征向量输入到机器学习模型中进行训练。其中,具体的机器学习模型可以有多种,例如,可以包含但不限于SVM、LR、朴素贝叶斯、最大熵等分类模型,以及lstm+softmax等深度学习方法,等等。经过多轮迭代直到算法收敛后,可以得到该HScode对应的分类模型。该分类模型同样可以由一个向量来表示,例如,{f1:w1,f2:w2,f3:w3,f4:w4,f5:w5,f6:w6…},其中,fn代表具体特征词的序号,wn代表对应的权重。也即,对于某个HScode而言,其训练结果用于表达,各序号对应的特征词对于该HScode的重要程度。After obtaining the feature word vectors of each training sample, the feature word vectors corresponding to multiple training samples associated with the same HScode can be input into a preset machine learning model for training, that is, if the training sample corresponds to a certain A total of 1000 training samples of an HScode can be input into the machine learning model for feature vectors corresponding to the 1000 training samples. Among them, there can be multiple specific machine learning models. For example, they can include but are not limited to classification models such as SVM, LR, Naive Bayes, maximum entropy, and deep learning methods such as lstm + softmax, and so on. After multiple iterations until the algorithm converges, the classification model corresponding to the HScode can be obtained. The classification model can also be represented by a vector, for example, {f1: w1, f2: w2, f3: w3, f4: w4, f5: w5, f6: w6 ...}, where fn represents the sequence number of a specific feature word, wn represents the corresponding weight. That is, for a certain HScode, the training result is used to express the importance of the feature words corresponding to each serial number to the HScode.
总之,经过机器学习训练之后,每个HScode都可以分别对应一个特征词权重向量,各自的特征词权重向量中,相同序号上的特征词对应的权重可能是不同的。训练得到的分类模型可以持久化存储到磁盘等存储介质中,或者,也可以如前文所述,根据该模型生成界面化的分类工具,提供给各种用户来使用。In short, after machine learning training, each HScode can respectively correspond to a feature word weight vector. In the respective feature word weight vectors, the weights corresponding to the feature words on the same sequence number may be different. The trained classification model can be persistently stored in a storage medium such as a disk, or, as described above, an interfaced classification tool can be generated based on the model and provided to various users for use.
当然,除了上述逻辑回归模型,还可以采用决策树模型、神经网络模型等其他类型的分类模型。例如,对于决策树模型,可以是基于词特征在多棵树模型做决策过程,基于每棵树保存着的分裂的阈值以及特征词向量的特征,决定该物流对象属于每棵树的哪一个叶子节点,从而决定该物流对象被归类于每个潜在HScode等编码对应类别的概率。对于神经网络模型,具体的编码分类模型可以拥有多层的非线性变化单元,每一层的判非线性变化单元与下一层非线性变化单元串联,每一层非线性变化单元保存有基于特征词向量或由特征词向量衍生特征向量的特征权重,通过多层非线性变化单元的相互作用得到物流对象被归类到每个编码对应类别的概率。关于更具体的细节,这里不再详述。Of course, in addition to the above logistic regression models, other types of classification models such as decision tree models and neural network models can also be used. For example, for a decision tree model, it is possible to make a decision process in a multiple tree model based on word features, and based on the threshold of splitting stored in each tree and the characteristics of the feature word vector, determine which leaf of each tree the logistics object belongs to Node, thereby determining the probability that the logistics object is classified into each potential HScode and other coding corresponding categories. For the neural network model, the specific coding classification model can have multiple layers of non-linear change units, and the non-linear change unit of each layer is connected in series with the non-linear change unit of the next layer. The feature weight of a word vector or a feature vector derived from a feature word vector is obtained through the interaction of multiple layers of non-linear change units to obtain the probability that a logistics object is classified into each coding corresponding category. For more specific details, I won't go into details here.
另外,对于除了HScode之外的其他编码,也可以采用类似的方式获得编码分类模型。In addition, for other encodings other than HScode, a coding classification model can also be obtained in a similar manner.
上述建立编码分类模型的过程可以是提前完成的,完成之后,便可以利用具体的模型对待归类的目标物流对象进行归类。具体的,参见图3,可以包括以下步骤:The above-mentioned process of establishing a coding classification model may be completed in advance. After the completion, a specific model may be used to classify the target logistics object to be classified. Specifically, referring to FIG. 3, the following steps may be included:
S301:确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;S301: Determine the text description information of the target logistics object to be classified and process the text description information to determine the target feature words included;
从该步骤开始,主要就是利用上述编码分类模型对具体的目标物流对象的编码进行预测的过程。具体的,首先可以确定出待归类的目标物流对象的文本描述信息,其中,具体的文本描述信息可以从物流对象的标题等信息中获得。具体实现时,如果是提供界面化的工具,如图4所示,还可以在界面中提供用于输入目标物流对象文本描述信息的入口,例如,可以是输入框等。或者,还可以提供用于批量导入多个目标物流对象的文本描述信息的入口,这样,用户便可以预先通过Excel表格等方式对需要进行分类的物流对象的文本描述信息进行整理,可以按照预先规定的字段名称等对表格中的数据列进行命名。之后,便可以通过上述批量操作入口,将该表格中记录的各条物流对象的文本描述信息导入到工具中,等等。其中,无论是单条输入目标物流对象的文本描述信息,还是批量导入,具体的目标物流对象都可以是等待进行报关的物流对象,例如,可以是从具体的跨境订单中提取出的目标物流对象的标题等文本对象信息,等等。From this step, it is mainly a process of predicting the coding of a specific target logistics object by using the above coding classification model. Specifically, first, the text description information of the target logistics object to be classified can be determined, where the specific text description information can be obtained from information such as the title of the logistics object. In specific implementation, if an interface-based tool is provided, as shown in FIG. 4, an interface for inputting text description information of a target logistics object may also be provided in the interface, for example, it may be an input box. Alternatively, an entry for importing text description information of multiple target logistics objects in batches can also be provided. In this way, users can organize the text description information of logistics objects that need to be classified in advance through Excel forms and other methods. Name the data columns in the table. After that, you can import the textual description information of each logistics object recorded in this form into the tool through the batch operation entry mentioned above, and so on. Among them, whether it is a single entry of the text description information of the target logistics object or a batch import, the specific target logistics object can be a logistics object waiting for customs declaration, for example, it can be a target logistics object extracted from a specific cross-border order. Title, text object information, etc.
S302:根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量;S302: Generate a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;
在获取到目标数据对象的文本描述信息中,便可以对文本描述信息进行处理,具体的处理方式,与对训练样本中文本描述信息的处理方式可以是一致的。例如,同样可以进行分词处理,过滤掉其中的无效词汇,将剩余的有效词汇确定为目标特征词。然后, 同样可以根据文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量。具体的,可以是根据所述目标物流对象的文本描述信息中对各序号上的特征词的包含情况,生成该目标物流对象对应的特征词向量。例如,目标物流对象的文本描述信息中包括1号特征词,5号特征词,8号特征词,27号特征词,等等,则上述各序号对应的元素值可以为1,或者预置的初始权重值,其他序号对应的元素值则为0。当然,在实际应用中,对于待预测的目标物流对象的文本描述信息,其中可能包括有在训练过程中未收录过的词汇,对于这种词汇,可以将其过滤掉,不必输入到具体的分类模型中。但是在完成此次预测之后,还可以根据该词汇的命名实体信息等,确定其是否与HScode分类相关,如果相关,则还可以作为特征词加入到对应的特征词集合中,并且可以重新对模型进行训练,等等。After obtaining the text description information of the target data object, the text description information can be processed. The specific processing method can be consistent with the processing method of the text description information in the training sample. For example, word segmentation processing can also be performed to filter out invalid words and determine the remaining valid words as target feature words. Then, a feature word vector corresponding to the target logistics object may also be generated according to the inclusion condition of each target feature word in the text description information. Specifically, the feature word vector corresponding to the target logistics object may be generated according to the inclusion of the feature words on each serial number in the text description information of the target logistics object. For example, the text description information of the target logistics object includes No. 1 feature word, No. 5 feature word, No. 8 feature word, No. 27 feature word, etc., then the element values corresponding to the above serial numbers can be 1, or preset The initial weight value, and the element values corresponding to other sequence numbers are 0. Of course, in practical applications, the text description information of the target logistics object to be predicted may include vocabulary that has not been included in the training process. For this vocabulary, it can be filtered out without entering into specific categories In the model. However, after completing this prediction, you can also determine whether the vocabulary is related to the HScode classification based on the named entity information of the vocabulary. If relevant, you can also add it as a feature word to the corresponding feature word set, and you can re-model the model Training, etc.
这里需要说明的是,这里针对目标物流对象生成的特征词向量的维度,与训练时的特征词集合中特征词的数量是一致的。例如,如果在训练时,将所有的训练样本中的特征词汇集在一起组成特征词集合,其中包括的特征词数量为N,则需要预测的目标物流对象对应的特征词向量也可以是N维向量。另外,如果在训练时,对HScode进行了分组,每个组内的HScode对应的训练样本中特征词进行汇总,则每个组内的特征词数量也会有所减少。在这种情况下,具体在针对目标物流对象生成特征词向量之前,还可以首先确定出该目标物流对象所属的组别,例如,如果是按照某网络销售系统中的类目体系对HScode进行的分组,则可以根据目标物流对象在该网络销售系统中的类目体系下所属的类目,确定出对应的HScode组别。进而,利用该HScode组别中包含的特征词集合,来确定当前目标物流对象的特征词向量即可。It should be noted here that the dimension of the feature word vector generated for the target logistics object here is consistent with the number of feature words in the feature word set during training. For example, if the feature words in all the training samples are combined to form a feature word set during training, the number of feature words included is N, and the feature word vector corresponding to the target logistics object to be predicted may also be N-dimensional vector. In addition, if the HScode is grouped during training, and the feature words in the training samples corresponding to the HScode in each group are summarized, the number of feature words in each group will also be reduced. In this case, before generating a feature word vector for the target logistics object, the group to which the target logistics object belongs can be determined first. For example, if HScode is performed according to the category system in an online sales system Grouping can determine the corresponding HScode group according to the category to which the target logistics object belongs under the category system in the online sales system. Furthermore, the feature word set included in the HScode group may be used to determine the feature word vector of the current target logistics object.
S303:将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。S303: The feature word vector is input into a coding classification model, and corresponding classification feature information is obtained.
在确定出目标物流对象对应的特征词向量之后,便可以输入到编码分类模型中,获取具体的分类特征信息。例如,具体实现时,对于采用逻辑回归模型进行HScode归类的情况,可以将所述特征词向量输入到所述编码分类模型中,确定所述目标物流对象属于各HScode对应类别的概率,另外,还可以根据所述概率提供分类建议信息。具体的,可以将目标物流对象的特征词向量分别与各HScode对应的特征词权重向量进行乘法运算(还可能会通过某偏置值等进行调节),从而得到该目标物流对象属于各HScode对应类别的概率值。其中,如果在训练时,是进行了分组训练,则在将特征词向量输入到所述分类模型中的同时,还可以将该目标物流对象对应的组别信息,输入到分类模型中。这样,只需要将该目标物流对象的特征词向量,分别与该组别内的各HScode对应的特 征词权重向量进行运算即可,而不需要对全部的HScode分别进行概率计算,从而可以节省计算量。After the feature word vector corresponding to the target logistics object is determined, it can be input into the coding classification model to obtain specific classification feature information. For example, in specific implementation, for a case where HScode is classified by using a logistic regression model, the feature word vector may be input into the coding classification model to determine a probability that the target logistics object belongs to a corresponding category of each HScode. In addition, Classification recommendation information may also be provided according to the probability. Specifically, the feature word vector of the target logistics object may be multiplied with the feature word weight vector corresponding to each HScode (it may also be adjusted by an offset value, etc.) to obtain that the target logistics object belongs to the corresponding category of each HScode. Probability value. Wherein, if group training is performed during training, while the feature word vector is input into the classification model, the group information corresponding to the target logistics object may also be input into the classification model. In this way, only the feature word vector of the target logistics object needs to be calculated with the feature word weight vector corresponding to each HScode in the group, instead of calculating the probability of all HScodes separately, which can save calculations. the amount.
在计算得到目标物流对象属于各HScode对应类别的概率之后,还可以返回对应的分类建议信息。例如,可以将概率高于预置阈值的一个或者几个HScode进行返回,这样用户可以根据这种建议结果为目标物流对象确定具体的HScode。After calculating the probability that the target logistics object belongs to the corresponding category of each HScode, the corresponding classification suggestion information can also be returned. For example, one or more HScodes with a probability higher than a preset threshold can be returned, so that the user can determine the specific HScode for the target logistics object based on the result of this recommendation.
总之,通过本申请实施例,可以预先确定编码分类模型,这样,针对待归类的目标物流对象,可以获取其文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词,然后,根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量,之后,便可以将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。通过这种方式,可以实现对物流对象的自动分类,而不再需要依赖于人工分类,因此,可以提高效率以及准确度。In short, according to the embodiment of the present application, a coding classification model can be determined in advance, so that, for a target logistics object to be classified, text description information can be obtained and processed, and the target feature words included are determined, and Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information, and then, inputting the feature word vector into a coding classification model to obtain corresponding classification feature information . In this way, automatic classification of logistics objects can be achieved without relying on manual classification, so efficiency and accuracy can be improved.
在可选的实施例中,通过对训练样本的收集、处理以及机器学习训练,可以得到针对各个HScode的分类模型,具体可以通过特征词权重向量来表示,其中记录有各个特征词对关联HScode的判别权重值。这样,具体在对某个目标数据对象进行预测时,就可以将该目标数据对象的文本描述信息进行分词等处理,确定出其中包含的特征词,并生成特征词向量。这样就可以将这种特征词向量输入到之前训练号的分类模型中,从而可以计算出该目标物流对象被归类为各个HScode的概率,并可以据此给出建议信息,例如,可以给出建议的一个或者几个HScode,等等。通过这种方式,使得对目标数据对象进行分类的过程不再完全依赖于专家,可以降低人力成本,并且,分类的效率也得到提升,不会受限于专家的经验以及个人能力。In an optional embodiment, a classification model for each HScode can be obtained through the collection, processing, and machine learning training of training samples. Specifically, it can be represented by a feature word weight vector, where each feature word pair is associated with the HScode. Determine the weight value. In this way, when a certain target data object is predicted, the text description information of the target data object may be processed by word segmentation, etc., the feature words contained therein are determined, and a feature word vector is generated. In this way, this feature word vector can be input into the classification model of the previous training number, so that the probability that the target logistics object is classified as each HScode can be calculated, and the recommended information can be given accordingly, for example, it can be given Suggested HScode or codes, etc. In this way, the process of classifying target data objects is no longer completely dependent on experts, which can reduce labor costs, and the efficiency of classification is also improved, without being limited by the experience of experts and personal capabilities.
实施例二Example two
该实施例二提供了一种生成编码分类模型的方法,参见图5,该方法具体可以包括:This second embodiment provides a method for generating a coding classification model. Referring to FIG. 5, the method may specifically include:
S501:收集训练样本,其中,每条训练样本中包括已知的物流对象文本描述信息与编码之间的对应关系;S501: Collect training samples, where each training sample includes a corresponding relationship between known text object description information and a coding of a logistics object;
其中,所述编码具体就可以是指前文所述的海关编码HScode等。The code may specifically refer to the customs code HScode and the like described above.
S502:对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;S502: Perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;
S503:将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;S503: Summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;
S504:根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对 应的特征词向量;S504: Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;
S505:分别将同一编码关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各编码分别对应的分类模型。S505: Each feature word vector corresponding to multiple training samples associated with the same code is input into a preset machine learning model for training, and a classification model corresponding to each code is obtained.
关于该实施例二中的未详述部分,可以参见前述实施例一中的记载,这里不再赘述。Regarding the undetailed parts in the second embodiment, reference may be made to the records in the foregoing first embodiment, and details are not described herein again.
与实施例一相对应,本申请实施例还提供了一种物流对象信息处理装置,参见图6,该装置具体可以包括:Corresponding to the first embodiment, the embodiment of the present application further provides a logistics object information processing device. Referring to FIG. 6, the device may specifically include:
目标物流对象信息确定单元601,用于确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;A target logistics object information determining unit 601, configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;
特征向量生成单元602,用于确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;A feature vector generating unit 602, configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;
分类特征信息获取单元603,用于将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。A classification feature information obtaining unit 603 is configured to input the feature word vector into a coding classification model, and obtain corresponding classification feature information.
其中,所述编码分类模型包括逻辑回归模型、决策树模型、神经网络模型。The coding classification model includes a logistic regression model, a decision tree model, and a neural network model.
若所述编码分类模型为逻辑回归模型,则所述编码分类模型保存有每个编码对应的特征词权重向量。If the coding classification model is a logistic regression model, the coding classification model stores a feature word weight vector corresponding to each coding.
具体的,所述编码包海关编码HScode,所述编码分类模型中保存有每个海关编码HScode对应的特征词权重向量;所述特征词权重向量中记录有各个特征词对关联HScode的判别权重值。Specifically, the encoding package includes a HS code, and the encoding classification model stores a feature word weight vector corresponding to each customs code HScode. The feature word weight vector records the distinguishing weight value of each feature word on the associated HScode. .
若所述编码分类模型为决策树模型,则所述编码分类模型保存有多棵树模型,以及基于每棵树保存有分裂的阈值以及特征词向量的特征,以便确定所述目标物流对象被归类于每个潜在编码对应类别的概率。If the coding classification model is a decision tree model, the coding classification model stores multiple tree models, and based on each tree, a threshold of division and features of a feature word vector are stored in order to determine that the target logistics object is classified. Probability of class corresponding to each potential code.
若所述编码分类模型为神经网络模型,则所述编码分类模型具有多层的非线性变化单元,每一层的判非线性变化单元与下一层非线性变化单元串联,每一层非线性变化单元保存有基于特征词向量或由特征词向量衍生特征向量的特征权重,以便通过多层非线性变化单元的相互作用得到物流对象被归类于每个潜在编码对应类别的概率。If the coding classification model is a neural network model, the coding classification model has multiple layers of non-linear change units, and each layer's non-linear change unit is connected in series with the next layer of non-linear change units, and each layer is non-linear The change unit stores feature weights based on the feature word vector or the feature vector derived from the feature word vector, so as to obtain the probability that the logistics object is classified into each potential coding corresponding category through the interaction of the multilayer non-linear change unit.
具体实现时,所述分类特征信息获取单元具体可以用于将所述特征词向量输入到编码分类模型中,确定所述目标物流对象被归类于各潜在编码对应类别的概率。还可以用于根据所述概率提供分类建议信息。In specific implementation, the classification feature information obtaining unit may be specifically configured to input the feature word vector into a coding classification model, and determine a probability that the target logistics object is classified into each potential coding corresponding category. It can also be used to provide classification recommendation information according to the probability.
所述编码分类模型通过以下方式建立:The coding classification model is established in the following manner:
样本收集单元,用于收集训练样本,其中,每条训练样本中包括已知的物流对象文 本描述信息与HScode之间的对应关系;A sample collection unit, configured to collect training samples, where each training sample includes a correspondence relationship between known text description information of logistics objects and HScode;
特征词确定单元,用于对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;A feature word determining unit, configured to perform word segmentation on the text description information in the training sample, and filter out invalid words to obtain a feature word;
特征词汇总单元,用于将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;The feature vocabulary unit is used to summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;
特征词向量生成单元,用于根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量;A feature word vector generating unit, configured to generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each sequence number in each training sample;
训练单元,用于分别将同一HScode关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各HScode分别对应的分类模型。A training unit is configured to input feature word vectors corresponding to multiple training samples associated with the same HScode into a preset machine learning model for training, and obtain a classification model corresponding to each HScode.
具体实现时,所述特征向量生成单元具体可以用于,根据所述目标物流对象的文本描述信息中对各序号上的特征词的包含情况,生成该目标物流对象对应的特征词向量。In specific implementation, the feature vector generating unit may be specifically configured to generate a feature word vector corresponding to the target logistics object according to the inclusion of the feature words on each serial number in the text description information of the target logistics object.
其中,在进行模型训练时,该装置还可以包括:When performing model training, the device may further include:
数据清洗单元,用于所述收集训练样本之后,对所述训练样本进行数据清洗,以便利用剩余的有效训练样本进行分类模型的训练。A data cleaning unit is configured to perform data cleaning on the training samples after the training samples are collected, so as to use the remaining valid training samples to train a classification model.
具体的,所述数据清洗单元具体可以用于:Specifically, the data cleaning unit may be specifically configured to:
预先保存新旧HScode映射关系信息;对于出现旧HScode的训练样本,根据所述映射关系替换为新HScode后,再作为有效训练样本加入到训练样本集合中。Information on the mapping relationship between the old and new HScodes is saved in advance; for training samples where the old HScode appears, the new HScode is replaced according to the mapping relationship, and then added to the training sample set as valid training samples.
或者,数据清洗单元也可以用于:Alternatively, the data cleaning unit can also be used:
预先保存已停用的HScode名单;将出现所述已停用HScode的训练样本删除。Save the list of disabled HScodes in advance; delete the training samples where the disabled HScodes appear.
再或者,数据清洗单元也可以用于:Alternatively, the data cleaning unit can also be used:
预先保存分拆HScode信息名单,其中,每条分拆HScode信息中包括拆分前的HScode,以及对应的拆分后的多个HScode;Pre-save a list of split HScode information, where each split HScode information includes the HScode before split, and the corresponding multiple HScode after split;
将出现所述拆分前的HScode的训练样本提取出来,以便重新确定分拆后的HScode后,再作为有效训练样本加入到训练样本集合中。The training samples before the HScode before splitting are extracted, so that the HScode after splitting is re-determined, and then added to the training sample set as valid training samples.
具体实现时,该装置还可以包括:In specific implementation, the device may further include:
词汇过滤单元,用于对所述文本描述信息的分词结果得到的词汇进行命名实体识别,根据命名实体识别结果,过滤到与物流对象分类无关的词汇。A vocabulary filtering unit is configured to perform named entity recognition on a vocabulary obtained from the segmentation result of the text description information, and filter vocabularies that are not related to the classification of the logistics object according to the recognition result of the named entity.
另外,具体在进行模型训练时,该装置还可以包括:In addition, when specifically performing model training, the device may further include:
分组单元,用于所述将各条训练样本中得到的特征词进行汇总及去重处理之前,根据相关网络销售系统中的类目体系下其中一个级别的类目信息,将所述HScode进行分 组,得到多个组别,每个组别下包括多个HScode,以便以各个HScode组别为单位,进行所述特征词进行汇总去重,以及生成特征向量、模型训练处理。A grouping unit, configured to group the HScodes according to one level of category information in the category system in the relevant online sales system before the feature words obtained in each training sample are summarized and deduplicated To obtain multiple groups, each group includes multiple HScodes, so that the feature words are summarized and deduplicated, and feature vectors are generated and model training processing is performed on each HScode group as a unit.
具体的,所述分类模型中还可以保存有各组别与HScode之间的对应关系;Specifically, the classification model may also save the correspondence between each group and the HScode;
在进行预测时,所述装置还可以包括:When performing prediction, the apparatus may further include:
组别确定单元,用于根据所述目标物流对象在所述网络销售系统类目体系下所属的类目,为其确定对应的HScode组别;A group determining unit, configured to determine a corresponding HScode group according to the category to which the target logistics object belongs under the category system of the online sales system;
所述预测单元具体可以用于:The prediction unit may be specifically configured to:
将所述目标物流对象对应的HScode组别以及所述特征词向量输入到所述分类模型中,以便确定所述目标物流对象属于所述组别下各HScode对应类别的概率。The HScode group corresponding to the target logistics object and the feature word vector are input into the classification model, so as to determine a probability that the target logistics object belongs to a corresponding category of each HScode in the group.
与实施例二相对应,本申请实施例还提供了一种生成海关编码分类模型的装置,参见图7,该装置具体可以包括:Corresponding to the second embodiment, this embodiment of the present application further provides a device for generating a customs code classification model. Referring to FIG. 7, the device may specifically include:
样本收集单元701,用于收集训练样本,其中,每条训练样本中包括已知的物流对象文本描述信息与编码之间的对应关系;A sample collection unit 701 is configured to collect training samples, where each training sample includes a corresponding relationship between a text description information of a known logistics object and a coding;
特征词确定单元702,用于对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;A feature word determining unit 702, configured to perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain a feature word;
特征词汇总单元703,用于将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;A feature vocabulary unit 703 is configured to summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;
特征词向量生成单元704,用于根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量;A feature word vector generating unit 704, configured to generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each sequence number in each training sample;
训练单元705,用于分别将同一编码关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各编码分别对应的编码分类模型;所述编码分类模型中保存有每个编码对应的特征词权重向量;所述特征词权重向量中记录有各个特征词对关联编码的判别权重值。A training unit 705 is configured to input feature word vectors corresponding to multiple training samples associated with the same code into a preset machine learning model for training, and obtain a coding classification model corresponding to each coding; the coding classification model is stored in the coding classification model. There is a feature word weight vector corresponding to each coding; the feature word weight vector records the discrimination weight value of each feature word pair associated coding.
另外,对应于本申请实施例一,本申请实施例还提供了一种计算机系统,包括:In addition, corresponding to the first embodiment of the present application, the embodiment of the present application further provides a computer system including:
一个或多个处理器;以及One or more processors; and
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:A memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;Determining the text description information of the target logistics object to be classified and processing the text description information to determine the target feature words included;
根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特 征词向量;Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;
将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。The feature word vector is input into a coding classification model to obtain corresponding classification feature information.
其中,图8示例性的展示出了计算机系统的架构,具体可以包括处理器810,视频显示适配器811,磁盘驱动器812,输入/输出接口813,网络接口814,以及存储器820。上述处理器810、视频显示适配器811、磁盘驱动器812、输入/输出接口813、网络接口814,与存储器820之间可以通过通信总线830进行通信连接。Among them, FIG. 8 exemplarily shows the architecture of the computer system, which may specifically include a processor 810, a video display adapter 811, a disk drive 812, an input / output interface 813, a network interface 814, and a memory 820. The processor 810, the video display adapter 811, the disk drive 812, the input / output interface 813, and the network interface 814 can communicate with the memory 820 through a communication bus 830.
其中,处理器810可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请所提供的技术方案。The processor 810 may be implemented by using a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits. Relevant procedures are executed to implement the technical solution provided in this application.
存储器820可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器820可以存储用于控制计算机系统800运行的操作系统821,用于控制计算机系统800的低级别操作的基本输入输出系统(BIOS)822。另外,还可以存储网页浏览器823,数据存储管理系统824,以及分类处理系统825等等。上述分类处理系统825就可以是本申请实施例中具体实现前述各步骤操作的应用程序。总之,在通过软件或者固件来实现本申请所提供的技术方案时,相关的程序代码保存在存储器820中,并由处理器810来调用执行。The memory 820 may be implemented in the form of ROM (Read Only Memory, Read Only Memory), RAM (Random Access Memory, Random Access Memory), static storage devices, dynamic storage devices, and the like. The memory 820 may store an operating system 821 for controlling the operation of the computer system 800, and a basic input output system (BIOS) 822 for controlling low-level operations of the computer system 800. In addition, a web browser 823, a data storage management system 824, and a classification processing system 825 can also be stored. The classification processing system 825 may be an application program that specifically implements the foregoing steps in the embodiments of the present application. In short, when the technical solution provided in the present application is implemented by software or firmware, the relevant program code is stored in the memory 820, and is called and executed by the processor 810.
输入/输出接口813用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。The input / output interface 813 is used to connect an input / output module to implement information input and output. The input / output / module can be configured in the device as a component (not shown in the figure), or it can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, and an indicator light.
网络接口814用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The network interface 814 is used to connect a communication module (not shown in the figure) to implement communication interaction between the device and other devices. The communication module can implement communication through a wired method (such as USB, network cable, etc.), and can also implement communication through a wireless method (such as mobile network, WIFI, Bluetooth, etc.).
总线830包括一通路,在设备的各个组件(例如处理器810、视频显示适配器811、磁盘驱动器812、输入/输出接口813、网络接口814,与存储器820)之间传输信息。The bus 830 includes a path for transmitting information between various components of the device (for example, the processor 810, the video display adapter 811, the disk drive 812, the input / output interface 813, the network interface 814, and the memory 820).
另外,该计算机系统800还可以从虚拟资源对象领取条件信息数据库841中获得具体领取条件的信息,以用于进行条件判断,等等。In addition, the computer system 800 may also obtain information of specific receiving conditions from the virtual resource object receiving condition information database 841 for use in performing condition judgment, and so on.
需要说明的是,尽管上述设备仅示出了处理器810、视频显示适配器811、磁盘驱动器812、输入/输出接口813、网络接口814,存储器820,总线830等,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员 可以理解的是,上述设备中也可以仅包含实现本申请方案所必需的组件,而不必包含图中所示的全部组件。It should be noted that although the above device only shows the processor 810, the video display adapter 811, the disk drive 812, the input / output interface 813, the network interface 814, the memory 820, the bus 830, etc., in the specific implementation process, the The device may also include other components necessary for proper operation. In addition, a person skilled in the art can understand that the foregoing device may also include only components necessary for implementing the solution of the present application, and does not necessarily include all the components shown in the figure.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。It can be known from the description of the foregoing embodiments that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary universal hardware platform. Based on such an understanding, the technical solution of this application that is essentially or contributes to the existing technology can be embodied in the form of a software product, which can be stored in a storage medium, such as ROM / RAM, magnetic disk , Optical discs, and the like, including a number of instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in each embodiment or some parts of the application.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。Each embodiment in this specification is described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. In particular, for a system or a system embodiment, since it is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment. The system and system embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, It can be located in one place or distributed across multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the objective of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without creative efforts.
以上对本申请所提供的物流对象信息处理方法、装置及计算机系统,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本申请的限制。The logistics object information processing method, device, and computer system provided in the present application have been described in detail above. Specific examples have been used in this document to explain the principle and implementation of the present application. The descriptions of the above embodiments are only for understanding The method of the present application and its core idea; meanwhile, for a person of ordinary skill in the art, according to the idea of the present application, there will be changes in the specific implementation and application scope. In summary, the content of this specification should not be construed as a limitation on this application.

Claims (21)

  1. 一种物流对象信息处理方法,其特征在于,包括:A logistics object information processing method, comprising:
    确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;Determining the text description information of the target logistics object to be classified and processing the text description information to determine the target feature words included;
    根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量;Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;
    将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。The feature word vector is input into a coding classification model to obtain corresponding classification feature information.
  2. 根据权利要求1所述的方法,其特征在于,编码分类模型包括逻辑回归模型、决策树模型、神经网络模型。The method according to claim 1, wherein the coding classification model comprises a logistic regression model, a decision tree model, and a neural network model.
  3. 根据权利要求2所述的方法,其特征在于,若所述编码分类模型为逻辑回归模型,则所述编码分类模型保存有每个编码对应的特征词权重向量。The method according to claim 2, wherein if the coding classification model is a logistic regression model, the coding classification model stores a feature word weight vector corresponding to each coding.
  4. 根据权利要求3所述的方法,其特征在于,所述编码包海关编码HScode,所述编码分类模型中保存有每个海关编码HScode对应的特征词权重向量;所述特征词权重向量中记录有各个特征词对关联HScode的判别权重值。The method according to claim 3, wherein the code includes a customs code HScode, and the code classification model stores a feature word weight vector corresponding to each customs code HScode; and the feature word weight vector is recorded in the code classification model. The discrimination weight value of each feature word to the associated HScode.
  5. 根据权利要求2所述的方法,其特征在于,若所述编码分类模型为决策树模型,则所述编码分类模型保存有多棵树模型,以及基于每棵树保存有分裂的阈值以及特征词向量的特征,以便确定所述目标物流对象被归类于每个潜在编码对应类别的概率。The method according to claim 2, characterized in that, if the coding classification model is a decision tree model, the coding classification model stores multiple tree models, and based on each tree, a split threshold and feature words are stored. The characteristics of the vectors in order to determine the probability that the target logistics object is classified into the corresponding category of each potential code.
  6. 根据权利要求2所述的方法,其特征在于,若所述编码分类模型为神经网络模型,则所述编码分类模型具有多层的非线性变化单元,每一层的非线性变化单元与下一层非线性变化单元串联,每一层非线性变化单元保存有基于特征词向量或由特征词向量衍生特征向量的特征权重,以便通过多层非线性变化单元的相互作用得到物流对象被归类于每个潜在编码对应类别的概率。The method according to claim 2, characterized in that, if the coding classification model is a neural network model, the coding classification model has multiple layers of non-linear change units, and each layer of the non-linear change unit is the same as the next Layers of non-linear change units are connected in series, and each layer of non-linear change units holds feature weights based on or derived from feature word vectors, so that logistics objects obtained through the interaction of multi-layered non-linear change units are classified as The probability of each potential code corresponding to a category.
  7. 根据权利要求1所述的方法,其特征在于,The method according to claim 1, wherein:
    所述将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息,包括:The inputting the feature word vector into a coding classification model to obtain corresponding classification feature information includes:
    将所述特征词向量输入到编码分类模型中,确定所述目标物流对象被归类于各潜在编码对应类别的概率。The feature word vector is input into a coding classification model, and a probability that the target logistics object is classified into each potential coding corresponding category is determined.
  8. 根据权利要求7所述的方法,其特征在于,还包括:The method according to claim 7, further comprising:
    根据所述概率提供分类建议信息。Provide classification recommendation information according to the probability.
  9. 根据权利要求3或7所述的方法,其特征在于,The method according to claim 3 or 7, characterized in that:
    所述编码分类模型通过以下方式建立:The coding classification model is established in the following manner:
    收集训练样本,其中,每条训练样本中包括已知的物流对象文本描述信息与海关编码HScode之间的对应关系;Collect training samples, where each training sample includes the corresponding relationship between the text description information of the known logistics object and the customs code HScode;
    对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;Perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;
    将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;The feature words obtained from each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;
    根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量;Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;
    分别将同一HScode关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各HScode分别对应的编码分类模型。The feature word vectors corresponding to multiple training samples associated with the same HScode are input into a preset machine learning model for training, and a coding classification model corresponding to each HScode is obtained.
  10. 根据权利要求9所述的方法,其特征在于,The method according to claim 9, characterized in that:
    所述生成该目标物流对象对应的特征词向量,包括:The generating a feature word vector corresponding to the target logistics object includes:
    根据所述目标物流对象的文本描述信息中对各序号上的特征词的包含情况,生成该目标物流对象对应的特征词向量。According to the inclusion of the feature words on each serial number in the text description information of the target logistics object, a feature word vector corresponding to the target logistics object is generated.
  11. 根据权利要求9所述的方法,其特征在于,The method according to claim 9, characterized in that:
    所述收集训练样本之后,还包括:After collecting the training samples, the method further includes:
    对所述训练样本进行数据清洗,以便利用剩余的有效训练样本进行分类模型的训练。Data washing is performed on the training samples, so as to use the remaining valid training samples to train the classification model.
  12. 根据权利要求11所述的方法,其特征在于,The method according to claim 11, wherein:
    所述对所述训练样本进行数据清洗包括:The data cleaning on the training samples includes:
    预先保存新旧HScode映射关系信息;Save the old and new HScode mapping relationship information in advance;
    对于出现旧HScode的训练样本,根据所述映射关系替换为新HScode后,再作为有效训练样本加入到训练样本集合中。For the training samples in which the old HScode appears, the new HScode is replaced according to the mapping relationship, and then added to the training sample set as valid training samples.
  13. 根据权利要求11所述的方法,其特征在于,The method according to claim 11, wherein:
    所述对所述训练样本进行数据清洗包括:The data cleaning on the training samples includes:
    预先保存已停用的HScode名单;Save a list of disabled HScodes in advance;
    将出现所述已停用HScode的训练样本删除。Deletion of the training samples with the disabled HScode will appear.
  14. 根据权利要求11所述的方法,其特征在于,The method according to claim 11, wherein:
    所述对所述训练样本进行数据清洗包括:The data cleaning on the training samples includes:
    预先保存分拆HScode信息名单,其中,每条分拆HScode信息中包括拆分前的HScode,以及对应的拆分后的多个HScode;Pre-save a list of split HScode information, where each split HScode information includes the HScode before split, and the corresponding multiple HScode after split;
    将出现所述拆分前的HScode的训练样本提取出来,以便重新确定分拆后的HScode后,再作为有效训练样本加入到训练样本集合中。The training samples before the HScode before splitting are extracted, so that the HScode after splitting is re-determined, and then added to the training sample set as valid training samples.
  15. 根据权利要求9所述的方法,其特征在于,The method according to claim 9, characterized in that:
    所述过滤掉无效词汇,包括:The filtering out invalid words includes:
    对所述文本描述信息的分词结果得到的词汇进行命名实体识别,根据命名实体识别结果,过滤到与物流对象分类无关的词汇。Named entity recognition is performed on the vocabulary obtained from the word segmentation result of the text description information, and words that are not related to the classification of the logistics object are filtered according to the recognition result of the named entity.
  16. 根据权利要求9所述的方法,其特征在于,The method according to claim 9, characterized in that:
    所述将各条训练样本中得到的特征词进行汇总及去重处理之前,还包括:Before the summary and deduplication of the feature words obtained in each training sample, the method further includes:
    根据相关网络销售系统中的类目体系下其中一个级别的类目信息,将所述HScode进行分组,得到多个组别,每个组别下包括多个HScode,以便以各个HScode组别为单位,进行所述特征词进行汇总去重,以及生成特征向量、模型训练处理。According to one level of category information in the relevant online sales system, the HScode is grouped to obtain multiple groups, and each group includes multiple HScodes so that each HScode group is a unit. , Performing summary deduplication of the feature words, generating feature vectors, and model training processing.
  17. 根据权利要求16所述的方法,其特征在于,The method according to claim 16, wherein:
    所述编码分类模型中还保存有各组别与HScode之间的对应关系;The coding classification model also stores the corresponding relationship between each group and HScode;
    所述方法还包括:The method further includes:
    根据所述目标物流对象在所述网络销售系统类目体系下所属的类目,为其确定对应的HScode组别;Determine the corresponding HScode group according to the category to which the target logistics object belongs under the category system of the online sales system;
    所述将所述特征词向量输入到所述编码分类模型中,确定所述目标商品对象属于各HScode对应类别的概率,包括:The inputting the feature word vector into the coding classification model and determining a probability that the target commodity object belongs to a corresponding category of each HScode includes:
    将所述目标商品对象对应的HScode组别以及所述特征词向量输入到所述编码分类模型中,以便确定所述目标物流对象属于所述组别下各HScode对应类别的概率。The HScode group corresponding to the target commodity object and the feature word vector are input into the coding classification model, so as to determine a probability that the target logistics object belongs to a corresponding category of each HScode in the group.
  18. 一种生成编码分类模型的方法,其特征在于,包括:A method for generating a coding classification model, comprising:
    收集训练样本,其中,每条训练样本中包括已知的物流对象文本描述信息与编码之间的对应关系;Collect training samples, where each training sample includes a known correspondence between the text description information of the logistics object and the coding;
    对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;Perform word segmentation processing on the text description information in the training sample, and filter out invalid words to obtain feature words;
    将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;The feature words obtained from each training sample are summarized and deduplicated to obtain a feature word set, and a corresponding sequence number is assigned to each feature word;
    根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量;Generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each serial number in each training sample;
    分别将同一编码关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各编码分别对应的编码分类模型;所述编码分类模型中保存有每个编码对应的特征词权重向量;所述特征词权重向量中记录有各个特征词对关联编码的判别权重值。The feature word vectors corresponding to multiple training samples associated with the same code are respectively input to a preset machine learning model for training, and a coding classification model corresponding to each coding is obtained; the coding classification model stores a corresponding coding classification model. Feature word weight vector; the feature word weight vector records the discrimination weight value of each feature word pair encoding.
  19. 一种物流对象信息处理装置,其特征在于,包括:A logistics object information processing device, comprising:
    目标物流对象信息确定单元,用于确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;A target logistics object information determining unit, configured to determine text description information of a target logistics object to be classified and process the text description information to determine a target feature word included;
    特征向量生成单元,用于根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量;A feature vector generating unit, configured to generate a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word in the text description information;
    分类特征信息获取单元,用于将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。A classification feature information acquisition unit is configured to input the feature word vector into a coding classification model to obtain corresponding classification feature information.
  20. 一种生成编码分类模型的装置,其特征在于,包括:An apparatus for generating a coding classification model, comprising:
    样本收集单元,用于收集训练样本,其中,每条训练样本中包括已知的物流对象文 本描述信息与编码之间的对应关系;A sample collection unit, configured to collect training samples, where each training sample includes a correspondence between a text description information of a known logistics object and a code;
    特征词确定单元,用于对所述训练样本中的文本描述信息进行分词处理,并过滤掉无效词汇,得到特征词;A feature word determining unit, configured to perform word segmentation on the text description information in the training sample, and filter out invalid words to obtain a feature word;
    特征词汇总单元,用于将各条训练样本中得到的特征词进行汇总及去重处理,得到特征词集合,并分别为各个特征词分配对应的序号;The feature vocabulary unit is used to summarize and deduplicate the feature words obtained from each training sample to obtain a feature word set, and assign a corresponding sequence number to each feature word;
    特征词向量生成单元,用于根据各条训练样本中对各序号上的特征词的包含情况,生成各条训练样本对应的特征词向量;A feature word vector generating unit, configured to generate a feature word vector corresponding to each training sample according to the inclusion of the feature words on each sequence number in each training sample;
    训练单元,用于分别将同一编码关联的多条训练样本对应的特征词向量输入到预置的机器学习模型中进行训练,得到各编码分别对应的编码分类模型;所述编码分类模型中保存有每个编码对应的特征词权重向量;所述特征词权重向量中记录有各个特征词对关联编码的判别权重值。A training unit is configured to input feature word vectors corresponding to multiple training samples associated with the same code into a preset machine learning model for training, and obtain a coding classification model corresponding to each coding; the coding classification model stores therein Feature word weight vector corresponding to each code; the feature word weight vector records the discrimination weight value of each feature word pair associated code.
  21. 一种计算机系统,其特征在于,包括:A computer system, comprising:
    一个或多个处理器;以及One or more processors; and
    与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:A memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
    确定待归类的目标物流对象的文本描述信息并对所述文本描述信息进行处理,确定包含的目标特征词;Determining the text description information of the target logistics object to be classified and processing the text description information to determine the target feature words included;
    根据所述文本描述信息对各目标特征词的包含情况,生成该目标物流对象对应的特征词向量;Generating a feature word vector corresponding to the target logistics object according to the inclusion of each target feature word by the text description information;
    将所述特征词向量输入到编码分类模型中,获取对应的分类特征信息。The feature word vector is input into a coding classification model to obtain corresponding classification feature information.
PCT/CN2019/099552 2018-08-17 2019-08-07 Logistics object information processing method, device and computer system WO2020034880A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810943287.XA CN110858219A (en) 2018-08-17 2018-08-17 Logistics object information processing method and device and computer system
CN201810943287.X 2018-08-17

Publications (1)

Publication Number Publication Date
WO2020034880A1 true WO2020034880A1 (en) 2020-02-20

Family

ID=69524695

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/099552 WO2020034880A1 (en) 2018-08-17 2019-08-07 Logistics object information processing method, device and computer system

Country Status (3)

Country Link
CN (1) CN110858219A (en)
TW (1) TW202009748A (en)
WO (1) WO2020034880A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022081127A1 (en) * 2020-10-12 2022-04-21 Hewlett-Packard Development Company, L.P. Document language prediction

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613585A (en) * 2021-01-07 2021-04-06 绿湾网络科技有限公司 Method and device for determining article category
CN113343640B (en) * 2021-05-26 2024-02-20 南京大学 Method and device for classifying customs commodity HS codes
CN116166805B (en) * 2023-02-24 2023-09-22 北京青萌数海科技有限公司 Commodity coding prediction method and device
CN116776831B (en) * 2023-03-27 2023-12-22 阿里巴巴(中国)有限公司 Customs, customs code determination, decision tree construction method and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294355A (en) * 2015-05-14 2017-01-04 阿里巴巴集团控股有限公司 A kind of determination method and apparatus of business object attribute
US20170278015A1 (en) * 2016-03-24 2017-09-28 Accenture Global Solutions Limited Self-learning log classification system
CN108182279A (en) * 2018-01-26 2018-06-19 有米科技股份有限公司 Object classification method, device and computer equipment based on text feature
CN108228622A (en) * 2016-12-15 2018-06-29 平安科技(深圳)有限公司 The sorting technique and device of traffic issues

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120052636A (en) * 2010-11-16 2012-05-24 한국전자통신연구원 A hscode recommendation service system and method using ontology
US8498986B1 (en) * 2012-01-31 2013-07-30 Business Objects Software Ltd. Classifying data using machine learning
WO2014036282A2 (en) * 2012-08-31 2014-03-06 The Dun & Bradstreet Corporation System and process of associating import and/or export data with a corporate identifier relating to buying and supplying goods
CN105354194A (en) * 2014-08-19 2016-02-24 上海中怡通信息科技有限公司 Intelligent commodity classifying method and system
WO2016057000A1 (en) * 2014-10-08 2016-04-14 Crimsonlogic Pte Ltd Customs tariff code classification
CN105117426B (en) * 2015-07-31 2019-04-16 重庆龙工场跨境电子商务投资有限公司 A kind of intellectual coded searching method of customs
CN106294684A (en) * 2016-08-06 2017-01-04 上海高欣计算机系统有限公司 The file classification method of term vector and terminal unit
CN106503236B (en) * 2016-10-28 2020-09-11 北京百度网讯科技有限公司 Artificial intelligence based problem classification method and device
CN108334522B (en) * 2017-01-20 2021-12-14 阿里巴巴集团控股有限公司 Method for determining customs code, and method and system for determining type information
CN106897428B (en) * 2017-02-27 2022-08-09 腾讯科技(深圳)有限公司 Text classification feature extraction method and text classification method and device
EP3446241A4 (en) * 2017-06-20 2019-11-06 Accenture Global Solutions Limited Automatic extraction of a training corpus for a data classifier based on machine learning algorithms
CN107301248B (en) * 2017-07-19 2020-07-21 百度在线网络技术(北京)有限公司 Word vector construction method and device of text, computer equipment and storage medium
CN107704892B (en) * 2017-11-07 2019-05-17 宁波爱信诺航天信息有限公司 A kind of commodity code classification method and system based on Bayesian model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294355A (en) * 2015-05-14 2017-01-04 阿里巴巴集团控股有限公司 A kind of determination method and apparatus of business object attribute
US20170278015A1 (en) * 2016-03-24 2017-09-28 Accenture Global Solutions Limited Self-learning log classification system
CN108228622A (en) * 2016-12-15 2018-06-29 平安科技(深圳)有限公司 The sorting technique and device of traffic issues
CN108182279A (en) * 2018-01-26 2018-06-19 有米科技股份有限公司 Object classification method, device and computer equipment based on text feature

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022081127A1 (en) * 2020-10-12 2022-04-21 Hewlett-Packard Development Company, L.P. Document language prediction

Also Published As

Publication number Publication date
TW202009748A (en) 2020-03-01
CN110858219A (en) 2020-03-03

Similar Documents

Publication Publication Date Title
WO2020034880A1 (en) Logistics object information processing method, device and computer system
US11682093B2 (en) Document term recognition and analytics
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
CN111639516B (en) Analysis platform based on machine learning
EP3121738A1 (en) Data storage extract, transform and load operations for entity and time-based record generation
CN113256367B (en) Commodity recommendation method, system, equipment and medium for user behavior history data
US11580119B2 (en) System and method for automatic persona generation using small text components
CN110827112B (en) Deep learning commodity recommendation method and device, computer equipment and storage medium
US11367117B1 (en) Artificial intelligence system for generating network-accessible recommendations with explanatory metadata
CN112884551A (en) Commodity recommendation method based on neighbor users and comment information
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN111966886A (en) Object recommendation method, object recommendation device, electronic equipment and storage medium
CN113722611A (en) Method, device and equipment for recommending government affair service and computer readable storage medium
CN111695024A (en) Object evaluation value prediction method and system, and recommendation method and system
CN112990973A (en) Online shop portrait construction method and system
JP6242540B1 (en) Data conversion system and data conversion method
CN110109902A (en) A kind of electric business platform recommender system based on integrated learning approach
US10795956B1 (en) System and method for identifying potential clients from aggregate sources
CN114997916A (en) Prediction method, system, electronic device and storage medium of potential user
CN111429161A (en) Feature extraction method, feature extraction device, storage medium, and electronic apparatus
CN113570437A (en) Product recommendation method and device
EP3489838A1 (en) Method and apparatus for determining an association
CN112328899B (en) Information processing method, information processing apparatus, storage medium, and electronic device
CN114065063A (en) Information processing method, information processing apparatus, storage medium, and electronic device
CN114942974A (en) E-commerce platform commodity user evaluation emotional tendency classification method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19850044

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19850044

Country of ref document: EP

Kind code of ref document: A1