CN111985204A

CN111985204A - Customs import and export commodity tax number prediction method

Info

Publication number: CN111985204A
Application number: CN202010744808.6A
Authority: CN
Inventors: 车超; 周成杰; 张强
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-11-24
Anticipated expiration: 2040-07-29
Also published as: CN111985204B

Abstract

The invention discloses a customs import and export commodity tax number prediction method, which specifically comprises the following steps: step 1: preprocessing the customhouse import and export commodity text to obtain an element name and element content; step 2: splitting the element content obtained in the step 1, and then selecting differential elements by using an auxiliary network; and step 3: and (3) sending the difference elements obtained in the step (2) into a CNN network for feature extraction, and simultaneously extracting element name features and element content features by using a DPCNN network. And 4, step 4: and (4) fusing the difference element characteristics, the element name characteristics and the element content characteristics obtained in the step (3), and then performing classification operation to obtain the commodity tax number. By utilizing the special corpus resources of customs, the method realizes the tax number prediction of the customs import and export commodity text on the premise of short element characteristic dilution caused by reporting element length difference, and improves the accuracy of the tax number prediction.

Description

Customs import and export commodity tax number prediction method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a customhouse import and export commodity tax number prediction method based on a hybrid convolutional neural network and an auxiliary network.

Background

Customs taxes are a major source of taxes in many countries. At present, China customs mainly uses manpower to examine the tax rate of import and export commodities, and can cover a small part of mass import and export commodities. The customs taxation is mainly based on the text information of the commodity, the natural language processing technology is used for classifying the commodity text, and the taxation is determined according to the category, so that the automation of tax risk prevention and control can be realized. Tax prediction for goods may be translated into a Chinese text classification problem.

The Chinese text classification refers to a process of automatically classifying and marking a text set (or other entities or objects) according to a certain classification system or standard. It finds the relation model between the document feature and the document category according to a labeled training document set, and then judges the category of the new document by using the relation model obtained by learning. Existing text classification is gradually changing from knowledge-based methods to statistical and machine learning-based methods. Many classification models achieve a relatively ideal effect on Chinese text classification tasks, and compared with common Chinese, a single text is linearly composed of a plurality of elements and has no continuous context semantics. At present, no attempt is made by people to perform a customhouse import and export declaration text classification task by using artificial intelligence, but the task is abstracted into the traditional text classification problem, a textCNN convolution model provided by Yoon Kim and labor can well extract text features, and text classification is performed by using feature combinations; according to the BERT model provided by Google, the precision of a text classification task is improved by utilizing large-scale pre-training corpora and huge model parameters. However, for the customs import and export declaration text classification task, due to the domain and the particularity of the customs text, the common model has poor effect performance on the customs commodity classification task.

Disclosure of Invention

The application aims to provide a customs import and export commodity tax number prediction method, by utilizing the language material resources special for customs, the tax number prediction of a customs import and export commodity text is realized on the premise that short element feature dilution is caused by reporting element length difference, and the accuracy of the tax number prediction is improved.

In order to achieve the purpose, the technical scheme of the application is as follows: a customs import and export commodity tax number prediction method specifically comprises the following steps:

step 1: preprocessing the customhouse import and export commodity text to obtain an element name and element content;

step 2: splitting the element content obtained in the step 1, and then selecting differential elements by using an auxiliary network;

and step 3: and (3) sending the difference elements obtained in the step (2) into a CNN network for feature extraction, and simultaneously extracting element name features and element content features by using a DPCNN network.

And 4, step 4: and (4) fusing the difference element characteristics, the element name characteristics and the element content characteristics obtained in the step (3), and then performing classification operation to obtain the commodity tax number.

Further, the specific implementation manner of step 2 is as follows:

step 21, gathering the obtained element contents together with data with the same commodity categories to form a paragraph;

step 22, calculating the number of commodity subclasses of each paragraph, and sending each paragraph into an auxiliary network according to the number of commodity subclasses, and performing classification training on the commodity subclasses; during the training of each paragraph, sequentially changing the element content into the element name in sequence to obtain the loss value of each element;

and 23, selecting the first 2 differential elements according to the loss values of the elements obtained from each paragraph from large to small.

Further, the specific implementation manner of step 3 is as follows:

step 31, sending the difference elements into a CNN network to extract features by utilizing the convolutional layer, and performing feature sparseness on the largest pooling layer;

step 32, sending the element names into a DPCNN network to extract features by utilizing a convolutional layer, and sampling the layer to compress the sequence length to enlarge the receptive field;

and 33, sending the content to be processed into the SSCNN network to extract shallow features by using the structured convolution layer.

Further, the specific implementation manner of step 4 is as follows:

step 41, splicing the difference element characteristics, the element name characteristics and the element content characteristics;

step 42, sending the spliced characteristics into a full connection layer of two layers, wherein the full connection layer is a network in which each node is connected with all nodes of the previous layer and is used for integrating the extracted characteristics; the first layer of output dimensionality is a commodity large-class number, the second layer of full-connection layer of output dimensionality is a commodity small-class number, and the large-class number and the small-class number are spliced together to obtain a commodity tax number.

Due to the adoption of the technical scheme, the invention can obtain the following technical effects: the method solves the problems of unobvious commodity distinctiveness under the same catalogue and dilution of short element characteristics caused by reporting of element length difference by fusing various convolution networks, utilizing the special corpus resources of customs and combining the characteristics of customs texts, enhances the independence and importance of shorter content elements in the overall characteristics, and improves the accuracy of customs import and export commodity tax number prediction.

Drawings

FIG. 1 is a flow chart of a method for predicting the commodity tax number of customs import and export.

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples: the present application is further described by taking this as an example.

Example 1

In the process of predicting the commodity tax number of customs import and export, the characteristic that each element has an innate boundary is well utilized, so that the semantic fusion degree among the elements is required to be reduced, and the characteristic which is sufficiently prominent is required to be extracted to obtain the correct tax number of the commodity. Based on the characteristics of customs texts and the problems in the customs import and export commodity tax number prediction task, referring to fig. 1, the application provides a customs import and export commodity tax number prediction method: firstly, data preprocessing is carried out on customhouse import and export commodity declaration texts, then word segmentation is carried out on the text data, and element names corresponding to the commodity declaration text element contents are found by looking up a declaration element catalog. And then finding out the decisive difference elements of the commodities under the same large class by utilizing an auxiliary convolutional network, and predicting the tax number of the commodity text by utilizing a mixed convolutional neural network. The mixed convolutional Neural Network uses three types of Convolution to process different commodity text contents, a common Convolutional Neural Network (CNN) is used for extracting features of different elements, Shallow Structured Convolution (SSCNN) is used for extracting features of the element contents, Deep Pyramid Convolution (DPCNN) is used for extracting features of the element names, the three types of features are spliced together and classified by using a full-connection Network, and then a commodity tax number is obtained. The method effectively solves the problems that commodities in the same category are difficult to distinguish and short element characteristics are diluted due to the fact that the difference of reported element lengths is too large in the problem of forecasting the tax number of the commodities in import and export of customs, and the accuracy rate of the method is remarkably improved compared with other mainstream deep learning methods at present.

The present invention is described in detail below with reference to examples and the accompanying drawings so that those skilled in the art can implement the invention by referring to the description.

In this embodiment, Pycharm is used as a development platform, and Python is used as a development language. The custom truth data is processed on 1400000 sentences of corpus. The specific process is as follows:

step 1: and preprocessing the customs import and export commodity text to obtain the element name and the element content.

Step 2: splitting the element content obtained in the step 1, and then performing differential element selection by using an auxiliary network, wherein the specific steps are as follows:

step 21: gathering the element contents obtained in the step 1 with data with the same commodity categories to form a paragraph; for example, the data:

Step 22: and calculating the number of commodity subclasses of each paragraph, and sending each paragraph into an auxiliary network according to the number of commodity subclasses, and performing classification training on the commodity subclasses. During the training of each paragraph, sequentially changing the element content into the element name in sequence to obtain the loss value of each element;

step 23: and selecting the first 2 differential elements according to the loss values of the elements obtained from each paragraph from large to small.

The difference elements obtained by the two data through the calculation of the auxiliary network are the commodity name and the principle respectively.

And step 3: and (3) utilizing the difference elements obtained in the step (2) to be sent into a CNN network to extract features, and simultaneously utilizing DPCNN and SSPCNN networks to respectively extract element names and element content features, wherein the method specifically comprises the following steps:

step 31: the difference elements are sent into a CNN network, features are extracted by utilizing the convolutional layers, and feature sparseness is carried out on the largest pooling layer;

step 32: sending the key element name into a DPCNN network to extract features by utilizing a convolution layer, and sampling the sequence length of the layer compression to expand the receptive field;

step 33: and sending the element content into an SSPCNN network to extract shallow features by utilizing the structured convolution layer.

For example, the above data, element names and element contents are kept unchanged and are sent to the respective convolutional neural network models to extract features, the data a sends the data "pneumatic actuator | converts pneumatic power into mechanical power", and the data B sends the data "ram air actuator | provides pneumatic linear force" to the TextCNN model to extract features.

And 4, step 4: and (4) fusing the difference element characteristics, the element name characteristics and the element content characteristics obtained in the step (3), and then performing classification operation.

Step 41: splicing the difference element characteristics, the element name characteristics and the element content characteristics;

step 42: and sending the spliced features into two full-connection layers for classification, wherein the output dimensionality of the first layer is a commodity large-class number, the output dimensionality of the second full-connection layer is a commodity small-class number, and splicing the large-class number and the small-class number together to obtain the commodity tax number.

For example, the data is obtained by selecting the target class with the highest probability as the final prediction class of the model according to the probabilities of all the classes.

According to the steps, the word segmentation effect is compared with the DPCNN model, the Transform model, the BERT model and the RoBERTA model. As can be seen from Table 1, the method provided by the invention is obviously superior to other methods in the aspects of classification accuracy, precision and F1 value.

TABLE 1 comparison of classification effect of different models for customs import and export commodities

Meanwhile, the invention also verifies the influence of different auxiliary networks on the final commodity classification. As shown in table 2, the TextCNN model selected by the auxiliary network in the present invention can greatly improve the accuracy of classification of customs import and export commodities.

TABLE 2 influence of different auxiliary networks on classification effect of customs import and export commodities

The above description is only for the purpose of creating a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims

1. A customs import and export commodity tax number prediction method is characterized by specifically comprising the following steps:

and step 3: sending the difference elements obtained in the step (2) into a CNN network for feature extraction, and simultaneously extracting element name features and SSCNN network element content features by using a DPCNN network;

2. The method for predicting the commodity tax number of customs import and export according to claim 1, wherein the step 2 is implemented in a way that:

3. The method for predicting the commodity tax number of customs import and export according to claim 1, wherein the step 3 is implemented in a manner that:

4. The method for predicting the commodity tax number of customs import and export according to claim 1, wherein the step 4 is implemented in a manner that:

step 42, sending the spliced characteristics into a full-connection layer of two layers for integrating the extracted characteristics; the first layer of output dimensionality is a commodity large-class number, the second layer of full-connection layer of output dimensionality is a commodity small-class number, and the large-class number and the small-class number are spliced together to obtain a commodity tax number.