CN111985204B

CN111985204B - Method for predicting tax numbers of customs import and export commodities

Info

Publication number: CN111985204B
Application number: CN202010744808.6A
Authority: CN
Inventors: 张强; 周成杰; 车超
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2023-06-02
Anticipated expiration: 2040-07-29
Also published as: CN111985204A

Abstract

The invention discloses a customs import and export commodity tax number prediction method, which specifically comprises the following steps: step 1: preprocessing the customs import and export commodity text to obtain element names and element contents; step 2: splitting the element content obtained in the step 1, and then utilizing an auxiliary network to select different elements; step 3: and (3) sending the differential elements obtained in the step (2) into a CNN network for feature extraction, and simultaneously extracting element name features and SSCNN network extract element content features by using the DPCNN network. Step 4: and (3) fusing the differential element characteristics, the element name characteristics and the element content characteristics obtained in the step (3), and then performing classification operation to obtain the commodity tax number. According to the method and the device, by utilizing the special corpus resource of customs, tax prediction is carried out on the commodity text of customs import and export on the premise that the feature dilution of the short elements is caused by the difference of the length of the declaration elements, and the accuracy of tax prediction is improved.

Description

Method for predicting tax numbers of customs import and export commodities

Technical Field

The invention relates to the technical field of natural language processing, in particular to a customs import and export commodity tax number prediction method based on a mixed convolution neural network and an auxiliary network.

Background

Customs tax is a major source of tax in many countries. At present, china customs mainly uses manpower to check tax rate of import and export commodities, and can cover a small part of massive import and export commodities. Since the main basis of customs tax collection is text information of commodities, the natural language processing technology is used for classifying the commodity text, and tax is determined according to the category, so that automation of tax risk prevention and control can be realized. Tax predictions for goods can be translated into a Chinese text classification problem.

Chinese text classification refers to the process of automatically classifying and marking a text set (or other entities or objects) according to a certain classification system or standard. According to a marked training document set, a relation model between document characteristics and document categories is found, and then category judgment is carried out on a new document by using the relation model obtained by learning. Existing text classification is gradually transitioning from knowledge-based methods to statistical and machine learning-based methods. Many classification models obtain ideal effects on Chinese text classification tasks, compared with common Chinese, customs import and export declaration texts, a single text is linearly composed of a plurality of elements, and continuous context semantics are not available. At present, no attempt is made to carry out customs import and export declaration text classification tasks by using artificial intelligence, but the text classification task is abstracted into a traditional text classification problem, and text characteristics can be well extracted by a text CNN convolution model proposed by Yoon Kim of a Massachu staff, and text classification is carried out by using characteristic combinations; the BERT model proposed by google utilizes large-scale pre-training corpus and huge model parameter quantity, so that the precision of text classification tasks is improved. However, for customs import and export declaration text classification tasks, the common model has poor effect on customs commodity classification tasks due to the territory and specificity of customs texts.

Disclosure of Invention

The purpose of the application is to provide a method for predicting tax numbers of customs import and export commodity, which utilizes the special corpus resource of customs to predict tax numbers of customs import and export commodity texts on the premise of diluting short element characteristics caused by the difference of the length of declaration elements, so that the accuracy rate of tax number prediction is improved.

In order to achieve the above purpose, the technical scheme of the application is as follows: a customs import and export commodity tax number prediction method specifically comprises the following steps:

step 1: preprocessing the customs import and export commodity text to obtain element names and element contents;

step 2: splitting the element content obtained in the step 1, and then utilizing an auxiliary network to select different elements;

step 3: and (3) sending the differential elements obtained in the step (2) into a CNN network for feature extraction, and simultaneously extracting element name features and SSCNN network extract element content features by using the DPCNN network.

Step 4: and (3) fusing the differential element characteristics, the element name characteristics and the element content characteristics obtained in the step (3), and then performing classification operation to obtain the commodity tax number.

Further, the specific implementation manner of the step 2 is as follows:

step 21, gathering the obtained element content and the data with the same commodity major class together to form a paragraph;

step 22, calculating how many commodity subclasses are in each paragraph, and sending each paragraph into an auxiliary network according to the commodity subclasses, and performing classification training on the commodity subclasses; when training each paragraph, sequentially changing element contents into element names according to the sequence to obtain a loss value of each element;

step 23, selecting the first 2 differential elements according to the order from the big to the small by using the loss value of each element obtained in each paragraph.

Further, the specific implementation manner of the step 3 is as follows:

step 31, sending the differential elements into a CNN network, extracting features by using a convolution layer, and carrying out feature sparsity by using a maximum pooling layer;

step 32, sending the element name into a DPCNN network, extracting features by using a convolution layer, and reducing the sequence length of a sampling layer to enlarge a receptive field;

step 33, the element content is sent into an SSCNN network to extract shallow features by using a structured convolution layer.

Further, the specific implementation manner of the step 4 is as follows:

step 41, splicing the different element characteristics, the element name characteristics and the element content characteristics;

step 42, sending the spliced characteristics into two layers of full-connection layers, wherein each node of the full-connection layers is a network connected with all nodes of the upper layer and is used for integrating the extracted characteristics; the first layer output dimension is commodity large class number, the second layer full-connection layer output dimension is commodity small class number, and the large class number and the small class number are spliced together to obtain commodity tax number.

By adopting the technical scheme, the invention can obtain the following technical effects: according to the invention, by integrating a plurality of convolution networks and utilizing the special corpus resource of customs and combining the characteristics of customs texts, the problems of unobvious commodity distinguishing property under the same objective and short element characteristic dilution caused by the difference of the length of the declaration elements are solved, the independence and the importance of the shorter content elements in the integral characteristics are enhanced, and the accuracy of customs import and export commodity tax number prediction is improved.

Drawings

FIG. 1 is a flow chart of a method for predicting tax numbers of customs import and export commodities.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and to specific embodiments: this is taken as an example to describe the present application further.

Example 1

In the process of predicting tax numbers of articles at customs import and export, the characteristic that each element has congenital demarcations is utilized, so that the semantic fusion degree among the elements needs to be reduced, and enough outstanding features need to be extracted to obtain correct tax numbers of articles. Based on the characteristics of customs texts and the problems in the customs import and export commodity tax prediction task, referring to fig. 1, the application provides a customs import and export commodity tax prediction method: firstly, data preprocessing is carried out on commodity declaration texts of customs import and export, then word segmentation is carried out on the text data, and element names corresponding to the commodity declaration text element content are found out by referring to a declaration element catalog. And then, using an auxiliary convolution network to find out decisive difference elements of commodities under the same large class, and using a mixed convolution neural network to predict tax numbers for commodity texts. The mixed convolution neural network processes different commodity text contents by using three convolutions, performs feature extraction on differential elements by using a common convolution neural network (Convolution Neural Network, CNN), performs feature extraction on element contents by using shallow structured convolutions (Shallow Structured Convolution Neural Network, SSCNN), performs feature extraction on element names by using deep pyramid convolutions (Deep Pyramid Convolution Neural Network, DPCNN), and performs classification by using a fully connected network after the three feature extraction, thereby obtaining commodity tax numbers. The method effectively solves the problems that commodities in the same general class are difficult to distinguish and short element characteristics are diluted due to the fact that the length difference of declaration elements is too large in the customs import and export commodity tax number prediction problem, and the accuracy is remarkably improved compared with other mainstream deep learning methods at present.

The present invention will be described in detail below with reference to examples and drawings so as to enable one of ordinary skill in the art to practice the same, with reference to the present description.

In this embodiment, pycharm is used as a development platform, and Python is used as a development language. The method is carried out on 1400000 sentence corpus of customs real data. The method comprises the following specific processes:

step 1: and preprocessing the customs import and export commodity text to obtain element names and element contents.

Step 2: splitting the element content obtained in the step 1, and then using an auxiliary network to perform differential element selection, wherein the method specifically comprises the following steps:

step 21: the element content obtained in the step 1 is gathered together with the data with the same commodity category to form a paragraph; for example, data:

Data B, "8412310090 ram air actuator |4|3| provides air pressure linear force |aircraft power system |HONEYWELL| 676000141'

Step 22: and calculating how many commodity subclasses exist in each paragraph, and sending each paragraph into an auxiliary network according to the commodity subclasses, and performing classification training on the commodity subclasses. When training each paragraph, sequentially changing element contents into element names according to the sequence to obtain a loss value of each element;

step 23: the first 2 differential elements are selected in order from the largest to the smallest by using the loss values of the elements obtained in each paragraph.

The two pieces of data are calculated by an auxiliary network, and the obtained differential elements are a commodity name and a principle.

Step 3: sending the differential elements obtained in the step 2 into a CNN network to extract features, and simultaneously respectively extracting element names and element content features by using DPCNN and SSPCNN networks, wherein the method specifically comprises the following steps:

step 31: sending the difference elements into a CNN network, extracting features by using a convolution layer, and carrying out feature sparseness by using a maximum pooling layer;

step 32: sending the element name into a DPCNN network, extracting features by using a convolution layer, and reducing the sequence length of a layer of laminated sequence by a downsampled layer to enlarge a receptive field;

step 33: and (3) sending the element content into an SSPCNN network, and extracting shallow features by using a structured convolution layer.

For example, the above data, element names and element contents are kept from being transmitted to respective convolutional neural network models to extract features, the data A converts pneumatic actuator I into mechanical power, and the data B transmits ram air actuator I to provide pneumatic linear acting force to the textCNN model to extract features.

Step 4: and (3) fusing the differential element characteristics, the element name characteristics and the element content characteristics obtained in the step (3), and then performing classification operation.

Step 41: splicing the differential element characteristics, the element name characteristics and the element content characteristics;

step 42: and sending the spliced features into two layers of full-connection layers for classification, wherein the output dimension of the first layer is a commodity large-class number, the output dimension of the second layer is a commodity small-class number, and splicing the large-class number and the small-class number together to obtain the commodity tax number.

For example, the data is obtained according to the probability of all classifications, and the target class with the highest probability is selected as the final prediction class of the model.

According to the steps, the word segmentation effect is compared with DPCNN model, transform model, BERT model and RoBERTa model methods. As can be seen from table 1, the method proposed by the present invention is significantly superior to other methods in terms of classification accuracy, precision and F1 value.

Table 1 comparison of different models for customs import and export commodity classification effect

Meanwhile, the invention also verifies the influence of different auxiliary networks on the final commodity classification. As shown in Table 2, the accuracy of classification of customs import and export commodities can be greatly improved by adopting the textCNN model for the auxiliary network.

TABLE 2 influence of different auxiliary networks on the classification of customs import and export commodities

While the invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A customs import and export commodity tax number prediction method is characterized by comprising the following steps:

step 3: sending the differential elements obtained in the step 2 into a CNN network for feature extraction, and simultaneously extracting element name features and SSCNN network content features by using the DPCNN network;

step 4: fusing the differential element characteristics, the element name characteristics and the element content characteristics obtained in the step 3, and then performing classification operation to obtain commodity tax numbers;

the specific implementation mode of the step 2 is as follows:

2. The customs import and export commodity tax prediction method according to claim 1, wherein the specific implementation manner of the step 3 is as follows:

3. The customs import and export commodity tax prediction method according to claim 1, wherein the specific implementation manner of the step 4 is as follows:

step 42, sending the spliced features into two full-connection layers for integrating the extracted features; the first layer output dimension is commodity large class number, the second layer full-connection layer output dimension is commodity small class number, and the large class number and the small class number are spliced together to obtain commodity tax number.