CN111476027A

CN111476027A - Big data based anti-smuggling case information extraction method

Info

Publication number: CN111476027A
Application number: CN202010263448.8A
Authority: CN
Inventors: 邱明月; 吴育宝; 王新猛
Original assignee: Nanjing Forest Police College
Current assignee: Nanjing Forest Police College
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2020-07-31

Abstract

The invention discloses a big data-based anti-smuggling case information extraction method, provides an anti-smuggling information extraction model facing big data, and verifies the specific application effect of the model in a real case copy. Firstly, constructing an intelligence element expression model of the anti-smuggling case based on the anti-smuggling case information; then, automatic extraction of various smuggling information elements is realized through a natural language processing technology and a deep learning algorithm model; and finally, the intelligent verification of the smuggling information extraction effect under a big data environment is realized by combining the document chain, the fund chain and the goods chain evidence chain. According to the method, the key element model and the natural language processing technology are used for reference, the information related to the smuggling is used as a data source, the smuggling information extraction model under the big data environment is constructed, the smuggling information service method for automatically extracting the smuggling information is formed, experimental verification analysis is carried out, automatic extraction of the information is achieved, and investment of manpower and material resources is effectively saved.

Description

Big data based anti-smuggling case information extraction method

Technical Field

The invention belongs to the technical field of data processing based on a calculation model, and particularly relates to a big data-based anti-smuggling case information extraction method.

Background

Under the background of big data era, informatization means applied by lawless persons in the process of carrying out smuggling and illegal criminal activities is increasing, and relevant information of transactions is stored in various media and mediums, so that the method has the characteristics of large data volume and being hidden in memories in different fields. The specific data sources mainly comprise customs service data, internal data of public security, other administrative data and internet resources.

The criminal activities of smuggling generally involve the declaration of customs documents, waybills, manifest, bill of lading and sale of equivalent textual information materials, as well as the basic personnel information, call records, money incoming and outgoing and related clue information of the criminal of smuggling. In the face of the case massive complex information of various data types in long time span, customs officers must rapidly and efficiently extract effective information related to the case from related materials and arrange evidence chains of fund flow, cargo flow, document flow and the like of the case in time, so as to achieve the purpose of helping to rapidly comb the case and help to solve the case. Each link of the links puts a hard requirement on the extraction capability of customs smuggling information in the big data era, and the information extraction forms a bottleneck for the traditional work of smuggling the information.

Based on the analysis, a method for extracting the anti-smuggling case information, which is suitable for multi-level wide-view large-scale anti-smuggling case data research, is necessary to be constructed in the background of the big data technology.

Disclosure of Invention

The invention provides a big data oriented anti-smuggling information extraction model aiming at the bottleneck problem of information extraction in the traditional anti-smuggling work of customs, and verifies the specific application effect of the model in the real case copy.

In order to achieve the purpose, the technical scheme adopted by the invention is a big data-based wanted-and-private information extraction method, which comprises the following steps:

s1: constructing an intelligence element expression model of the anti-smuggling case based on the anti-smuggling case information;

s2: the automatic extraction of various smuggling information elements is realized through a natural language processing technology and a deep learning algorithm model;

s3: and the intelligent verification of the smuggling information extraction effect under a big data environment is realized by combining the document chain, the fund chain and the goods chain evidence chain.

The step S1 further includes the following steps:

s11: element expression of anti-smuggling case

And constructing an element expression model for the smuggling of the cases based on the case characteristics and the data sources of the smuggling of the cases.

S12: character vectorization

And carrying out element labeling on the wanted case text according to the composition and the relation of the element expression model to obtain a labeled data set, and carrying out unsupervised training learning on the text of the good words by using a related model tool for generating word vectors on the basis of a labeled training set and unlabeled case text linguistic data.

Preferably, the related model tool for generating word vectors uses word2vec for training to reconstruct the linguistic word text.

The modeling process of the deep learning algorithm model in the step S2 specifically includes the following steps:

s21: dilated convolution

Applying the convolutional neural network layer to the sequence labeling problem, performing convolution operation on sequence vectors, performing affine transformation on an input sequence, and performing affine transformation on a subsequence vector x of the sequence vector_tIs defined as the output of

(1) In the formula, r is convolution radius, ⊕ is vector linking operation, the expansion convolution increases expansion width on the filter of common convolution, when there is input sequence vector, the expansion convolution can automatically skip all input data in the middle of expansion width, so that effective input becomes wider, more input data can be obtained, when the expansion convolution is used for sequence marking, subsequence vector x is used for sequence marking_tIs defined as the output of

(2) In the formula: σ is an expansion width, when σ is 1, the expansion convolution is the same as the ordinary convolution operation, and when σ is larger than 1, the expansion width enables the receptive field range to be larger, so that the void problem caused by the expansion width can be added into the convolution operation through the translation of the filter;

s22: iterative dilation convolutional neural network

The iterative expansion convolutional neural network is used for preventing the output result from having the risk of overfitting through simple layer number superposition, the iterative expansion convolutional neural network is used for multiple times for the same expansion convolutional block, the output result of the last layer is used as input in each iteration, and the same parameters are repeatedly used in each iteration;

s23: conditional random field

The conditional random field defines a series of binary characteristic functions, including a transfer characteristic function and a state characteristic function, the former considers the dependency relationship between output variables, the latter considers the influence of input characteristics, the weighting summation of all the characteristic functions is carried out to obtain the score of each labeling category, the probability of each labeling category is obtained by applying a normalization factor, the maximum probability is the current labeling category, and the observation sequence X is (X is the current labeling category), the observation sequence X is the observation sequence X₁,x₂,…x_n) Under the condition of X, the predicted sequence Y ═ Y | X of the random field P ═ Y | X of the linear chain member₁,y₂,…y_n) The simplified representation of the conditional probability with the value y is

(3) In the formula: z (x) is a normalization factor; exp is an exponential function; w is a_kRepresentative characteristic function f_k(y, x) corresponding weights; k represents the number of the defined characteristic functions;

s24: iterative expansion convolutional neural network combined with conditional random field

The word vectors and the word vectors obtained through preprocessing are used as the input of a model, the input vectors are input into an iterative expansion convolution network after passing through a common convolution layer, iterative blocks share a group of parameters, finally, the network layer outputs each classified score for each sequence, each score sequence is used as the input of a CRF layer, the CRF layer judges each labeled sequence by utilizing a transfer characteristic function considering the output variable dependency relationship, and further optimizes the classification result obtained through iterative expansion convolution.

In the whole process of step S24, the named entities are first kept from being split into separate words in the preprocessing stage, each entity is used as an independent word vector input model, then an iterative expansion convolutional neural network is selected to extract features with higher robustness, then some connections of the network are discarded randomly by dropout to solve overfitting possibly caused by repeated iteration of the network layer, and finally the result obtained by the network layer is further corrected by the conditional random field.

In step S3, the reliability, effectiveness, timeliness, repeatability, and urgency of the information are comprehensively evaluated and verified in combination with five dimensions of the information evaluation.

Compared with the prior art, the invention has the following beneficial technical effects:

(1) the invention provides a big data-oriented smuggling information extraction model aiming at the bottleneck problem of the custom traditional smuggling information work by taking the difficulty in the customs smuggling work in the big data era as a starting point, and verifies the specific application effect of the model in the real case copy.

(2) The invention realizes automatic extraction of information based on the expression of the anti-smuggling information elements of the space-time frame, effectively saves the input of manpower and material resources, and effectively solves the difficult problem of handling the policeman at the first line of the anti-smuggling information department by applying the positive influence on the anti-smuggling work by a big data analysis method.

(3) The invention uses element models and natural language processing technology for reference, uses the information related to the smuggling cases as data sources, constructs the smuggling information extraction model under the big data environment, forms a smuggling information service method for automatic extraction of the smuggling information, and carries out experimental verification analysis.

Drawings

FIG. 1 is a flow chart of a depth conditional random field model;

FIG. 2 is a model of information element expression;

FIG. 3 is a diagram of a normal convolution and dilation convolution;

FIG. 4 is a depth conditional random field information extraction model.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

Different public security data sources have obvious complementary advantages in the aspects of richness of knowledge, expression abstraction, cognition habituation and the like. From the data form, the smuggling data comprises five types of text, voice, image, video, audio and the like. At present, case and event information extraction aiming at structured data is relatively mature and easy to realize. In contrast, the extraction of unstructured text, voice, image, video and audio data requires further research. In order to obtain more context information in case texts, the invention mainly uses text data to construct a depth conditional random field information extraction model based on the driving of smuggling intelligence elements. The modeling process is shown in fig. 1, and specifically includes:

s11: element expression of anti-smuggling case

At present, the national public security information system usually collects, processes and manages information according to five elements of people, affairs, things, organizations and places. The public security service typically needs to answer several types of basic questions: "who is alert involved? When an alert occurs? Where is the alarm? What did the alarm-related actor? Why do the police-related actor do so? What are items related to the alert? What consequences are caused by a case event? Thus, the public security information can be summarized into five basic elements with case as the core, namely time, place, people, articles and events. Wherein, the time and the place are basic conditions for the existence and the evolution of the three elements of people, articles and events, and have attributes, behaviors, states and process characteristics. It should be noted that the character includes an organization and a virtual character. The case and event element relationship can be divided into two types of concept relationship and characteristic relationship. The concept relationship refers to a semantic relationship among concepts of different elements in the same classification system, and includes an equivalence relationship, an upper-level relationship, a lower-level relationship, a same-level relationship, a correlation relationship, and the like. A feature relationship is a relationship between different feature features, such as a temporal relationship, a spatial relationship, an attribute relationship, a state relationship, a process relationship, and the like.

Generally, the intelligence elements of case and event can be divided into three layers, namely a concept layer, an element layer and an element relation layer according to the types, characteristics and logic relation of the elements. Wherein, the element layer can be divided into three sub-layers, including basic characteristics (time, space, attribute, behavior), state characteristics and process characteristics; the element relation layer can be divided into two sub-layers, namely a concept relation and a characteristic relation. According to the hierarchical division of the case elements, the case element semantic units with different levels can be formed. The intelligence element expression model includes an intelligence element submodel and an intelligence element relation submodel, as shown in FIG. 2.

S12: character vectorization

In order to obtain high-quality word vector features, information element labeling is firstly carried out on data information by a basic-level policeman, and the labeled element type is based on an information element expression model provided by the embodiment, so that a labeled data set is obtained. And performing unsupervised training learning on the text of the good words by using a word2vec tool based on the labeled training set and the unlabeled case text corpus.

S21: dilated convolution

The method is characterized in that the expansion convolution is applied to an image segmentation task, the contradiction that information is lost while a receptive field is increased in a pooled image is solved, and extra calculation is not needed, the general convolution structure with a convolution kernel of 3 × 3 and a receptive field of 3 × 3 is shown in fig. 3(a), the expansion convolution with the convolution kernel size of 3 × 3 and an expansion width of 2 is shown in fig. 3(b), the receptive field is increased to 5 × 5, and the expansion convolution increases the receptive field under the condition of no information loss, so that the expansion convolution neural network is applied to element extraction to solve the problem that long sequence dependency is needed in the element extraction.

The convolutional neural network layer is applied to the sequence labeling problem, and the essential is to perform convolution operation on sequence vectors, which is different from two-dimensional image convolution operation, and is equivalent to performing affine transformation on an input sequence. For its subsequence vector x_tIs defined as the output of

(1) Where r is the convolution radius and ⊕ is the vector chaining operation, the dilation convolution adds dilation width to the filter of ordinary convolution, when there is an input sequence vector, the dilation convolution will automatically skip all the input data in the middle of dilation width, making the effective input wider and more input data available_tIs defined as the output of

(2) In the formula: σ is the width of the dilation, when σ is 1, the dilation convolution is the same as the ordinary convolution operation, and when σ >1, the dilation width makes the receptive field range larger, and the resulting hole problem can be added to the convolution operation by the translation of the filter. Therefore, the expansion convolution can acquire more context information of the text compared with the common convolution.

S22: iterative dilation convolutional neural network

The iterative dilation convolutional neural network can prevent the output result from having the risk of overfitting through simple layer number superposition. The iterative dilation convolution neural network applies the same dilation convolution block for multiple times, each iteration takes the output result of the last layer as input, and the same parameters are repeatedly used in each iteration, so that the effective input width can be widened, and the model generalization capability can be enhanced.

S23: conditional random field

The conditional random field defines a series of binary feature functions including a transfer feature function and a state feature function, the former being consideredThe dependencies between output variables are taken into account, which takes into account the influence of input characteristics. And weighting and summing all the characteristic functions to obtain the grade of each labeling category, and obtaining the probability of each labeling category by using a normalization factor, wherein the maximum probability is the current labeling category. In the observation sequence X ═ X₁,x₂,…x_n) Under the condition of X, the predicted sequence Y ═ Y | X of the random field P ═ Y | X of the linear chain member₁,y₂,…y_n) The simplified representation of the conditional probability with the value y is

(3) In the formula: z (x) is a normalization factor; exp is an exponential function; w is a_kRepresentative characteristic function f_k(y, x) corresponding weights; k represents the number of defined feature functions.

The method comprises the steps of using word vectors and word vectors obtained through preprocessing as input of a model, inputting the input vectors into an iterative expansion convolution network after passing through a common convolution layer, wherein the network in fig. 4 is composed of blocks which are iterated for 4 times and comprise 3 expansion convolution layers, and the iterated blocks share a group of parameters.

S3: analysis and verification of smuggling information

The invention combines five dimensions of information evaluation (table 1) to comprehensively evaluate and analyze the reliability, effectiveness, timeliness, repeatability and emergency degree of the information.

TABLE 1 information evaluation analysis matrix

Intelligence assessment feature	A	B	C
				Practicality of use	With strong pertinence	Has a certain potential	Not applicable to
Authenticity	True and accurate content	Is more reliable	Is unreliable
				Aging property	Advanced early warning	Timely early warning	No early warning and lag
Repeatability of	Never appeared	Occasionally appear	Always occur
				Degree of urgency	Very severe and severe alarm	Moderate warning	Light warning

It should be understood that the above description of specific embodiments is not intended to limit the invention, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A big data-based anti-smuggling case information extraction method is characterized by comprising the following steps:

2. The big data based extraction method of wanted and private intelligence as claimed in claim 1, wherein step S1 comprises the following steps:

s11: element expression of anti-smuggling case

Constructing an element expression model of the anti-smuggling case based on case characteristics and data sources of the anti-smuggling case;

s12: character vectorization

And carrying out element labeling on the wanted case text according to the composition and the relation of the element expression model to obtain a labeled data set, and carrying out unsupervised training learning on the text of the good words by using a related model tool for generating word vectors on the basis of the labeled data set and the unlabeled case text linguistic data.

3. The big data based anti-privately intelligence extraction method as claimed in claim 2, wherein the relevant model tool to generate word vectors is word2 vec.

4. The big data-based anti-privately information extraction method according to claim 1, wherein the modeling process of the deep learning algorithm model in the step S2 specifically comprises the following steps:

s21: dilated convolution

(1) In the formula: r is the convolution radius;

for vector chaining operation, the expansion convolution increases the expansion width on a filter of common convolution, when an input sequence vector exists, the expansion convolution can automatically skip all input data in the middle of the expansion width, so that effective input becomes wider, more input data can be obtained, and when the expansion convolution is used for sequence marking, a subsequence vector x is used_tIs defined as the output of

s22: iterative dilation convolutional neural network

s23: conditional random field

5. The big data-based anti-privately case intelligence extraction method as claimed in claim 4, characterized in that in the whole process of step S24, named entities are firstly kept in a preprocessing stage and are not split by word segmentation, each entity is used as an independent word vector input model, then an iterative expansion convolutional neural network is selected to extract features with higher robustness, then dropouts are used to randomly discard some connections of the network to solve overfitting possibly caused by repeated iteration of the network layer, and finally a conditional random field is used to further correct the result obtained by the network layer.

6. The big data-based anti-smuggling case intelligence extraction method as claimed in claim 1, wherein in the step S3, the anti-smuggling intelligence extraction effect verification is a comprehensive evaluation verification of five dimensions of reliability, effectiveness, timeliness, repeatability and urgency of intelligence evaluation.