CN111414749A

CN111414749A - Social text dependency syntactic analysis system based on deep neural network

Info

Publication number: CN111414749A
Application number: CN202010193329.XA
Authority: CN
Inventors: 刘宇鹏; 张晓晨
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2020-07-14
Anticipated expiration: 2040-03-18
Also published as: CN111414749B

Abstract

A deep neural network-based social text dependency syntactic analysis system relates to the technical field of computer information processing, and aims at solving the problem of sparse social text data in the prior art, and comprises the following steps: the system comprises a social text crawling module, a preprocessing module, a basic bilinear attention module, a stacked bilinear attention module and a joint decoding and training module; the social text crawling module is used for crawling social texts from social media websites; the preprocessing module is used for filtering the obtained social text and generating an initialization word vector; the basic bilinear attention module is used for pre-training by utilizing regular texts; the stacked bilinear attention module is used for predicting social texts; the combined decoding and training module is used for calculating an empirical risk function for the stacked bilinear attention module, performing back propagation gradient adjustment parameters, fitting a training function, and finally performing parallel calculation to accelerate model training by using the GPU.

Description

Social text dependency syntactic analysis system based on deep neural network

Technical Field

The invention relates to the technical field of computer information processing, in particular to a social text dependency syntax analysis system based on a deep neural network.

Background

Dependency analysis is a fundamental and important task in natural language processing, and many applications require dependency analysis on sentences to provide syntactic results to the corresponding task. Through the powerful computing power of a computer, the dependency syntax structure of the sentence is identified. Dependency syntax trees are broadly divided into two categories by structure: project (Project) and Non-Project (Non-Project) dependency syntax structures; according to the decoding algorithm: graph-based (Graph-based) and transformation-based (Transition-based) dependency algorithms. The deep neural network part overcomes gradient diffusion and explosion of the traditional neural network, is developed rapidly in recent years, and makes great progress in various application fields of natural language. The deep neural translation method has the advantages that 1, the deep neural translation method is a nonparametric model and is irrelevant to the scale of a task, and the task of data of any scale can be learned as long as parameters are specified; 2. unlike the traditional dependency analysis method which needs independent feature extraction, the feature extraction and the training of the dependency analyzer are put together, and the Joint model method overcomes the fault propagation defect of the traditional Pipeline (Pipeline) model; 3. compared with the traditional method, the method has higher performance and is used for many tasks. Many research institutions and scientific institutions have focused on deep learning models.

Unlike traditional dependency algorithms, dependency analysis of social text has some problems: for example, the training corpus is small, and special words and dependency relationships can occur.

Disclosure of Invention

The purpose of the invention is: aiming at the problem of sparse social text data in the prior art, a deep neural network-based social text dependency syntax analysis system is provided.

The technical scheme adopted by the invention to solve the technical problems is as follows:

a deep neural network based social text dependency syntactic analysis system, comprising: the system comprises a social text crawling module, a preprocessing module, a basic bilinear attention module, a stacked bilinear attention module and a joint decoding and training module;

the social text crawling module is used for crawling social texts from social media websites;

the preprocessing module is used for filtering the obtained social text and generating an initialization word vector;

the basic bilinear attention module is used for pre-training by utilizing regular texts;

the stacked bilinear attention module is used for predicting social texts;

the combined decoding and training module is used for calculating an empirical risk function for the stacked bilinear attention module, performing back propagation gradient adjustment parameters, fitting a training function, and finally performing parallel calculation to accelerate model training by using the GPU.

Further, the social text crawling module executes the following steps:

firstly, compiling a webpage crawler by using Scapy based on Python, configuring the Scapy, setting a crawling time interval and an agent, and then positioning related text content of a webpage for extraction.

Further, the specific steps of filtering in the pretreatment module are as follows:

firstly, English regular text Gigaword is used for training corpus, then a language model tool Ken L M is used for training a language model, and finally the language model is used for calculating scores of downloaded social texts and filtering the scores by using a threshold value.

Further, the specific steps of generating the initialization word vector in the preprocessing module are as follows:

firstly, training regular texts and social texts of good words by using a Glove tool to generate sentence word vectors { e } of the regular texts₁,e₂,…,e_LSentence word vector of social text { e'₁,e'₂,…,e'_LL, where L represents the sentence length that needs dependency analysis.

Further, the base bilinear attention module performs the following steps:

firstly, a bidirectional long-time memory module is used for modeling a sentence, then a self-attention module is used for generating the dependency relationship of other words on the current word, then a multilayer perceptron module is used for purifying the generated word feature vector, and finally a bilinear attention module is used for generating a target function of the dependency relationship among regular text words for training.

Further, the stacked bilinear attention module performs the following steps:

firstly, outputting the purified word feature vector in the basic model as a part to a stacked neural network with the same structure as the basic model, and then predicting the dependency relationship of the social text.

Further, the joint decoding and training module performs the steps of:

firstly, combining a base bilinear attention module and a stacked bilinear attention module to form a whole deep dependence analysis network, then decoding by using a beam search algorithm, then training a model by gradient back propagation, continuously iterating until convergence, and finally accelerating parallel training by using a GPU.

The invention has the beneficial effects that:

the method uses a stacked neural network structure, uses regular text in a base neural network for pre-training to overcome the problem of sparse social text data, uses a global objective function for training and decoding to better consider global information, adds a self-attention mechanism on the basis of the original bidirectional L STM to better model the relationship among words, and uses the base layer and the stacked head and tail word feature vectors to better balance two layers of learning results when calculating the stacked neural network.

Drawings

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is a block diagram of a radix bilinear attention module on regular text;

FIG. 3 is a schematic diagram of a stacked bilinear attention module;

FIG. 4 is an exemplary diagram of a social text parsing tree.

Detailed Description

The first embodiment is as follows: specifically describing the present embodiment with reference to fig. 1, the social text dependency parsing system based on deep neural network according to the present embodiment includes: the system comprises a social text crawling module, a preprocessing module, a basic bilinear attention module, a stacked bilinear attention module and a joint decoding and training module;

the stacked bilinear attention module is used for predicting social texts;

A. A social text crawling step: as a further description of the present invention, the step a includes the following steps:

a1, webpage obtaining step: compiling a webpage crawler by using Scapy based on Python, wherein the webpage crawler is set, a main module is crawled, and data are stored;

a2, text extraction: extracting related content in the webpage by using Goose based on Python;

B. a pretreatment step: filtering by using a filtering algorithm, segmenting words of the filtered text, and generating an initialized word vector by using a word vector training tool, wherein the step B comprises the following steps of:

b1, text filtering step: filtering the social text using a language model tool;

b2, word segmentation and word vector training steps: performing word segmentation on the selected text and training an initial word vector;

C. modeling a sequence by using a bidirectional long-short time memory (L STM) module, generating the influence of other words on a current word by using a Self-attention (Self-attention) module, purifying generated word feature vectors by using a multi-layer perceptron module, and finally generating an objective function of the dependence relationship among regular text words by using bilinear attention (Bi-linear attention) for training, wherein the basic bilinear attention module is shown in FIG. 2 and is used for further explanation of the invention, and the step C comprises the following steps:

c1, bidirectional long-short time memory step: in each unit related to the word, memorizing or forgetting the current word or historical information can process long-term and short-term memorization;

c2, self-attention step: the self-attention mechanism is used for modeling the soft alignment among the words, so that the influence of bidirectional long-time and short-time memory only considering context information is made up, and the inter-word relation is better described;

c3, multi-layer perceptron step: generating a current word as a head and tail dependency vector through multilayer nonlinear transformation, and reflecting the characteristic description of the current word as the head and tail;

c4, bilinear attention step: calculating the relationship between two words through a bilinear attention mechanism, and reflecting the dependency relationship score between the current word and other words;

D. a stack bilinear attention step: the word feature vector after purification in the Base Model (Base Model) is output as a part to a stacked neural network with the same structure as the Base Model (a bidirectional long-time memory module, a self-attention module, a multi-layer perceptron module and finally bilinear attention generating a target function of the dependency relationship among the social text words for training), the stacked bilinear attention is shown in fig. 3, and the step D comprises the following steps:

d1, a step of memorizing the two-way length of the stack: on the basis of the base output feature vector, a layer of two-way long-time memory is established, not only the feature vector of the regular text of the base layer is considered, but also the word vector of the current social text is considered, and due to the particularity of social dependency analysis, the special word vector is used for representing ROOT (a ROOT node representing dependency relationship) and EMP (a word representing no dependency relationship) so as to depict the special dependency phenomenon that the head word in the social text is ROOT and has no head word;

d2, stacked self-attention step: a layer of self-attention step is added on the basis of the stacked two-way long-short-time memory step to depict the relation between social text words, so that the problem that the two-way long-short-time memory only considers local context information is solved;

d3, stacked multilayer perceptron step: generating a head and tail word feature vector for the social text word;

d4, stacked bilinear attention step: calculating the relationship between the two words through a bilinear attention mechanism, reflecting the dependency relationship score between the current word and other words, and including the head and tail word feature vectors generated by the base model in addition to the current head and tail word feature vectors, so that feature information can be referenced from the base model;

E. joint decoding and training step: as a further explanation of the present invention, step E includes the following steps of training a basic bilinear attention module, stacking a new module with the same structure as the basic bilinear attention module on the trained result (the module on the stack is not trained), and performing joint decoding by using a stacked neural network during decoding:

e1, joint decoding step: combining the steps A, B, C and D to form a whole deep dependence analysis network, calculating an objective function value, adopting GPU parallel training to accelerate the dependence relationship result of a given social text sentence, adopting a global-based beam search algorithm for decoding, and considering the historically generated dependence result;

e2, back propagation step: the parameters are updated according to the calculated gradient and iterated until convergence.

Fig. 1 shows a block diagram of the system of the present invention, which is set forth in detail below:

step A1: the crawler was written using the individual components of Scapy. Defining data needing to be captured and post-processed by using a project module; configuring the script by using the configuration module file so as to modify a user-agent, set a crawling time interval, set an agent, configure various middleware and the like; the pipeline module is used for storing data needing to be processed in the later period, so that the crawling and the processing of the data are separated; the crawler is customized using a crawler module.

Step A2: and removing disordered characters and pictures on the webpage, and only reserving the text part which is subjected to the finishing typesetting. And positioning the related text content of the webpage for extraction.

And step B1, training a language model by using English regular text Gigaword training corpus and a language model tool Ken L M, calculating scores (the scores reflect the fluency of the language) by using the language model for the downloaded social texts, and filtering out low scores by using a threshold value.

Step B2, compared with regular texts, the social texts have some special linguistic phenomena such as @ mention, emoticon (Emotion), website (UR L), # subject (Hashtag), forward (Retweet) and abbrevation (Abbreviation), the social texts are kept as independent marks during word segmentation, punctuation separation is needed as the regular texts, punctuation separation is needed for the regular texts, the regular texts are trained for the regular texts and the social texts with the Glove tool, and sentence vector representation { e } of the regular texts is generated₁,e₂,...,e_LSentence word vector representation of social text { e'₁,e'₂,...,e'_LL, where L indicates that dependency analysis sentence length is required.

Step C1 use of a bidirectional L STM with peep hole (taking long term memory into account when calculating gates) including three gates, forget gate f_t(for controlling long-term memory), input gate i_t(for controlling short-term memory of current word) and an output gate o_t(for controlling the weighted average post-memory vector), the basic process is described as follows:

forget the door: f. of_t＝σ(W_f·[C_t-1,h_t-1,e_t]+b_f)

An input gate: i.e. i_t＝σ(W_i·[C_t-1,h_t-1,e_t]+b_i)

An output gate: o_t＝σ(W_o·[C_t-1,h_t-1,e_t]+b_o)

Wherein sigma is sigmoid function and represents the value of [0,1 ]]Is a weight function, W_f,W_i,W_oRepresenting a parameter matrix, C_t-1Representing the long-term memory delivered last time, h_t-1Hidden state vector representing the last moment, e_tA feature vector representing a pre-trained regular text word, [,]representing a vector connection, b_f，b_i，b_oDenotes an offset vector, h₀And h_L+1The initial vector uses random initialization, L indicating the length of the text.

Short-term memory vector for current word:

long-term memory vector of current word:

where tanh is the hyperbolic tangent function,

representing hadamard products between elements, current time long memory vector C_tEqual to the last moment long memory vector C_t-1Short-term memory vector with current word

Weighted average obtained, forgetting gate f_tAnd an input gate i_tIs a weight vector.

Forward hidden state vector for current word:

backward hidden state vector

Generating mode and forward hidden state vector

Similarly, the hidden state vector h at the next moment is simply not considered in the calculation of the gate function_t-1Instead, the hidden state vector h at the previous moment is considered_t+1. Entire hidden state vector

Obtained by forward and backward implicit state vector concatenation.

Step C2: the self-attention mechanism with multiple heads is described as follows:

and (3) querying the vector: q. q.s_t＝W_q·h_t

Keyword vector: k is a radical of_t＝W_k·h_t

Vector of values: v. of_t＝W_v·h_t

Wherein

(d_modelIs the dimension of the model vector) is the output of the previous step,

(d_kis a query vector, the dimensions of a key vector and a value vector) is a parameter matrix, which represents that a linear transformation is performed on a feature vector to generate a query vector (different parameter matrices are adopted to represent that different representations are generated for the same feature vector).

Attention weight:

where softmax represents the normalized probability calculated by column j,

indicating that the dimensionality is used to adjust the results.

Attention generation vector:

representation for value vector v_jA weighted average is performed.

Single-head attention generation matrix:

wherein the single-head attention generating matrix

Is the attention vector c_hAre connected in columns.

Multi-head attention generation matrix: c ═ C₁,...,C_H]

Self-attention feature matrix: s ═ C.W_S

Wherein H represents the number of heads, the parameter matrix adopted by each head is different, and each head forms an attention generation matrix C_hMatrix connections generated for each head

Reuse parameter matrix

Performing linear transformation to generate self-attention feature matrix

The self-attention feature vector representation s of each word_t(is one row of the S matrix).

Step C3: generating representations of head and tail feature vectors using a Multi-layer Perceptron (Multi-layer Perceptron):

head word feature vector:

the characteristic vector of the end word:

MLP^(head)and M L P^(dep)The function representation is subjected to a multi-layer non-linear transformation (using a hyperbolic tangent function tanh), and the two functions are different by using a parameter matrix.

Step C4: the bilinear attention model adopts a bilinear affine (Bi-affine) function to calculate the dependency relationship score between the head and the tail word feature vectors.

The dependency score is:

conversion matrix between U vectors, w_head，w_depRepresenting head and tail parameter vectors.

Step D1 similar to the C1 procedure, using a bidirectional L STM with peepholes, the basic procedure is described as follows:

forget the door: f'_t＝σ(W'_f·[C'_t-1,h'_t-1,s_t,e'_t]+b'_f)

An input gate: i'_t＝σ(W'_i·[C'_t-1,h'_t-1,s_t,e'_t]+b'_i)

An output gate: o 'to'_t＝σ(W'_o·[C'_t-1,h'_t-1,s_t,e'_t]+b'_o)

Wherein s is_tA vector representing the inter-word relationship of the response generated by the base layer through the self-attention mechanism. Difference from step C1: considering the input vector not only the feature vector s of the previous layer_tCurrent social text word feature vector e 'is also considered'_t(and denormal text word feature vector e_t) At the same time miningWith different parameter matrix W'_f,W'_i,W'_oAnd vector b'_f，b'_i，b'_o。

Short-term memory vector for current word:

calculating the short-term memory vector of the current word also needs to consider the feature vector s of the previous layer_t. Step D2: similar to the C2 procedure, except that the input vector is derived from the hidden state vector h₁...h_tBecomes the Stack hidden State vector h'_t...h'_t。

Step D3: similar to the step C3, except that the head-tail vectors generated here need to take into account not only the generated head-and-tail vectors of the current layer but also the generated head-and-tail vectors of the base layer:

head vector quantity:

tail vector:

combined head vector:

tail vector after combination:

indicating that the corresponding dimensions are added.

Step D4: similar to the step of C4, the combined head and tail vectors are used to calculate the scores of the two dependencies.

Step E1: the objective function employs a maximum interval based ranking function (structured change loss function). The calculation formula is as follows:

wherein

Training data set

Total of N pairs of data input sentences x_iAnnotating analytical trees for gold y_iC represents the weighted hamming distance,

representing the square of a 2-norm for a parameter matrix or vector, and λ is a weighting function for balancing the regularization factor

And an objective function L (x)_i,y_iTheta) to prevent overfitting, 1/N is used to lose L (x) for all sentence levels_i,y_i(ii) a Θ), which is a set of parameters that contains all the parameters during neural network training.

The joint decoding process is divided into two parts: decoding of a base model for pre-training, the basic formula is:

indicating that the N-best result is searched; the decoding of the stack model is performed on the basis of the base model, and the basic formula is as follows:

fig. 4 shows the result of the dependency parsing (ROOT represents the ROOT word, the word without edges represents the unselected special word, and the two words with edges: the arrow points to the head word, and the other is the end word).

Step E2: the batch updating mode has the advantage of combining the quick convergence of random updating and the stability of whole batch updating. Adam (adaptive motion estimation) adjusts the learning rate for each parameter using first and second Moment estimates of the gradient. Adam has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.

m₀＝0,n₀＝0

m_t＝μ·m_t-1+(1-μ)·g_t

Wherein g is_tRepresents the parameter W of the objective function J for the time t_t(W_tEither a matrix or a vector, depending on the particular case of the parameters); the algorithm updates the exponential moving average (m) of the gradient_t) And exponential moving average of squared gradient (n)_t) Wherein the hyperparameter u, v ∈ [0,1 ]]The exponential decay rate of these moving averages is controlled. The moving average is estimated using the first moment (mean) and the second original moment (non-centered variance) of the gradient. However, these moving averages m₀，n₀Vectors initialized to 0, result in moment estimates that are biased toward zero, particularly during the initial time step (especially when the decay rate is small, i.e., the decay rate is small), i.e., theμ, ν approaches 1). This initialization bias is easily cancelled out, resulting in a bias correction

And

the gradient is estimated by unbiased first moment and second moment respectively, the exponential attenuation rate mu of the first moment is 0.9, the exponential attenuation rate v of the second moment is 0.999, the smooth parameter is 1e-08, the learning rate η is 0.001, and the training parameters are in the interval of [ -0.1,0.1]The sampling is carried out in uniform distribution; dropout is set to 0.5; the minimum batch size is set to 10. Multiplication represents the product between a vector or matrix and a scalar,

multiplying the product of corresponding elements between the expression vectors or matrixes; for division, a vector or matrix divided by a scalar represents each element divided by a scalar, and a vector or matrix divided by a vector or matrix represents the corresponding element division.

The recursion part of the deep network in the patent adopts a BPTT (Back propagation Through time) algorithm, which is basically the same as the traditional back propagation algorithm, but a plurality of connection parameters between the internal parameters of each hidden unit and the hidden units are shared, and the parameters need to be accumulated for gradient updating of each step.

It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations be included within the scope of the invention as defined in the following claims and the description.

Claims

1. The social text dependency syntactic analysis system based on the deep neural network is characterized by comprising: the system comprises a social text crawling module, a preprocessing module, a basic bilinear attention module, a stacked bilinear attention module and a joint decoding and training module;

the preprocessing module is used for filtering the crawled social texts and generating an initialization word vector;

the stacked bilinear attention module is used for predicting social texts;

the joint decoding and training module is used for carrying out joint decoding training on the base bilinear attention module and the stacked bilinear attention module, carrying out back propagation gradient adjustment parameters, fitting a training function and finally utilizing a GPU (graphics processing Unit) to calculate and accelerate model decoding training in parallel.

2. The deep neural network-based social text dependency parsing system of claim 1, wherein the social text crawling module performs the steps of:

3. The deep neural network-based social text dependency parsing system of claim 2, wherein the specific steps of filtering in the preprocessing module are:

firstly, English regular text Giga word is used for training corpus, then a language model tool Ken L M is used for training a language model, and finally the language model is used for calculating scores of downloaded social texts and filtering the scores by using a threshold value.

4. The deep neural network-based social text dependency parsing system of claim 2, wherein the specific steps of generating the initialization word vector in the preprocessing module are:

firstly, the methodTraining regular texts and social texts of good words by using a Glove tool to generate sentence word vectors { e } of the regular texts₁,e₂,…,e_LSentence word vector of social text { e'₁,e'₂,…,e'_LL, where L represents the sentence length that needs dependency analysis.

5. The deep neural network-based social text dependency parsing system of claim 3, wherein the radix bilinear attention module performs the steps of:

6. The deep neural network-based social text dependency parsing system of claim 5, wherein the stacked bilinear attention module performs the steps of:

7. The deep neural network-based social text dependency parsing system of claim 6, wherein the joint decoding and training module performs the steps of: