CN111930952A

CN111930952A - Method, system, equipment and storage medium for long text cascade classification

Info

Publication number: CN111930952A
Application number: CN202010991960.4A
Authority: CN
Inventors: 刘广峰; 张卓仁
Original assignee: Hangzhou Zhidu Technology Co ltd
Current assignee: Hangzhou Zhidu Technology Co ltd
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2020-11-13

Abstract

The invention relates to a long text cascade classification method, which comprises the following steps: s1: preprocessing input long text data by using a sliding window mechanism, and dividing the long text data into a plurality of intervals; s2: semantic coding is carried out on the generated texts in each interval, and a local semantic vector of each interval is obtained; s3: extracting keywords from the interval text, coding the keywords into keyword vectors, and respectively performing vector splicing on the keyword vectors and the local semantic vectors to obtain an overall semantic vector of the long text; s4: reducing the dimension of the whole semantic vector, and calculating the probability distribution of class labels of the whole semantic vector after dimension reduction by using a classifier; s5: according to the steps S1-S4, a classification model of the long text corpus is trained. The technical scheme provided by the invention can improve the performance of the classification model at the bottom layer to promote the development of other intelligent services in the vertical field, thereby improving the experience and viscosity of users on intelligent products.

Description

Method, system, equipment and storage medium for long text cascade classification

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a method, a system, equipment and a storage medium for long text cascade classification.

Background

With the continuous popularization of artificial intelligence technology, the industry development in the vertical field is changing day by day, and novel AI products are continuously rising.

Products such as case retrieval, regulation retrieval, intelligent consulting robot and the like in the vertical field of law all need support of intelligent semantic analysis technology, but the intelligent semantic analysis technology comprises a plurality of algorithm models such as an element extraction model, a relation extraction model and a complex event extraction model in the field of law, and the models are supported by using a classification model as a bottom layer. The performance of the bottom layer is closely related to the application performance of the upper layer model, however, part of the open corpus in the vertical field such as the legal field is in the form of long text, taking the referee document as an example, and the minimum document length is between 500 words and 1000 words.

For the long text, the existing methods in the vertical field shorten the text length by three modes of interception, segmentation and summarization, and the methods generally have 2 defects:

(1) each term within a document is computed interactively with all other terms, resulting in high complexity, especially if the length of the document exceeds a threshold.

(2) Different section texts obtained by segmentation cannot be subjected to cross-section interaction, and a large amount of key information can be lost.

The two defects greatly limit the performance of the existing classification model in the vertical field and the application range of the upper layer model.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method, a system, a device and a storage medium for long text cascade classification, which can improve the performance of a bottom-layer classification model to promote the development of other intelligent services in the vertical field, and further improve the experience and viscosity of users on intelligent products.

The technical scheme of the invention is as follows:

a long text cascade classification method comprises the following steps:

s1: preprocessing input long text data by using a sliding window mechanism, and dividing the long text data into a plurality of intervals;

s2: semantic coding is carried out on the generated texts in each interval, and a local semantic vector of each interval is obtained;

s3: extracting keywords from the interval text, coding the keywords into keyword vectors, and respectively performing vector splicing on the keyword vectors and the local semantic vectors to obtain an overall semantic vector of the long text;

s4: reducing the dimension of the whole semantic vector, and calculating the probability distribution of class labels of the whole semantic vector after dimension reduction by using a classifier;

s5: according to the steps S1-S4, a classification model of the long text corpus is trained.

Preferably, the preprocessing in step S1 includes the specific steps of:

s1.1: the long text is segmented according to the length d of the specified interval to obtain a first interval s₁；

S2.2: performing subsequent interval segmentation according to the specified interval length d and the step length overlap on the basis of the step S1.1 to obtain an interval S of the residual text₂-s_kThe number of the segments is totally divided into k segments.

Preferably, the specific step of acquiring the local semantic vector in step S2 is:

s2.1: performing token preprocessing on the texts in all the obtained intervals;

s2.2: performing semantic coding on a token set of each interval obtained after token preprocessing through an encoder to obtain a corresponding interval semantic vector;

s2.3: and extracting corresponding interval global features aiming at the interval semantic vectors to form local semantic vectors of the whole long text on each interval.

Preferably, the encoder is a BERT encoder, in a BERT mechanism,' [ CLS]' the corresponding vector includes the global feature of the section, the section global feature corresponding to the CLS is extracted in the step S2.3, and the local semantic vector of the whole long text on each section is formed

。

Preferably, the specific process of acquiring the whole semantic vector in step S3 is as follows:

s3.1: extracting topic keywords corresponding to each interval of the long text by using an LDA algorithm based on Gibbs sampling, and obtaining a keyword vector theta corresponding to the topic keywords through an encoder;

s3.2: splicing and summing the topic keyword vectors corresponding to the intervals according to the following formula:

；

j is the number of the keyword sets extracted from the ith interval;

and splicing the local semantic vectors on the interval according to the following formula:

；

s3.3: an attention mechanism is introduced to enhance the dependence of the long text on the key word features of the topic in each interval, so as to enhance the overall semantic expression of the classification features, and the specific calculation formula is as follows:

；

；

；

wherein, w_i、b_iRepresenting the weight matrix and the bias during training; u. of_iFor the weighing of the ith interval s as a scoring function_iThe importance of the topic keyword feature of (1); a is_iIs a vector s in the interval_iCalculating the obtained weight value through a softmax formula; f is the final interval s_iAnd the corresponding attention weighted vector is the whole semantic vector of the long text.

Preferably, the specific calculation process of the dimensionality reduction and the probability distribution in step S4 is as follows:

s4.1: carrying out concentration and dimension reduction on the whole semantic vector f of the long text, wherein the formula is as follows:

；

；

converting the whole semantic vector f of the long text into a low-dimensional condensed vector f₂；

S4.2: for low dimensional concentration vector f₂Calculating the probability distribution of the class label by utilizing softmax, wherein the formula is as follows:

。

the invention provides a long text cascade classification system, which utilizes a long text cascade classification method based on a sliding window mechanism and is characterized by comprising the following steps:

the sliding window module: used for cutting the input long text to form a plurality of sections;

the coding module: the local semantic coding is carried out on the text in each interval to obtain a local semantic vector;

a global interaction module: acquiring an integral semantic vector of the long text according to the local semantic vector and the text in each interval;

a classification module: and the method is used for classifying the whole semantic vector to obtain the probability distribution of the class labels.

The invention also provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the long text cascade classification method when executing the computer program.

The invention also provides a computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the steps of the long text cascade classification method.

The invention has the beneficial effects that:

1. according to the technical scheme, the development of other intelligent services in the vertical field is promoted by improving the performance of the classification model at the bottom layer, so that the experience and viscosity of a user on an intelligent product are improved;

2. according to the technical scheme, an attention mechanism is introduced to enhance the dependence of the long text on the key word features of the section topics, and further the overall semantic expression of the classification features is enhanced;

3. the classification module provided by the invention performs long text corpus training through the sliding window module, the coding module, the global interaction module and the classification module, improves the capturing capability of the model on the local semantic features and the global dependency features of the long text, and further promotes the performance improvement of upstream tasks.

Drawings

Fig. 1 is an overall architecture diagram of a method provided in an embodiment of the invention.

Fig. 2 is a diagram illustrating a sliding window module according to an embodiment of the present invention.

FIG. 3 is a detailed flowchart of keyword extraction in the global interaction module according to the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present invention provides a long text cascade classification method, including the following steps:

1. and preprocessing the input long text data.

1) And cutting for the first time.

The long text is segmented according to the length d of the specified interval to obtain a first interval s₁Taking the legal field as an example, as shown in fig. 2, the first interval obtained is: [ Zhangxiu lotus root cause family.... The rights of liabilities]。

2) And continuously cutting.

And performing subsequent interval segmentation on the input text on the basis of the first segmentation according to the specified interval length d and the step length overlap to obtain all intervals.

Taking the legal field as an example, as shown in fig. 2, the second interval s is obtained after the segmentation is continued₂Is given as [ warranty]Wherein s is₂Is the first word "has the position of and s₁Is overlap between the positions of the first word "sheet".

2. And (4) carrying out semantic coding on the texts in all the intervals obtained in the step (1).

1) And (5) token pretreatment.

And carrying out token preprocessing on the texts in all the obtained intervals. Since BERT is an encoder using bi-directional encoding and has a global semantic association feature, BERT mechanism is used herein for illustration, and other BERT mechanism-like models can be used.

Taking the legal field as an example, for the text token preprocessing, the text token becomes [ ' [ CLS ] ', ' Zhang ', ' Xiu ', ' Juan ', ' because of family trivia ', ' because ' the token operation, the text token becomes [ ', ' family ', ' trivia ', ' President ', ' Code ', ' AND ', ' West ', ' Fu ', ' alarm ', ' leaving ', ' wedding ', ' SEP ] ' ] after the token operation.

2) And independently coding interval semantics.

And carrying out semantic coding on each interval token set obtained after token preprocessing by a BERT (belief transform) coder to obtain a corresponding interval semantic vector. For example, the intervals S1 and S2 are coded by BERT to obtain semantic vectors

。

3) And extracting local semantics.

Due to the BERT mechanism,' [ CLS ]]' the corresponding vector includes the global feature of the interval, so the semantic vector V is aimed at each interval_1、V₂、......、V_kExtracting the interval global features corresponding to the CLS to form the local semantic direction of the whole long text on each intervalMeasurement of

。

3. And acquiring the integral semantic vector of the long text through the local semantic vector.

1) And extracting the key words.

And extracting the topic key words corresponding to each interval of the long text by using an LDA algorithm based on Gibbs sampling. For example, fig. 2 illustrates that the topic keywords extracted from the first interval of the legal long text are [ 'family trivia', 'guarantee' ]. And then obtaining a keyword vector theta corresponding to the topic keywords through a BERT encoder.

2) And vector splicing.

Splicing and summing the topic keyword vectors corresponding to the intervals according to the following formula:

wherein j is the number of the keyword sets extracted from the ith interval.

Splicing the interval vectors according to the following formula:

3) attention interaction calculation.

Not all words have the same effect on the expression of the semantic information of the sentence, and similarly, the influence of texts in each interval on the expression of the overall semantics of the long text is different. Therefore, an attention mechanism is introduced to enhance the dependence of the long text on the key word features of the topic in each interval, and further enhance the overall semantic expression of the classification features. The specific calculation method is as follows:

wherein, w_i、b_iRepresenting the weight matrix and the bias during training; u. of_iAnalogous to a scoring function, for measuring the ith interval s_iThe importance of the topic keyword feature of (1); a is_iI.e. the interval vector s_iCalculating the obtained weight value through a softmax formula; f is the final interval s_iA corresponding attention weighting vector.

4. And performing dimensionality reduction and concentration on the whole semantic vector of the long text, and acquiring a corresponding category label.

1) And two layers are fully connected.

After obtaining the whole semantic vector f of the long text, concentrating and reducing dimensions of the long text, and operating according to the following formula:

by w as in the above two formulae_f1、w_f2Converting f into a low-dimensional concentration vector f₂。

2) Softmax calculates the class probability distribution.

For a global vector f as follows₂Calculating the probability distribution of the class label:

taking the legal field as an example, if the labels corresponding to the legal long text are 4 types of divorce disputes, divorce property disputes, labor and wound disputes and house property trade disputes, the probability distribution p available for the example text of fig. 2 is { 'divorce dispute': 0.2, 'divorce property disputes': 0.6, 'labor and industrial injury dispute': 0.1, 'house property trade dispute': 0.1}. According to the maximum probability principle, the finally obtained prediction category of the example text is divorce property disputes.

5. The model training is completed according to the method provided by the first 4 steps to obtain the classification model of the long text corpus, so that the capturing capability of the model on the local semantic features and the global dependency features of the long text corpus can be improved, and the performance of the upstream task is promoted by the method.

The method is not limited to be used in the legal field, and other fields including long texts, such as medical treatment, finance and the like, can be applied.

Based on the long text cascade classification method, the embodiment of the invention also provides a long text cascade classification system, which comprises a sliding window module for performing sliding window segmentation on the input long text, a coding module for performing local semantic coding on the sliding window text, a global interaction module for interacting global information, and a classification module for classifying the global information, as shown in fig. 1.

The sliding window module cuts the text into a plurality of interval texts by using a sliding window mechanism aiming at the input long text;

the coding module codes the local semantics corresponding to the segmented interval text to obtain interval semantic features; the global interaction module is used for integrating global attention aiming at the interval semantic features corresponding to the intervals so as to obtain global features; the classification module classifies the global features to obtain prediction categories.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A long text cascade classification method is characterized by comprising the following steps:

2. The method for classifying long text cascades of claim 1, wherein the specific steps of preprocessing in step S1 are:

S1.2: performing subsequent interval segmentation according to the specified interval length d and the step length overlap on the basis of the step S1.1 to obtain an interval S of the residual text₂-s_kThe number of the segments is totally divided into k segments.

3. The method for classifying long texts in cascade according to claim 1, wherein the step S2 of obtaining local semantic vectors includes the following specific steps:

4. The method of claim 3, wherein the encoder is a BERT encoder,' [ CLS ] in BERT mechanism]' the corresponding vector includes the global feature of the section, the section global feature corresponding to the CLS is extracted in the step S2.3, and the local semantic vector of the whole long text on each section is formed

。

5. The method for classifying long text cascades of claim 1, wherein the specific process of obtaining the whole semantic vector in step S3 is as follows:

；

j is the number of the keyword sets extracted from the ith interval;

；

；

；

；

6. The method for classifying long text cascades according to claim 1, wherein the specific calculation processes of dimension reduction and probability distribution in the step S4 are as follows:

；

；

S4.2: for low dimensional concentration vector f₂And calculating the probability distribution of the class label by utilizing softmax, wherein the formula is as follows:

。

7. a long text cascade classification system using the method for classifying long text based on the sliding window mechanism of any one of claims 1 to 6, comprising:

8. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the computer program is executed by the processor.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.