CN114416981A

CN114416981A - Long text classification method, device, equipment and storage medium

Info

Publication number: CN114416981A
Application number: CN202111677818.3A
Authority: CN
Inventors: 王得贤; 李长亮
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-29

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for classifying long texts, wherein the method comprises the following steps: acquiring a long text to be classified, and dividing the long text into a plurality of text blocks; coding each text block to obtain a feature vector corresponding to the text block; fusing the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text; based on the target feature vector, carrying out classification processing by using a preset classification model to obtain a classification result of the long text; the preset classification model is obtained by training according to the target feature vector corresponding to the sample text and the classification result corresponding to the sample text. According to the embodiment of the invention, the efficiency and the accuracy of text classification can be improved.

Description

Long text classification method, device, equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for classifying long texts.

Background

With the development of the field of artificial intelligence, Natural Language Processing (NLP) is widely used in many scenes, such as emotion analysis, text similarity calculation, review viewpoint extraction, text classification, and lexical analysis.

Long text classification is an important application in natural language processing. The traditional long text classification method needs to manually extract the feature information of the long text to be classified, and further, the long text to be classified is classified by the manually extracted feature information through the existing common classification algorithm, which may be, for example: a Support Vector Machine (SVM) classification algorithm, a Logistic Regression (LR) classification algorithm, or an eXtreme Gradient boost (XGBoost) classification algorithm, etc.

The conventional method for classifying the long texts to be classified by manually extracting the feature information utilizes the existing common classification algorithm, and the time consumed for manually extracting the feature information of the long texts to be classified is long, and the extracted feature information is possibly incomplete, so that the efficiency of text classification is low, and the accuracy of classification is influenced.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device, equipment and a storage medium for classifying long texts, so as to improve the efficiency and the accuracy of text classification. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for classifying a long text, where the method includes:

acquiring a long text to be classified, and dividing the long text into a plurality of text blocks;

coding each text block to obtain a feature vector corresponding to the text block;

fusing the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text;

based on the target feature vector, carrying out classification processing by using a preset classification model to obtain a classification result of the long text; the preset classification model is obtained by training according to the target feature vector corresponding to the sample text and the classification result corresponding to the sample text.

Optionally, the segmenting the long text into a plurality of text blocks includes:

and dividing the long text into a plurality of text blocks with preset lengths.

and according to the specified characters, dividing the long text into a plurality of text blocks with the length not exceeding the preset length.

Optionally, the text blocks do not intersect with each other.

Optionally, the encoding, for each text block, the text block to obtain a feature vector corresponding to the text block includes:

and aiming at each text block, coding the text block by using a pre-training model to obtain a feature vector corresponding to the text block.

Optionally, the obtaining a classification result of the long text by performing classification processing using a preset classification model based on the target feature vector includes:

and inputting the target characteristic vector into the preset classification model for classification processing to obtain a classification result of the long text.

Optionally, the performing fusion processing on the feature vectors corresponding to the text blocks to obtain the target feature vector corresponding to the long text includes:

splicing the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text;

or determining an average value vector of the feature vectors corresponding to the text blocks as a target feature vector corresponding to the long text;

or determining the accumulated sum vector of the feature vectors corresponding to the text blocks as the target feature vector corresponding to the long text.

Optionally, the inputting the target feature vector into the preset classification model for classification processing to obtain a classification result of the long text includes:

processing the target feature vector by using a feature processing module in the preset classification model to obtain a processed feature vector;

performing dimensionality reduction on the processed feature vector by using a full connection layer in the preset classification model to obtain a dimensionality reduced feature vector;

and carrying out normalization processing on the feature vectors subjected to dimension reduction by utilizing a Softmax function layer in the preset classification model to obtain a classification result of the long text.

In a second aspect, an embodiment of the present invention provides an apparatus for classifying a long text, where the apparatus includes:

the text acquisition module is used for acquiring a long text to be classified and dividing the long text into a plurality of text blocks;

the text coding module is used for coding each text block to obtain a feature vector corresponding to the text block;

the feature fusion module is used for carrying out fusion processing on the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text;

the text classification module is used for performing classification processing by using a preset classification model based on the target feature vector to obtain a classification result of the long text; the preset classification model is obtained by training according to the target feature vector corresponding to the sample text and the classification result corresponding to the sample text.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the steps of the method for classifying a long text according to the first aspect when executing a program stored in the memory.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and when executed by a processor, the computer program implements the steps of the method for classifying a long text according to the first aspect.

Embodiments of the present invention further provide a computer program product containing instructions, which when run on a computer, cause the computer to perform any of the steps of the method for classifying long texts.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a method, a device, equipment and a storage medium for classifying a long text, which are used for dividing the long text to be classified into a plurality of text blocks with certain lengths, namely converting the long text to be classified into a short text for processing, then coding the text blocks aiming at each text block to obtain a feature vector corresponding to the text block, performing fusion processing on the feature vectors corresponding to the text blocks to obtain a target feature vector corresponding to the long text to be classified, and further performing classification processing by using a preset classification model based on the target feature vector to obtain a classification result of the long text to be classified. In the embodiment of the invention, the long text characteristic information to be classified does not need to be manually extracted, the problems of incomplete extraction of the characteristic information and long time consumption are avoided, and the efficiency and the accuracy of text classification are improved.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by referring to these drawings.

Fig. 1 is a schematic flowchart of a method for classifying long texts according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of another method for classifying long texts according to the embodiment of the present invention;

fig. 3 is a schematic flowchart of a method for classifying a long text according to another embodiment of the present invention;

fig. 4 is a schematic diagram illustrating an architecture of an embodiment of classifying a long text according to the present invention;

fig. 5 is a schematic structural diagram of a device for classifying long texts according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments given herein by one of ordinary skill in the art, are within the scope of the invention.

In order to solve the problems that the time consumed for classifying a long text to be classified is long and the extracted feature information may not be comprehensive in the conventional method for classifying the feature information extracted manually by using the existing common classification algorithm, the efficiency of text classification is low and the classification accuracy is affected, the embodiments of the present invention provide a method, an apparatus, a device and a storage medium for classifying the long text.

The method for classifying the long text provided by the embodiment of the invention comprises the following steps:

The method for classifying the long text provided by the embodiment of the invention comprises the steps of dividing the long text to be classified into a plurality of text blocks with certain lengths, namely converting the long text to be classified into a short text for processing, then coding the text blocks aiming at each text block to obtain a feature vector corresponding to the text block, performing fusion processing on the feature vectors corresponding to the text blocks to obtain a target feature vector corresponding to the long text to be classified, and further performing classification processing by using a preset classification model based on the target feature vector to obtain a classification result of the long text to be classified. In the embodiment of the invention, the long text characteristic information to be classified does not need to be manually extracted, the problems of incomplete extraction of the characteristic information and long time consumption are avoided, and the efficiency and the accuracy of text classification are improved.

The following describes a method for classifying long texts according to an embodiment of the present invention in detail:

the method for classifying the long text provided by the embodiment of the invention can be applied to electronic equipment, and the electronic equipment can be client equipment or server equipment and the like.

Method example 1

As shown in fig. 1, an embodiment of the present invention provides a method for classifying a long text, where the method may include the following steps:

s101, acquiring a long text to be classified, and dividing the long text into a plurality of text blocks.

In the embodiment of the invention, the long text to be classified is obtained from the electronic equipment, the long text to be classified can be a text with a length larger than a preset length, the preset length can be the maximum text length which can be processed by a text processing model capable of performing processing such as feature extraction on the text, and the text length is determined according to the number of characters and/or characters contained in the text. In the field of natural language processing, some text processing models have a certain limit on the length of text that can be processed, for example, a text processing model based on a Transformer structure, which has a maximum length of 512 words and the like for text that can be processed at a time, wherein the Transformer is a self-attention (self-attention) based feature extractor. For texts of different languages, the text length is determined in different manners, for example, a chinese text may determine the number of characters included in the text, and an english text may determine the number of characters included in the text.

In the embodiment of the invention, aiming at the long text with the length exceeding the maximum text length which can be processed by the text processing model, the long text is divided into a plurality of text blocks, so that the text processing model can process the divided text blocks.

As an optional implementation manner of the embodiment of the present invention, the long text to be classified is segmented, and a plurality of obtained text blocks may be mutually disjoint.

In the embodiment of the invention, the long text to be classified can be divided into a plurality of mutually disjoint text blocks. In one embodiment, the length of the long text to be classified may be determined first, and then the long text to be classified is divided into a plurality of mutually disjoint text blocks with the set length according to the set length, or the long text to be classified is divided into a plurality of mutually disjoint text blocks with the length not exceeding the set length according to the designated character, and the like according to the sequence. The set length may be any value not exceeding the preset length, and specifically, a person skilled in the art may set the length according to actual requirements.

Taking the text processing model based on the Transformer structure as an example, the maximum length of the text which can be processed in one time is 512 words, so that the long text can be divided into a plurality of text blocks with the length not exceeding 512 words. Illustratively, the long text to be classified is divided into a plurality of mutually disjoint text blocks, the words in the long text are sorted in the forward order, the first text block comprises the 1 st to 510 th words of the long text, the second text block comprises the 511 th and 1000 th words of the long text, and so on, or the first text block comprises the 1 st to 255 th words of the long text, and the second text block comprises the 256 th and 510 th words of the long text, and so on.

In the embodiment of the invention, the long text to be classified is divided into a plurality of mutually disjoint text blocks, so that no characters are overlapped among the text blocks, and the problems of repeated coding of partial characters and semantic overlapping among different text blocks in the subsequent coding process of the text blocks are avoided.

And S102, coding each text block to obtain a feature vector corresponding to the text block.

After the long text to be classified is divided into a plurality of text blocks, the text blocks may be encoded for each text block to obtain a feature vector corresponding to the text block, where the feature vector is represented by a vector obtained by encoding the text block.

As an optional implementation manner in the embodiment of the present invention, an implementation process of encoding each text block to obtain a feature vector corresponding to the text block may include:

In the embodiment of the invention, each text block obtained by segmentation can be input into a pre-training model for coding, so as to obtain the feature vector corresponding to the text block. The pre-training model may be a model using a transform structure, such as a transform-based Bidirectional Encoder Representation (BERT), a RoBERTa pre-training model, a RoBERTa-large chinese pre-training model, and the like.

Illustratively, the pre-training model is used as a BERT pre-training model, which is a typical bidirectional coding model. The BERT model extracts text features based on a Transformer, is more efficient than CNN/RNN, and can capture bidirectional context information and longer-distance dependence in the true sense. Specifically, for each text block, the text block may be input into a pre-training model based on a Transformer to be encoded, so as to obtain a feature vector corresponding to the text block.

As an optional implementation manner of the embodiment of the present invention, the pre-training model may also be a model obtained by using a Transformer structure and performing fine tuning on each text block included in the long text.

In the embodiment of the invention, the text blocks can be coded by adopting a pre-training model with a Transformer structure, and the text blocks can also be coded by adopting the Transformer structure and based on the pre-training model obtained after fine tuning of the text blocks contained in the long text to be classified.

The text block is encoded by using a pre-training model of a transform structure, for example, a BERT pre-training model represented by a transform-based bidirectional encoder is used for encoding the text block, and the BERT pre-training model is an open-source pre-training language model and can pre-process most texts in the natural language processing field. In order to improve the accuracy of coding the text block, the long text to be classified and a specific training task can be used, the BERT pre-training model is further finely adjusted, so that the BERT pre-training model obtained after fine adjustment can extract more accurate feature information for the long text to be classified, and the specific training task can be, for example, a task of text named entity recognition, text classification and the like.

The method includes the steps of adopting a Transformer structure, carrying out fine tuning on each text block contained in a long text to be classified to obtain a pre-training model, wherein the pre-training model can be obtained by taking parameters of the pre-training model of the Transformer structure as initialization parameters of the model after a training task is set, further training the model of the set training task, and continuously adjusting the parameters of the model in the training process to obtain the fine-tuned pre-training model.

Illustratively, the training task is a text classification task, the pre-training model of the Transformer structure is a BERT pre-training model, and the process of obtaining the pre-training model after fine tuning may be: setting labels for long texts to be classified, setting the same labels for a plurality of text blocks obtained by dividing the long texts to be classified, wherein the labels can represent the types of the texts, and the like, further, taking parameters of a BERT pre-training model as initialization parameters, inputting each text block contained in the long texts to be classified and the label corresponding to the text block into the BERT pre-training model for training, finely adjusting the parameters of the BERT pre-training model according to the loss of the model, finishing the training when the loss of the model reaches a set threshold value, obtaining the BERT pre-training model after the fine adjustment, namely adopting a Transformer structure, finely adjusting the pre-training model based on each text block contained in the long texts to be classified, and further, coding the text blocks by using the BERT pre-training model obtained after the fine adjustment to obtain the feature vectors corresponding to each text block.

In the embodiment of the invention, a Transformer structure is adopted, and the text blocks are encoded based on the pre-training model obtained after the text blocks contained in the long text are finely adjusted, so that the text blocks are encoded more accurately, and the obtained feature vectors corresponding to the text blocks can more accurately represent the semantic feature information corresponding to the text blocks.

And S103, fusing the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text.

After each text block is encoded to obtain the feature vector corresponding to each text block, the feature vectors corresponding to each text block may be subjected to fusion processing to obtain a target feature vector capable of representing semantic feature information of the long text to be classified.

As an optional implementation manner in the embodiment of the present invention, performing fusion processing on feature vectors corresponding to text blocks to obtain target feature vectors corresponding to long texts includes:

splicing the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long texts;

or determining the average value vector of the feature vectors corresponding to the text blocks as the target feature vector corresponding to the long text;

The feature vectors corresponding to the text blocks may be subjected to end-to-end splicing according to the sequence of the text blocks in the long text to be classified. For example, the feature vector dimension corresponding to one text block is 1 × 768, and the target feature vector dimension corresponding to the long text to be classified obtained after the feature vectors corresponding to N text blocks are spliced is N × 768.

The feature vectors corresponding to the text blocks are subjected to fusion processing, so that the feature information of the long text to be classified is prevented from being represented by the feature information of a single text block, and the obtained feature information of the long text to be classified is more complete.

And S104, based on the target characteristic vector, carrying out classification processing by using a preset classification model to obtain a classification result of the long text.

The preset classification model is obtained by training according to the target feature vector corresponding to the sample text and the classification result corresponding to the sample text.

As an optional implementation manner of the embodiment of the present invention, the preset classification model may include: a neural network model, a long-short term memory network model or a classification model based on a Transformer structure, and the like.

The neural network model, the long-term and short-term memory network model, or the classification model based on the transform structure may be obtained by training according to the target feature vector corresponding to the sample text and the classification result corresponding to the sample text, and the loss function used in the model training process may be, for example: cross entropy loss, etc.

Further, the implementation process of performing classification processing by using a preset classification model based on the target feature vector to obtain a classification result of the long text may include:

and inputting the target characteristic vector into a preset classification model for classification processing to obtain a classification result of the long text.

Illustratively, the classification result may contain a type of the long text to be classified, which may be, for example, a thesis, a resume, a contract, or the like, and may also be a field to which the long text to be classified belongs, such as news, entertainment, or the like.

As an optional implementation manner of the embodiment of the present invention, the inputting the target feature vector into a preset classification model for classification processing to obtain a classification result of the long text may include:

processing the target feature vector by using a feature processing module in a preset classification model to obtain a processed feature vector;

performing dimensionality reduction on the processed feature vector by using a full connection layer in a preset classification model to obtain a dimensionality reduced feature vector;

and carrying out normalization processing on the feature vectors subjected to dimension reduction by utilizing a Softmax function layer in a preset classification model to obtain a classification result of the long text.

The preset classification model may include a feature processing module, such as a convolution network, a residual module, and the like, a fully-connected layer, and a Softmax function layer. Furthermore, the feature processing module in the preset classification model can be used for processing the target feature vector, refining the feature information of the long text to obtain the processed feature vector, the full connection layer in the preset classification model is used for performing dimension reduction processing on the processed feature vector to obtain the dimension-reduced feature vector, and further the Softmax function layer in the preset classification model is used for performing normalization processing on the dimension-reduced feature vector to obtain the classification result of the long text.

For example, the preset classification model is a neural network model, and the training process of the neural network model may be:

step one, constructing an initial neural network model.

And step two, inputting the target characteristic vector corresponding to the sample text and the classification result corresponding to the sample text into the initial neural network model.

The process of obtaining the target feature vector corresponding to the sample text can be implemented by referring to the process of obtaining the target feature vector corresponding to the long text to be classified, and the details of the embodiment of the present invention are not repeated herein.

And thirdly, obtaining a prediction classification result corresponding to the sample text by using the initial neural network model.

And fourthly, calculating a loss function based on the prediction classification result corresponding to the sample text and the classification result corresponding to the sample text, wherein the type of the loss function is cross entropy loss.

Illustratively, the loss function may be expressed as:

wherein L represents a loss function, T represents the number of sample feature data, i represents the ith sample, c represents the c-th classification result, M represents the number of classification results, y_icRepresenting a symbolic function, taking the value of 1 if the real classification result of the ith sample is c, or taking the value of 0, p_icAnd the prediction probability that the ith sample belongs to the classification result c is shown.

And step five, performing minimization processing on the loss function to obtain a minimized loss function.

And step six, determining the weight parameters of each module in the initial neural network model according to the minimum loss function.

And seventhly, updating the parameters in the initial neural network model based on the weight parameters, and training to obtain the neural network.

In the training process, the weight parameters of each module in the neural network model are adjusted according to the loss function obtained by each iteration, the target characteristic vector corresponding to the sample text and the classification result corresponding to the sample text are further returned to be executed, and the target characteristic vector and the classification result are input into the neural network model to be trained, so that the trained neural network is obtained when a preset ending condition is met. The preset ending condition may be a preset iteration number, or a loss reaching a preset loss threshold.

Method example 2

As shown in fig. 2, another method for classifying long texts is provided in the embodiments of the present invention, and the method may include the following steps:

s201, obtaining a long text to be classified, and dividing the long text into a plurality of text blocks with preset lengths.

In the embodiment of the present invention, for a long text that exceeds the maximum text length that can be processed by the text processing model, the long text may be segmented into a plurality of text blocks with preset lengths, so that the text processing model can process the segmented text blocks with preset lengths.

For example, based on a text processing model of a Transformer structure, the maximum length of text that can be processed is 512 words, and for a long text with a length greater than 512 words, the long text can be divided into a plurality of text blocks according to the length of 512 words, and the end of the long text which is less than 512 words is also divided into one text block, so that finally the length of each text block is not greater than 512 words.

In the embodiment of the invention, the long text to be classified can be divided into a plurality of mutually-disjoint text blocks with preset length. Illustratively, the long text to be classified is divided into a plurality of text blocks with the length of 512 words and without mutual intersection, the words in the long text are sorted in a positive order, the first text block comprises the 1 st to 512 th words of the long text, the second text block comprises the 513 rd and 1024 th words of the long text, and so on. The long text to be classified can be further divided into a plurality of text blocks with the length of 256 words and without mutual intersection, the words in the long text are sorted in the forward order, the first text block comprises the 1 st to 256 th words of the long text, the second text block comprises the 257 st and 512 th words of the long text, and the like.

S202, aiming at each text block, coding the text block to obtain a feature vector corresponding to the text block.

As an optional implementation manner of the embodiment of the present invention, the pre-training model may be a model obtained by using a transform structure and performing fine tuning on each text block included in the long text.

And S203, fusing the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text.

And S204, based on the target characteristic vector, carrying out classification processing by using a preset classification model to obtain a classification result of the long text.

The implementation process of performing classification processing by using a preset classification model based on the target feature vector to obtain the classification result of the long text may include:

Specifically, the implementation processes of the steps S202 to S204 may refer to the implementation processes of the steps S102 to S104, and the embodiment of the present invention is not described herein again.

Method example 3

As shown in fig. 3, another method for classifying a long text is provided in an embodiment of the present invention, and the method may include the following steps:

s301, obtaining a long text to be classified, and dividing the long text into a plurality of text blocks with the length not exceeding a preset length according to specified characters.

In the embodiment of the invention, for a long text with a length exceeding the maximum text length capable of being processed by the text processing model, the long text to be classified can be segmented into a plurality of text blocks with lengths not exceeding the preset length according to the specified characters, so that the text processing model can process the text blocks with lengths not exceeding the preset length obtained after segmentation.

For example, based on a text processing model of a Transformer structure, which can process a maximum length of text of 512 words, for a long text with a length greater than 512 words, the long text can be divided into a plurality of text blocks according to specified characters, and each text block has a length not greater than 512 words.

The designated characters may include, but are not limited to, punctuation marks, emoticons, letters, numbers, and the like. Punctuation may include, but is not limited to, periods, exclamation marks, and the like, representing the end of a sentence.

For example, in the embodiment of the present invention, the long text to be classified may be divided into a plurality of text blocks having lengths not exceeding a preset length according to punctuation marks such as periods and exclamation marks representing the ends of sentences.

As an optional implementation manner of the embodiment of the present invention, according to the specified characters, the long text is divided into a plurality of text blocks whose lengths do not exceed the preset length, and the text blocks may be mutually disjoint.

In the embodiment of the invention, the long text to be classified can be divided into a plurality of mutually-disjoint text blocks with the length not exceeding the preset length. Illustratively, the long text to be classified is divided into a plurality of text blocks with the length not exceeding 512 words and not intersecting with each other according to periods, exclamation marks and the like representing the end of sentences, the words in the long text are sorted in a positive order, the first text block comprises the 1 st to 300 th words of the long text, the second text block comprises the 301 st and 800 th words of the long text and the like.

S302, aiming at each text block, the text block is coded to obtain a feature vector corresponding to the text block.

And S303, fusing the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text.

And S304, based on the target characteristic vector, carrying out classification processing by using a preset classification model to obtain a classification result of the long text.

Specifically, the implementation processes of the steps S302 to S304 may refer to the implementation processes of the steps S102 to S104, and the embodiment of the present invention is not described herein again.

Exemplarily, as shown in fig. 4, fig. 4 is an architecture diagram of an implementation of classifying a long text according to an embodiment of the present invention.

For the long text to be classified, the long text may be divided into a plurality of mutually disjoint text blocks, or the long text may be divided into a plurality of preset-length mutually disjoint text blocks, or the long text may be divided into a plurality of mutually disjoint text blocks whose length does not exceed a preset length according to a specified character, such as text block 1, text block 2, … …, and text block N in fig. 4.

Further, for each text block, the text block is encoded by using the pre-training model, and feature vectors corresponding to the text blocks, such as the feature vector corresponding to the text block 1, the feature vector corresponding to the text block 2, … …, and the feature vector corresponding to the text block N in fig. 4, are obtained.

And further, splicing the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long texts, inputting the target feature vectors into a preset classification model for classification, and obtaining a classification result of the long texts to be classified.

For the embodiment of the present invention, on one hand, compared with the prior art in which feature information of a long text needs to be extracted manually, the time consumed is longer, and the extracted feature information may not be comprehensive, which results in lower efficiency of text classification and affects the accuracy of classification. The long text classification method adopted by the embodiment of the invention divides the long text to be classified into a plurality of text blocks with certain lengths, namely, the long text to be classified is converted into the short text for processing, then, the text blocks are coded according to each text block to obtain the feature vectors corresponding to the text blocks, and the feature vectors corresponding to the text blocks are fused to obtain the target feature vectors corresponding to the long text to be classified.

On the other hand, compared with a mode of directly cutting off the long text, namely only retaining the words with the front preset length or the words with the middle preset length of the long text, the method is easy to cause text missing or part of semantics in the missing text. The long text classification method adopted by the embodiment of the invention divides a long text to be classified into a plurality of text blocks with certain lengths, namely, the long text to be classified is converted into a short text for processing, then, the text blocks are coded according to each text block to obtain the feature vectors corresponding to the text blocks, the feature vectors corresponding to the text blocks are fused to obtain the target feature vectors corresponding to the long text to be classified, the partial text of the long text is not required to be intercepted, the information loss of the text is not caused, the semantics of each text block in the long text is kept, the extracted long text feature information is more comprehensive, and the efficiency and the accuracy of text classification are higher.

Corresponding to the method embodiment, the embodiment of the invention also provides a corresponding device embodiment.

As shown in fig. 5, an embodiment of the present invention provides an apparatus for classifying a long text, where the apparatus may include:

the text obtaining module 501 is configured to obtain a long text to be classified, and divide the long text into a plurality of text blocks;

a text encoding module 502, configured to encode each text block to obtain a feature vector corresponding to the text block;

the feature fusion module 503 is configured to perform fusion processing on the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text;

the text classification module 504 is configured to perform classification processing by using a preset classification model based on the target feature vector to obtain a classification result of the long text; the preset classification model is obtained by training according to the target feature vector corresponding to the sample text and the classification result corresponding to the sample text.

The long text to be classified is divided into a plurality of text blocks with certain lengths, that is, the long text to be classified is converted into a short text to be processed, then, the text blocks are coded according to each text block to obtain feature vectors corresponding to the text blocks, the feature vectors corresponding to the text blocks are fused to obtain target feature vectors corresponding to the long text to be classified, and further, based on the target feature vectors, a preset classification model is used for classification to obtain a classification result of the long text to be classified. In the embodiment of the invention, the long text characteristic information to be classified does not need to be manually extracted, the problems of incomplete extraction of the characteristic information and long time consumption are avoided, and the efficiency and the accuracy of text classification are improved.

It should be noted that the apparatus according to the embodiment of the present invention is an apparatus corresponding to the method for classifying a long text shown in fig. 1, and all embodiments of the method for classifying a long text shown in fig. 1 are applicable to the apparatus and all can achieve the same beneficial effects.

Optionally, the text obtaining module 501 includes: a first text segmentation module;

the first text segmentation module is used for segmenting the long text into a plurality of text blocks with preset lengths.

Optionally, the text obtaining module 501 includes: a second text segmentation module;

and the second text segmentation module is used for segmenting the long text into a plurality of text blocks with the length not exceeding the preset length according to the specified characters.

Optionally, the text blocks do not intersect with each other.

Optionally, the text encoding module 502 is specifically configured to:

Optionally, the text classification module 504 is specifically configured to:

Optionally, the feature fusion module 503 is specifically configured to:

Optionally, the text classification module 504 is specifically configured to:

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the following steps when executing the program stored in the memory 603:

The electronic device provided by the embodiment of the invention divides a long text to be classified into a plurality of text blocks with certain lengths, namely, converts the long text to be classified into a short text for processing, then codes the text blocks for each text block to obtain feature vectors corresponding to the text blocks, fuses the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text to be classified, and further performs classification processing by using a preset classification model based on the target feature vectors to obtain a classification result of the long text to be classified. In the embodiment of the invention, the long text characteristic information to be classified does not need to be manually extracted, the problems of incomplete extraction of the characteristic information and long time consumption are avoided, and the efficiency and the accuracy of text classification are improved.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In another embodiment provided by the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps:

The computer-readable storage medium provided by the embodiment of the present invention divides a long text to be classified into a plurality of text blocks with a certain length, that is, converts the long text to be classified into a short text for processing, then encodes the text blocks for each text block to obtain feature vectors corresponding to the text blocks, performs fusion processing on the feature vectors corresponding to the text blocks to obtain target feature vectors corresponding to the long text to be classified, and further performs classification processing by using a preset classification model based on the target feature vectors to obtain a classification result of the long text to be classified. In the embodiment of the invention, the long text characteristic information to be classified does not need to be manually extracted, the problems of incomplete extraction of the characteristic information and long time consumption are avoided, and the efficiency and the accuracy of text classification are improved.

In yet another embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of:

The computer program product provided by the embodiment of the invention divides a long text to be classified into a plurality of text blocks with certain lengths, namely, converts the long text to be classified into a short text for processing, then codes the text blocks aiming at each text block to obtain a feature vector corresponding to the text block, performs fusion processing on the feature vectors corresponding to the text blocks to obtain a target feature vector corresponding to the long text to be classified, and further performs classification processing by using a preset classification model based on the target feature vector to obtain a classification result of the long text to be classified. In the embodiment of the invention, the long text characteristic information to be classified does not need to be manually extracted, the problems of incomplete extraction of the characteristic information and long time consumption are avoided, and the efficiency and the accuracy of text classification are improved.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device/electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to some descriptions of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for classifying long text, the method comprising:

2. The method of claim 1, wherein the segmenting the long text into a plurality of text blocks comprises:

and dividing the long text into a plurality of text blocks with preset lengths.

3. The method of claim 1, wherein the segmenting the long text into a plurality of text blocks comprises:

4. A method according to any of claims 1-3, wherein the text blocks do not intersect each other.

5. The method according to any one of claims 1 to 3, wherein said encoding, for each text block, the text block to obtain the feature vector corresponding to the text block comprises:

6. The method according to any one of claims 1 to 3, wherein the obtaining the classification result of the long text by performing classification processing using a preset classification model based on the target feature vector comprises:

7. The method according to any one of claims 1 to 3, wherein the fusing the feature vectors corresponding to the text blocks to obtain the target feature vector corresponding to the long text comprises:

8. The method according to claim 6, wherein the inputting the target feature vector into the preset classification model for classification processing to obtain the classification result of the long text comprises:

9. An apparatus for classifying long texts, the apparatus comprising:

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 8 when executing a program stored in the memory.

11. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-8.