CN113821637A

CN113821637A - Long text classification method and device, computer equipment and readable storage medium

Info

Publication number: CN113821637A
Application number: CN202111041158.XA
Authority: CN
Inventors: 张盼盼; 邓积杰; 林星; 白兴安; 徐扬
Original assignee: Beijing Weiboyi Technology Co ltd
Current assignee: Beijing Weiboyi Technology Co ltd
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2021-12-21

Abstract

The application provides a long text classification method, a long text classification device, computer equipment and a readable storage medium, and relates to the technical field of text classification, wherein the method comprises the following steps: acquiring a text to be classified; segmenting the text to be classified according to the preset length to obtain a plurality of short texts corresponding to the text to be classified; sequentially inputting the plurality of short texts to a pre-trained fine tuning model to obtain a plurality of word vector sequences corresponding to the plurality of short texts, wherein the pre-trained fine tuning model is obtained by fine tuning a pre-set BERT model by using a training text in advance; generating a plurality of feature vectors corresponding to a plurality of word vector sequences; and obtaining a classification result corresponding to the text to be classified according to the plurality of feature vectors. According to the scheme, the preset BERT model is finely adjusted by using the training text in advance, the finely adjusted preset fine adjustment model can capture context information, polysemous words can be conveniently recognized, the characteristics of the text to be classified are accurately extracted, and the text to be classified is accurately classified.

Description

Long text classification method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of text classification technologies, and in particular, to a method and an apparatus for classifying long texts, a computer device, and a readable storage medium.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and text classification is an important application in natural Language Processing, and is mainly applied in the fields of emotion classification, user evaluation classification, and the like. Before the 90 s of the 20 th century, automatic text classification mainly adopts a knowledge engineering-based mode, namely, manual classification is performed by professionals, and the automatic text classification method has the defects of high cost, time waste and labor waste; after the 90 s, researchers began applying various statistical methods and machine learning to automatic text classification, such as SVM, na iotave bayes, KNN, and LR; in recent years, with the rapid development of deep learning and various neural network models, text classification methods based on deep learning have attracted close attention and research in academic and industrial fields, and cyclic neural networks RNN, convolutional neural networks CNN, and the like are widely used in text classification.

However, in the current text classification method, the input is often a non-dynamic word vector or word vector, the word vector or word vector cannot be changed according to the context thereof, and the information coverage is relatively single; the adopted feature extraction models are CNN and RNN models in deep learning, and fine-grained adjustment of different importance levels of input information streams in input dimensions is lacked, so that the accuracy of text classification is low.

Disclosure of Invention

In view of the above, the present application mainly aims to solve the technical problem of low accuracy of the existing classification method.

Therefore, a first objective of the present application is to provide a long text classification method, which performs classification based on fine-tuning model processing, and can improve the accuracy of classifying texts to be classified.

A second object of the present application is to provide a long text classification apparatus.

A third object of the present application is to propose a computer device.

A fourth object of the present application is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, an embodiment of a first aspect of the present application provides a long text classification method, including:

acquiring a text to be classified;

segmenting the text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified;

sequentially inputting the short texts into a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text;

generating a plurality of feature vectors corresponding to the plurality of word vector sequences;

and obtaining a classification result corresponding to the text to be classified according to the plurality of feature vectors.

In order to achieve the above object, a second embodiment of the present application provides a long text classification apparatus, including:

the text acquisition module is used for acquiring texts to be classified;

the segmentation module is connected with the text acquisition module and is used for segmenting the text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified;

the sequence obtaining module is connected with the segmentation module and used for sequentially inputting the short texts into a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text;

the generating module is connected with the sequence acquiring module and used for generating a plurality of feature vectors corresponding to the word vector sequences;

and the classification module is connected with the generation module and is used for acquiring a classification result corresponding to the text to be classified according to the plurality of characteristic vectors.

In order to achieve the above object, a third aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the long text classification method according to the first aspect of the present application.

To achieve the above object, a non-transitory computer-readable storage medium is provided in a fourth aspect of the present application, and a computer program is stored thereon, and when being executed by a processor, the computer program implements the long text classification method according to the first aspect of the present application.

In summary, the embodiment fine-tunes the preset BERT model by using the training text, so as to improve the accuracy of classifying the text to be classified. According to the technical scheme, after the text to be classified is obtained, the text to be classified is firstly divided into a plurality of short texts corresponding to the text to be classified according to the preset length, then a plurality of word vector sequences corresponding to the short texts are obtained through the preset fine tuning model, wherein the preset fine tuning model is obtained by fine tuning the preset BERT model through the training text. After the preset BERT model is finely adjusted by using the training text, the context information can be captured, the ambiguous words can be conveniently recognized, the feature extraction can be well carried out, and the accuracy of text classification is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a long text classification method provided in embodiment 1 of the present application;

fig. 2 is a flowchart of a long text classification method provided in embodiment 2 of the present application;

fig. 3 is a flowchart of a long text classification method provided in embodiment 3 of the present application;

fig. 4 is a first schematic structural diagram of a long text classification apparatus provided in embodiment 4 of the present application;

fig. 5 is a schematic structural diagram of a long text classification apparatus provided in embodiment 4 of the present application; and

fig. 6 is a schematic structural diagram of a long text classification device provided in embodiment 4 of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

Example 1

Fig. 1 is a flowchart of a long text classification method according to an embodiment of the present application.

As shown in fig. 1, a long text classification method provided in an embodiment of the present application includes the following steps:

and step 110, acquiring a text to be classified.

In this embodiment of the application, before performing step 120, the length of the text to be classified may be determined, so as to determine whether the text to be classified is a long text.

Specifically, in the embodiment of the present application, if the number of characters of the text to be classified is greater than or equal to the preset number threshold, it may be determined that the text to be classified is a long text, and then the next step of processing is performed in step 120 in the long text classification method provided in the embodiment of the present application; if the number of the characters of the text to be classified is smaller than the preset number threshold, the text to be classified can be determined to be a short text, and then classification processing is carried out according to a text classification method common in the field.

And step 120, segmenting the text to be classified according to the preset length to obtain a plurality of short texts corresponding to the text to be classified.

And step 130, sequentially inputting the plurality of short texts to a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the plurality of short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text in advance.

In this embodiment, after the plurality of short texts are sequentially input into the preset fine tuning model in step 130, the word vector at the [ CLS ] position of the last layer may be taken as the word vector sequence of the corresponding short text.

Step 140 generates a plurality of feature vectors corresponding to the plurality of word vector sequences.

In this embodiment, the process of generating the plurality of feature vectors through step 140 may specifically be: and carrying out depth coding on the word vector sequences by utilizing a preset LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences.

And 150, acquiring a classification result corresponding to the text to be classified according to the plurality of feature vectors.

In this embodiment of the application, the feature vectors generated in step 140 may be sequentially input to the full connection layer for dimension reduction processing, and then probability classification is performed on the feature vectors subjected to the dimension reduction processing by using softmax, and a probability prediction vector is output, so as to determine a classification result corresponding to the text to be classified according to a maximum probability value in the probability prediction vector.

According to the technical scheme provided by the embodiment of the application, the technical problem that the text to be classified is limited by the fixed length of a bert model in the prior art is solved by the scheme of segmenting the text to be classified according to the preset length; after a text to be classified is segmented according to a preset length, context information of the text to be classified can be effectively captured through a fine tuning model obtained after a preset BERT model is finely tuned by using a training text in advance, polysemous words in the text are recognized, namely a plurality of word vector sequences corresponding to a plurality of short texts can be accurately obtained through the preset fine tuning model, and then a plurality of feature vectors corresponding to the word vector sequences are generated; and then obtaining a classification result corresponding to the text to be classified according to the feature vector.

In summary, the scheme of segmenting the text to be classified according to the preset length in the embodiment of the application realizes that the text to be classified is not limited by the fixed length of the BERT model, and meanwhile, the preset BERT model is finely tuned by using the training text, so that the finely tuned preset fine tuning model can capture context information, facilitate the recognition of ambiguous words, accurately extract the characteristics of the text to be classified, and further realize the accurate classification of the text to be classified.

Example 2

Fig. 2 is a flowchart of a long text classification method according to an embodiment of the present application.

As shown in fig. 2, a long text classification method provided in the embodiment of the present application includes the following steps:

step 210, obtaining a text to be classified. The specific process is similar to step 110 shown in fig. 1, and is not described in detail here.

Step 220, preprocessing the text to be classified.

And step 230, segmenting the preprocessed text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified.

And 240, sequentially inputting the plurality of short texts to a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the plurality of short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text in advance.

Step 250, a plurality of feature vectors corresponding to the plurality of word vector sequences are generated.

And step 260, obtaining a classification result corresponding to the text to be classified according to the plurality of feature vectors.

In this embodiment, the specific implementation process of steps 230 to 260 is similar to that of steps 120 to 150 shown in fig. 1, and is not repeated here.

Compared with embodiment 1, before the text to be classified is segmented according to the predetermined length, the text to be classified is preprocessed in the embodiment of the present application, where the preprocessing in the embodiment of the present application includes filtering punctuation marks, numbers, link addresses, stop words, spaces and illegal characters, and filling up abbreviations and the like, and it is emphasized that the preprocessing in the embodiment of the present application includes, but is not limited to, one or more of the preprocessing methods described above.

Specifically, punctuation marks, numbers, link addresses, stop words, spaces, special characters, illegal characters and the like have little influence on text analysis, and the embodiment of the application segments the text to be classified according to the preset length, so that the punctuation marks, the numbers, the link addresses, the stop words, the spaces, the special characters, the illegal characters and the like do not occupy too much memory, the embodiment of the application filters out the punctuation marks and the like which do not help the text analysis, and further improves the segmentation speed of the text to be classified, and the embodiment of the application removes the punctuation marks, the numbers, the link addresses, the stop words, the spaces, the special characters, the illegal characters and the like.

In addition, for the abbreviations, the embodiments of the present application may complement them, that is, reduce the abbreviations, for example: abbreviations such as We'll, don't, I'm, I've, He's may be complemented by using a custom regular matching method, and embodiments of the present application include, but are not limited to, complementing abbreviations by using the above method, which is not listed here.

Example 3

Fig. 3 is a flowchart of a long text classification method according to an embodiment of the present application.

As shown in fig. 3, a long text classification method provided in the embodiment of the present application includes the following steps:

and in steps 310 to 320, obtaining the text to be classified and a plurality of corresponding short texts. The process is similar to steps 110 to 120 shown in fig. 1; particularly, after the text to be classified is obtained, the text to be classified may also be preprocessed, which is similar to embodiment 2 of the present invention and is not described in detail herein.

Step 330, obtaining a preset fine tuning model.

In this embodiment, the process of obtaining the preset fine tuning model through step 330 may include: acquiring the training text and a classification label corresponding to the training text; segmenting the training text according to the preset length to obtain a first short text; determining vector coding, sentence coding and position coding corresponding to the first short text; generating an input vector according to the vector code, sentence code and position code; inputting the input vector into the preset BERT model to obtain a word vector sequence corresponding to the first short text; and finely adjusting the preset BERT model according to the word vector sequence corresponding to the first short text and the classification label to obtain the preset fine adjustment model. And generating an input vector according to the vector code, the sentence code and the position code, wherein the vector code, the sentence code and the position code are added to obtain the input vector.

Particularly, in order to enable the segmented text to still maintain the position sequence, a single sentence vector can be additionally added, so that the position information is kept more completely, namely the segmented text still maintains the position sequence, and the defect that the position coding can only play a role in a single segment and cannot run through the whole is overcome. Specifically, while determining the vector code, sentence code and position code corresponding to the first short text, the corresponding single sentence vector is also determined; and then generating an input vector according to the vector coding, sentence coding, position coding and single sentence vector, and executing the processes of obtaining a word vector sequence, finely adjusting a BERT model and the like. Generating an input vector according to vector coding, sentence coding, position coding and single sentence vector may specifically be to add the parameters; a sinusoidal position vector may be used as the single sentence vector.

For better understanding of the embodiments of the present application, the process of obtaining the preset fine-tuning model in the embodiments of the present application will now be explained by the following examples, which are detailed as follows:

assume that the acquired training text is "Xiaoming is in Harbin, i.e., I is also true", the classification label is "1", and the predetermined length is 8.

The short text set L generated from the training text is [ "xiaoming at harbin", "working, i is also" ], wherein the short text in the short text set is obtained by segmenting the training text according to a predetermined length, and thus, the text content of each short text in the short text set is 6.

In the embodiment of the application, extra symbols [ CLS ] and [ SEP ] are respectively arranged at the beginning and the end of each short text in the short text set, the length occupied by each separator is 1, or a prefix is arranged at the beginning of each short text in the short text set, a suffix is arranged at the end of each short text in the short text set, and the length occupied by each prefix and each suffix is 1. In other words, the sum of the length occupied by the text content of each short text and the lengths occupied by the extra symbols set at the beginning and end is equal to the predetermined length 6; or the sum of the length occupied by the text content of each short text and the length occupied by the prefix provided at the beginning and the suffix provided at the end is equal to the predetermined length 6.

In the above embodiment, the extra symbols provided at the beginning and the end of each short text, or the prefix provided at the beginning and the end of each short text are used to determine the beginning and the end of each short text, and in addition, the preset length in the embodiment of the present application may be 20, 50, 100, etc., and the size of the predetermined length is determined according to the attribute of the training text. Therefore, the length of the text to be classified is not limited by the input of the fixed length of the preset BERT model.

In the above embodiment, the short text set includes two short texts, the two short texts are respectively a first short text and a second short text, wherein the first short text is "mingming in harbin", and the second short text is "working, i.e. also".

After the training text is segmented according to the preset length to obtain a short text set comprising a first short text and a second short text, determining a first short text, namely 'Xiaoming is at Haerbin', from the short text set, and then obtaining vector codes, sentence codes and position codes (simultaneously, single sentence vectors) corresponding to the first short text; subsequently, an input vector is generated from the vector code, sentence code, and position code (and single sentence vector); inputting the input vector into a preset BERT model for processing to obtain a word vector sequence 'x 1' corresponding to the first short text; and finely adjusting the preset BERT model according to the word vector sequence 'x 1' and the classification label '1' corresponding to the first short text to obtain a preset fine adjustment model.

The method for acquiring the preset fine tuning model comprises the steps of firstly, segmenting a training text according to a preset length to generate a first short text; then, determining a word vector sequence corresponding to the first short text; and finally, fine-tuning the preset BERT model according to the word vector sequence and the classification labels to obtain a fine-tuning model. In practical application, the scheme of segmenting the text to be classified according to the preset length realizes that the text to be classified is not limited by the fixed length of the BERT model, meanwhile, the preset BERT model is finely tuned by using the training text, and the finely tuned preset fine tuning model can capture context information, facilitate recognition of ambiguous words, accurately extract the characteristics of the text to be classified, and further realize accurate classification of the text to be classified.

Example 4

Fig. 4 is a schematic structural diagram of a long text classification device according to an embodiment of the present application.

As shown in fig. 4, a long text classification apparatus provided in an embodiment of the present application includes:

the text obtaining module 410 is configured to obtain a text to be classified.

The segmentation module 420 is connected with the text acquisition module and is used for segmenting the text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified;

a sequence obtaining module 430, connected to the segmentation module, configured to sequentially input the short texts into a preset fine tuning model, so as to obtain a plurality of word vector sequences corresponding to the short texts, where the preset fine tuning model is obtained by performing fine tuning on a preset BERT model by using a training text in advance;

a generating module 440, connected to the sequence obtaining module, configured to generate a plurality of feature vectors corresponding to the plurality of word vector sequences;

and the classification module 450 is connected with the generation module and is used for acquiring a classification result corresponding to the text to be classified according to the plurality of feature vectors.

In this embodiment, the process of implementing long text classification through the modules is similar to that provided in embodiment 1 of the present invention, and is not described in detail here.

The generation module is specifically configured to: and carrying out depth coding on the word vector sequences by utilizing a preset LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences.

Further, as shown in fig. 5, the long text classification apparatus provided in this embodiment further includes:

the preprocessing module 460 is respectively connected with the text acquisition module and the segmentation module, and is used for preprocessing the text to be classified, wherein the preprocessing includes one or more of filtering punctuation marks, supplementing abbreviations, filtering spaces and filtering illegal characters;

the segmentation module 420 is specifically configured to segment the preprocessed text to be classified according to a predetermined length.

In this embodiment, when the long text classification device further includes a preprocessing module, a long text classification process is implemented, which is similar to that provided in embodiment 2 of the present invention and is not described in detail herein.

Further, as shown in fig. 6, the long text classification apparatus provided in this embodiment further includes:

and a model obtaining module 470, connected to the sequence obtaining module, for obtaining the preset fine tuning model.

Specifically, the model obtaining module 470 includes:

the text acquisition unit is used for acquiring the training text and the classification label corresponding to the training text;

the segmentation unit is connected with the text acquisition unit and used for segmenting the training text according to the preset length to obtain a first short text;

the determining unit is connected with the dividing unit and used for determining the vector code, the sentence code and the position code corresponding to the first short text;

the vector generating unit is connected with the determining unit and used for generating an input vector according to the vector code, the sentence code and the position code;

the sequence obtaining unit is connected with the vector generating unit and used for inputting the input vector into the preset BERT model to obtain a word vector sequence corresponding to the first short text;

and the training unit is respectively connected with the text acquisition unit and the sequence acquisition unit and is used for finely adjusting the preset BERT model according to the word vector sequence corresponding to the first short text and the classification label to obtain the preset fine adjustment model.

In particular, the vector generation unit is specifically configured to: and adding the vector code, the sentence code and the position code to obtain the input vector.

In this embodiment, when the long text classification apparatus further includes the model obtaining module 470, the process of implementing long text classification is similar to that provided in embodiment 3 of the present invention, and is not described in detail herein.

In summary, the scheme of the embodiment of the application realizes that the text to be classified is not limited by the fixed length of the BERT model through the segmentation module, and not only can the context information be captured through the sequence acquisition module, so that ambiguous words can be conveniently identified, the characteristics of the text to be classified are accurately extracted, and the text to be classified is accurately classified.

In summary, in practical applications of the embodiments, after the text to be classified is processed by the preprocessing module, the technical effect that the text to be classified is not limited by the fixed length of the BERT model is achieved through the scheme of the segmentation module, meanwhile, the sequence acquisition module in the embodiments of the present application finely tunes the preset BERT model by using the training text, and the finely tuned preset fine tuning model can capture context information, so as to facilitate recognition of ambiguous words, accurately extract the features of the text to be classified, and further achieve accurate classification of the text to be classified; and then, the generating module carries out depth coding on the word vector sequences by using an LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences, so that the accurate classification processing of various long texts can be realized. The method for combining the BERT model and the LSTM network can be expanded to other pre-training models with fixed lengths, and then the models such as the neural network are connected to classify long texts, so that different accuracy rates can be improved.

In order to implement the foregoing embodiments, the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the long text classification method according to the foregoing embodiments is implemented.

In order to implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the long text classification method described in the above embodiments.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one preprocessing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for classifying long texts, comprising:

acquiring a text to be classified;

sequentially inputting the short texts into a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text in advance;

2. The method of classifying long texts according to claim 1, wherein before the segmenting the text to be classified according to the predetermined length, further comprising:

preprocessing the text to be classified, wherein the preprocessing comprises one or more of filtering punctuation marks, filling abbreviations, filtering blank spaces and filtering illegal characters;

the segmenting the text to be classified according to the preset length specifically comprises the following steps: and segmenting the preprocessed text to be classified according to a preset length.

3. The method for classifying long texts according to claim 1, wherein before the step of sequentially inputting the plurality of short texts into a preset fine-tuning model, the method further comprises:

and acquiring the preset fine tuning model.

4. The method of classifying long texts according to claim 3, wherein the obtaining the preset fine-tuning model comprises:

acquiring the training text and a classification label corresponding to the training text;

segmenting the training text according to the preset length to obtain a first short text;

determining vector coding, sentence coding and position coding corresponding to the first short text;

generating an input vector according to the vector code, sentence code and position code;

inputting the input vector into the preset BERT model to obtain a word vector sequence corresponding to the first short text;

and finely adjusting the preset BERT model according to the word vector sequence corresponding to the first short text and the classification label to obtain the preset fine adjustment model.

5. The method of classifying long text according to claim 4, wherein generating an input vector based on the vector coding, sentence coding, and position coding comprises:

and adding the vector code, the sentence code and the position code to obtain the input vector.

6. The method of classifying long text according to any one of claims 1-5, wherein the generating a plurality of feature vectors corresponding to the plurality of word vector sequences comprises:

and carrying out depth coding on the word vector sequences by utilizing a preset LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences.

7. A long text classification apparatus, comprising:

the text acquisition module is used for acquiring texts to be classified;

the sequence obtaining module is connected with the segmentation module and is used for sequentially inputting the short texts into a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text in advance;

8. The long text classification apparatus of claim 7, further comprising:

the preprocessing module is respectively connected with the text acquisition module and the segmentation module and is used for preprocessing the text to be classified, and the preprocessing comprises one or more of punctuation mark filtering, abbreviation supplementing, space filtering and illegal character filtering;

the segmentation module is specifically used for segmenting the preprocessed text to be classified according to a preset length.

9. The long text classification apparatus of claim 7, further comprising:

and the model acquisition module is connected with the sequence acquisition module and is used for acquiring the preset fine tuning model.

10. The apparatus for classifying long texts according to claim 9, wherein the model obtaining module comprises:

11. The apparatus for classifying long texts according to claim 10, wherein the vector generation unit is specifically configured to:

12. The apparatus according to any of claims 7 to 11, wherein the generating module is specifically configured to:

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of classifying long text as claimed in any one of claims 1 to 6 when executing the computer program.

14. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the long text classification method according to any one of claims 1-6.