CN113821637A - Long text classification method and device, computer equipment and readable storage medium - Google Patents

Long text classification method and device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN113821637A
CN113821637A CN202111041158.XA CN202111041158A CN113821637A CN 113821637 A CN113821637 A CN 113821637A CN 202111041158 A CN202111041158 A CN 202111041158A CN 113821637 A CN113821637 A CN 113821637A
Authority
CN
China
Prior art keywords
text
preset
classified
model
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111041158.XA
Other languages
Chinese (zh)
Inventor
张盼盼
邓积杰
林星
白兴安
徐扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Weiboyi Technology Co ltd
Original Assignee
Beijing Weiboyi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Weiboyi Technology Co ltd filed Critical Beijing Weiboyi Technology Co ltd
Priority to CN202111041158.XA priority Critical patent/CN113821637A/en
Publication of CN113821637A publication Critical patent/CN113821637A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a long text classification method, a long text classification device, computer equipment and a readable storage medium, and relates to the technical field of text classification, wherein the method comprises the following steps: acquiring a text to be classified; segmenting the text to be classified according to the preset length to obtain a plurality of short texts corresponding to the text to be classified; sequentially inputting the plurality of short texts to a pre-trained fine tuning model to obtain a plurality of word vector sequences corresponding to the plurality of short texts, wherein the pre-trained fine tuning model is obtained by fine tuning a pre-set BERT model by using a training text in advance; generating a plurality of feature vectors corresponding to a plurality of word vector sequences; and obtaining a classification result corresponding to the text to be classified according to the plurality of feature vectors. According to the scheme, the preset BERT model is finely adjusted by using the training text in advance, the finely adjusted preset fine adjustment model can capture context information, polysemous words can be conveniently recognized, the characteristics of the text to be classified are accurately extracted, and the text to be classified is accurately classified.

Description

Long text classification method and device, computer equipment and readable storage medium
Technical Field
The present application relates to the field of text classification technologies, and in particular, to a method and an apparatus for classifying long texts, a computer device, and a readable storage medium.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and text classification is an important application in natural Language Processing, and is mainly applied in the fields of emotion classification, user evaluation classification, and the like. Before the 90 s of the 20 th century, automatic text classification mainly adopts a knowledge engineering-based mode, namely, manual classification is performed by professionals, and the automatic text classification method has the defects of high cost, time waste and labor waste; after the 90 s, researchers began applying various statistical methods and machine learning to automatic text classification, such as SVM, na iotave bayes, KNN, and LR; in recent years, with the rapid development of deep learning and various neural network models, text classification methods based on deep learning have attracted close attention and research in academic and industrial fields, and cyclic neural networks RNN, convolutional neural networks CNN, and the like are widely used in text classification.
However, in the current text classification method, the input is often a non-dynamic word vector or word vector, the word vector or word vector cannot be changed according to the context thereof, and the information coverage is relatively single; the adopted feature extraction models are CNN and RNN models in deep learning, and fine-grained adjustment of different importance levels of input information streams in input dimensions is lacked, so that the accuracy of text classification is low.
Disclosure of Invention
In view of the above, the present application mainly aims to solve the technical problem of low accuracy of the existing classification method.
Therefore, a first objective of the present application is to provide a long text classification method, which performs classification based on fine-tuning model processing, and can improve the accuracy of classifying texts to be classified.
A second object of the present application is to provide a long text classification apparatus.
A third object of the present application is to propose a computer device.
A fourth object of the present application is to propose a non-transitory computer-readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present application provides a long text classification method, including:
acquiring a text to be classified;
segmenting the text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified;
sequentially inputting the short texts into a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text;
generating a plurality of feature vectors corresponding to the plurality of word vector sequences;
and obtaining a classification result corresponding to the text to be classified according to the plurality of feature vectors.
In order to achieve the above object, a second embodiment of the present application provides a long text classification apparatus, including:
the text acquisition module is used for acquiring texts to be classified;
the segmentation module is connected with the text acquisition module and is used for segmenting the text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified;
the sequence obtaining module is connected with the segmentation module and used for sequentially inputting the short texts into a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text;
the generating module is connected with the sequence acquiring module and used for generating a plurality of feature vectors corresponding to the word vector sequences;
and the classification module is connected with the generation module and is used for acquiring a classification result corresponding to the text to be classified according to the plurality of characteristic vectors.
In order to achieve the above object, a third aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the long text classification method according to the first aspect of the present application.
To achieve the above object, a non-transitory computer-readable storage medium is provided in a fourth aspect of the present application, and a computer program is stored thereon, and when being executed by a processor, the computer program implements the long text classification method according to the first aspect of the present application.
In summary, the embodiment fine-tunes the preset BERT model by using the training text, so as to improve the accuracy of classifying the text to be classified. According to the technical scheme, after the text to be classified is obtained, the text to be classified is firstly divided into a plurality of short texts corresponding to the text to be classified according to the preset length, then a plurality of word vector sequences corresponding to the short texts are obtained through the preset fine tuning model, wherein the preset fine tuning model is obtained by fine tuning the preset BERT model through the training text. After the preset BERT model is finely adjusted by using the training text, the context information can be captured, the ambiguous words can be conveniently recognized, the feature extraction can be well carried out, and the accuracy of text classification is improved.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of a long text classification method provided in embodiment 1 of the present application;
fig. 2 is a flowchart of a long text classification method provided in embodiment 2 of the present application;
fig. 3 is a flowchart of a long text classification method provided in embodiment 3 of the present application;
fig. 4 is a first schematic structural diagram of a long text classification apparatus provided in embodiment 4 of the present application;
fig. 5 is a schematic structural diagram of a long text classification apparatus provided in embodiment 4 of the present application; and
fig. 6 is a schematic structural diagram of a long text classification device provided in embodiment 4 of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
Example 1
Fig. 1 is a flowchart of a long text classification method according to an embodiment of the present application.
As shown in fig. 1, a long text classification method provided in an embodiment of the present application includes the following steps:
and step 110, acquiring a text to be classified.
In this embodiment of the application, before performing step 120, the length of the text to be classified may be determined, so as to determine whether the text to be classified is a long text.
Specifically, in the embodiment of the present application, if the number of characters of the text to be classified is greater than or equal to the preset number threshold, it may be determined that the text to be classified is a long text, and then the next step of processing is performed in step 120 in the long text classification method provided in the embodiment of the present application; if the number of the characters of the text to be classified is smaller than the preset number threshold, the text to be classified can be determined to be a short text, and then classification processing is carried out according to a text classification method common in the field.
And step 120, segmenting the text to be classified according to the preset length to obtain a plurality of short texts corresponding to the text to be classified.
And step 130, sequentially inputting the plurality of short texts to a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the plurality of short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text in advance.
In this embodiment, after the plurality of short texts are sequentially input into the preset fine tuning model in step 130, the word vector at the [ CLS ] position of the last layer may be taken as the word vector sequence of the corresponding short text.
Step 140 generates a plurality of feature vectors corresponding to the plurality of word vector sequences.
In this embodiment, the process of generating the plurality of feature vectors through step 140 may specifically be: and carrying out depth coding on the word vector sequences by utilizing a preset LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences.
And 150, acquiring a classification result corresponding to the text to be classified according to the plurality of feature vectors.
In this embodiment of the application, the feature vectors generated in step 140 may be sequentially input to the full connection layer for dimension reduction processing, and then probability classification is performed on the feature vectors subjected to the dimension reduction processing by using softmax, and a probability prediction vector is output, so as to determine a classification result corresponding to the text to be classified according to a maximum probability value in the probability prediction vector.
According to the technical scheme provided by the embodiment of the application, the technical problem that the text to be classified is limited by the fixed length of a bert model in the prior art is solved by the scheme of segmenting the text to be classified according to the preset length; after a text to be classified is segmented according to a preset length, context information of the text to be classified can be effectively captured through a fine tuning model obtained after a preset BERT model is finely tuned by using a training text in advance, polysemous words in the text are recognized, namely a plurality of word vector sequences corresponding to a plurality of short texts can be accurately obtained through the preset fine tuning model, and then a plurality of feature vectors corresponding to the word vector sequences are generated; and then obtaining a classification result corresponding to the text to be classified according to the feature vector.
In summary, the scheme of segmenting the text to be classified according to the preset length in the embodiment of the application realizes that the text to be classified is not limited by the fixed length of the BERT model, and meanwhile, the preset BERT model is finely tuned by using the training text, so that the finely tuned preset fine tuning model can capture context information, facilitate the recognition of ambiguous words, accurately extract the characteristics of the text to be classified, and further realize the accurate classification of the text to be classified.
Example 2
Fig. 2 is a flowchart of a long text classification method according to an embodiment of the present application.
As shown in fig. 2, a long text classification method provided in the embodiment of the present application includes the following steps:
step 210, obtaining a text to be classified. The specific process is similar to step 110 shown in fig. 1, and is not described in detail here.
Step 220, preprocessing the text to be classified.
And step 230, segmenting the preprocessed text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified.
And 240, sequentially inputting the plurality of short texts to a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the plurality of short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text in advance.
Step 250, a plurality of feature vectors corresponding to the plurality of word vector sequences are generated.
And step 260, obtaining a classification result corresponding to the text to be classified according to the plurality of feature vectors.
In this embodiment, the specific implementation process of steps 230 to 260 is similar to that of steps 120 to 150 shown in fig. 1, and is not repeated here.
Compared with embodiment 1, before the text to be classified is segmented according to the predetermined length, the text to be classified is preprocessed in the embodiment of the present application, where the preprocessing in the embodiment of the present application includes filtering punctuation marks, numbers, link addresses, stop words, spaces and illegal characters, and filling up abbreviations and the like, and it is emphasized that the preprocessing in the embodiment of the present application includes, but is not limited to, one or more of the preprocessing methods described above.
Specifically, punctuation marks, numbers, link addresses, stop words, spaces, special characters, illegal characters and the like have little influence on text analysis, and the embodiment of the application segments the text to be classified according to the preset length, so that the punctuation marks, the numbers, the link addresses, the stop words, the spaces, the special characters, the illegal characters and the like do not occupy too much memory, the embodiment of the application filters out the punctuation marks and the like which do not help the text analysis, and further improves the segmentation speed of the text to be classified, and the embodiment of the application removes the punctuation marks, the numbers, the link addresses, the stop words, the spaces, the special characters, the illegal characters and the like.
In addition, for the abbreviations, the embodiments of the present application may complement them, that is, reduce the abbreviations, for example: abbreviations such as We'll, don't, I'm, I've, He's may be complemented by using a custom regular matching method, and embodiments of the present application include, but are not limited to, complementing abbreviations by using the above method, which is not listed here.
Example 3
Fig. 3 is a flowchart of a long text classification method according to an embodiment of the present application.
As shown in fig. 3, a long text classification method provided in the embodiment of the present application includes the following steps:
and in steps 310 to 320, obtaining the text to be classified and a plurality of corresponding short texts. The process is similar to steps 110 to 120 shown in fig. 1; particularly, after the text to be classified is obtained, the text to be classified may also be preprocessed, which is similar to embodiment 2 of the present invention and is not described in detail herein.
Step 330, obtaining a preset fine tuning model.
In this embodiment, the process of obtaining the preset fine tuning model through step 330 may include: acquiring the training text and a classification label corresponding to the training text; segmenting the training text according to the preset length to obtain a first short text; determining vector coding, sentence coding and position coding corresponding to the first short text; generating an input vector according to the vector code, sentence code and position code; inputting the input vector into the preset BERT model to obtain a word vector sequence corresponding to the first short text; and finely adjusting the preset BERT model according to the word vector sequence corresponding to the first short text and the classification label to obtain the preset fine adjustment model. And generating an input vector according to the vector code, the sentence code and the position code, wherein the vector code, the sentence code and the position code are added to obtain the input vector.
Particularly, in order to enable the segmented text to still maintain the position sequence, a single sentence vector can be additionally added, so that the position information is kept more completely, namely the segmented text still maintains the position sequence, and the defect that the position coding can only play a role in a single segment and cannot run through the whole is overcome. Specifically, while determining the vector code, sentence code and position code corresponding to the first short text, the corresponding single sentence vector is also determined; and then generating an input vector according to the vector coding, sentence coding, position coding and single sentence vector, and executing the processes of obtaining a word vector sequence, finely adjusting a BERT model and the like. Generating an input vector according to vector coding, sentence coding, position coding and single sentence vector may specifically be to add the parameters; a sinusoidal position vector may be used as the single sentence vector.
For better understanding of the embodiments of the present application, the process of obtaining the preset fine-tuning model in the embodiments of the present application will now be explained by the following examples, which are detailed as follows:
assume that the acquired training text is "Xiaoming is in Harbin, i.e., I is also true", the classification label is "1", and the predetermined length is 8.
The short text set L generated from the training text is [ "xiaoming at harbin", "working, i is also" ], wherein the short text in the short text set is obtained by segmenting the training text according to a predetermined length, and thus, the text content of each short text in the short text set is 6.
In the embodiment of the application, extra symbols [ CLS ] and [ SEP ] are respectively arranged at the beginning and the end of each short text in the short text set, the length occupied by each separator is 1, or a prefix is arranged at the beginning of each short text in the short text set, a suffix is arranged at the end of each short text in the short text set, and the length occupied by each prefix and each suffix is 1. In other words, the sum of the length occupied by the text content of each short text and the lengths occupied by the extra symbols set at the beginning and end is equal to the predetermined length 6; or the sum of the length occupied by the text content of each short text and the length occupied by the prefix provided at the beginning and the suffix provided at the end is equal to the predetermined length 6.
In the above embodiment, the extra symbols provided at the beginning and the end of each short text, or the prefix provided at the beginning and the end of each short text are used to determine the beginning and the end of each short text, and in addition, the preset length in the embodiment of the present application may be 20, 50, 100, etc., and the size of the predetermined length is determined according to the attribute of the training text. Therefore, the length of the text to be classified is not limited by the input of the fixed length of the preset BERT model.
In the above embodiment, the short text set includes two short texts, the two short texts are respectively a first short text and a second short text, wherein the first short text is "mingming in harbin", and the second short text is "working, i.e. also".
After the training text is segmented according to the preset length to obtain a short text set comprising a first short text and a second short text, determining a first short text, namely 'Xiaoming is at Haerbin', from the short text set, and then obtaining vector codes, sentence codes and position codes (simultaneously, single sentence vectors) corresponding to the first short text; subsequently, an input vector is generated from the vector code, sentence code, and position code (and single sentence vector); inputting the input vector into a preset BERT model for processing to obtain a word vector sequence 'x 1' corresponding to the first short text; and finely adjusting the preset BERT model according to the word vector sequence 'x 1' and the classification label '1' corresponding to the first short text to obtain a preset fine adjustment model.
The method for acquiring the preset fine tuning model comprises the steps of firstly, segmenting a training text according to a preset length to generate a first short text; then, determining a word vector sequence corresponding to the first short text; and finally, fine-tuning the preset BERT model according to the word vector sequence and the classification labels to obtain a fine-tuning model. In practical application, the scheme of segmenting the text to be classified according to the preset length realizes that the text to be classified is not limited by the fixed length of the BERT model, meanwhile, the preset BERT model is finely tuned by using the training text, and the finely tuned preset fine tuning model can capture context information, facilitate recognition of ambiguous words, accurately extract the characteristics of the text to be classified, and further realize accurate classification of the text to be classified.
Example 4
Fig. 4 is a schematic structural diagram of a long text classification device according to an embodiment of the present application.
As shown in fig. 4, a long text classification apparatus provided in an embodiment of the present application includes:
the text obtaining module 410 is configured to obtain a text to be classified.
The segmentation module 420 is connected with the text acquisition module and is used for segmenting the text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified;
a sequence obtaining module 430, connected to the segmentation module, configured to sequentially input the short texts into a preset fine tuning model, so as to obtain a plurality of word vector sequences corresponding to the short texts, where the preset fine tuning model is obtained by performing fine tuning on a preset BERT model by using a training text in advance;
a generating module 440, connected to the sequence obtaining module, configured to generate a plurality of feature vectors corresponding to the plurality of word vector sequences;
and the classification module 450 is connected with the generation module and is used for acquiring a classification result corresponding to the text to be classified according to the plurality of feature vectors.
In this embodiment, the process of implementing long text classification through the modules is similar to that provided in embodiment 1 of the present invention, and is not described in detail here.
The generation module is specifically configured to: and carrying out depth coding on the word vector sequences by utilizing a preset LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences.
Further, as shown in fig. 5, the long text classification apparatus provided in this embodiment further includes:
the preprocessing module 460 is respectively connected with the text acquisition module and the segmentation module, and is used for preprocessing the text to be classified, wherein the preprocessing includes one or more of filtering punctuation marks, supplementing abbreviations, filtering spaces and filtering illegal characters;
the segmentation module 420 is specifically configured to segment the preprocessed text to be classified according to a predetermined length.
In this embodiment, when the long text classification device further includes a preprocessing module, a long text classification process is implemented, which is similar to that provided in embodiment 2 of the present invention and is not described in detail herein.
Further, as shown in fig. 6, the long text classification apparatus provided in this embodiment further includes:
and a model obtaining module 470, connected to the sequence obtaining module, for obtaining the preset fine tuning model.
Specifically, the model obtaining module 470 includes:
the text acquisition unit is used for acquiring the training text and the classification label corresponding to the training text;
the segmentation unit is connected with the text acquisition unit and used for segmenting the training text according to the preset length to obtain a first short text;
the determining unit is connected with the dividing unit and used for determining the vector code, the sentence code and the position code corresponding to the first short text;
the vector generating unit is connected with the determining unit and used for generating an input vector according to the vector code, the sentence code and the position code;
the sequence obtaining unit is connected with the vector generating unit and used for inputting the input vector into the preset BERT model to obtain a word vector sequence corresponding to the first short text;
and the training unit is respectively connected with the text acquisition unit and the sequence acquisition unit and is used for finely adjusting the preset BERT model according to the word vector sequence corresponding to the first short text and the classification label to obtain the preset fine adjustment model.
In particular, the vector generation unit is specifically configured to: and adding the vector code, the sentence code and the position code to obtain the input vector.
In this embodiment, when the long text classification apparatus further includes the model obtaining module 470, the process of implementing long text classification is similar to that provided in embodiment 3 of the present invention, and is not described in detail herein.
In summary, the scheme of the embodiment of the application realizes that the text to be classified is not limited by the fixed length of the BERT model through the segmentation module, and not only can the context information be captured through the sequence acquisition module, so that ambiguous words can be conveniently identified, the characteristics of the text to be classified are accurately extracted, and the text to be classified is accurately classified.
In summary, in practical applications of the embodiments, after the text to be classified is processed by the preprocessing module, the technical effect that the text to be classified is not limited by the fixed length of the BERT model is achieved through the scheme of the segmentation module, meanwhile, the sequence acquisition module in the embodiments of the present application finely tunes the preset BERT model by using the training text, and the finely tuned preset fine tuning model can capture context information, so as to facilitate recognition of ambiguous words, accurately extract the features of the text to be classified, and further achieve accurate classification of the text to be classified; and then, the generating module carries out depth coding on the word vector sequences by using an LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences, so that the accurate classification processing of various long texts can be realized. The method for combining the BERT model and the LSTM network can be expanded to other pre-training models with fixed lengths, and then the models such as the neural network are connected to classify long texts, so that different accuracy rates can be improved.
In order to implement the foregoing embodiments, the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the long text classification method according to the foregoing embodiments is implemented.
In order to implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the long text classification method described in the above embodiments.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one preprocessing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (14)

1. A method for classifying long texts, comprising:
acquiring a text to be classified;
segmenting the text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified;
sequentially inputting the short texts into a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text in advance;
generating a plurality of feature vectors corresponding to the plurality of word vector sequences;
and obtaining a classification result corresponding to the text to be classified according to the plurality of feature vectors.
2. The method of classifying long texts according to claim 1, wherein before the segmenting the text to be classified according to the predetermined length, further comprising:
preprocessing the text to be classified, wherein the preprocessing comprises one or more of filtering punctuation marks, filling abbreviations, filtering blank spaces and filtering illegal characters;
the segmenting the text to be classified according to the preset length specifically comprises the following steps: and segmenting the preprocessed text to be classified according to a preset length.
3. The method for classifying long texts according to claim 1, wherein before the step of sequentially inputting the plurality of short texts into a preset fine-tuning model, the method further comprises:
and acquiring the preset fine tuning model.
4. The method of classifying long texts according to claim 3, wherein the obtaining the preset fine-tuning model comprises:
acquiring the training text and a classification label corresponding to the training text;
segmenting the training text according to the preset length to obtain a first short text;
determining vector coding, sentence coding and position coding corresponding to the first short text;
generating an input vector according to the vector code, sentence code and position code;
inputting the input vector into the preset BERT model to obtain a word vector sequence corresponding to the first short text;
and finely adjusting the preset BERT model according to the word vector sequence corresponding to the first short text and the classification label to obtain the preset fine adjustment model.
5. The method of classifying long text according to claim 4, wherein generating an input vector based on the vector coding, sentence coding, and position coding comprises:
and adding the vector code, the sentence code and the position code to obtain the input vector.
6. The method of classifying long text according to any one of claims 1-5, wherein the generating a plurality of feature vectors corresponding to the plurality of word vector sequences comprises:
and carrying out depth coding on the word vector sequences by utilizing a preset LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences.
7. A long text classification apparatus, comprising:
the text acquisition module is used for acquiring texts to be classified;
the segmentation module is connected with the text acquisition module and is used for segmenting the text to be classified according to a preset length to obtain a plurality of short texts corresponding to the text to be classified;
the sequence obtaining module is connected with the segmentation module and is used for sequentially inputting the short texts into a preset fine tuning model to obtain a plurality of word vector sequences corresponding to the short texts, wherein the preset fine tuning model is obtained by fine tuning a preset BERT model by using a training text in advance;
the generating module is connected with the sequence acquiring module and used for generating a plurality of feature vectors corresponding to the word vector sequences;
and the classification module is connected with the generation module and is used for acquiring a classification result corresponding to the text to be classified according to the plurality of characteristic vectors.
8. The long text classification apparatus of claim 7, further comprising:
the preprocessing module is respectively connected with the text acquisition module and the segmentation module and is used for preprocessing the text to be classified, and the preprocessing comprises one or more of punctuation mark filtering, abbreviation supplementing, space filtering and illegal character filtering;
the segmentation module is specifically used for segmenting the preprocessed text to be classified according to a preset length.
9. The long text classification apparatus of claim 7, further comprising:
and the model acquisition module is connected with the sequence acquisition module and is used for acquiring the preset fine tuning model.
10. The apparatus for classifying long texts according to claim 9, wherein the model obtaining module comprises:
the text acquisition unit is used for acquiring the training text and the classification label corresponding to the training text;
the segmentation unit is connected with the text acquisition unit and used for segmenting the training text according to the preset length to obtain a first short text;
the determining unit is connected with the dividing unit and used for determining the vector code, the sentence code and the position code corresponding to the first short text;
the vector generating unit is connected with the determining unit and used for generating an input vector according to the vector code, the sentence code and the position code;
the sequence obtaining unit is connected with the vector generating unit and used for inputting the input vector into the preset BERT model to obtain a word vector sequence corresponding to the first short text;
and the training unit is respectively connected with the text acquisition unit and the sequence acquisition unit and is used for finely adjusting the preset BERT model according to the word vector sequence corresponding to the first short text and the classification label to obtain the preset fine adjustment model.
11. The apparatus for classifying long texts according to claim 10, wherein the vector generation unit is specifically configured to:
and adding the vector code, the sentence code and the position code to obtain the input vector.
12. The apparatus according to any of claims 7 to 11, wherein the generating module is specifically configured to:
and carrying out depth coding on the word vector sequences by utilizing a preset LSTM network to obtain a plurality of feature vectors corresponding to the word vector sequences.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of classifying long text as claimed in any one of claims 1 to 6 when executing the computer program.
14. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the long text classification method according to any one of claims 1-6.
CN202111041158.XA 2021-09-07 2021-09-07 Long text classification method and device, computer equipment and readable storage medium Pending CN113821637A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111041158.XA CN113821637A (en) 2021-09-07 2021-09-07 Long text classification method and device, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111041158.XA CN113821637A (en) 2021-09-07 2021-09-07 Long text classification method and device, computer equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113821637A true CN113821637A (en) 2021-12-21

Family

ID=78922047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111041158.XA Pending CN113821637A (en) 2021-09-07 2021-09-07 Long text classification method and device, computer equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113821637A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579752A (en) * 2022-05-09 2022-06-03 中国人民解放军国防科技大学 Long text classification method and device based on feature importance and computer equipment
CN114677691A (en) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 Text recognition method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200251100A1 (en) * 2019-02-01 2020-08-06 International Business Machines Corporation Cross-domain multi-task learning for text classification
CN112070139A (en) * 2020-08-31 2020-12-11 三峡大学 Text classification method based on BERT and improved LSTM
CN112084337A (en) * 2020-09-17 2020-12-15 腾讯科技(深圳)有限公司 Training method of text classification model, and text classification method and equipment
CN113220890A (en) * 2021-06-10 2021-08-06 长春工业大学 Deep learning method combining news headlines and news long text contents based on pre-training

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200251100A1 (en) * 2019-02-01 2020-08-06 International Business Machines Corporation Cross-domain multi-task learning for text classification
CN112070139A (en) * 2020-08-31 2020-12-11 三峡大学 Text classification method based on BERT and improved LSTM
CN112084337A (en) * 2020-09-17 2020-12-15 腾讯科技(深圳)有限公司 Training method of text classification model, and text classification method and equipment
CN113220890A (en) * 2021-06-10 2021-08-06 长春工业大学 Deep learning method combining news headlines and news long text contents based on pre-training

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677691A (en) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN114677691B (en) * 2022-04-06 2023-10-03 北京百度网讯科技有限公司 Text recognition method, device, electronic equipment and storage medium
CN114579752A (en) * 2022-05-09 2022-06-03 中国人民解放军国防科技大学 Long text classification method and device based on feature importance and computer equipment
CN114579752B (en) * 2022-05-09 2023-05-26 中国人民解放军国防科技大学 Feature importance-based long text classification method and device and computer equipment

Similar Documents

Publication Publication Date Title
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN110196894B (en) Language model training method and language model prediction method
CN107832353B (en) False information identification method for social media platform
CN108427738B (en) Rapid image retrieval method based on deep learning
CN109816032B (en) Unbiased mapping zero sample classification method and device based on generative countermeasure network
CN111159454A (en) Picture description generation method and system based on Actor-Critic generation type countermeasure network
CN111914644A (en) Dual-mode cooperation based weak supervision time sequence action positioning method and system
CN103984959A (en) Data-driven and task-driven image classification method
CN112905795A (en) Text intention classification method, device and readable medium
CN113821637A (en) Long text classification method and device, computer equipment and readable storage medium
CN111914085A (en) Text fine-grained emotion classification method, system, device and storage medium
CN111695527A (en) Mongolian online handwriting recognition method
CN113326380B (en) Equipment measurement data processing method, system and terminal based on deep neural network
CN112966691A (en) Multi-scale text detection method and device based on semantic segmentation and electronic equipment
CN112734803B (en) Single target tracking method, device, equipment and storage medium based on character description
CN112163429B (en) Sentence correlation obtaining method, system and medium combining cyclic network and BERT
CN112966088B (en) Unknown intention recognition method, device, equipment and storage medium
CN114821271B (en) Model training method, image description generation device and storage medium
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN114722822B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and named entity recognition computer readable storage medium
CN112131879A (en) Relationship extraction system, method and device
CN115438658A (en) Entity recognition method, recognition model training method and related device
CN112100368B (en) Method and device for identifying dialogue interaction intention
CN112836670A (en) Pedestrian action detection method and device based on adaptive graph network
CN114626378A (en) Named entity recognition method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination