CN113672728A - Text classification method, device, terminal and medium based on classification network model - Google Patents
Text classification method, device, terminal and medium based on classification network model Download PDFInfo
- Publication number
- CN113672728A CN113672728A CN202110877266.4A CN202110877266A CN113672728A CN 113672728 A CN113672728 A CN 113672728A CN 202110877266 A CN202110877266 A CN 202110877266A CN 113672728 A CN113672728 A CN 113672728A
- Authority
- CN
- China
- Prior art keywords
- classification
- text
- initial model
- model
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 217
- 238000000605 extraction Methods 0.000 claims abstract description 26
- 238000007781 pre-processing Methods 0.000 claims abstract description 24
- 230000004927 fusion Effects 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 68
- 230000011218 segmentation Effects 0.000 claims description 43
- 230000006870 function Effects 0.000 claims description 42
- 238000012545 processing Methods 0.000 claims description 23
- 230000007246 mechanism Effects 0.000 claims description 15
- 238000004140 cleaning Methods 0.000 claims description 12
- 238000005457 optimization Methods 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000005406 washing Methods 0.000 claims 1
- 238000004891 communication Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000013145 classification model Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000000844 transformation Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text classification method, a text classification device, a text classification terminal and a readable storage medium based on a classification network model, wherein the method comprises the following steps: obtaining multiple types of sample files, and converting each sample file into a multi-source information text comprising a text, a suffix text, a title text and other information texts; calling a classification initial model constructed based on a multi-source information fusion and feature extraction classification network, preprocessing a multi-source information text, generating training data with labels, transmitting the training data to the classification initial model, and training the classification initial model to obtain a classification network model; and classifying and identifying the text to be classified based on the classification network model, and determining the text category to which the text to be classified belongs. According to the method, the classification initial model constructed by the multi-source information fusion and feature extraction classification network is trained through the multi-source information text, and the classification network model is generated to classify the text, so that the text is accurately classified.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text classification method, a text classification device, a text classification terminal and a text classification medium based on a classification network model.
Background
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Text classification in the field of natural language processing is an important and meaningful task in practice, and many companies and enterprises determine the category of internal text by using the text classification.
At present, various neural network algorithms for text classification tasks are diversified, but the algorithms only symbolize words due to loss of sequence information of words in the text, semantic relations among the words are not considered, and global characteristics of the text are difficult to extract, so that the classification of the segmented text is inaccurate. Moreover, due to the particularity of the internal text data, the writing habits of different companies are different in different enterprises, and even different people of the same enterprise/company can write texts with different styles for the same type of text. The traditional neural network has weak capability of processing the internal text without normalization, and the accuracy of text classification is further influenced.
Therefore, how to accurately classify each text of an enterprise/company is a technical problem to be solved at present.
Disclosure of Invention
The invention mainly aims to provide a text classification method, a text classification device, a text classification terminal and a text classification medium based on a classification network model, and aims to solve the technical problem of accurately classifying each text of an enterprise/company in the prior art.
In order to achieve the above object, the present invention provides a text classification method based on a classification network model, which comprises:
obtaining multiple types of sample files, and converting each sample file into a multi-source information text, wherein the multi-source information text at least comprises a body text, a suffix text, a title text and other information texts;
calling a classification initial model constructed based on a multi-source information fusion and feature extraction classification network, preprocessing the multi-source information text, generating training data with labels, transmitting the training data to the classification initial model, and training the classification initial model to obtain a classification network model;
and classifying and identifying the texts to be classified based on the classification network model, and determining the text classes to which the texts to be classified belong.
Optionally, the preprocessing includes word segmentation and cleaning, and the step of preprocessing the multi-source information text, generating labeled training data, and transmitting the labeled training data to the classification initial model includes:
performing word segmentation processing on each multi-source information text according to a preset word segmentation rule to obtain a word segmentation group corresponding to each multi-source information text, and filtering and cleaning the word segmentation in each word segmentation group to obtain a selected word segmentation group;
coding each selected phrase group based on a preset length to generate a word embedding vector corresponding to each multi-source information text;
constructing a category embedding vector corresponding to a preset category number, setting the dimension of the type embedding vector as the dimension of the word embedding vector, generating the word embedding vector and the category embedding vector into labeled training data, and transmitting the labeled training data to the classification initial model.
Optionally, the step of generating the labeled training data and transmitting the labeled training data to the classification initial model, training the classification initial model, and obtaining the classification network model includes:
after generating the training data with the labels, transmitting the training data to the classification initial model, training the classification initial model, and calculating a cross entropy classification loss function value of the classification initial model;
and judging whether the classification initial model reaches a convergence condition or not according to the cross entropy classification loss function value, finishing the training of the classification initial model if the classification initial model reaches the convergence condition, and generating the classification initial model into a classification network model based on target model parameters obtained by training.
Optionally, the step of determining whether the classification initial model reaches the convergence condition according to the cross entropy classification loss function value includes:
if the classification initial model does not reach the convergence condition, optimizing the model parameters of the classification initial model based on a preset optimization mechanism;
and executing the step of training the classification initial model based on the optimized model parameters until the classification initial model reaches a convergence condition.
Optionally, the step of training the classification initial model and calculating the cross-entropy classification loss function value of the classification initial model includes:
processing the training data based on the classification initial model, and generating type information vectors in a feature extraction classification layer of the classification initial model;
transmitting the type information vector to a full-connection layer of the classification initial model, and processing the type information vector by the full-connection layer to obtain a training result of the classification initial model;
and calculating a cross entropy classification loss function of the classification initial model based on the training result to obtain the cross entropy classification loss function value.
Optionally, the step of preprocessing the multi-source information text includes:
and acquiring a public data set, transmitting the public data in the public data set to the classification initial model, and pre-training the classification initial model to update the classification initial model.
Optionally, the step of generating the labeled training data and transmitting the labeled training data to the classification initial model, training the classification initial model, and obtaining the classification network model includes:
after generating the training data with the labels, transmitting the training data to the classification initial model, training the classification initial model, and counting the training times of the classification initial model;
judging whether the training times are matched with preset iteration times or not, finishing the training of the classification initial model if the training times are matched with the preset iteration times, and generating the classification initial model into a classification network model based on target model parameters obtained by training;
and if the training times are not matched with the preset iteration times, optimizing the model parameters of the classification initial model based on a preset optimization mechanism, and executing the step of training the classification initial model based on the optimized model parameters.
Further, to achieve the above object, the present invention further provides a text classification device based on a classification network model, wherein the text classification device based on the classification network model comprises:
the conversion module is used for obtaining a plurality of types of sample files and converting each sample file into a multi-source information text, wherein the multi-source information text at least comprises a body text, a suffix text, a title text and other information texts;
the calling module is used for calling a classification initial model constructed based on a multi-source information fusion and feature extraction classification network, preprocessing the multi-source information text, generating training data with labels, transmitting the training data to the classification initial model, and training the classification initial model to obtain a classification network model;
and the determining module is used for classifying and identifying the texts to be classified based on the classification network model and determining the text categories to which the texts to be classified belong.
Further, in order to achieve the above object, the present invention further provides a text classification terminal based on a classification network model, where the text classification terminal based on the classification network model includes: a memory, a processor and a control program stored on the memory and executable on the processor, the control program when executed by the processor implementing the steps of the method for classifying a text based on a classification network model as described above.
Further, to achieve the above object, the present invention also provides a readable storage medium, on which a control program is stored, which when executed by a processor, implements the steps of the text classification method based on the classification network model as described above.
The text classification method, the text classification device, the text classification terminal and the text classification medium based on the classification network model firstly acquire multiple types of sample files of multiple enterprises/companies and convert each sample file into a multi-source information text containing a text, a suffix text, a title text and other information texts; then, a classification initial model constructed according to the multi-source information fusion and feature extraction classification network is obtained, the converted multi-source information text is preprocessed, training data with labels are generated and transmitted to the classification initial model, and the classification initial model is trained to obtain a classification network model; and then, when the text classification requirement exists, classifying and identifying the text to be classified through the trained classification network model so as to determine the text category to which the text to be classified belongs. The multi-source information fusion utilizes suffixes, titles and other information except texts to a great extent, is all applicable to different types of literary styles, and is beneficial to improving the accuracy of classification; meanwhile, the feature extraction classification network can completely rely on an attention mechanism to model the global dependency of input and output, so that the problem that the global features of the text are difficult to extract is avoided, and the classification accuracy is improved. Therefore, the classification initial model constructed by the multi-source information fusion and feature extraction classification network is trained through the multi-source information text, and the classification network model is generated to classify the text, so that the text is accurately classified.
Drawings
FIG. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of a text classification terminal based on a classification network model;
FIG. 2 is a flowchart illustrating a text classification method based on a classification network model according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a text classification method based on a classification network model according to a second embodiment of the present invention;
FIG. 4 is a flowchart illustrating a text classification method based on a classification network model according to a third embodiment of the present invention;
FIG. 5 is a flowchart illustrating a fourth embodiment of a text classification method based on a classification network model according to the present invention;
FIG. 6 is a schematic diagram of a process of classifying texts by a classification network model in the text classification method based on the classification network model according to the present invention;
FIG. 7 is a schematic diagram of data processed by the full link layer of the classification network model in the text classification method based on the classification network model according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a text classification terminal based on a classification network model.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a server hardware operating environment according to an embodiment of the text classification terminal based on a classification network model of the present invention.
As shown in fig. 1, the text classification terminal based on the classification network model may include a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the hardware configuration of the classification network model-based text classification terminal shown in fig. 1 does not constitute a limitation of the classification network model-based text classification terminal, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a control program. The operating system is a program for managing and controlling the text classification terminal and software resources based on the classification network model, and supports the operation of a network communication module, a user interface module, a control program and other programs or software; the network communication module is used to manage and control the network interface 1004; the user interface module is used to manage and control the user interface 1003.
In the hardware structure of the text classification terminal based on the classification network model shown in fig. 1, the network interface 1004 is mainly used for connecting a background server to perform data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; the processor 1001 may call the control program stored in the memory 1005 and perform the following operations:
obtaining multiple types of sample files, and converting each sample file into a multi-source information text, wherein the multi-source information text at least comprises a body text, a suffix text, a title text and other information texts;
calling a classification initial model constructed based on a multi-source information fusion and feature extraction classification network, preprocessing the multi-source information text, generating training data with labels, transmitting the training data to the classification initial model, and training the classification initial model to obtain a classification network model;
and classifying and identifying the texts to be classified based on the classification network model, and determining the text classes to which the texts to be classified belong.
Further, the preprocessing comprises word segmentation and cleaning, the preprocessing is performed on the multi-source information text, and the step of generating the training data with labels and transmitting the training data to the classification initial model comprises the following steps:
performing word segmentation processing on each multi-source information text according to a preset word segmentation rule to obtain a word segmentation group corresponding to each multi-source information text, and filtering and cleaning the word segmentation in each word segmentation group to obtain a selected word segmentation group;
coding each selected phrase group based on a preset length to generate a word embedding vector corresponding to each multi-source information text;
constructing a category embedding vector corresponding to a preset category number, setting the dimension of the type embedding vector as the dimension of the word embedding vector, generating the word embedding vector and the category embedding vector into labeled training data, and transmitting the labeled training data to the classification initial model.
Further, the step of generating the labeled training data and transmitting the labeled training data to the classification initial model, training the classification initial model, and obtaining the classification network model includes:
after generating the training data with the labels, transmitting the training data to the classification initial model, training the classification initial model, and calculating a cross entropy classification loss function value of the classification initial model;
and judging whether the classification initial model reaches a convergence condition or not according to the cross entropy classification loss function value, finishing the training of the classification initial model if the classification initial model reaches the convergence condition, and generating the classification initial model into a classification network model based on target model parameters obtained by training.
Further, the step of judging whether the classification initial model reaches a convergence condition according to the cross entropy classification loss function value is carried out; the processor 1001 may call the control program stored in the memory 1005 and perform the following operations:
if the classification initial model does not reach the convergence condition, optimizing the model parameters of the classification initial model based on a preset optimization mechanism;
and executing the step of training the classification initial model based on the optimized model parameters until the classification initial model reaches a convergence condition.
Further, the step of training the classification initial model and calculating the cross-entropy classification loss function value of the classification initial model includes:
processing the training data based on the classification initial model, and generating type information vectors in a feature extraction classification layer of the classification initial model;
transmitting the type information vector to a full-connection layer of the classification initial model, and processing the type information vector by the full-connection layer to obtain a training result of the classification initial model;
and calculating a cross entropy classification loss function of the classification initial model based on the training result to obtain the cross entropy classification loss function value.
Further, the step of preprocessing the multi-source information text is performed before; the processor 1001 may call the control program stored in the memory 1005 and perform the following operations:
and acquiring a public data set, transmitting the public data in the public data set to the classification initial model, and pre-training the classification initial model to update the classification initial model.
Further, the step of generating the labeled training data and transmitting the labeled training data to the classification initial model, training the classification initial model, and obtaining the classification network model includes:
after generating the training data with the labels, transmitting the training data to the classification initial model, training the classification initial model, and counting the training times of the classification initial model;
judging whether the training times are matched with preset iteration times or not, finishing the training of the classification initial model if the training times are matched with the preset iteration times, and generating the classification initial model into a classification network model based on target model parameters obtained by training;
and if the training times are not matched with the preset iteration times, optimizing the model parameters of the classification initial model based on a preset optimization mechanism, and executing the step of training the classification initial model based on the optimized model parameters.
The implementation of the text classification terminal based on the classification network model of the present invention is basically the same as the following embodiments of the text classification method based on the classification network model, and is not described herein again.
The invention provides a text classification method based on a classification network model, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the text classification method based on the classification network model.
While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown. Specifically, the text classification method based on the classification network model in the embodiment includes:
step S10, obtaining multiple types of sample files, and converting each sample file into a multi-source information text, wherein the multi-source information text at least comprises a text, a suffix text, a title text and other information texts;
the text classification method of the embodiment is suitable for the text classification terminal, and the text classification terminal can be a mobile terminal such as a mobile phone, a tablet computer, a notebook computer and the like, and can also be a fixed terminal such as a desktop computer, an intelligent television and the like. A classification network model constructed and trained by a multi-source information fusion and feature extraction classification network is deployed at a text classification terminal to identify internal texts of companies and enterprises and institutions so as to perform classification management on the texts. The divided categories can be set according to requirements, for example, the divided categories include financial and financial reports text, rules and regulations text, reward and punishment reporting text, and appointments text.
Specifically, a plurality of types of sample files are acquired from a plurality of companies and enterprises. The types include, but are not limited to doc, docx, ppt, pptx, xls, xlsxx, png, jpg, tiff, dwg, pdf, etc., and the sample file contains the readable text content. And analyzing the readable text content of each sample file according to different information types contained in the sample file, and converting the sample file into a multi-source information text, wherein the multi-source information text is a text file containing readable characters. The information type at least comprises text information, title information, suffix information, author information, time information and other text information; and analyzing and converting the sample file into a multi-source information text at least comprising a multi-source body text, a suffix text, a title text and other information texts according to the information types for later model training.
Step S20, calling a classification initial model constructed based on a multi-source information fusion and feature extraction classification network, preprocessing the multi-source information text, generating training data with labels, transmitting the training data to the classification initial model, and training the classification initial model to obtain a classification network model;
further, a classification initial model is constructed in advance according to the multi-source information fusion and feature extraction classification network; the multi-source information fusion is a multi-level and multi-aspect processing process of various information, including detection, combination and estimation of multi-source data, so as to improve the precision of state and characteristic estimation; a feature extraction classification network, i.e. transformations, is a network that performs the function of feature extraction and classification, and contains a plurality of sub-transformation modules. In this embodiment, the preset classification initial model is called, and the labeled training data generated by preprocessing the multi-source information text is transmitted to the classification initial model, so as to train the classification initial model, and obtain a classification network model for text classification.
Understandably, in order to ensure that the classification network model has an accurate classification effect, a large amount of training data is generated and labeled to train the classification initial model. However, labeling of a large amount of training data usually consumes a large amount of manpower and material resources, and labeling efficiency is low and errors are easy to occur. In order to avoid such problems, the embodiment is provided with a mechanism that the training is firstly pre-trained by public data and then adjusted by relatively few labeled training data. Specifically, the step of preprocessing the multi-source information text comprises the following steps:
step a, obtaining a public data set, transmitting the public data in the public data set to the classification initial model, and pre-training the classification initial model to update the classification initial model.
Further, the public data set is obtained through a network approach, and the structure of each item of public data in the public data set is adjusted according to the network structure of the classification initial model. And transmitting the adjusted public data to a classification initial model, pre-training the initial model to obtain model parameters, and updating the classification initial model. And then, continuously training the updated initial model by using the training data with the labels generated by the multi-source information text to obtain a classification network model finally used for text classification.
Step S30, classifying and identifying the text to be classified based on the classification network model, and determining the text category to which the text to be classified belongs.
Further, when there is a need to classify the text, the text to be classified is passed to the text classification terminal as the text to be classified. Classifying and identifying the text to be classified by a text classification terminal based on a trained classification network model to obtain a class identifier; and determining the text category to which the text to be classified belongs according to the text category corresponding to each category identification. If a company divides the text into 13 categories, namely category A, category B and category C … category M, and the corresponding categories are marked as 1, 2 and 3 … 13; if the category identification obtained through the classification network model identification is 3, the text category to which the text to be classified belongs can be determined to be category C. And then, performing classification management on the texts to be classified according to the management rules corresponding to the text types, such as whether the text types need to be encrypted, the encryption level and other factors. Therefore, the ordered management of various files is realized while the accurate classification of various files is ensured.
Referring to fig. 6, fig. 6 shows a specific process of text recognition in a specific embodiment, where a common format file is a text to be classified, and is converted into a readable text containing text, suffix, title, other multi-source information, i.e., a multi-source information text, by a conversion module; then preprocessing and coding the multi-source information text to obtain word embedding vectors n1 multiplied by 256, n2 multiplied by 256, n3 multiplied by 256 and ni multiplied by 256; thereafter, a class embedding vector n matching the number of classes is constructedclassAnd x 256, forming the word embedded vector and the category embedded vector into data to be processed, transmitting the data to a classification network model, extracting feature in the classification layer transforms to generate a type information vector, and processing the type information vector by a full connection layer FC of the classification network model to obtain an output result of the category to which the common format file belongs.
The text classification method based on the classification network model comprises the steps of firstly obtaining multiple types of sample files of multiple enterprises/companies, and converting each sample file into a multi-source information text containing a text, a suffix text, a title text and other information texts; then, a classification initial model constructed according to the multi-source information fusion and feature extraction classification network is obtained, the converted multi-source information text is preprocessed, training data with labels are generated and transmitted to the classification initial model, and the classification initial model is trained to obtain a classification network model; and then, when the text classification requirement exists, classifying and identifying the text to be classified through the trained classification network model so as to determine the text category to which the text to be classified belongs. The multi-source information fusion utilizes suffixes, titles and other information except texts to a great extent, is all applicable to different types of literary styles, and is beneficial to improving the accuracy of classification; meanwhile, the feature extraction classification network can completely rely on an attention mechanism to model the global dependency of input and output, so that the problem that the global features of the text are difficult to extract is avoided, and the classification accuracy is improved. Therefore, the classification initial model constructed by the multi-source information fusion and feature extraction classification network is trained through the multi-source information text, and the classification network model is generated to classify the text, so that the text is accurately classified.
Further, referring to fig. 3, a second embodiment of the text classification method based on the classification network model is provided based on the first embodiment of the text classification method based on the classification network model.
The second embodiment of the text classification method based on the classification network model is different from the first embodiment of the text classification method based on the classification network model in that the preprocessing comprises word segmentation and cleaning, and the preprocessing of the multi-source information text, the generation of the training data with tags and the transmission of the training data to the classification initial model comprise the following steps:
step S21, performing word segmentation processing on each multi-source information text according to a preset word segmentation rule to obtain a word segmentation group corresponding to each multi-source information text, and filtering and cleaning the words in each word segmentation group to obtain a selected word segmentation group;
step S22, coding each selected phrase based on a preset length, and generating a word embedding vector corresponding to each multi-source information text;
step S23, constructing category embedding vectors corresponding to a preset number of categories, and generating the word embedding vectors and the category embedding vectors as tagged training data to be transmitted to the classification initial model after setting the dimensions of the category embedding vectors as the dimensions of the word embedding vectors.
In the embodiment, each item of multi-source text information is preprocessed and then encoded to form a word embedding vector and a category embedding vector representing a category, the word embedding vector and the category embedding vector form training data with labels, and a classification initial model updated by public data is trained.
Specifically, the pretreatment includes at least a word segmentation treatment and a cleaning treatment. For word segmentation processing, according to a preset word segmentation rule, segmenting each multi-source information text, dividing a sentence corresponding to each multi-source information text into a plurality of words, and forming a word segmentation group of each multi-source information text. And simultaneously filtering and cleaning each word segmentation group, and removing meaningless words such as connection words such as 'and', 'o' and the like and language words and the like to obtain the selected word segmentation group. One multi-source information text corresponds to one selected word group, and the semantics of each word in the selected word group represent the text meaning of the multi-source information text.
Furthermore, each selected word group is constructed into a word stock, and the preset length is preset according to the requirement. Selecting words with preset length from each selected word segmentation group for encoding operation to obtain word vectors of each selected word segmentation group as word embedding vectors corresponding to each multi-source information text, as shown in fig. 6, in the specific embodiment, the word embedding vector is 256-dimensional and can be represented as nt×256。
Further, the number of classes divided according to the requirement in advance is used as a preset class number, and a class embedding vector with the same number as the preset class number is constructed and named as a class Token, which can be specifically expressed as:and the category embedded vector is a learnable embedded vector, which is initialized in the training process, realizes learning along with the training and learns the weight code which specifically reflects the attribution category. The dimensionality of the category embedded vector can be set to be the same as that of the word embedded vector, the word embedded vector and the type embedded vector are generated into training data to be added with a label, the label is added to the training data to be added with the label, the generated training data with the label is transmitted to the classification initial model, and the classification initial model is trained.
In the embodiment, the selected word group generated by preprocessing the word segmentation and cleaning of the multi-source information text more accurately reflects the content of the multi-source information text and reflects the type of the text from which the multi-source information text comes, so that the accuracy of training the classification initial model according to the training data to be labeled formed by the selected word group is ensured, and the training is favorable for generating the classification network model to accurately classify the text.
Further, referring to fig. 4, a third embodiment of the text classification method based on the classification network model is provided based on the first or second embodiment of the text classification method based on the classification network model of the present invention.
The third embodiment of the text classification method based on the classification network model is different from the first or second embodiment of the text classification method based on the classification network model in that the step of generating the training data with labels is transmitted to the classification initial model, training the classification initial model and obtaining the classification network model comprises the following steps:
step S24, after generating the training data with labels, transmitting the training data to the classification initial model, training the classification initial model, and calculating the cross entropy classification loss function value of the classification initial model;
in the embodiment, whether the trained classification initial model reaches the convergence condition of the training end is judged through the cross entropy classification loss function. Specifically, after the multi-source information text is generated into the training data with the labels, the training data is transmitted to the classification initial model, and the classification initial model processes the training data and trains the classification model. After each training is finished, calculating a function value of a cross entropy classification loss function of the initial classification model, and evaluating the classification accuracy through the cross entropy classification loss function value; wherein, the smaller the function value is, the more accurate the classification is, otherwise, the larger the function value is.
Further, the step of training the classification initial model and calculating the cross-entropy classification loss function value of the classification initial model includes:
step S241, processing the training data based on the classification initial model, and generating type information vectors in a feature extraction classification layer of the classification initial model;
step S242, transmitting the type information vector to a full connection layer of the classification initial model, and processing the type information vector by the full connection layer to obtain a training result of the classification initial model;
step S243, calculating a cross entropy classification loss function of the classification initial model based on the training result, and obtaining the cross entropy classification loss function value.
Furthermore, for a classification initial model formed based on the multi-source information fusion and feature extraction classification network, the classification initial model comprises a feature extraction classification layer transformations. The classification initial model processes the training data, and generates a type information vector at the feature extraction classification layer transformations, wherein the type information vector contains information representing classes, and the dimension of the type information vector is the same as that of the word embedding vector.
Further, the type information vector is transmitted to the fully-connected layer of the classification initial model, and the fully-connected layer processes the type information vector and outputs a one-dimensional processing result, as shown in fig. 7. The output processing result is a training result obtained by the classification of the classification initial model to the training data, and a cross entropy classification function of the classification initial model is calculated according to the training result to obtain a cross entropy classification loss function value representing whether the classification processing is accurate or not.
Step S25, judging whether the classification initial model reaches the convergence condition according to the cross entropy classification loss function value, finishing the training of the classification initial model if the classification initial model reaches the convergence condition, and generating the classification initial model into a classification network model based on the target model parameters obtained by training.
Further, a preset value representing convergence of the model is preset, and the preset value is used as a convergence condition. And determining whether the classification initial model reaches a convergence condition or not by comparing the magnitude relation between the cross entropy classification loss function value and the preset value. Specifically, if the cross entropy classification loss function value is determined to be smaller than the preset value through comparison, the classification accuracy of the classification initial model is high, and the convergence condition is determined to be reached. At this time, training of the classification initial model is completed, model parameters obtained through training are fixed as target model parameters for processing text classification, the classification initial model is generated into a classification network model, and the classification network model performs text classification based on the target model parameters.
Further, for the case that the classification initial model does not reach the convergence condition, the step of judging whether the classification initial model reaches the convergence condition according to the cross entropy classification loss function value includes:
step S251, if the classification initial model does not reach the convergence condition, optimizing the model parameters of the classification initial model based on a preset optimization mechanism;
step S252, based on the optimized model parameters, executing a step of training the classification initial model until the classification initial model reaches a convergence condition.
Furthermore, if the cross entropy classification loss function value is determined to be not less than the preset value through comparison, the classification accuracy of the classification initial model is not high, and the classification initial model is judged not to reach the convergence condition. At this time, the model parameters of the classification initial model are optimized according to a preset optimization mechanism preset in the classification initial model. And controlling the classification initial model to classify the training data based on the optimized model parameters, so as to train the classification initial model, calculate a new cross entropy classification loss function value and compare the new cross entropy classification loss function value with a preset value. And circulating in this way, once the cross entropy classification loss function value is smaller than the preset value, judging that the classification initial model reaches the convergence condition, and generating the classification initial model into a classification network model for classifying the text.
In this embodiment, whether the classification initial model meets the convergence condition is determined by calculating the cross entropy classification loss function value, and only after the convergence condition is reached, the classification initial model is generated into a classification network model used for text classification, which is beneficial to ensuring that the generated classification network model accurately classifies texts.
Further, referring to fig. 5, a fourth embodiment of the text classification method based on the classification network model is provided based on the first, second or third embodiments of the text classification method based on the classification network model of the present invention.
The fourth embodiment of the text classification method based on the classification network model is different from the first, second or third embodiment of the text classification method based on the classification network model in that the step of generating the training data with labels is transmitted to the classification initial model, training the classification initial model and obtaining the classification network model comprises the following steps:
step S26, after generating the labeled training data, transmitting the training data to the classification initial model, training the classification initial model, and counting the training times of the classification initial model;
step S27, judging whether the training times are matched with preset iteration times or not, finishing the training of the classification initial model if the training times are matched with the preset iteration times, and generating the classification initial model into a classification network model based on target model parameters obtained by training;
and step S28, if the training times are not matched with the preset iteration times, optimizing the model parameters of the classification initial model based on a preset optimization mechanism, and executing the step of training the classification initial model based on the optimized model parameters.
In this embodiment, whether the trained classification initial model reaches the condition of training end is determined by setting the maximum iteration number. Specifically, the maximum iteration times is used as preset iteration times, after the multi-source information text is generated into the training data with the labels, the training data is transmitted to the classification initial model, the classification initial model is used for processing, the classification model is trained, and the training times of the classification initial model are counted. And then, comparing the counted training times with the preset iteration training times, and judging whether the training times are matched with the preset iteration times by judging whether the training times are greater than or equal to the preset iteration times. If the training times are larger than or equal to the preset iteration times, judging that the training times are matched with the iteration times, otherwise, judging that the training times are not matched with the iteration times.
Further, for the condition that the training times are matched with the preset iteration times, training of the classification initial model is completed, model parameters of the classification initial model obtained through training are fixed as target model parameters for processing text classification, the classification initial model is generated into a classification network model, and the classification network model performs text classification based on the target model parameters.
Furthermore, for the case that the training times are less than the preset iteration times and the training times are not matched with the preset iteration times, the model parameters of the classification initial model are optimized according to a preset optimization mechanism preset in the classification initial model. And controlling the classification initial model to classify the training data based on the optimized model parameters, so as to train the classification initial model, and comparing the accumulated statistical training times with the preset iteration times. And circulating in the way, once the training times are more than or equal to the preset iteration times, judging that the training times are matched with the preset iteration times, and generating a classification initial model into a classification network model for classifying the text.
In the embodiment, whether the classification initial model reaches the condition of training completion is judged by counting the iterative training times, and only after the condition of training completion is reached, the classification initial model is generated into the classification network model used for text classification, so that the full training of the classification network model is ensured, and the text can be accurately classified by the generated classification network model.
The embodiment of the present invention further provides a text classification device based on a classification network model, where the text classification device based on the classification network model includes:
the conversion module is used for obtaining a plurality of types of sample files and converting each sample file into a multi-source information text, wherein the multi-source information text at least comprises a body text, a suffix text, a title text and other information texts;
the calling module is used for calling a classification initial model constructed based on a multi-source information fusion and feature extraction classification network, preprocessing the multi-source information text, generating training data with labels, transmitting the training data to the classification initial model, and training the classification initial model to obtain a classification network model;
and the determining module is used for classifying and identifying the texts to be classified based on the classification network model and determining the text categories to which the texts to be classified belong.
Further, the preprocessing includes word segmentation and cleaning, and the calling module is further configured to:
performing word segmentation processing on each multi-source information text according to a preset word segmentation rule to obtain a word segmentation group corresponding to each multi-source information text, and filtering and cleaning the word segmentation in each word segmentation group to obtain a selected word segmentation group;
coding each selected phrase group based on a preset length to generate a word embedding vector corresponding to each multi-source information text;
constructing a category embedding vector corresponding to a preset category number, setting the dimension of the type embedding vector as the dimension of the word embedding vector, generating the word embedding vector and the category embedding vector into labeled training data, and transmitting the labeled training data to the classification initial model.
Further, the invoking module is further configured to:
after generating the training data with the labels, transmitting the training data to the classification initial model, training the classification initial model, and calculating a cross entropy classification loss function value of the classification initial model;
and judging whether the classification initial model reaches a convergence condition or not according to the cross entropy classification loss function value, finishing the training of the classification initial model if the classification initial model reaches the convergence condition, and generating the classification initial model into a classification network model based on target model parameters obtained by training.
Further, the invoking module is further configured to:
if the classification initial model does not reach the convergence condition, optimizing the model parameters of the classification initial model based on a preset optimization mechanism;
and executing the step of training the classification initial model based on the optimized model parameters until the classification initial model reaches a convergence condition.
Further, the invoking module is further configured to:
and acquiring a public data set, transmitting the public data in the public data set to the classification initial model, and pre-training the classification initial model to update the classification initial model.
Further, the invoking module is further configured to:
after generating the training data with the labels, transmitting the training data to the classification initial model, training the classification initial model, and counting the training times of the classification initial model;
judging whether the training times are matched with preset iteration times or not, finishing the training of the classification initial model if the training times are matched with the preset iteration times, and generating the classification initial model into a classification network model based on target model parameters obtained by training;
and if the training times are not matched with the preset iteration times, optimizing the model parameters of the classification initial model based on a preset optimization mechanism, and executing the step of training the classification initial model based on the optimized model parameters.
The specific implementation of the text classification device based on the classification network model is basically the same as that of the text classification method based on the classification network model, and is not described herein again.
The embodiment of the invention also provides a readable storage medium. The readable storage medium has stored thereon a control program which, when executed by the processor, implements the steps of the method for classifying a text based on a classification network model as described above.
The readable storage medium of the present invention may be a computer readable storage medium, and the specific implementation manner of the readable storage medium of the present invention is basically the same as that of each embodiment of the text classification method based on the classification network model, and will not be described herein again.
The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.
Claims (10)
1. A text classification method based on a classification network model is characterized by comprising the following steps:
obtaining multiple types of sample files, and converting each sample file into a multi-source information text, wherein the multi-source information text at least comprises a body text, a suffix text, a title text and other information texts;
calling a classification initial model constructed based on a multi-source information fusion and feature extraction classification network, preprocessing the multi-source information text, generating training data with labels, transmitting the training data to the classification initial model, and training the classification initial model to obtain a classification network model;
and classifying and identifying the texts to be classified based on the classification network model, and determining the text classes to which the texts to be classified belong.
2. The method for classifying web-based texts according to claim 1, wherein the preprocessing comprises word segmentation and washing, and the preprocessing the multi-source information text to generate labeled training data is transmitted to the classification initial model comprises:
performing word segmentation processing on each multi-source information text according to a preset word segmentation rule to obtain a word segmentation group corresponding to each multi-source information text, and filtering and cleaning the word segmentation in each word segmentation group to obtain a selected word segmentation group;
coding each selected phrase group based on a preset length to generate a word embedding vector corresponding to each multi-source information text;
constructing a category embedding vector corresponding to a preset category number, setting the dimension of the type embedding vector as the dimension of the word embedding vector, generating the word embedding vector and the category embedding vector into labeled training data, and transmitting the labeled training data to the classification initial model.
3. The method for classifying web-based text according to claim 1, wherein the step of generating labeled training data to be transmitted to the classification initial model, training the classification initial model and obtaining the classification web model comprises:
after generating the training data with the labels, transmitting the training data to the classification initial model, training the classification initial model, and calculating a cross entropy classification loss function value of the classification initial model;
and judging whether the classification initial model reaches a convergence condition or not according to the cross entropy classification loss function value, finishing the training of the classification initial model if the classification initial model reaches the convergence condition, and generating the classification initial model into a classification network model based on target model parameters obtained by training.
4. The method for classifying web-based text according to claim 3, wherein said step of determining whether said classification initial model meets a convergence criterion based on said cross-entropy classification loss function value comprises:
if the classification initial model does not reach the convergence condition, optimizing the model parameters of the classification initial model based on a preset optimization mechanism;
and executing the step of training the classification initial model based on the optimized model parameters until the classification initial model reaches a convergence condition.
5. The classification-network-based text classification method of claim 3, wherein the step of training the classification initial model and calculating the cross-entropy classification loss function value of the classification initial model comprises:
processing the training data based on the classification initial model, and generating type information vectors in a feature extraction classification layer of the classification initial model;
transmitting the type information vector to a full-connection layer of the classification initial model, and processing the type information vector by the full-connection layer to obtain a training result of the classification initial model;
and calculating a cross entropy classification loss function of the classification initial model based on the training result to obtain the cross entropy classification loss function value.
6. The classification-network-based text classification method according to any one of claims 1-5, wherein the step of preprocessing the multi-source information text is preceded by:
and acquiring a public data set, transmitting the public data in the public data set to the classification initial model, and pre-training the classification initial model to update the classification initial model.
7. The method for classifying web-based text according to claim 1, wherein the step of generating labeled training data to be transmitted to the classification initial model, training the classification initial model and obtaining the classification web model comprises:
after generating the training data with the labels, transmitting the training data to the classification initial model, training the classification initial model, and counting the training times of the classification initial model;
judging whether the training times are matched with preset iteration times or not, finishing the training of the classification initial model if the training times are matched with the preset iteration times, and generating the classification initial model into a classification network model based on target model parameters obtained by training;
and if the training times are not matched with the preset iteration times, optimizing the model parameters of the classification initial model based on a preset optimization mechanism, and executing the step of training the classification initial model based on the optimized model parameters.
8. A text classification device based on a classification network model is characterized in that the text classification device based on the classification network model comprises:
the conversion module is used for obtaining a plurality of types of sample files and converting each sample file into a multi-source information text, wherein the multi-source information text at least comprises a body text, a suffix text, a title text and other information texts;
the calling module is used for calling a classification initial model constructed based on a multi-source information fusion and feature extraction classification network, preprocessing the multi-source information text, generating training data with labels, transmitting the training data to the classification initial model, and training the classification initial model to obtain a classification network model;
and the determining module is used for classifying and identifying the texts to be classified based on the classification network model and determining the text categories to which the texts to be classified belong.
9. A text classification terminal based on a classification network model, characterized in that the text classification terminal based on the classification network model comprises a memory, a processor and a control program stored on the memory and operable on the processor, the control program, when executed by the processor, implementing the steps of the text classification method based on the classification network model according to any one of claims 1-7.
10. A readable storage medium, having stored thereon a control program which, when being executed by a processor, carries out the steps of the method for classifying web model-based text according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110877266.4A CN113672728A (en) | 2021-07-31 | 2021-07-31 | Text classification method, device, terminal and medium based on classification network model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110877266.4A CN113672728A (en) | 2021-07-31 | 2021-07-31 | Text classification method, device, terminal and medium based on classification network model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113672728A true CN113672728A (en) | 2021-11-19 |
Family
ID=78541115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110877266.4A Pending CN113672728A (en) | 2021-07-31 | 2021-07-31 | Text classification method, device, terminal and medium based on classification network model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113672728A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160344770A1 (en) * | 2013-08-30 | 2016-11-24 | Rakesh Verma | Automatic Phishing Email Detection Based on Natural Language Processing Techniques |
CN110110075A (en) * | 2017-12-25 | 2019-08-09 | 中国电信股份有限公司 | Web page classification method, device and computer readable storage medium |
KR20210004058A (en) * | 2019-07-03 | 2021-01-13 | 인하대학교 산학협력단 | A Novel Healthcare Monitoring Method and Apparatus Using Wearable Sensors and Social Networking Data |
-
2021
- 2021-07-31 CN CN202110877266.4A patent/CN113672728A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160344770A1 (en) * | 2013-08-30 | 2016-11-24 | Rakesh Verma | Automatic Phishing Email Detection Based on Natural Language Processing Techniques |
CN110110075A (en) * | 2017-12-25 | 2019-08-09 | 中国电信股份有限公司 | Web page classification method, device and computer readable storage medium |
KR20210004058A (en) * | 2019-07-03 | 2021-01-13 | 인하대학교 산학협력단 | A Novel Healthcare Monitoring Method and Apparatus Using Wearable Sensors and Social Networking Data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112784578B (en) | Legal element extraction method and device and electronic equipment | |
CN112955893A (en) | Automatic hyperlink of document | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN112632226B (en) | Semantic search method and device based on legal knowledge graph and electronic equipment | |
CN113486178B (en) | Text recognition model training method, text recognition method, device and medium | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN111782793A (en) | Intelligent customer service processing method, system and equipment | |
CN117873487B (en) | GVG-based code function annotation generation method | |
CN113837307A (en) | Data similarity calculation method and device, readable medium and electronic equipment | |
CN117725458A (en) | Method and device for obtaining threat information sample data generation model | |
AU2019290658B2 (en) | Systems and methods for identifying and linking events in structured proceedings | |
CN112270189A (en) | Question type analysis node generation method, question type analysis node generation system and storage medium | |
CN116226711A (en) | Case dispute focus identification method and device based on multi-feature fusion | |
CN113688234A (en) | Text classification management method and device, terminal and readable storage medium | |
CN115238645A (en) | Asset data identification method and device, electronic equipment and computer storage medium | |
CN112133308B (en) | Method and device for classifying multiple tags of speech recognition text | |
CN113672728A (en) | Text classification method, device, terminal and medium based on classification network model | |
CN112100336A (en) | Method and device for identifying preservation time of file and storage medium | |
Yang et al. | Network Configuration Entity Extraction Method Based on Transformer with Multi-Head Attention Mechanism. | |
Liu et al. | Practical skills of business english correspondence writing based on data mining algorithm | |
Yu et al. | A knowledge-graph based text summarization scheme for mobile edge computing | |
CN117421405A (en) | Language model fine tuning method, device, equipment and medium for financial service | |
Jin | Bayesian Classification Algorithm in Recognition of Insurance Tax Documents | |
CN118839247A (en) | Industrial product main data entity alignment method based on BERT model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20230707 Address after: Room 635, No. 1198 Hulin Road, Huangpu District, Guangzhou City, Guangdong Province, 510700 (office only) Applicant after: Guangzhou Yongzhe Information Technology Co.,Ltd. Address before: 510700 room 635, No. 1198, Hulin Road, Huangpu District, Guangzhou, Guangdong Applicant before: Guangzhou Yonglian Information Technology Co.,Ltd. |
|
TA01 | Transfer of patent application right |