CN114547305A

CN114547305A - Text classification system based on natural language processing

Info

Publication number: CN114547305A
Application number: CN202210172720.0A
Authority: CN
Inventors: 韩天; 张竹; 江晓林; 任明远; 董长春
Original assignee: Jinhua Institute Of Higher Learning Office Of Leading Group For Preparation Of Jinhua Institute Of Technology
Current assignee: Jinhua Institute Of Higher Learning Office Of Leading Group For Preparation Of Jinhua Institute Of Technology
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2022-05-27

Abstract

The invention discloses a text classification system based on natural language processing, which comprises a data acquisition module, a data preprocessing module, a data post-processing module, a text classification module, a classification result verification module and a visualization module, wherein the data acquisition module is used for acquiring a text; the method combines the characteristics of natural language processing to preprocess the original text data, thereby facilitating the uniform processing of the original text data, reducing the influence of word frequency factors on classification results by weighting processing key information, improving the accuracy of text classification, solving the problem that the traditional algorithm can not reflect word position information, combining a convolutional neural network with a support vector machine classifier, and adding an attention mechanism in a model, thereby performing feature extraction on the text data through the convolutional neural network, replacing a normalization exponential function with insufficient generalization capability in the convolutional neural network by using classification counting based on the support vector machine, simplifying model parameters and improving the efficiency and the accuracy of text classification.

Description

Text classification system based on natural language processing

Technical Field

The invention relates to the technical field of text classification, in particular to a text classification system based on natural language processing.

Background

The text classification refers to automatic classification and marking of a text set according to a certain classification system or standard by a computer, a relation model between document features and document categories is found according to a labeled training document set, then a new document is subjected to category judgment by using the relation model obtained by learning, the text classification gradually changes from a knowledge-based method to a statistical and machine learning-based method along with the development of science and technology, the text classification generally comprises the processes of text expression, classifier selection and training, classification result evaluation and feedback and the like, wherein the text expression can be subdivided into the steps of text preprocessing, indexing and statistics, feature extraction and the like.

Natural language processing refers to a technology of interactive communication between a natural language used for human communication and a machine, through artificial natural language processing, a computer can read and understand the natural language, related research of natural language processing starts with human exploration of machine translation, although natural language processing relates to multidimensional operations such as voice, grammar, semantics and pragmatics, the basic task of natural language processing is simply to divide words of a to-be-processed corpus based on an ontology dictionary, word frequency statistics, context semantic analysis and the like to form a lexical item unit which takes minimum part of speech as a unit and is rich in semantics, and the natural language processing is mainly applied to aspects such as machine translation, public opinion monitoring, automatic summarization and viewpoint extraction.

The traditional text classification work mostly depends on manual computer operation, which not only wastes time and labor, but also ensures the classification effect, and with the explosive growth of text document data, the manual operation can not meet the requirements of text classification work, however, the research of applying the natural language processing technology to text classification is not mature enough at present, the importance of a word in a text is mostly measured only according to the occurrence frequency of the word, and the importance evaluation cannot be carried out according to the occurrence position of the word in an article, thereby reducing the accuracy of text classification, having low text classification efficiency, failing to meet some precise text classification work, in addition, the readability of the classification result of the existing text classification system is also poor, the user can not be provided with more visual experience, therefore, the present invention provides a text classification system based on natural language processing to solve the problems in the prior art.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a text classification system based on natural language processing, which preprocesses original text data to facilitate uniform processing of the original text data, reduces the influence of word frequency factors on classification results by performing weighting processing on key information, improves the accuracy of text classification, and solves the problem that a conventional algorithm cannot reflect word position information.

In order to achieve the purpose of the invention, the invention is realized by the following technical scheme: a text classification system based on natural language processing comprises a data acquisition module, a data preprocessing module, a data post-processing module, a text classification module, a classification result verification module and a visualization module, wherein the data acquisition module acquires original text data to be classified and sends the acquired original text data to the data preprocessing module, the preprocessing module comprises a data screening unit, a formatting unit and a normalization unit, the data post-processing module comprises a text word segmentation unit for decomposing the preprocessed data into word segmentation text data and an information weight unit for processing the weight of the word segmentation text data into a text data set, the text classification module comprises a text classification model for classifying the text data and a model training unit for training the text classification model, the text classification model is constructed based on a convolutional neural network and a support vector machine classifier and introduces attention control, the classification result verification module tests and verifies the text classification result of the text classification module, and the visualization module visually displays the text classification result of the text classification module and the verification result of the classification result verification module.

The further improvement lies in that: the data screening unit screens original text data, screens invalid text data in the original text data, and meanwhile retains valid text data, wherein the invalid text data comprise missing value data, abnormal value data, inconsistent value data and repeated text data.

The further improvement lies in that: the formatting unit formats effective text data obtained after the data screening unit screens the effective text data into a uniform format to obtain text data with the uniform format, the normalization unit splits the text data with the uniform format by taking a sentence as a unit and creates a normalization tag for the split sentence to obtain normalized text data and finish preprocessing of the original text data.

The further improvement lies in that: the text word segmentation unit performs word segmentation on the preprocessed text data, removes inflectives and stop words in the text data to obtain word segmentation text data, the information weight unit gives different weights to words appearing at different positions of the word segmentation text data to enable the word segmentation text data to be processed by key information weights, and then the words in the word segmentation text data are mapped into word vector forms corresponding to the words by using a one-hot coding or word embedding technology to obtain a text data set.

The further improvement lies in that: the model training unit trains the text classification model by using a model algorithm based on machine learning or a model algorithm based on deep learning, and the text classification model inputs a text data set for text classification after training.

The further improvement lies in that: the convolutional neural network in the text classification model comprises an input layer, a hidden layer and an output layer, wherein the hidden layer comprises a convolutional layer, a pooling layer, an attention layer and a full connection layer, the input layer introduces a text data set, the convolutional layer and the pooling layer complete feature extraction work and introduce an attention mechanism by the attention layer in the extraction process, and the full connection layer realizes text classification work.

The further improvement lies in that: the classification result verification module comprises a data set dividing unit and a classification result analysis unit, the data set dividing unit randomly divides the text data set into a training set and a testing set, then the training set and the testing set are input into the text classification model for training and testing, and the classification result analysis unit compares and analyzes the testing result and the text classification result and verifies the accuracy of text classification.

The further improvement lies in that: the data set dividing unit randomly arranges the text data sets by setting random seeds and randomly divides the text data sets into a training set and a testing set according to a ratio of 9:1 or 8: 2.

The further improvement lies in that: the data conversion unit converts a text classification result of the text classification module and a verification result of the classification result verification module into visual data, and the data visualization unit puts the visual data on an external display and displays the visual data to a user.

The invention has the beneficial effects that: the invention combines the characteristics of natural language processing to carry out screening, format unification and normalization pretreatment on original text data, thereby facilitating the unified treatment on the original text data, decomposing the text data into basic processing units by carrying out word segmentation operation, key information weight treatment and characterization treatment on the text data, enabling the feature extraction work in the text classification process to be more convenient, simultaneously reducing the cost of subsequent treatment, reducing the influence of word frequency factors on classification results by carrying out weight treatment on key information, improving the accuracy of text classification, solving the problem that the traditional algorithm can not reflect word position information, combining a convolutional neural network with a support vector machine classifier, increasing an attention mechanism in a model, thereby carrying out feature extraction on the text data through the convolutional neural network, and replacing normalization with insufficient generalization capability in the convolutional neural network by using classification counting based on the support vector machine The exponential function simplifies the model parameters, improves the efficiency and the accuracy of text classification, and finally improves the readability of the text classification result through the visualization module, so that the text classification result can be seen more intuitively.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of the system architecture of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, the embodiment provides a text classification system based on natural language processing, including a data acquisition module, a data preprocessing module, a data post-processing module, a text classification module, a classification result verification module and a visualization module, where the data acquisition module acquires original text data to be classified and sends the acquired original text data to the data preprocessing module, the preprocessing module includes a data screening unit, a formatting unit and a normalization unit, which perform preprocessing of screening, formatting unification and normalization on the original text data in combination with the characteristics of natural language processing, so as to facilitate uniform processing on the original text data, the data post-processing module includes a text word segmentation unit for decomposing the preprocessed data into word segmentation text data and an information weight unit for processing the weight of the word segmentation text data into a text data set, the text classification module comprises a text classification model for classifying text data and a model training unit for training the text classification model, the text classification model is constructed based on a convolutional neural network and a support vector machine classifier, an attention mechanism is introduced, the convolutional neural network is combined with the support vector machine classifier, the attention mechanism is added in the model, so that the text data is subjected to feature extraction through the convolutional neural network, a normalization exponential function with insufficient generalization capability in the convolutional neural network is replaced by classification counting based on the support vector machine, the efficiency and the accuracy of text classification are improved while model parameters are simplified, the text classification result of the text classification module is tested and verified by the classification result verification module, and the text classification result of the text classification module and the verification result of the classification result verification module are visually displayed by the visualization module, the readability of the text classification result is improved through the visualization module, and the text classification result can be seen more visually through the visualization module.

The data screening unit screens the original text data, screens invalid text data in the original text data, and meanwhile retains the valid text data, wherein the invalid text data comprises missing value data, abnormal value data, inconsistent value data and repeated text data.

The formatting unit formats the effective text data screened by the data screening unit into a uniform format to obtain text data with the uniform format, the normalizing unit splits the text data with the uniform format by taking a sentence as a unit and creates a normalization label for the split sentence to obtain normalized text data and finish preprocessing of the original text data.

The text word segmentation unit performs word segmentation on the preprocessed text data, removes inflectives and stop words in the text data to obtain segmented text data, the information weight unit gives different weights to words appearing at different positions of the segmented text data to enable the segmented text data to obtain key information weight processing, then uses a one-hot coding technology to map the words in the segmented text data into word vector forms corresponding to the words to obtain a text data set, and performs word segmentation, key information weight processing and characterization processing on the text data to decompose the text data into basic processing units, so that feature extraction in the text classification process is more convenient, cost of subsequent processing is reduced, influence of word frequency factors on classification results is reduced by performing weight processing on the key information, and accuracy of text classification is improved, the problem that the traditional algorithm cannot reflect word position information is solved.

The model training unit trains the text classification model by using a model algorithm based on machine learning, and the text classification model is input into a text data set for text classification after training.

The convolutional neural network in the text classification model comprises an input layer, a hidden layer and an output layer, wherein the hidden layer comprises a convolutional layer, a pooling layer, an attention layer and a full connection layer, the input layer imports a text data set, the convolutional layer and the pooling layer complete feature extraction work and an attention mechanism is introduced by the attention layer in the extraction process, and the full connection layer realizes text classification work.

The classification result verification module comprises a data set dividing unit and a classification result analysis unit, wherein the data set dividing unit randomly divides the text data set into a training set and a testing set, then inputs the training set and the testing set into a text classification model for training and testing, and the classification result analysis unit compares and analyzes the testing result and the text classification result and verifies the accuracy of text classification.

The data set dividing unit randomly arranges the text data set by setting random seeds and randomly divides the text data set into a training set and a testing set according to a ratio of 9: 1.

The visualization module comprises a data conversion unit and a data visualization unit, the data conversion unit converts the text classification result of the text classification module and the verification result of the classification result verification module into visualization data, and the data visualization unit puts the visualization data on the external display and displays the visualization data to a user.

Example two

Referring to fig. 1, the embodiment provides a text classification system based on natural language processing, including a data acquisition module, a data preprocessing module, a data post-processing module, a text classification module, a classification result verification module and a visualization module, where the data acquisition module acquires original text data to be classified and sends the acquired original text data to the data preprocessing module, the preprocessing module includes a data screening unit, a formatting unit and a normalization unit, which perform preprocessing of screening, formatting unification and normalization on the original text data in combination with the characteristics of natural language processing, so as to facilitate uniform processing on the original text data, the data post-processing module includes a text word segmentation unit for decomposing the preprocessed data into word segmentation text data and an information weight unit for processing the weight of the word segmentation text data into a text data set, the text classification module comprises a text classification model for text data classification and a model training unit for training the text classification model, the text classification model is constructed based on a convolutional neural network and a support vector machine classifier, an attention mechanism is introduced, the convolutional neural network is combined with the support vector machine classifier, the attention mechanism is added in the model, so that the feature extraction is carried out on the text data through the convolutional neural network, the classification counting based on the support vector machine is used for replacing a normalization exponential function with insufficient generalization capability in the convolutional neural network, the efficiency and the accuracy of text classification are improved while model parameters are simplified, the text classification result verification module tests and verifies the text classification result of the text classification module, and the visualization module visually displays the text classification result of the text classification module and the verification result of the classification result verification module, the readability of the text classification result is improved through the visualization module, and the text classification result can be seen more visually through the visualization module.

The model training unit trains the text classification model by using a deep learning-based model algorithm, and after the training of the text classification model is finished, a text data set is input for text classification.

The data set dividing unit randomly arranges the text data set by setting random seeds and randomly divides the text data set into a training set and a testing set according to the proportion of 8: 2.

When original text data is subjected to text classification, the original text data to be classified is collected by a data collection module, then the collected original text data is screened, unified in format and normalized by a data preprocessing module, word segmentation operation, key information weight processing and characterization processing are performed on the preprocessed text data by a data post-processing module to obtain a text data set, then a text classification model trained by a model training unit is used for performing text classification on the text data set, a classification result verification module is used for verifying a text classification result, and finally a visualization module is used for visually displaying the text classification result of the text classification module and the verification result of the classification result verification module to a user.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A text classification system based on natural language processing, characterized by: the system comprises a data acquisition module, a data preprocessing module, a data post-processing module, a text classification module, a classification result verification module and a visualization module, wherein the data acquisition module acquires original text data to be classified and sends the original text data to the data preprocessing module, the preprocessing module comprises a data screening unit, a formatting unit and a normalization unit, the data post-processing module comprises a text word segmentation unit and an information weight unit, the text word segmentation unit is used for decomposing the preprocessed data into word segmentation text data, the information weight unit is used for processing the weight of the word segmentation text data into a text data set, the text classification module comprises a text classification model used for classifying the text data and a model training unit used for training the text classification model, the text classification model is constructed based on a convolutional neural network and a support vector machine classifier and is introduced with an attention machine system, the classification result verification module tests and verifies the text classification result of the text classification module, and the visualization module visually displays the text classification result of the text classification module and the verification result of the classification result verification module.

2. A system for natural language processing based text classification as claimed in claim 1, wherein: the data screening unit screens original text data, screens invalid text data in the original text data, and meanwhile retains valid text data, wherein the invalid text data comprise missing value data, abnormal value data, inconsistent value data and repeated text data.

3. A natural language processing based text classification system according to claim 2, characterized in that: the formatting unit formats effective text data obtained after the data screening unit screens the effective text data into a uniform format to obtain text data with the uniform format, the normalization unit splits the text data with the uniform format by taking a sentence as a unit and creates a normalization tag for the split sentence to obtain normalized text data and finish preprocessing of the original text data.

4. A natural language processing based text classification system according to claim 1, characterized in that: the text word segmentation unit performs word segmentation on the preprocessed text data, removes inflectives and stop words in the text data to obtain word segmentation text data, the information weight unit gives different weights to words appearing at different positions of the word segmentation text data to enable the word segmentation text data to be processed by key information weights, and then the words in the word segmentation text data are mapped into word vector forms corresponding to the words by using a one-hot coding or word embedding technology to obtain a text data set.

5. A natural language processing based text classification system according to claim 1, characterized in that: the model training unit trains the text classification model by using a model algorithm based on machine learning or a model algorithm based on deep learning, and the text classification model inputs a text data set for text classification after training.

6. A natural language processing based text classification system according to claim 1, characterized in that: the convolutional neural network in the text classification model comprises an input layer, a hidden layer and an output layer, wherein the hidden layer comprises a convolutional layer, a pooling layer, an attention layer and a full connection layer, the input layer introduces a text data set, the convolutional layer and the pooling layer complete feature extraction work and introduce an attention mechanism by the attention layer in the extraction process, and the full connection layer realizes text classification work.

7. A natural language processing based text classification system according to claim 1, characterized in that: the classification result verification module comprises a data set dividing unit and a classification result analysis unit, the data set dividing unit randomly divides the text data set into a training set and a testing set, then the training set and the testing set are input into the text classification model for training and testing, and the classification result analysis unit compares and analyzes the testing result and the text classification result and verifies the accuracy of text classification.

8. A natural language processing based text classification system according to claim 7, characterized in that: the data set dividing unit randomly arranges the text data sets by setting random seeds and randomly divides the text data sets into a training set and a testing set according to a ratio of 9:1 or 8: 2.

9. A natural language processing based text classification system according to claim 1, characterized in that: the data conversion unit converts a text classification result of the text classification module and a verification result of the classification result verification module into visual data, and the data visualization unit puts the visual data on an external display and displays the visual data to a user.