CN111460148A - Text classification method and device, terminal equipment and storage medium - Google Patents

Text classification method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN111460148A
CN111460148A CN202010230787.6A CN202010230787A CN111460148A CN 111460148 A CN111460148 A CN 111460148A CN 202010230787 A CN202010230787 A CN 202010230787A CN 111460148 A CN111460148 A CN 111460148A
Authority
CN
China
Prior art keywords
text
target
model
training
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010230787.6A
Other languages
Chinese (zh)
Inventor
赵洋
包荣鑫
王宇
魏世胜
朱继刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Valueonline Technology Co ltd
Original Assignee
Shenzhen Valueonline Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Valueonline Technology Co ltd filed Critical Shenzhen Valueonline Technology Co ltd
Priority to CN202010230787.6A priority Critical patent/CN111460148A/en
Publication of CN111460148A publication Critical patent/CN111460148A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The application is applicable to the technical field of text classification, and provides a text classification method, a text classification device, a terminal device and a storage medium, wherein the method comprises the following steps: acquiring a target text to be classified; counting word vector data of a plurality of target words in the target text; extracting feature information of the word vector data; and classifying the target text according to the characteristic information, so that the accuracy of classifying the field to which the target text belongs is improved.

Description

Text classification method and device, terminal equipment and storage medium
Technical Field
The present application belongs to the technical field of text classification, and in particular, to a text classification method, apparatus, terminal device, and storage medium.
Background
At present, with the rapid development of the internet, the speed of spreading news is rapidly increased. However, in the face of news in an explosively growing and complicated domain, a user fixes the domain of receiving news even on a smart device so that interested news can be quickly browsed. However, when facing a huge amount of news or news with long texts, the existing classification method is difficult to accurately identify the domain to which the news belongs, and the accuracy of news classification is low.
Disclosure of Invention
The embodiment of the application provides a text classification method, a text classification device, a terminal device and a storage medium, and can solve the problems that the existing classification method is difficult to accurately identify the field to which news belongs, and the accuracy of news classification is low.
In a first aspect, an embodiment of the present application provides a text classification method, including:
acquiring a target text to be classified;
counting word vector data of a plurality of target words in the target text;
extracting feature information of the word vector data;
and classifying the target text according to the characteristic information.
In an embodiment, the obtaining of the target text to be classified includes:
acquiring a text to be classified, wherein the text comprises a text title and a text body;
and adopting a preset regular expression to carry out data cleaning on the text title and the text to obtain a target text.
In one embodiment, the counting word vector data of a plurality of target words in the target text includes:
acquiring a plurality of target words located in a target area of the target text;
and generating word vector data of the target words according to the position information of the target words in a preset word vector library.
In an embodiment, the extracting feature information of the word vector data includes:
mapping the word vector data into a target matrix with preset dimensionality;
inputting the target matrix into a first model, wherein the first model comprises a convolution layer;
and extracting characteristic information of the target matrix through the convolution layer of the first model.
In one embodiment, the first model is trained by:
acquiring training data, wherein the training data comprises training word vector data of a plurality of training texts, and the training texts respectively have labeled text category information;
inputting the training word vector data into an initial first model for training to obtain an initial text category of the training text;
determining the training loss of the training text according to the text type information and the initial text type;
iteratively updating model parameters of the initial first model according to the training loss;
if the training loss is converged in the iterative updating process, finishing training the initial first model, and taking the current initial first model as a trained first model;
if the training loss is not converged in the iterative updating process, adjusting the model parameters of the initial first model, and returning to the step of inputting the training word vector data to the initial first model for training to obtain the initial text category of the training text until the training loss is converged.
In an embodiment, the ending of training the initial first model if the training loss converges in the iterative update process, and taking the current initial first model as the trained first model includes:
judging whether the training loss output value is continuously changed in the iterative updating process;
and if the training loss output value is not continuously changed in the iterative updating process, judging that the training loss is converged, finishing training the initial first model, and taking the current initial first model as the trained first model.
In an embodiment, the classifying the target text according to the feature information includes:
inputting the feature information into the first model, and obtaining a first text category of the target text output by the first model, wherein the first text category comprises a target field category or a non-target field category;
and if the first text type of the target text is a target field type, inputting the characteristic information of the target text into a second model to obtain a second text type of the target text, wherein the second text type is a sub-classification of the target field type, the second model is obtained by training in the same training mode as the first model, and the network structure of the second model is the same as that of the first model.
In a second aspect, an embodiment of the present application provides a text classification apparatus, including:
the first acquisition module is used for acquiring a target text to be classified;
the statistical module is used for counting word vector data of a plurality of target words in the target text;
the extraction module is used for extracting the characteristic information of the word vector data;
and the classification module is used for classifying the target text according to the characteristic information.
In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the text classification method according to any one of the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, and the embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the text classification method according to any one of the first aspect.
In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to execute the text classification method according to any one of the first aspect.
Compared with the prior art, the embodiment of the application has the advantages that: the method comprises the steps of counting word vector data of target words in a target text to be classified according to the obtained target text to preliminarily represent text information of the target text, then performing further feature information extraction on the word vector data, classifying the target text according to the extracted feature information, and improving accuracy of classification of the field to which the target text belongs.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a first implementation of a text classification method provided in an embodiment of the present application;
fig. 2 is a flowchart illustrating a second implementation of a text classification method according to an embodiment of the present application;
fig. 3 is a flowchart illustrating a third implementation of a text classification method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a fourth implementation of a text classification method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a first model network structure in a text classification method provided in an embodiment of the present application;
FIG. 6 is a schematic flowchart of a first model training in a text classification method according to an embodiment of the present disclosure;
fig. 7 is a schematic flowchart of a fifth implementation of a text classification method according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
The text classification method provided by the embodiment of the application can be applied to terminal devices such as a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook and the like, and the embodiment of the application does not limit the specific types of the terminal devices.
Fig. 1 shows a flowchart of an implementation of the text classification method provided in the embodiment of the present invention, which is detailed as follows:
s101, obtaining a target text to be classified.
In application, the text to be classified includes, but is not limited to, english text, chinese text or text in other languages, and the text format includes, but is not limited to, text in the form of news, papers, periodicals, etc. In the embodiment, for convenience of description, the text to be classified is taken as a chinese text, and the text format is news for example, and the following description is made. The target text may be a text that needs to be classified and is stored in a database of the terminal device in advance, or a text that is pushed to the terminal device by the server, which is not limited to this.
And S102, counting word vector data of a plurality of target words in the target text.
In application, the plurality of target words may be all words in the target text, or may be a plurality of words in a specific area in the target text, which is not limited herein. In a specific application, a target text with more than 500 words can be regarded as a long text, and a target word in the target text can be limited to the fixed first 500 words in the target text. The target word may be a target word of a single word in the target text, for example, "i" in the target text is the target word, or may be a target word composed of a plurality of continuous words, for example, two continuous words in the target text constitute the target word, for example, "us", and when vector representation is performed on the target text, a feature vector is used for representation; or a target word is formed for a sentence and represented by a feature vector. That is, the above-described word vector data corresponds to a feature vector of each target word, and can be represented using characters, words, sentences, and N-grams (natural language models).
In application, because the word segmentation results in the Chinese language have the problems of ambiguity, inaccuracy and the like, in this embodiment, a single word at a character level is used as a target word, each character is replaced by a corresponding feature vector for representation, and the feature vector of each character in a target text is counted, so that the whole target text is replaced by a feature vector with a fixed dimension for representation. In application, a user may set a word vector library in advance and store the word vector library in the terminal device, where the word vector library already contains word vector data of each target word.
And S103, extracting the characteristic information of the word vector data.
In application, the word vector data is corresponding to a word at a single character level, and the target text is firstly transmitted to the terminal equipment by taking a character as a unit and then subjected to initial vectorization to obtain the word vector data. In a specific using process, a certain target word in the target text has semantic syntactic similarity, a relation with other words and the like, the target text has different degrees of action, and if the target text is classified by using word vector data only, the classification accuracy is biased. Therefore, further feature information extraction is required for the word vector data. For example, the obtained word vector data is mapped into a randomly initialized matrix through an embedding layer, each target word (one character) corresponds to a multi-dimensional matrix of a randomly generated target dimension, and then feature information is obtained through feature extraction through high-level convolution.
And S104, classifying the target text according to the characteristic information.
In application, the characteristic information represents text information of a target text, the obtained characteristic information can be input into a trained classification model for prediction to obtain a predicted value, and the target text is classified according to the predicted value. The classification can be divided into a medical type and a non-medical type, if the predicted value obtained by inputting the characteristic information into the classification model by the terminal equipment for prediction is 0.95, and the predicted value exceeds the preset predicted value of the medical type and is more than or equal to 0.9, the target text is judged to belong to the medical type text.
In the embodiment, the text information of the target text is preliminarily represented by acquiring the text to be classified and counting the word vector data of the target words in the target text, then the word vector data is further extracted as the characteristic information, and the target text is classified according to the extracted characteristic information, so that the accuracy of classifying the field to which the target text belongs is improved.
Referring to fig. 2, in an embodiment, S101 includes:
s201, obtaining a text to be classified, wherein the text comprises a text title and a text body.
In application, the text to be classified generally consists of a text header and a text body if the text is a news text or a thesis text. Normally, the title of a news text or a thesis text is usually used as a "topic eye" of the text, which contains the main content of the text, so that when the target text to be classified is predicted, the importance degree of the text title and the text in prediction is different, so as to improve the accuracy of the target text prediction. In addition, in order to distinguish the text header and the body of the text, a space or a special symbol may be added to the text header and the body for combination, which is not limited.
S202, data cleaning is carried out on the text titles and the texts by adopting a preset regular expression, and a target text is obtained.
In application, news texts often have the problems of carrying HTM L labels, data loss, character messy codes and the like, and in addition, irrelevant information in the news texts also has an influence on the classification result, so that the accuracy of the obtained predicted value is reduced.
In this embodiment, by distinguishing the text header and the body of the text, the importance degree of the text header and the body in prediction is divided, so as to improve the accuracy of target text prediction. And the accuracy of target text prediction is further improved by cleaning the text title and the text and deleting irrelevant information in the text.
Referring to fig. 3, in an embodiment, S102 includes:
s301, acquiring a plurality of target words located in a target area of the target text.
In application, the target text is composed of text contents composed of a plurality of words (characters), and different target texts contain different numbers of words. Therefore, in order to enable the predicted values of the target texts with different lengths to be more real and reliable, a plurality of target words in the target region in the target text can be selected for carrying out word vector data statistics. For example, for a target text of a news category, the main content of the target text generally appears in a relatively front paragraph, so that the title and the content of the news can be spliced into a text, and the top 500 characters of the text can be taken as target words to represent the target text. The target text with less than 500 words can be represented by a special vector, wherein the special vector can be a number 0 or a special character, and the method is not limited to this.
S302, generating word vector data of the target words according to the position information of the target words in a preset word vector library.
In application, the preset word vector library may be a word vector library which is generated by counting the occurrence frequency of all words in the training data for a user and taking out a dictionary generated by a preset number of words with the highest occurrence frequency. The preset number may be defined by the user, for example, 5000 characters, each character corresponds to a number in sequence, and all characters correspond to fixed numbers 1, 2, 3. The position information is the position number of each target word in the word vector library, for example, the first word in the word vector library is "and the word vector data corresponding to" word "of" is "1".
In a specific application, the text to be classified is news, the first 500 characters of the news are taken as a plurality of target words in a target area, and a 500-dimensional vector matrix (word vector data) is generated according to the number of the characters, for example, the word vector data corresponding to the "smart cloud marketing.
In this embodiment, by acquiring a plurality of target words located in a target region of a target text and generating word vector data according to position information of the target words in a word vector library, a representation method of the word vector data of the text is simplified, and a terminal device can conveniently acquire the word vector data of the target text.
Referring to fig. 4, in an embodiment, S103 includes:
s401, mapping the word vector data into a target matrix with preset dimensionality.
In application, for the text to be classified, whether the text is a Chinese text or an English text, or how many words or words in the text exist, and the semantics and the syntax have certain similarity and connection. If the word vector data of the target words are directly adopted for prediction, the accuracy of the prediction result may have deviation, so that the target words need to be mapped into a continuous vector space, and each target word is mapped into a vector on a real number domain, so that a certain relation is formed between the target words. The following are exemplary: "good story" and "very excellent story" have the same meaning, and a pre-constructed vocabulary a is { very good, so, thing, excellent }. Then defining the vector dimension of each word as a 6-dimensional feature vector, one element other than the index representing the corresponding word in the vocabulary V is obtained, such as marking the presence of a word as boolean, 0 for absence, and 1 for presence. Then there are: preferably ═ 0, 1, 0, 0, 0, 0 ]; bar ═ 0, 0, 0, 0, 1 ]; very ═ 1, 0, 0, 0, 0 ]; 6 sets of vector data; the vector data of each word is visualized and mapped into a 6-dimensional space, so that each word occupies one dimension (i.e. the target matrix of the preset dimension) independently of the other dimensions (no projection along the other dimensions). The degree of difference between 'good' and 'good' in a 6-dimensional space is the same as the degree of difference between 'very' and 'good', and then training can be carried out according to a neural network model, and the dependency relationship of 'good' on 'good' is introduced, so that certain dependency relationship is formed between words. In application, the preset dimension may be set by a user, or may be a dimension pre-stored in the terminal device, which is not limited herein.
S402, inputting the target matrix into a first model, wherein the first model comprises a coiling layer.
And S403, extracting characteristic information of the target matrix through the convolution layer of the first model.
In application, the specific network structure of the first model between the input layer and the output layer is as follows: the device comprises a convolution layer, a pooling layer, a first full-connection layer, a second full-connection layer and a classification layer. The convolution layer is used for performing one-dimensional convolution operation on the target matrix and extracting characteristic information of the target matrix. Specifically, the word vector data has 500 words, and a preset-dimension target matrix is obtained through mapping, for example, a 64-dimensional matrix is randomly generated by corresponding the word vector data of each target word, the word vector data can be mapped into a 500 × 64-dimensional target matrix, the target matrix may specifically refer to a drawing corresponding to "multichannel sentence vectorization representation" in fig. 5, then the target matrix may be convolved in a convolution layer by using 256 convolution kernels, and 256 features are correspondingly extracted, that is, feature information of the target matrix, and the feature information may specifically refer to a drawing in which the target matrix is processed by "a convolution layer having multiple kernels" in fig. 5. The feature information dimension may be 256 × 64, the size of the convolution kernel may be 7 × 7, or 3 × 3, and the convolution step size may be 1, 2, or other step sizes, which are not limited to these.
In a specific application, the pooling layer is a maximum pooling layer, and is configured to perform maximum pooling operation on the feature information, and is configured to select a feature with a maximum value in the feature information to form new feature information, where the dimension of the new feature information may be 256 × 1, so as to reduce an error of current feature extraction, and the new feature information may specifically refer to a drawing in fig. 5 after being processed by the "maximum pooling layer". And then inputting new feature information into a first full-connection layer, wherein the first full-connection layer is a random inactivation layer (dropout) for regularization, and the weights of learning parameters of partial layers of the first model training are zeroed so as to prevent overfitting during the first model training. And inputting the new feature information processed by the first fully-connected layer into a second fully-connected layer, wherein the second fully-connected layer is a classification layer (softmax) and is used for integrating the new feature information processed by the first fully-connected layer together to output a value, and specifically, refer to the drawing "having the corresponding dropouts and softmax fully-connected layers" in fig. 5. It can be understood that the 256 × 1 dimensional feature information is reduced to 2 × 1 dimensional feature information, that is, a 2 × 1 matrix, and a numerical value obtained by calculation according to the current matrix is the probability of classifying the target text.
In the embodiment, the target texts are classified by setting the neural network structure with only one convolutional layer, so that the method has obvious structural advantages compared with the neural network for the multilayer convolutional layers in image recognition, the neural network has only one convolutional layer, the overall structure of the model is simple, and the efficiency of model training and operation also has obvious advantages.
Referring to fig. 6, in an embodiment, the first model is obtained by training as follows:
s601, obtaining training data, wherein the training data comprises training word vector data of a plurality of training texts, and the training texts respectively have labeled text category information.
In application, the training data is used for training the first model, specifically, the selected training text is processed in S101-S102 to obtain training word vector data of each training text, and the training text is labeled with text type information, that is, a real text type. The training text can be historical news text acquired by a server or terminal equipment, or can be text selected by a user and stored in the terminal equipment.
S602, inputting the training word vector data to an initial first model for training to obtain an initial text category of the training text.
S603, determining the training loss of the training text according to the text type information and the initial text type.
In application, training word vector data is input to an initial first model for training, and the training word vector data is a forward propagation process. The intermediate is processed through the input layer of the initial first model, and the network structure and the output layer described above, and the final result is obtained. After the forward propagation process is finished, the initial time can be obtainedThe first model is started for an initial text category of the input training text. And then, calculating according to the initial text type and the real text type (marked text type information) to obtain the training loss of the current training text. The formula for calculating the training loss may be: a ═ y'i-yi)2Wherein, y'iFor the initial text category of the ith input training text, yiText type information of the training text input for the ith input.
And S604, iteratively updating the model parameters of the initial first model according to the training loss.
In application, the model parameters are specifically the learning parameter w and the bias vector b in the initial first model. Specifically, the model parameters may reversely determine an error influence of the learning parameters of the word vector data in each layer on the total loss value according to the training loss value, obtain an error of the current sample layer through the error influence, and multiply by the negative learning rate to obtain an error value Δ w of the learning parameters of the current layer and an error value Δ b of the bias vector, where the new learning parameter is w + Δ w and the bias vector is b + Δ b. For example, for the second fully connected layer (softmax), the calculation formula may be y ═ softmax (W)v·v+bv) (ii) a Wherein softmax is the activation function, y' is the output value of the initial text category, WvLearning parameters for the current fully-connected layer and bvAnd if the current offset vector of the full connection layer is the offset vector of the current full connection layer, performing back propagation to update the learning parameter and the offset vector according to the training loss A. Alternatively, the model parameters are optimized using an optimizer, for example, an adaptive moment estimator (Adam stostecortizer) optimizer automatically derives the output values of the training loss, and the model parameters are iteratively updated, which is not limited in this respect.
And S605, if the training loss is converged in the iterative updating process, finishing training the initial first model, and taking the current initial first model as the trained first model.
And S606, if the training loss is not converged in the iterative updating process, adjusting the model parameters of the initial first model, and returning to execute the step of inputting the training word vector data to the initial first model for training to obtain the initial text category of the training text until the training loss is converged.
In application, the convergence condition of the initial first model is determined by the obtained training loss in the iterative updating process. Specifically, when the training loss is smaller than a preset value or after a certain number of iterations, it may be determined that the initial first model has converged, and then the training of the initial first model is finished, and the current initial first model is used as the trained first model. Otherwise, repeating the training steps S602-S606 when the word vector data of the text is to be trained. And updating the original model parameters in the initial first model during the back propagation training in each iteration process, namely updating the iteration.
In other applications, the training data may further include a validation text and a test text, where the training text directly participates in model training of the first model, and a process of adjusting model parameters in the first model, the validation text is used to judge a training effect and a stop timing of the first model, and the test text is used to evaluate a classification capability of the first model after training is completed, where the texts in the training text, the validation text, and the test text are not intersected, that is, the texts do not appear in the validation text and the test text when belonging to the training text. The proportion of the training text, the test text and the verification text can be divided according to the proportion of 0.9:0.05: 0.05.
Specific amounts can be as shown in table 1 below:
table 1:
Figure BDA0002429221840000121
in application, accuracy, recall and F1 values are important indicators for evaluating the effectiveness of a model. Accuracy is the total weight of all data that are predicted to be correct. Recall is the proportion of data that is correctly predicted to be positive to all that is actually positive. The F1 value is a proportional value obtained by calculation of the accuracy and the recall ratio, and the larger the values of the three indexes, the better the value.
In application, the accuracy, recall rate and F1 value of the first model trained through the above steps are shown in table 2 below, and the overall accuracy of the first model on the test text is 95.70%.
Table 2:
name of classification Rate of accuracy Recall rate F1 value
Medical treatment 0.94 0.97 0.95
Others 0.97 0.95 0.96
In this embodiment, word vector data of all training texts are input into the initial first model to perform forward propagation training, so as to obtain training loss of the training texts, and then backward propagation training is performed according to the training loss to update model parameters of the initial first model, so that accuracy of text classification in the first model is improved.
In one embodiment, S606 includes:
and judging whether the training loss output value is continuously changed in the iteration updating process.
And if the training loss output value is not continuously changed in the iterative updating process, judging that the training loss is converged, finishing training the initial first model, and taking the current initial first model as the trained first model.
In application, when model training is performed, a training loss output value can be obtained every time an iterative updating process is performed, the training loss output value obtained for continuous preset times is kept unchanged, or the fluctuation range of adjacent training loss output values obtained for continuous preset times is within a preset value, and it can be determined that the training loss output value is not continuously changed in the iterative updating process. The preset number of times may be 10 times, and the preset value of the fluctuation range may be 0 to 0.1, which is not limited.
In this embodiment, the initial first model convergence is determined by calculating that the training loss output value in the continuous iteration process is not changed, so as to obtain the first model, and improve the accuracy of the first model in classifying the target text.
In one embodiment, S104 further includes:
and inputting the characteristic information into the first model, and obtaining a first text category of the target text output by the first model, wherein the first text category comprises a target field category or a non-target field category.
In application, the first model is processed according to the input feature information, the output predicted value is a probability value of the first model for judging the target text to be the target field type, and when the current probability value exceeds a threshold value, the target text is judged to be the target field type. For example, the first text type includes "medical type" and "non-medical type", and if a predicted value obtained by predicting the feature information by the first model is "0.95", and exceeds a preset "medical type" threshold value of 0.9, it is determined that the target text belongs to the "medical type" text, otherwise, it is determined that the target text belongs to the "non-medical type" text.
And if the first text type of the target text is a target field type, inputting the characteristic information of the target text into a second model to obtain a second text type of the target text, wherein the second text type is a sub-classification of the target field type, the second model is obtained by training in the same training mode as the first model, and the network structure of the second model is the same as that of the first model.
In application, the training mode of the second model is consistent with that of the first model, specifically referring to S601-S606, and the network structure of the second model is the same as that of the first model, so that two models can be obtained by training only one neural network structure on the basis of ensuring accuracy, the memory space is reduced, the deployment efficiency of the network structure is accelerated, in addition, by only designing one layer of convolution layer in the neural network structure, the deployment efficiency of the network structure is further improved, the calculated amount during model training is reduced, and the training speed is improved.
In application, the second text category is a sub-category of the target domain category, and the sub-category may be considered to have a plurality of sub-categories. For example, in response to the problem that there is no uniform domain division and definition standard in the medical field, the present embodiment divides news of the medical field into six categories, including "pharmaceutical manufacturing industry", "medical service", "pharmaceutical business", "medical instrument", "chinese medicine", and "health science popularization". The classification standard of the method can basically cover most of news in the medical field, and the minimum cross probability among news classifications is realized. In order to avoid the situation that one piece of news belongs to multiple categories at the same time, the embodiment also performs finer-grained label division on the category of each sub-category. After comprehensively considering the reasonableness, the comprehensiveness and the non-intersection, the application summarizes and gives the labels of the sub-classifications, and the labels are used for marking the data set and verifying the classification result. The specific classification can be shown in table 3 below:
table 3:
Figure BDA0002429221840000141
Figure BDA0002429221840000151
in other applications, the training data of the second model also includes verification text and test text, wherein the effect of each text is consistent with the effect of the text when the first model is trained, and the detailed description is omitted here. And all six sub-classifications and the labels corresponding to the six sub-classifications are manually screened, the training texts, the verification texts and the test texts can acquire news published by a plurality of authoritative medical news publishing organizations from the internet, the text data of the six sub-classifications are divided into the training texts, the test texts and the verification texts without intersection, and the data volume of each text is shown in the following table 4:
table 4:
name of classification Number of training sets Number of verification sets Number of test sets
Pharmaceutical manufacturing industry 2349 262 262
Medical services 2508 279 279
Pharmaceutical business 2371 264 264
Medical instrument 2409 268 268
Traditional Chinese medicine 2222 247 247
Health science popularization 1939 216 216
In application, the second model trained through the above steps has the experimental results of accuracy, recall rate and F1 value as shown in table 5 below, and the overall accuracy of the second model on the test text is 88.49%.
Table 5:
name of classification Rate of accuracy Recall rate F1 value
Pharmaceutical manufacturing industry 0.93 0.91 0.92
Medical services 0.90 0.86 0.88
Pharmaceutical business 0.85 0.87 0.86
Medical instrument 0.90 0.92 0.91
Traditional Chinese medicine 0.93 0.91 0.92
Health science popularization 0.93 0.92 0.92
For example, for the second model, since there are more than 2 sub-classifications corresponding to the target domain class, for example, the sub-classifications include "medical apparatus class", "chinese traditional medicine class", and "medical service class", the target text word vector data may be input into the second model, and at this time, the second model may be pre-trained with different learning parameters w and bias vectors b, that is, each sub-classification corresponds to one learning parameter and bias vector. Therefore, for word vector data of a target text, probability values of four predictor classifications, which respectively correspond to the probabilities (a) of the "medical instrument class", can be predicted according to different learning parameters and bias vectors1) Probability of Chinese medicine2) And probability of "medical service class" (a)3). Then, if a1=0.67,a20.2 and a3And if the text category is 0.13, selecting the sub-classification corresponding to the maximum value of the three a as a final second text category, namely judging that the current target text is the medical instrument category. Correspondingly, if the second model is in the training stage, the classification loss can be calculated according to the real second text category and the predicted sub-classification category, and then the three learning parameters and the bias parameter are updated in an iterative manner. For example, when the probability of the second text type of the preset target text is 1, the probability of the "medical instrument type", the probability of the "traditional Chinese medicine type" is 0, and the probability of the "medical service type" is 0, the predicted node type a is predicted1、a2And a3The probability of (d) is calculated corresponding to the probability of the second text category to obtain a classification loss. In the application, the target text subjected to sub-classification can be further classified by the above steps, that is, after determining which sub-classification the target text belongs to, the feature information is input into the classification model corresponding to the belonging label, which is not described in detail.
In this embodiment, the network structure of the second model is the same as that of the first model, on the basis of ensuring the accuracy recall rate of the classification of the target text, two models are obtained through training of only one neural network structure, the memory space is reduced, the deployment efficiency is accelerated, and in addition, only one layer of convolutional layer is designed in the neural network structure, so that the calculated amount during model training is reduced, and the training speed is improved.
In a specific embodiment, referring to S701 to S708 of fig. 7, the target text to be classified is news, the terminal device acquires text content of the news, namely a text title and a text body, then preprocessing the text title and the text body, namely adopting a regular expression to carry out data cleaning on the text title and the text body to obtain a target text, word vector data of a plurality of target words in the target text are obtained, corresponding characteristic information is extracted to represent news, the characteristic information is input into a first model for preliminary classification, when the news is judged to belong to the medical category, the characteristic information is input into the second model to be sub-classified, and when the news is judged to belong to the non-medical category, the domain classification can be realized for the target text through the characteristic information according to the steps, for example, with specific reference to S601-S606, other text classification models are trained for domain classification, the method is not described in detail, so that the purpose of effectively classifying news in any field can be achieved.
As shown in fig. 8, the present embodiment further provides a text classification apparatus 800, including:
the first obtaining module 810 is configured to obtain a target text to be classified.
A statistic module 820, configured to count word vector data of a plurality of target words in the target text.
And the extracting module 830 is configured to extract feature information of the word vector data.
A classification module 840, configured to classify the target text according to the feature information.
In an embodiment, the first obtaining module 810 is further configured to:
acquiring a text to be classified, wherein the text comprises a text title and a text body;
and adopting a preset regular expression to carry out data cleaning on the text title and the text to obtain a target text.
In one embodiment, the statistics module 820 is further configured to:
acquiring a plurality of target words located in a target area of the target text;
and generating word vector data of the target words according to the position information of the target words in a preset word vector library.
In an embodiment, the extraction module 830 is further configured to:
mapping the word vector data into a target matrix with preset dimensionality;
inputting the target matrix into a first model, wherein the first model comprises a convolution layer;
and extracting characteristic information of the target matrix through the convolution layer of the first model.
In an embodiment, the text classification apparatus 800 may also be used for network model training, including:
the second obtaining module is used for obtaining training data, the training data comprises training word vector data of a plurality of training texts, and the training texts respectively have labeled text category information.
And the training module is used for inputting the training word vector data to an initial first model for training to obtain an initial text category of the training text.
The determining module is used for determining the training loss of the training text according to the text category information and the initial text category;
an updating module for iteratively updating the model parameters of the initial first model according to the training loss;
a finishing module, configured to finish training the initial first model if the training loss is converged in the iterative update process, and use the current initial first model as a trained first model;
and the iteration module is used for adjusting the model parameters of the initial first model if the training loss is not converged in the iterative updating process, and returning to execute the step of inputting the training word vector data to the initial first model for training to obtain the initial text category of the training text until the training loss is converged.
In one embodiment, the ending module is further configured to:
judging whether the training loss output value is continuously changed in the iterative updating process;
and if the training loss output value is not continuously changed in the iterative updating process, judging that the training loss is converged, finishing training the initial first model, and taking the current initial first model as the trained first model.
In an embodiment, the classification module 840 is further configured to:
inputting the feature information into the first model, and obtaining a first text category of the target text output by the first model, wherein the first text category comprises a target field category or a non-target field category;
and if the first text type of the target text is a target field type, inputting the characteristic information of the target text into a second model to obtain a second text type of the target text, wherein the second text type is a sub-classification of the target field type, the second model is obtained by training in the same training mode as the first model, and the network structure of the second model is the same as that of the first model.
An embodiment of the present application further provides a terminal device, where the terminal device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments may be implemented.
The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.
Fig. 9 is a schematic diagram of a terminal device 90 according to an embodiment of the present application. As shown in fig. 9, the terminal device 90 of this embodiment includes: a processor 903, a memory 901 and a computer program 902 stored in said memory 901 and executable on said processor 903. The processor 903 implements the steps in the above-described method embodiments, such as the steps S101 to S104 shown in fig. 1, when executing the computer program 902. Alternatively, the processor 903 realizes the functions of each module/unit in the above device embodiments when executing the computer program 902.
Illustratively, the computer program 902 may be divided into one or more modules/units, which are stored in the memory 901 and executed by the processor 903, to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 902 in the terminal device 90. For example, the computer program 902 may be divided into a first obtaining module, a statistical module, an extracting module and a classifying module, and each module has the following specific functions:
the first obtaining module is used for obtaining the target text to be classified.
And the counting module is used for counting word vector data of a plurality of target words in the target text.
And the extraction module is used for extracting the characteristic information of the word vector data.
And the classification module is used for classifying the target text according to the characteristic information.
The terminal device 90 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 903, a memory 901. Those skilled in the art will appreciate that fig. 9 is merely an example of a terminal device 90 and does not constitute a limitation of the terminal device 90 and may include more or fewer components than shown, or some components may be combined, or different components, for example, the terminal device may also include input-output devices, network access devices, buses, etc.
The Processor 903 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 901 may be an internal storage unit of the terminal device 90, such as a hard disk or a memory of the terminal device 80. The memory 801 may also be an external storage device of the terminal device 90, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 90. In one embodiment, the memory 901 may also include both internal storage units and external storage devices of the terminal device 90. The memory 901 is used for storing the computer program and other programs and data required by the terminal device. The memory 901 may also be used to temporarily store data that has been output or is to be output.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method of text classification, comprising:
acquiring a target text to be classified;
counting word vector data of a plurality of target words in the target text;
extracting feature information of the word vector data;
and classifying the target text according to the characteristic information.
2. The text classification method according to claim 1, wherein the obtaining of the target text to be classified comprises:
acquiring a text to be classified, wherein the text comprises a text title and a text body;
and adopting a preset regular expression to carry out data cleaning on the text title and the text to obtain a target text.
3. The text classification method according to claim 1 or 2, wherein the counting word vector data of a plurality of target words in the target text comprises:
acquiring a plurality of target words located in a target area of the target text;
and generating word vector data of the target words according to the position information of the target words in a preset word vector library.
4. The text classification method according to claim 3, wherein the extracting feature information of the word vector data includes:
mapping the word vector data into a target matrix with preset dimensionality;
inputting the target matrix into a first model, wherein the first model comprises a convolution layer;
and extracting characteristic information of the target matrix through the convolution layer of the first model.
5. The method of text classification according to claim 4, characterized in that the first model is trained by the following steps:
acquiring training data, wherein the training data comprises training word vector data of a plurality of training texts, and the training texts respectively have labeled text category information;
inputting the training word vector data into an initial first model for training to obtain an initial text category of the training text;
determining the training loss of the training text according to the text type information and the initial text type;
iteratively updating model parameters of the initial first model according to the training loss;
if the training loss is converged in the iterative updating process, finishing training the initial first model, and taking the current initial first model as a trained first model;
if the training loss is not converged in the iterative updating process, adjusting the model parameters of the initial first model, and returning to the step of inputting the training word vector data to the initial first model for training to obtain the initial text category of the training text until the training loss is converged.
6. The method for classifying text according to claim 5, wherein said ending training of said initial first model if said training loss converges during said iterative updating process and using a current initial first model as a trained first model comprises:
judging whether the training loss output value is continuously changed in the iterative updating process;
and if the training loss output value is not continuously changed in the iterative updating process, judging that the training loss is converged, finishing training the initial first model, and taking the current initial first model as the trained first model.
7. The text classification method according to any one of claims 4 to 6, wherein the classifying the target text according to the feature information comprises:
inputting the feature information into the first model, and obtaining a first text category of the target text output by the first model, wherein the first text category comprises a target field category or a non-target field category;
and if the first text type of the target text is a target field type, inputting the characteristic information of the target text into a second model to obtain a second text type of the target text, wherein the second text type is a sub-classification of the target field type, the second model is obtained by training in the same training mode as the first model, and the network structure of the second model is the same as that of the first model.
8. A text classification apparatus, comprising:
the first acquisition module is used for acquiring a target text to be classified;
the statistical module is used for counting word vector data of a plurality of target words in the target text;
the extraction module is used for extracting the characteristic information of the word vector data;
and the classification module is used for classifying the target text according to the characteristic information.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202010230787.6A 2020-03-27 2020-03-27 Text classification method and device, terminal equipment and storage medium Pending CN111460148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010230787.6A CN111460148A (en) 2020-03-27 2020-03-27 Text classification method and device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010230787.6A CN111460148A (en) 2020-03-27 2020-03-27 Text classification method and device, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111460148A true CN111460148A (en) 2020-07-28

Family

ID=71681551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010230787.6A Pending CN111460148A (en) 2020-03-27 2020-03-27 Text classification method and device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111460148A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084785A (en) * 2020-07-30 2020-12-15 中国民用航空上海航空器适航审定中心 Airworthiness text feature extraction and evaluation method, system, device and storage medium
CN112668318A (en) * 2021-03-15 2021-04-16 常州微亿智造科技有限公司 Work author identification method based on time sequence
CN112765348A (en) * 2021-01-08 2021-05-07 重庆创通联智物联网有限公司 Short text classification model training method and device
CN113254635A (en) * 2021-04-14 2021-08-13 腾讯科技(深圳)有限公司 Data processing method, device and storage medium
CN113326347A (en) * 2021-05-21 2021-08-31 四川省人工智能研究院(宜宾) Syntactic information perception author attribution method
CN113392640A (en) * 2020-10-13 2021-09-14 腾讯科技(深圳)有限公司 Title determining method, device, equipment and storage medium
CN113535951A (en) * 2021-06-21 2021-10-22 深圳大学 Method, device, terminal equipment and storage medium for information classification
CN114969725A (en) * 2022-04-18 2022-08-30 中移互联网有限公司 Target command identification method and device, electronic equipment and readable storage medium
CN114997338A (en) * 2022-07-19 2022-09-02 成都数之联科技股份有限公司 Project classification and classification model training method, device, medium and equipment
CN116975299A (en) * 2023-09-22 2023-10-31 腾讯科技(深圳)有限公司 Text data discrimination method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160283583A1 (en) * 2014-03-14 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN109726391A (en) * 2018-12-11 2019-05-07 中科恒运股份有限公司 The method, apparatus and terminal of emotional semantic classification are carried out to text
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160283583A1 (en) * 2014-03-14 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN109726391A (en) * 2018-12-11 2019-05-07 中科恒运股份有限公司 The method, apparatus and terminal of emotional semantic classification are carried out to text
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴笛: "基于语义网的个性化网络学习服务", 武汉大学出版社, pages: 126 - 155 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084785A (en) * 2020-07-30 2020-12-15 中国民用航空上海航空器适航审定中心 Airworthiness text feature extraction and evaluation method, system, device and storage medium
CN113392640A (en) * 2020-10-13 2021-09-14 腾讯科技(深圳)有限公司 Title determining method, device, equipment and storage medium
CN113392640B (en) * 2020-10-13 2024-01-23 腾讯科技(深圳)有限公司 Title determination method, device, equipment and storage medium
CN112765348A (en) * 2021-01-08 2021-05-07 重庆创通联智物联网有限公司 Short text classification model training method and device
CN112668318A (en) * 2021-03-15 2021-04-16 常州微亿智造科技有限公司 Work author identification method based on time sequence
CN113254635A (en) * 2021-04-14 2021-08-13 腾讯科技(深圳)有限公司 Data processing method, device and storage medium
CN113326347A (en) * 2021-05-21 2021-08-31 四川省人工智能研究院(宜宾) Syntactic information perception author attribution method
CN113326347B (en) * 2021-05-21 2021-10-08 四川省人工智能研究院(宜宾) Syntactic information perception author attribution method
CN113535951A (en) * 2021-06-21 2021-10-22 深圳大学 Method, device, terminal equipment and storage medium for information classification
CN114969725A (en) * 2022-04-18 2022-08-30 中移互联网有限公司 Target command identification method and device, electronic equipment and readable storage medium
CN114997338A (en) * 2022-07-19 2022-09-02 成都数之联科技股份有限公司 Project classification and classification model training method, device, medium and equipment
CN116975299A (en) * 2023-09-22 2023-10-31 腾讯科技(深圳)有限公司 Text data discrimination method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111460148A (en) Text classification method and device, terminal equipment and storage medium
CN111949787B (en) Automatic question-answering method, device, equipment and storage medium based on knowledge graph
CN111125334B (en) Search question-answering system based on pre-training
CN107122346B (en) The error correction method and device of a kind of read statement
US11093854B2 (en) Emoji recommendation method and device thereof
CN108304468B (en) Text classification method and text classification device
CN112347778B (en) Keyword extraction method, keyword extraction device, terminal equipment and storage medium
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN110032639B (en) Method, device and storage medium for matching semantic text data with tag
CN108959312A (en) A kind of method, apparatus and terminal that multi-document summary generates
CN109241526B (en) Paragraph segmentation method and device
CA2882280A1 (en) System and method for matching data using probabilistic modeling techniques
US9348901B2 (en) System and method for rule based classification of a text fragment
KR20200007713A (en) Method and Apparatus for determining a topic based on sentiment analysis
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN111460115A (en) Intelligent man-machine conversation model training method, model training device and electronic equipment
CN110674301A (en) Emotional tendency prediction method, device and system and storage medium
CN114077661A (en) Information processing apparatus, information processing method, and computer readable medium
CN115098556A (en) User demand matching method and device, electronic equipment and storage medium
CN113641833B (en) Service demand matching method and device
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
Arif et al. Word sense disambiguation for Urdu text by machine learning
CN114117057A (en) Keyword extraction method of product feedback information and terminal equipment
CN111611379A (en) Text information classification method, device, equipment and readable storage medium
CN111382265A (en) Search method, apparatus, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination