CN110413773A

CN110413773A - Intelligent text classification method, device and computer readable storage medium

Info

Publication number: CN110413773A
Application number: CN201910540265.3A
Authority: CN
Inventors: 郑子欧; 刘京华; 汪伟
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2019-11-05
Anticipated expiration: 2039-06-20
Also published as: WO2020253043A1; CN110413773B

Abstract

The present invention relates to a kind of artificial intelligence technologys, disclose a kind of intelligent text classification method, it include: to receive text data and tally set, part-of-speech tagging is carried out to the text data, it segments text data progress fine granularity to obtain segmentation sequence collection according to the part-of-speech tagging, term vector is carried out to the segmentation sequence collection to handle to obtain term vector data set, the term vector data set and the tally set are input in disaggregated model and trains and obtains trained values, when the trained values are less than preset threshold, the disaggregated model exits training, receive the text of user's input, the term vector is carried out to the text to operate to obtain text term vector, the text term vector is input to the disaggregated model judgement and output category result.The present invention also proposes a kind of intelligent text sorter and a kind of computer readable storage medium.Accurate text classification function may be implemented in the present invention.

Description

Intelligent text classification method and device and computer readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an intelligent text classification method and device and a computer readable storage medium.

Background

Text classification is an important part of text processing, and its application is also very wide, such as: garbage filtering, news classification, part-of-speech tagging, and the like. The content of different texts is classified, and at present, the content is generally classified by labeling keywords. The classification method ignores chapter information in the text, and due to lack of consideration on part of speech, the classification method is incomplete and not detailed, so that the accuracy is low.

Disclosure of Invention

The invention provides an intelligent text classification method, an intelligent text classification device and a computer readable storage medium, and mainly aims to provide accurate classification results for a user when the user inputs texts.

In order to achieve the above object, the present invention provides an intelligent text classification method, which comprises:

receiving text data and a tag set, and performing part-of-speech tagging on the text data;

performing fine-grained word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;

inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;

receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.

Optionally, the part-of-speech tagging comprises:

firstly, according to a preset part-of-speech mark template, marking nouns and verbs in the text data;

searching for words in the text data that are longer than two characters and contain either "of" or "ground";

and judging whether the preceding and following words of the words with the length larger than the preset length characters and containing the 'or' ground 'in the text data are nouns or verbs, and if the preceding and following words are nouns or verbs, marking the words with the length larger than two characters and containing the' or 'ground' as adjectives or adverbs.

Optionally, the word vectorization processing includes:

establishing a classification probability model based on the word segmentation sequence set;

constructing a conditional probability model based on the classification probability model;

performing accumulation summation operation on the conditional probability model to obtain a log-likelihood function;

and maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set.

Optionally, the classification probability modelComprises the following steps:

wherein X is the word segmentation sequence set, omega is the noun, verb, adjective and adverb labeled by the part of speech, which can also be called as a characteristic word, e is an infinite endless decimal,is X_ωTransposed matrix of (2), X_ωAn accumulated sum operation of ω, the accumulated sum operation being:

wherein c is the data number of the word segmentation sequence set, V (omega)_i) A data set is vectorized for a word after the assumption has been word vectorized.

Optionally, the classification model comprises a convolutional neural network, an activation function and a loss function, wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and one full-link layer; and

inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model, including:

after receiving the word vectorization data set, the convolutional neural network inputs the word vectorization data set to the nineteen convolutional layers and the nineteen pooling layers to carry out convolution operation and maximum pooling operation to obtain a dimension reduction data set, and inputs the dimension reduction data set to a full connection layer;

and the full connection layer receives the dimensionality reduction data set, calculates by combining the activation function to obtain a prediction classification set, inputs the prediction classification set and the label set into the loss function to calculate a loss value, judges the size relation between the loss value and a preset threshold value, and quits training until the loss value is smaller than the preset threshold value.

In addition, in order to achieve the above object, the present invention further provides an intelligent text classification apparatus, which includes a memory and a processor, wherein the memory stores a text classification program operable on the processor, and the text classification program implements the following steps when executed by the processor:

Optionally, the part-of-speech tagging comprises:

Optionally, the word vectorization processing includes:

Optionally, the classification probability modelComprises the following steps:

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a text classification program stored thereon, the text classification program being executable by one or more processors to implement the steps of the intelligent text classification method as described above.

The invention provides an intelligent text classification method, an intelligent text classification device and a computer readable storage medium. The method carries out part-of-speech tagging according to the text content, can effectively convert the text data into part-of-speech data, simultaneously carries out word vectorization operation, can further unscramble the characteristics of the text data to a computer without loss for analysis, and can effectively improve the robustness and the accuracy of the classification type of the text data based on multiple training of a classification model. Therefore, the invention can provide accurate classification results for users.

Drawings

Fig. 1 is a schematic flowchart of an intelligent text classification method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an internal structure of an intelligent text classification apparatus according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a text classification program in the intelligent text classification apparatus according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an intelligent text classification method. Fig. 1 is a schematic flow chart of an intelligent text classification method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the intelligent text classification method includes:

and S1, receiving the text data and the label set, and performing part-of-speech tagging on the text data.

Preferably, the text data set includes text data of various subjects, such as subjects of finance, novel, education, property, sports, and the like, and the label set records labels of the text data in the text data set, such as recording text data a as sports and recording text data B as property.

In a preferred embodiment of the present invention, the part-of-speech tagging is performed by tagging a noun and a verb in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template is a recognizer tagged with characteristics of the noun and the verb, and the part-of-speech tagging template can determine the noun and the verb by recognizing characteristics of the word. The words are marked as nouns according to the part-of-speech mark template, wherein the words are [ I apple ] and [ basketball fitness ], [ enemy time ] and the words are [ favorite ], [ good for play ], [ yield ] are verbs;

searching for a word having a length greater than a preset length, such as two characters and containing "of" or "ground", in the text data, and judging whether a preceding word and a following word of the word having the length greater than two characters and containing "of" or "ground" in the text data are nouns or verbs. If the preceding and following words are nouns or verbs, the words with the length larger than the preset length characters and containing ' the ' or ' the ' ground ' are adjectives or adverbs, for example [ angry people violently and violently dozen hatable thieves]Firstly, people are identified according to the part of speech tag template]And can be opened][ thief]And recognizes that a word greater than two characters in length and containing "of" or "ground" is angry][ tough and tough ground]Can be hated]Judging whether there is noun or verb before and after the word, such as [ people]And can be opened][ thief]And thus adjectives or adverbs and labeled. Preferably, the annotation may take the form of including an annotation symbol, such as angry^adjPeopleⁿTough and tough ground^advCan be opened^vCan be hated^adjThiefⁿ]。

S2, performing fine-grained word segmentation on the text data according to the part of speech tagging to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.

In a preferred embodiment of the present invention, the fine-grained word segmentation means that words which are not marked as nouns, verbs, adjectives, and adverbs in the text data are removed, and a word segmentation sequence set is obtained based on a label. Preferably, the removed words are called an allograph word set, such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc., and the stop words include the words "for" and "for", such as "an pound piece of rain^adjIn the morningⁿHeavy rainⁿAll make the earthⁿWetting by flushing^vBecome into^vIs almost wet^adjMudⁿ]Obtaining (weight large rain) after said fine-grained division^adjIn the morningⁿHeavy rainⁿSoil (W) for buildingⁿWetting by flushing^vBecome into^vAlmost wet^adjMudⁿ]Then based on said mark symbol to obtain said participle sequence set, such as [ pound large rain^adjIn the morningⁿHeavy rainⁿSoil (W) for buildingⁿWetting by flushing^vBecome into^vAlmost wet^adjMudⁿ]Wet mud for overcast large rain based on said mark symbol]。

Further, a classification probability model is established based on the word segmentation sequence set, a conditional probability model is established based on the classification probability model, the conditional probability model is subjected to accumulation summation operation to obtain a log-likelihood function, the log-likelihood function is maximized to solve an optimal solution, and the optimal solution is the word vectorization data set.

Preferably, the classification probability modelComprises the following steps:

wherein X is the word segmentation sequence set, omega is noun, verb, adjective and adverb of the word segmentation sequence set, which can also be called as a feature word, e is an infinite acyclic decimal,is X_ωTransposed matrix of (2), X_ωAn accumulated sum operation of ω, the accumulated sum operation being:

wherein c is the data number of the word segmentation sequence set, V (omega)_i) And in order to assume a word vectorization data set after word vectorization, the log likelihood function is maximized in the following steps.

The conditional probability model p (ω | V (ω)_i) ) is:

wherein l^ωThe number of nodes included in omega in huffman coding is represented, in the huffman coding, in combination with a huffman binary tree, a tree is a nonlinear data structure formed by data elements (also called nodes) organized according to a branch relation, and a set of a plurality of trees is called a forest. A binary tree is an ordered tree with at most two subtrees per node, called left and right subtrees, respectively. If the path length of one binary tree is the minimum, the binary tree is called a Huffman binary tree, therefore omega is a leaf node, the weight value of each leaf node is expressed by Huffman coding, the invention uses different arrangements of 0 and 1 codes to express words,is shown on path p^ωIn the interior, the Huffman code corresponding to the jth node, the root node has no code,is the code for the word co and,represents a path p^ωWithin, the j-1 th non-leaf node corresponds to a vector, since the word ω is a leaf node, there is no corresponding vector.

Preferably, the log likelihood function ζ is

WhereinThe word stock comprises all nouns, verbs, adjectives and adverbs in the word segmentation sequence set.

Further, maximizing the log-likelihood function is:

wherein,a partial derivative representing a transpose of the log-likelihood function to the accumulated sum operation. Continuously optimizing the V (ω) based on the partial derivative_i) The optimization process comprises the following steps:

wherein η is a set learning rate, and the word vectorization data set V (ω) is obtained based on the above.

And S3, inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training.

Preferably, the classification model includes a convolutional neural network, an activation function, and a loss function. Wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and a full-link layer.

preferably, after receiving the word vectorization data set, the convolutional neural network inputs the word vectorization data set to the nineteen convolutional layers and the nineteen pooling layers for convolution operation and maximum pooling operation to obtain a dimension reduction data set, and inputs the dimension reduction data set to the full connection layer.

Further, the full connection layer receives the dimensionality reduction data set, obtains a prediction classification set by combining with the activation function calculation, inputs the prediction classification set and the label set into the loss function to calculate a loss value, and judges the size relation between the loss value and a preset threshold value until the loss value is smaller than the preset threshold value, and the classification model exits from training.

The convolution operation in the preferred embodiment of the present invention is:

wherein ω' is output data, ω is input data, k is the size of a convolution kernel, s is the step of the convolution operation, p is a data zero-padding matrix, the pooling operation can select a maximum pooling operation, and the maximum pooling operation is to select a value with the largest value in matrix data in the matrix to replace the whole matrix;

the activation function is:

where y is the prediction classification set and e is an infinite acyclic decimal.

In the preferred embodiment of the present invention, the loss value T is:

where n is the data size of the prediction classification set, y_tIs said set of tags, mu_tFor the prediction classification set, the preset threshold is typically set at 0.01.

And S4, receiving a text input by a user, carrying out word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.

The invention also provides an intelligent text classification device. Fig. 3 is a schematic diagram of an internal structure of the intelligent text classification device according to an embodiment of the present invention.

In the present embodiment, the intelligent text classification device 1 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, or a mobile Computer, or may be a server. The intelligent text classification device 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the intelligent text classification device 1, such as a hard disk of the intelligent text classification device 1. The memory 11 may also be an external storage device of the intelligent text classification device 1 in other embodiments, such as a plug-in hard disk provided on the intelligent text classification device 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on. Further, the memory 11 may also include both an internal storage unit of the intelligent text classification apparatus 1 and an external storage device. The memory 11 may be used not only to store application software installed in the intelligent text classification device 1 and various types of data, such as codes of the text classification program 01, but also to temporarily store data that has been output or is to be output.

Processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, executes program code or processes data stored in memory 11, such as executing text classifier 01.

The communication bus 13 is used to realize connection communication between these components.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.

Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a keyboard (KeVboard), and an optional user interface may also comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the intelligent text classification apparatus 1 and for displaying a visual user interface.

While fig. 3 only shows the intelligent text classification device 1 with the components 11-14 and the text classification program 01, it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the intelligent text classification device 1, and may comprise fewer or more components than shown, or some components may be combined, or a different arrangement of components.

In the embodiment of the apparatus 1 shown in fig. 3, a text classification program 01 is stored in the memory 11; the processor 12, when executing the text classification program 01 stored in the memory 11, implements the following steps:

the method comprises the steps of receiving text data and a tag set, and performing part-of-speech tagging on the text data.

In a preferred embodiment of the present invention, the part-of-speech tagging is performed by tagging a noun and a verb in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template is a recognizer tagged with characteristics of the noun and the verb, and the part-of-speech tagging template can determine the noun and the verb by recognizing characteristics of the word. The words such as [ i particularly like eating apple ], [ playing basketball is beneficial to fitness ], [ enemy yielded at last time ], and [ i apple ], [ basketball fitness ], [ enemy time ] are marked as names according to the part-of-speech mark template, and [ like eating ], [ playing beneficial ], [ yielding ] are verbs;

And secondly, performing fine-grained word segmentation on the text data according to the part of speech tags to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.

Preferably, the classification probability modelComprises the following steps:

wherein X is the word segmentation sequence set, and omega is the noun and the action of the word segmentation sequence setWords, adjectives, adverbs, which may also be called characteristic words, e is an infinite acyclic decimal,is X_ωTransposed matrix of (2), X_ωAn accumulated sum operation of ω, the accumulated sum operation being:

The conditional probability model p (ω | V (ω)_i) ) is:

Preferably, the log likelihood function ζ is

Further, maximizing the log-likelihood function is:

Inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training.

the activation function is:

In the preferred embodiment of the present invention, the loss value T is:

And step four, receiving a text input by a user, carrying out word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.

Alternatively, in other embodiments, the text classification program may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention, where the module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions for describing the execution process of the text classification program in the intelligent text classification device.

For example, referring to fig. 3, a schematic diagram of program modules of a text classification program in an embodiment of the intelligent text classification device of the present invention is shown, in this embodiment, the text classification program may be divided into a word tagging module 10, a word vectorization conversion module 20, a model training module 30, and a text classification result output module 40, which exemplarily:

the part of speech tagging module 10 is configured to: receiving text data and a tag set, and performing part-of-speech tagging on the text data.

The word vectorization conversion module 20 is configured to: and performing fine-grained word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and performing word vectorization on the word segmentation sequence set to obtain a word vectorization data set.

The model training module 30 is configured to: and inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model exits from training.

The text classification result output module 40 is configured to: receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.

The functions or operation steps of the above-mentioned part of speech tagging module 10, word vectorization conversion module 20, model training module 30, and text classification result output module 40 when executed are substantially the same as those of the above-mentioned embodiments, and are not described herein again.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, on which a text classification program is stored, where the text classification program is executable by one or more processors to implement the following operations:

receiving text data, and performing part-of-speech tagging on the text data to obtain text data;

The embodiment of the computer-readable storage medium of the present invention is substantially the same as that of the foregoing embodiments of the intelligent text classification device and method, and will not be described herein again.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. The term "comprising" is used to specify the presence of stated features, integers, steps, operations, elements, components, groups, integers, operations, elements, components, groups, elements, groups, integers, operations, elements.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims

1. An intelligent text classification method, characterized in that the method comprises:

2. The intelligent text classification method according to claim 1, characterized in that the part-of-speech tagging comprises:

marking nouns and verbs in the text data according to a preset part-of-speech mark template;

searching words with length larger than preset length characters and containing 'of' or 'place' in the text data;

3. The intelligent text classification method according to claim 1 or 2, characterized in that the word vectorization process comprises:

4. The intelligent text classification method of claim 3, wherein the classification probability modelComprises the following steps:

5. The intelligent text classification method of claim 4, wherein the classification model comprises a convolutional neural network, an activation function, and a loss function, wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers, and one fully-connected layer; and

and the full-connection layer receives the dimensionality reduction data set, calculates a prediction classification set by combining the activation function, inputs the prediction classification set and the label set into the loss function to calculate a loss value, judges the size relation between the loss value and a preset threshold value, and quits training until the loss value is smaller than the preset threshold value.

6. An intelligent text classification apparatus, comprising a memory and a processor, the memory having stored thereon a text classification program operable on the processor, the text classification program when executed by the processor implementing the steps of:

7. The intelligent text classification device according to claim 6, wherein the part-of-speech tagging comprises:

8. The intelligent text classification apparatus according to claim 6 or 7, wherein the word vectorization process comprises:

9. The intelligent text classification device of claim 8, wherein the classification probability modelComprises the following steps:

10. A computer-readable storage medium having stored thereon a text classification program executable by one or more processors to perform the steps of the intelligent text classification method of any one of claims 1 to 5.