CN110413773B

CN110413773B - Intelligent text classification method, device and computer readable storage medium

Info

Publication number: CN110413773B
Application number: CN201910540265.3A
Authority: CN
Inventors: 郑子欧; 刘京华; 汪伟
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2023-09-22
Anticipated expiration: 2039-06-20
Also published as: WO2020253043A1; CN110413773A

Abstract

The invention relates to an artificial intelligence technology, and discloses an intelligent text classification method, which comprises the following steps: receiving text data and a tag set, performing part-of-speech tagging on the text data, performing fine-granularity word segmentation on the text data according to the part-of-speech tagging to obtain a word segmentation sequence set, performing word vectorization on the word segmentation sequence set to obtain a word vectorization data set, inputting the word vectorization data set and the tag set into a classification model for training and obtaining a training value, when the training value is smaller than a preset threshold, exiting training of the classification model, receiving text input by a user, performing word vectorization on the text to obtain text word vectors, inputting the text word vectors into the classification model for judgment, and outputting classification results. The invention also provides an intelligent text classification device and a computer readable storage medium. The invention can realize accurate text classification function.

Description

Intelligent text classification method, device and computer readable storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to an intelligent text classification method, apparatus, and computer readable storage medium.

Background

Text classification is an important part of text processing and is also very widely used, for example: filtering garbage, classifying news, labeling parts of speech and the like. The content of different texts is classified, and keywords are usually marked for classification at present. Such classification methods ignore chapter information in the text, and because they lack consideration of parts of speech, the classification of the text is incomplete and not detailed, resulting in low accuracy.

Disclosure of Invention

The invention provides an intelligent text classification method, an intelligent text classification device and a computer readable storage medium, which mainly aim to provide accurate classification results for users when the users input texts.

In order to achieve the above object, the present invention provides an intelligent text classification method, including:

receiving text data and a tag set, and marking the parts of speech of the text data;

performing fine-granularity word segmentation on the text data according to the part-of-speech tags to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;

inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;

receiving a text input by a user, performing word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.

Optionally, the part-of-speech tagging includes:

marking nouns and verbs in the text data according to a preset part-of-speech marking template;

searching words which are longer than two characters and contain ' or ' ground ' in the text data;

judging whether the front and rear words of the words with the length larger than the preset length characters and containing 'or' ground 'in the text data are nouns or verbs, and if the front and rear words are nouns or verbs, marking the words with the length larger than two characters and containing' or 'ground' as adjectives or adverbs.

Optionally, the word vectorization processing includes:

establishing a classification probability model based on the word segmentation sequence set;

constructing a conditional probability model based on the classification probability model;

performing accumulation and summation operation on the conditional probability model to obtain a log-likelihood function;

and maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set.

Optionally, the saidClassification probability modelThe method comprises the following steps:

wherein X is the word segmentation sequence set, omega is the characteristic word in the word segmentation sequence set, the characteristic word comprises nouns, verbs, adjectives and adverbs marked by the parts of speech, e is infinite non-circulating decimal,is X _ω Transposed matrix of X _ω An accumulation and summation operation for ω, the accumulation and summation operation being:

wherein c is the data number, V (omega) _i ) For the ith feature word omega _i Is used to vector data.

Optionally, the classification model comprises a convolutional neural network, an activation function and a loss function, wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and one fully connected layer; a kind of electronic device with high-pressure air-conditioning system

Inputting the word vectorization data set and the tag set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training, including:

after the convolutional neural network receives the word vectorization data set, inputting the word vectorization data set into the nineteenth convolutional layer and the nineteenth pooling layer for convolution operation and maximum pooling operation to obtain a reduced-dimension data set, and inputting the reduced-dimension data set into a full-connection layer;

the full connection layer receives the reduced-dimension data set, combines the activation function to calculate a prediction classification set, inputs the prediction classification set and the label set into the loss function to calculate a loss value, judges the magnitude relation between the loss value and a preset threshold value, and exits training until the loss value is smaller than the preset threshold value.

In addition, in order to achieve the above object, the present invention also provides an intelligent text classification apparatus, which includes a memory and a processor, wherein a text classification program capable of being executed on the processor is stored in the memory, and the text classification program, when executed by the processor, performs the steps of:

Optionally, the part-of-speech tagging includes:

Optionally, the word vectorization processing includes:

Optionally, the classification probability modelThe method comprises the following steps:

In addition, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a text classification program executable by one or more processors to implement the steps of the intelligent text classification method as described above.

The invention provides an intelligent text classification method, an intelligent text classification device and a computer readable storage medium. According to the text data classifying method, part-of-speech tagging is carried out according to text content, text data can be effectively converted into part-of-speech data, meanwhile, word vectorization operation is carried out, characteristics of the text data can be further read to a computer for analysis without loss, and the robustness and accuracy of the text data classifying type can be effectively improved based on multiple training of the classifying model. Therefore, the invention can provide accurate classification results for users.

Drawings

FIG. 1 is a flow chart of an intelligent text classification method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an internal structure of an intelligent text classification apparatus according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a text classification program in an intelligent text classification apparatus according to an embodiment of the invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides an intelligent text classification method. Referring to fig. 1, a flow chart of an intelligent text classification method according to an embodiment of the invention is shown. The method may be performed by an apparatus, which may be implemented in software and/or hardware.

In this embodiment, the intelligent text classification method includes:

s1, receiving text data and a tag set, and marking the parts of speech of the text data.

Preferably, the text data set includes text data of various subjects, such as financial, novel, education, real estate, sports and the like, and the tag set records tags of each text data in the text data set, such as recording text data a as sports class and text data B as real estate class.

In a preferred embodiment of the present invention, the part-of-speech tagging first tags nouns and verbs in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template refers to a recognizer that tags nouns and verbs features, and the part-of-speech tagging template can determine nouns and verbs by recognizing features of words. If the character is like to eat apples in particular, the character is beneficial to body building when playing basketball, the character is yielded in the last time, the character is marked with the character of apples, the character is beneficial to basketball, the character is time, and the character is yield according to the part of speech marking template;

searching words with the length larger than a preset length, such as two characters and containing 'or' ground ', in the text data, and judging whether the front word and the rear word of the words with the length larger than two characters and containing' or 'ground' in the text data are nouns or verbs or not. If the front and rear words are nouns or verbs, the words with the length larger than the preset length characters and containing ' or ' ground ' are adjectives or adverbs, such as [ angry people strongly fight against smokable thieves ]]Firstly, identifying [ people ] according to the part-of-speech tagging template](make a beat)][ thief ]]And recognizes that the word having a length greater than two characters and containing "or" ground "is [ anger ]][ strong land ]][ can be moderated ]]Judging that the words are nouns or verbs such as [ people ]](make a beat)][ thief ]]Thus adjectives or adverbs are labeled. Preferably, the labeling means may take the form of a label including a label symbol, such as anger ^adj People ⁿ Strong ground ^adv Beating for a long time ^v Can be used for the treatment of abruptness ^adj Thieves ⁿ ]。

S2, carrying out fine-granularity word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and carrying out word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.

In the preferred embodiment of the present invention, the fine-grained word segmentation refers to removing words that are not marked as nouns, verbs, adjectives, and adverbs from the text data, and obtaining a word segmentation sequence set based on the marking symbol. Preferably, the words to be removed are called heterograms, such as all English letters, arabic numerals, chinese numerals, punctuation marks, stop words, etc., and the stop words include words such as "in" and the like, such as[ one pound strong rain ] ^adj Morning of course ⁿ Heavy rain ⁿ All take the land ⁿ Flushing and wetting ^v Becomes to ^v Wet and wet state ^adj Mud ⁿ ]The fine grain word is divided to obtain the product [ pound-shaped rain ] ^adj Morning of course ⁿ Heavy rain ⁿ Land ⁿ Flushing and wetting ^v Becomes into ^v Wet and very much ^adj Mud ⁿ ]The set of word sequences is then derived based on the annotation symbol, e.g., pound rain ^adj Morning of course ⁿ Heavy rain ⁿ Land ⁿ Flushing and wetting ^v Becomes into ^v Wet and very much ^adj Mud ⁿ ]Obtaining [ pound, heavy rain, morning, heavy rain, land wet into wet mud based on the label]。

Further, a classification probability model is established based on the word segmentation sequence set, a conditional probability model is established based on the classification probability model, the accumulation and summation operation is carried out on the conditional probability model to obtain a log likelihood function, the optimal solution is solved by maximizing the log likelihood function, and the optimal solution is the word vectorization data set.

Preferably, the classification probability modelThe method comprises the following steps:

wherein X is the word segmentation sequence set, omega is the characteristic word in the word segmentation sequence set, the characteristic word comprises nouns, verbs, adjectives and adverbs of the word segmentation sequence set, e is infinite non-circulating decimal,is X _ω Transposed matrix of X _ω An accumulation and summation operation for ω, the accumulation and summation operation being:

wherein c is the data number, V (omega) _i ) For the ith feature word omega _i Is then carried out to maximize the log likelihood function.

The conditional probability model p (ω|v (ω) _i ) Is) is:

wherein l ^ω The ω represents the number of nodes included in the huffman code, and in combination with the huffman binary tree, the tree is a nonlinear data structure formed by organizing data elements (also called nodes) according to a branching relationship, and the set of several trees is called a forest. A binary tree is an ordered tree having at most two sub-trees per node, the two sub-trees being referred to as the left and right sub-trees, respectively. If there is a binary tree with minimum path length, it is called Huffman binary tree, so omega is leaf node, the weight of each leaf node is represented by Huffman coding, the invention uses different arrangements of 0 and 1 codes to represent words,represented on path p ^ω In, the Huffman code corresponding to the jth node, the root node is not coded, and the ++>For the coding of the word ω ->Representing path p ^ω In, the j-1 th non-leaf node corresponds to a vector, because the word ω is a leaf node, there is no corresponding vector.

Preferably, the log likelihood function ζ is

Wherein the method comprises the steps ofThe word stock is a word stock, and the word stock comprises all nouns, verbs, adjectives and adverbs in the word segmentation sequence set.

Further, maximizing the log likelihood function is:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the partial derivatives of the log-likelihood function to the transpose matrix of the accumulation and summation operation. Continuously optimizing the V (omega based on the bias leads _i ) The optimization process is as follows:

where η is the set learning rate, the word-vectorized dataset V (ω) is obtained based on the above.

S3, inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training.

Preferably, the classification model comprises a convolutional neural network, an activation function and a loss function. Wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and one fully connected layer.

preferably, after the convolutional neural network receives the word vectorization data set, the word vectorization data set is input to the nineteenth layer convolutional layer and the nineteenth layer pooling layer to perform convolutional operation and maximum pooling operation, a reduced-dimension data set is obtained, and the reduced-dimension data set is input to a full-connection layer.

Further, the full connection layer receives the dimensionality reduction data set, calculates a prediction class set by combining the activation function, inputs the prediction class set and the label set into the loss function to calculate a loss value, judges the magnitude relation between the loss value and a preset threshold value, and exits training until the loss value is smaller than the preset threshold value.

The convolution operation in the preferred embodiment of the present invention is:

wherein ω' is output data, ω is input data, k is the size of a convolution kernel, s is the stride of the convolution operation, p is a data zero-filling matrix, the pooling operation can select a maximum pooling operation, and the maximum pooling operation is to select a value with the largest value in matrix data in the matrix to replace the whole matrix;

the activation function is:

where y is the prediction class set and e is an infinite non-cyclic fraction.

In the preferred embodiment of the present invention, the loss value T is:

wherein n is the data size of the predictive classification set, y _t Mu for the tag set _t For the prediction classification set, the preset threshold is typically set at 0.01.

S4, receiving a text input by a user, performing word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.

The invention further provides an intelligent text classification device. Referring to fig. 2, an internal structure diagram of an intelligent text classification apparatus according to an embodiment of the invention is shown.

In this embodiment, the intelligent text classification apparatus 1 may be a PC (Personal Computer ), or a terminal device such as a smart phone, a tablet computer, a portable computer, or a server. The intelligent text classification apparatus 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.

The memory 11 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the intelligent text classification apparatus 1, such as a hard disk of the intelligent text classification apparatus 1. The memory 11 may in other embodiments also be an external storage device of the intelligent text classification apparatus 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the intelligent text classification apparatus 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the intelligent text classification apparatus 1. The memory 11 may be used not only for storing application software installed in the intelligent text classification apparatus 1 and various types of data, such as codes of the text classification program 01, but also for temporarily storing data that has been output or is to be output.

Processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip for executing program code or processing data stored in memory 11, such as for executing text classification program 01 or the like.

The communication bus 13 is used to enable connection communication between these components.

The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), typically used to establish a communication connection between the apparatus 1 and other electronic devices.

Optionally, the device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or a display unit, as appropriate, for displaying information processed in the intelligent text classification apparatus 1 and for displaying a visual user interface.

Fig. 2 shows only the intelligent text classification apparatus 1 with components 11-14 and text classification program 01, it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the intelligent text classification apparatus 1, and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

In the embodiment of the device 1 shown in fig. 2, a text classification program 01 is stored in the memory 11; the processor 12 performs the following steps when executing the text classification program 01 stored in the memory 11:

step one, receiving text data and a tag set, and marking the parts of speech of the text data.

In a preferred embodiment of the present invention, the part-of-speech tagging first tags nouns and verbs in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template refers to a recognizer that tags nouns and verbs features, and the part-of-speech tagging template can determine nouns and verbs by recognizing features of words. If the character is like to eat apples in particular, the character is beneficial to body building when playing basketball, the character is yielded in the last time, the character is marked with the names of the character, i.e. apple, basketball body building, and enemy time according to the part of speech marking template, the character is like to eat, the character is beneficial to play basketball, and the character is yielded;

And secondly, carrying out fine-granularity word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and carrying out word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.

In the preferred embodiment of the present invention, the fine-grained word segmentation refers to removing words that are not marked as nouns, verbs, adjectives, and adverbs from the text data, and obtaining a word segmentation sequence set based on the marking symbol. Preferably, the words to be removed are called heterograms, such as all English letters, arabic numerals, chinese numerals, punctuation marks, stop words, etc., and the stop words include words such as "in" such as [ one pound of rain ] ^adj Morning of course ⁿ Heavy rain ⁿ All take the land ⁿ Flushing and wetting ^v Becomes to ^v Wet and wet state ^adj Mud ⁿ ]The fine grain word is divided to obtain the product [ pound-shaped rain ] ^adj Morning of course ⁿ Heavy rain ⁿ Land ⁿ Flushing and wetting ^v Becomes into ^v Wet and very much ^adj Mud ⁿ ]The set of word sequences is then derived based on the annotation symbol, e.g., pound rain ^adj Morning of course ⁿ Heavy rain ⁿ Land ⁿ Flushing and wetting ^v Becomes into ^v Wet and very much ^adj Mud ⁿ ]Obtaining [ pound, heavy rain, morning, heavy rain, land wet into wet mud based on the label]。

The conditional probability model p (ω|v (ω) _i ) Is) is:

Preferably, the log likelihood function ζ is

Further, maximizing the log likelihood function is:

And thirdly, inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model.

the activation function is:

where y is the prediction class set and e is an infinite non-cyclic fraction.

In the preferred embodiment of the present invention, the loss value T is:

And step four, receiving a text input by a user, carrying out word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.

Alternatively, in other embodiments, the text classification program may be divided into one or more modules, and one or more modules are stored in the memory 11 and executed by one or more processors (the processor 12 in this embodiment) to perform the present invention, where the modules refer to a series of instruction segments of a computer program capable of performing a specific function, for describing the execution of the text classification program in the intelligent text classification apparatus.

For example, referring to fig. 3, a schematic program module of a text classification program 01 in an embodiment of the intelligent text classification apparatus according to the present invention is shown, where the text classification program 01 may be divided into a part-of-speech tagging module 10, a word vectorization conversion module 20, a model training module 30, and a text classification result output module 40, which are exemplary:

the part-of-speech tagging module 10 is configured to: and receiving text data and a tag set, and marking the parts of speech of the text data.

The word vectorization conversion module 20 is configured to: and carrying out fine-granularity word segmentation on the text data according to the part-of-speech tags to obtain a word segmentation sequence set, and carrying out word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.

The model training module 30 is configured to: and inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model.

The text classification result output module 40 is configured to: receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.

The functions or operation steps implemented when the program modules, such as the part-of-speech tagging module 10, the word vectorization conversion module 20, the model training module 30, and the text classification result output module 40, are substantially the same as those of the above embodiments, and are not described herein again.

In addition, an embodiment of the present invention also proposes a computer-readable storage medium having stored thereon a text classification program executable by one or more processors to implement the following operations:

receiving text data, and marking the parts of speech of the text data to obtain the text data;

receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.

The computer-readable storage medium of the present invention is substantially the same as the above-described embodiments of the intelligent text classification apparatus and method, and will not be described in detail herein.

It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. An intelligent text classification method, characterized in that the method comprises:

receiving a text input by a user, performing word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result;

wherein the part of speech tagging comprises: marking nouns and verbs in the text data according to a preset part-of-speech marking template; searching words which have the length larger than the preset length characters and contain ' or ' ground ' in the text data; judging whether the front and rear words of the words with the length larger than the preset length characters and containing 'or' ground 'in the text data are nouns or verbs, and if the front and rear words are nouns or verbs, marking the words with the length larger than two characters and containing' or 'ground' as adjectives or adverbs;

the word vectorization processing includes: establishing a classification probability model based on the word segmentation sequence set; constructing a conditional probability model based on the classification probability model; performing accumulation and summation operation on the conditional probability model to obtain a log-likelihood function; maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set;

the classification probability modelThe method comprises the following steps:

2. The intelligent text classification method of claim 1, wherein the classification model comprises a convolutional neural network, an activation function, and a loss function, wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers, and a fully connected layer; a kind of electronic device with high-pressure air-conditioning system

3. An intelligent text classification apparatus comprising a memory and a processor, said memory having stored thereon a text classification program operable on said processor, said text classification program when executed by said processor performing the steps of:

the classification probability modelThe method comprises the following steps:

4. A computer readable storage medium having stored thereon a text classification program executable by one or more processors to implement the steps of the intelligent text classification method of claim 1 or 2.