CN110413773B - Intelligent text classification method, device and computer readable storage medium - Google Patents

Intelligent text classification method, device and computer readable storage medium Download PDF

Info

Publication number
CN110413773B
CN110413773B CN201910540265.3A CN201910540265A CN110413773B CN 110413773 B CN110413773 B CN 110413773B CN 201910540265 A CN201910540265 A CN 201910540265A CN 110413773 B CN110413773 B CN 110413773B
Authority
CN
China
Prior art keywords
word
text
classification
data
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910540265.3A
Other languages
Chinese (zh)
Other versions
CN110413773A (en
Inventor
郑子欧
刘京华
汪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910540265.3A priority Critical patent/CN110413773B/en
Publication of CN110413773A publication Critical patent/CN110413773A/en
Priority to PCT/CN2019/117341 priority patent/WO2020253043A1/en
Application granted granted Critical
Publication of CN110413773B publication Critical patent/CN110413773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses an intelligent text classification method, which comprises the following steps: receiving text data and a tag set, performing part-of-speech tagging on the text data, performing fine-granularity word segmentation on the text data according to the part-of-speech tagging to obtain a word segmentation sequence set, performing word vectorization on the word segmentation sequence set to obtain a word vectorization data set, inputting the word vectorization data set and the tag set into a classification model for training and obtaining a training value, when the training value is smaller than a preset threshold, exiting training of the classification model, receiving text input by a user, performing word vectorization on the text to obtain text word vectors, inputting the text word vectors into the classification model for judgment, and outputting classification results. The invention also provides an intelligent text classification device and a computer readable storage medium. The invention can realize accurate text classification function.

Description

Intelligent text classification method, device and computer readable storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to an intelligent text classification method, apparatus, and computer readable storage medium.
Background
Text classification is an important part of text processing and is also very widely used, for example: filtering garbage, classifying news, labeling parts of speech and the like. The content of different texts is classified, and keywords are usually marked for classification at present. Such classification methods ignore chapter information in the text, and because they lack consideration of parts of speech, the classification of the text is incomplete and not detailed, resulting in low accuracy.
Disclosure of Invention
The invention provides an intelligent text classification method, an intelligent text classification device and a computer readable storage medium, which mainly aim to provide accurate classification results for users when the users input texts.
In order to achieve the above object, the present invention provides an intelligent text classification method, including:
receiving text data and a tag set, and marking the parts of speech of the text data;
performing fine-granularity word segmentation on the text data according to the part-of-speech tags to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, performing word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.
Optionally, the part-of-speech tagging includes:
marking nouns and verbs in the text data according to a preset part-of-speech marking template;
searching words which are longer than two characters and contain ' or ' ground ' in the text data;
judging whether the front and rear words of the words with the length larger than the preset length characters and containing 'or' ground 'in the text data are nouns or verbs, and if the front and rear words are nouns or verbs, marking the words with the length larger than two characters and containing' or 'ground' as adjectives or adverbs.
Optionally, the word vectorization processing includes:
establishing a classification probability model based on the word segmentation sequence set;
constructing a conditional probability model based on the classification probability model;
performing accumulation and summation operation on the conditional probability model to obtain a log-likelihood function;
and maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set.
Optionally, the saidClassification probability modelThe method comprises the following steps:
wherein X is the word segmentation sequence set, omega is the characteristic word in the word segmentation sequence set, the characteristic word comprises nouns, verbs, adjectives and adverbs marked by the parts of speech, e is infinite non-circulating decimal,is X ω Transposed matrix of X ω An accumulation and summation operation for ω, the accumulation and summation operation being:
wherein c is the data number, V (omega) i ) For the ith feature word omega i Is used to vector data.
Optionally, the classification model comprises a convolutional neural network, an activation function and a loss function, wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and one fully connected layer; a kind of electronic device with high-pressure air-conditioning system
Inputting the word vectorization data set and the tag set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training, including:
after the convolutional neural network receives the word vectorization data set, inputting the word vectorization data set into the nineteenth convolutional layer and the nineteenth pooling layer for convolution operation and maximum pooling operation to obtain a reduced-dimension data set, and inputting the reduced-dimension data set into a full-connection layer;
the full connection layer receives the reduced-dimension data set, combines the activation function to calculate a prediction classification set, inputs the prediction classification set and the label set into the loss function to calculate a loss value, judges the magnitude relation between the loss value and a preset threshold value, and exits training until the loss value is smaller than the preset threshold value.
In addition, in order to achieve the above object, the present invention also provides an intelligent text classification apparatus, which includes a memory and a processor, wherein a text classification program capable of being executed on the processor is stored in the memory, and the text classification program, when executed by the processor, performs the steps of:
receiving text data and a tag set, and marking the parts of speech of the text data;
performing fine-granularity word segmentation on the text data according to the part-of-speech tags to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, performing word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.
Optionally, the part-of-speech tagging includes:
marking nouns and verbs in the text data according to a preset part-of-speech marking template;
searching words which are longer than two characters and contain ' or ' ground ' in the text data;
judging whether the front and rear words of the words with the length larger than the preset length characters and containing 'or' ground 'in the text data are nouns or verbs, and if the front and rear words are nouns or verbs, marking the words with the length larger than two characters and containing' or 'ground' as adjectives or adverbs.
Optionally, the word vectorization processing includes:
establishing a classification probability model based on the word segmentation sequence set;
constructing a conditional probability model based on the classification probability model;
performing accumulation and summation operation on the conditional probability model to obtain a log-likelihood function;
and maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set.
Optionally, the classification probability modelThe method comprises the following steps:
wherein X is the word segmentation sequence set, omega is the characteristic word in the word segmentation sequence set, the characteristic word comprises nouns, verbs, adjectives and adverbs marked by the parts of speech, e is infinite non-circulating decimal,is X ω Transposed matrix of X ω An accumulation and summation operation for ω, the accumulation and summation operation being:
wherein c is the data number, V (omega) i ) For the ith feature word omega i Is used to vector data.
In addition, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a text classification program executable by one or more processors to implement the steps of the intelligent text classification method as described above.
The invention provides an intelligent text classification method, an intelligent text classification device and a computer readable storage medium. According to the text data classifying method, part-of-speech tagging is carried out according to text content, text data can be effectively converted into part-of-speech data, meanwhile, word vectorization operation is carried out, characteristics of the text data can be further read to a computer for analysis without loss, and the robustness and accuracy of the text data classifying type can be effectively improved based on multiple training of the classifying model. Therefore, the invention can provide accurate classification results for users.
Drawings
FIG. 1 is a flow chart of an intelligent text classification method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an internal structure of an intelligent text classification apparatus according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a text classification program in an intelligent text classification apparatus according to an embodiment of the invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides an intelligent text classification method. Referring to fig. 1, a flow chart of an intelligent text classification method according to an embodiment of the invention is shown. The method may be performed by an apparatus, which may be implemented in software and/or hardware.
In this embodiment, the intelligent text classification method includes:
s1, receiving text data and a tag set, and marking the parts of speech of the text data.
Preferably, the text data set includes text data of various subjects, such as financial, novel, education, real estate, sports and the like, and the tag set records tags of each text data in the text data set, such as recording text data a as sports class and text data B as real estate class.
In a preferred embodiment of the present invention, the part-of-speech tagging first tags nouns and verbs in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template refers to a recognizer that tags nouns and verbs features, and the part-of-speech tagging template can determine nouns and verbs by recognizing features of words. If the character is like to eat apples in particular, the character is beneficial to body building when playing basketball, the character is yielded in the last time, the character is marked with the character of apples, the character is beneficial to basketball, the character is time, and the character is yield according to the part of speech marking template;
searching words with the length larger than a preset length, such as two characters and containing 'or' ground ', in the text data, and judging whether the front word and the rear word of the words with the length larger than two characters and containing' or 'ground' in the text data are nouns or verbs or not. If the front and rear words are nouns or verbs, the words with the length larger than the preset length characters and containing ' or ' ground ' are adjectives or adverbs, such as [ angry people strongly fight against smokable thieves ]]Firstly, identifying [ people ] according to the part-of-speech tagging template](make a beat)][ thief ]]And recognizes that the word having a length greater than two characters and containing "or" ground "is [ anger ]][ strong land ]][ can be moderated ]]Judging that the words are nouns or verbs such as [ people ]](make a beat)][ thief ]]Thus adjectives or adverbs are labeled. Preferably, the labeling means may take the form of a label including a label symbol, such as anger adj People n Strong ground adv Beating for a long time v Can be used for the treatment of abruptness adj Thieves n ]。
S2, carrying out fine-granularity word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and carrying out word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.
In the preferred embodiment of the present invention, the fine-grained word segmentation refers to removing words that are not marked as nouns, verbs, adjectives, and adverbs from the text data, and obtaining a word segmentation sequence set based on the marking symbol. Preferably, the words to be removed are called heterograms, such as all English letters, arabic numerals, chinese numerals, punctuation marks, stop words, etc., and the stop words include words such as "in" and the like, such as[ one pound strong rain ] adj Morning of course n Heavy rain n All take the land n Flushing and wetting v Becomes to v Wet and wet state adj Mud n ]The fine grain word is divided to obtain the product [ pound-shaped rain ] adj Morning of course n Heavy rain n Land n Flushing and wetting v Becomes into v Wet and very much adj Mud n ]The set of word sequences is then derived based on the annotation symbol, e.g., pound rain adj Morning of course n Heavy rain n Land n Flushing and wetting v Becomes into v Wet and very much adj Mud n ]Obtaining [ pound, heavy rain, morning, heavy rain, land wet into wet mud based on the label]。
Further, a classification probability model is established based on the word segmentation sequence set, a conditional probability model is established based on the classification probability model, the accumulation and summation operation is carried out on the conditional probability model to obtain a log likelihood function, the optimal solution is solved by maximizing the log likelihood function, and the optimal solution is the word vectorization data set.
Preferably, the classification probability modelThe method comprises the following steps:
wherein X is the word segmentation sequence set, omega is the characteristic word in the word segmentation sequence set, the characteristic word comprises nouns, verbs, adjectives and adverbs of the word segmentation sequence set, e is infinite non-circulating decimal,is X ω Transposed matrix of X ω An accumulation and summation operation for ω, the accumulation and summation operation being:
wherein c is the data number, V (omega) i ) For the ith feature word omega i Is then carried out to maximize the log likelihood function.
The conditional probability model p (ω|v (ω) i ) Is) is:
wherein l ω The ω represents the number of nodes included in the huffman code, and in combination with the huffman binary tree, the tree is a nonlinear data structure formed by organizing data elements (also called nodes) according to a branching relationship, and the set of several trees is called a forest. A binary tree is an ordered tree having at most two sub-trees per node, the two sub-trees being referred to as the left and right sub-trees, respectively. If there is a binary tree with minimum path length, it is called Huffman binary tree, so omega is leaf node, the weight of each leaf node is represented by Huffman coding, the invention uses different arrangements of 0 and 1 codes to represent words,represented on path p ω In, the Huffman code corresponding to the jth node, the root node is not coded, and the ++>For the coding of the word ω ->Representing path p ω In, the j-1 th non-leaf node corresponds to a vector, because the word ω is a leaf node, there is no corresponding vector.
Preferably, the log likelihood function ζ is
Wherein the method comprises the steps ofThe word stock is a word stock, and the word stock comprises all nouns, verbs, adjectives and adverbs in the word segmentation sequence set.
Further, maximizing the log likelihood function is:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the partial derivatives of the log-likelihood function to the transpose matrix of the accumulation and summation operation. Continuously optimizing the V (omega based on the bias leads i ) The optimization process is as follows:
where η is the set learning rate, the word-vectorized dataset V (ω) is obtained based on the above.
S3, inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training.
Preferably, the classification model comprises a convolutional neural network, an activation function and a loss function. Wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and one fully connected layer.
Inputting the word vectorization data set and the tag set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training, including:
preferably, after the convolutional neural network receives the word vectorization data set, the word vectorization data set is input to the nineteenth layer convolutional layer and the nineteenth layer pooling layer to perform convolutional operation and maximum pooling operation, a reduced-dimension data set is obtained, and the reduced-dimension data set is input to a full-connection layer.
Further, the full connection layer receives the dimensionality reduction data set, calculates a prediction class set by combining the activation function, inputs the prediction class set and the label set into the loss function to calculate a loss value, judges the magnitude relation between the loss value and a preset threshold value, and exits training until the loss value is smaller than the preset threshold value.
The convolution operation in the preferred embodiment of the present invention is:
wherein ω' is output data, ω is input data, k is the size of a convolution kernel, s is the stride of the convolution operation, p is a data zero-filling matrix, the pooling operation can select a maximum pooling operation, and the maximum pooling operation is to select a value with the largest value in matrix data in the matrix to replace the whole matrix;
the activation function is:
where y is the prediction class set and e is an infinite non-cyclic fraction.
In the preferred embodiment of the present invention, the loss value T is:
wherein n is the data size of the predictive classification set, y t Mu for the tag set t For the prediction classification set, the preset threshold is typically set at 0.01.
S4, receiving a text input by a user, performing word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.
The invention further provides an intelligent text classification device. Referring to fig. 2, an internal structure diagram of an intelligent text classification apparatus according to an embodiment of the invention is shown.
In this embodiment, the intelligent text classification apparatus 1 may be a PC (Personal Computer ), or a terminal device such as a smart phone, a tablet computer, a portable computer, or a server. The intelligent text classification apparatus 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
The memory 11 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the intelligent text classification apparatus 1, such as a hard disk of the intelligent text classification apparatus 1. The memory 11 may in other embodiments also be an external storage device of the intelligent text classification apparatus 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the intelligent text classification apparatus 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the intelligent text classification apparatus 1. The memory 11 may be used not only for storing application software installed in the intelligent text classification apparatus 1 and various types of data, such as codes of the text classification program 01, but also for temporarily storing data that has been output or is to be output.
Processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip for executing program code or processing data stored in memory 11, such as for executing text classification program 01 or the like.
The communication bus 13 is used to enable connection communication between these components.
The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), typically used to establish a communication connection between the apparatus 1 and other electronic devices.
Optionally, the device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or a display unit, as appropriate, for displaying information processed in the intelligent text classification apparatus 1 and for displaying a visual user interface.
Fig. 2 shows only the intelligent text classification apparatus 1 with components 11-14 and text classification program 01, it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the intelligent text classification apparatus 1, and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
In the embodiment of the device 1 shown in fig. 2, a text classification program 01 is stored in the memory 11; the processor 12 performs the following steps when executing the text classification program 01 stored in the memory 11:
step one, receiving text data and a tag set, and marking the parts of speech of the text data.
Preferably, the text data set includes text data of various subjects, such as financial, novel, education, real estate, sports and the like, and the tag set records tags of each text data in the text data set, such as recording text data a as sports class and text data B as real estate class.
In a preferred embodiment of the present invention, the part-of-speech tagging first tags nouns and verbs in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template refers to a recognizer that tags nouns and verbs features, and the part-of-speech tagging template can determine nouns and verbs by recognizing features of words. If the character is like to eat apples in particular, the character is beneficial to body building when playing basketball, the character is yielded in the last time, the character is marked with the names of the character, i.e. apple, basketball body building, and enemy time according to the part of speech marking template, the character is like to eat, the character is beneficial to play basketball, and the character is yielded;
searching words with the length larger than a preset length, such as two characters and containing 'or' ground ', in the text data, and judging whether the front word and the rear word of the words with the length larger than two characters and containing' or 'ground' in the text data are nouns or verbs or not. If the front and rear words are nouns or verbs, the words with the length larger than the preset length characters and containing ' or ' ground ' are adjectives or adverbs, such as [ angry people strongly fight against smokable thieves ]]Firstly, identifying [ people ] according to the part-of-speech tagging template](make a beat)][ thief ]]And recognizes that the word having a length greater than two characters and containing "or" ground "is [ anger ]][ strong land ]][ can be moderated ]]Judging that the words are nouns or verbs such as [ people ]](make a beat)][ thief ]]Thus adjectives or adverbs are labeled. Preferably, the labeling means may take the form of a label including a label symbol, such as anger adj People n Strong ground adv Beating for a long time v Can be used for the treatment of abruptness adj Thieves n ]。
And secondly, carrying out fine-granularity word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and carrying out word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.
In the preferred embodiment of the present invention, the fine-grained word segmentation refers to removing words that are not marked as nouns, verbs, adjectives, and adverbs from the text data, and obtaining a word segmentation sequence set based on the marking symbol. Preferably, the words to be removed are called heterograms, such as all English letters, arabic numerals, chinese numerals, punctuation marks, stop words, etc., and the stop words include words such as "in" such as [ one pound of rain ] adj Morning of course n Heavy rain n All take the land n Flushing and wetting v Becomes to v Wet and wet state adj Mud n ]The fine grain word is divided to obtain the product [ pound-shaped rain ] adj Morning of course n Heavy rain n Land n Flushing and wetting v Becomes into v Wet and very much adj Mud n ]The set of word sequences is then derived based on the annotation symbol, e.g., pound rain adj Morning of course n Heavy rain n Land n Flushing and wetting v Becomes into v Wet and very much adj Mud n ]Obtaining [ pound, heavy rain, morning, heavy rain, land wet into wet mud based on the label]。
Further, a classification probability model is established based on the word segmentation sequence set, a conditional probability model is established based on the classification probability model, the accumulation and summation operation is carried out on the conditional probability model to obtain a log likelihood function, the optimal solution is solved by maximizing the log likelihood function, and the optimal solution is the word vectorization data set.
Preferably, the classification probability modelThe method comprises the following steps:
wherein X is the word segmentation sequence set, omega is the characteristic word in the word segmentation sequence set, the characteristic word comprises nouns, verbs, adjectives and adverbs of the word segmentation sequence set, e is infinite non-circulating decimal,is X ω Transposed matrix of X ω An accumulation and summation operation for ω, the accumulation and summation operation being:
wherein c is the data number, V (omega) i ) For the ith feature word omega i Is then carried out to maximize the log likelihood function.
The conditional probability model p (ω|v (ω) i ) Is) is:
wherein l ω The ω represents the number of nodes included in the huffman code, and in combination with the huffman binary tree, the tree is a nonlinear data structure formed by organizing data elements (also called nodes) according to a branching relationship, and the set of several trees is called a forest. A binary tree is an ordered tree having at most two sub-trees per node, the two sub-trees being referred to as the left and right sub-trees, respectively. If there is a binary tree with minimum path length, it is called Huffman binary tree, so omega is leaf node, the weight of each leaf node is represented by Huffman coding, the invention uses different arrangements of 0 and 1 codes to represent words,represented on path p ω In, the Huffman code corresponding to the jth node, the root node is not coded, and the ++>For the coding of the word ω ->Representing path p ω In, the j-1 th non-leaf node corresponds to a vector, because the word ω is a leaf node, there is no corresponding vector.
Preferably, the log likelihood function ζ is
Wherein the method comprises the steps ofThe word stock is a word stock, and the word stock comprises all nouns, verbs, adjectives and adverbs in the word segmentation sequence set.
Further, maximizing the log likelihood function is:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the partial derivatives of the log-likelihood function to the transpose matrix of the accumulation and summation operation. Continuously optimizing the V (omega based on the bias leads i ) The optimization process is as follows:
where η is the set learning rate, the word-vectorized dataset V (ω) is obtained based on the above.
And thirdly, inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model.
Preferably, the classification model comprises a convolutional neural network, an activation function and a loss function. Wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and one fully connected layer.
Inputting the word vectorization data set and the tag set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training, including:
preferably, after the convolutional neural network receives the word vectorization data set, the word vectorization data set is input to the nineteenth layer convolutional layer and the nineteenth layer pooling layer to perform convolutional operation and maximum pooling operation, a reduced-dimension data set is obtained, and the reduced-dimension data set is input to a full-connection layer.
Further, the full connection layer receives the dimensionality reduction data set, calculates a prediction class set by combining the activation function, inputs the prediction class set and the label set into the loss function to calculate a loss value, judges the magnitude relation between the loss value and a preset threshold value, and exits training until the loss value is smaller than the preset threshold value.
The convolution operation in the preferred embodiment of the present invention is:
wherein ω' is output data, ω is input data, k is the size of a convolution kernel, s is the stride of the convolution operation, p is a data zero-filling matrix, the pooling operation can select a maximum pooling operation, and the maximum pooling operation is to select a value with the largest value in matrix data in the matrix to replace the whole matrix;
the activation function is:
where y is the prediction class set and e is an infinite non-cyclic fraction.
In the preferred embodiment of the present invention, the loss value T is:
wherein n is the data size of the predictive classification set, y t Mu for the tag set t For the prediction classification set, the preset threshold is typically set at 0.01.
And step four, receiving a text input by a user, carrying out word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.
Alternatively, in other embodiments, the text classification program may be divided into one or more modules, and one or more modules are stored in the memory 11 and executed by one or more processors (the processor 12 in this embodiment) to perform the present invention, where the modules refer to a series of instruction segments of a computer program capable of performing a specific function, for describing the execution of the text classification program in the intelligent text classification apparatus.
For example, referring to fig. 3, a schematic program module of a text classification program 01 in an embodiment of the intelligent text classification apparatus according to the present invention is shown, where the text classification program 01 may be divided into a part-of-speech tagging module 10, a word vectorization conversion module 20, a model training module 30, and a text classification result output module 40, which are exemplary:
the part-of-speech tagging module 10 is configured to: and receiving text data and a tag set, and marking the parts of speech of the text data.
The word vectorization conversion module 20 is configured to: and carrying out fine-granularity word segmentation on the text data according to the part-of-speech tags to obtain a word segmentation sequence set, and carrying out word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.
The model training module 30 is configured to: and inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model.
The text classification result output module 40 is configured to: receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.
The functions or operation steps implemented when the program modules, such as the part-of-speech tagging module 10, the word vectorization conversion module 20, the model training module 30, and the text classification result output module 40, are substantially the same as those of the above embodiments, and are not described herein again.
In addition, an embodiment of the present invention also proposes a computer-readable storage medium having stored thereon a text classification program executable by one or more processors to implement the following operations:
receiving text data, and marking the parts of speech of the text data to obtain the text data;
performing fine-granularity word segmentation on the text data according to the part-of-speech tags to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model to judge and output a classification result.
The computer-readable storage medium of the present invention is substantially the same as the above-described embodiments of the intelligent text classification apparatus and method, and will not be described in detail herein.
It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (4)

1. An intelligent text classification method, characterized in that the method comprises:
receiving text data and a tag set, and marking the parts of speech of the text data;
performing fine-granularity word segmentation on the text data according to the part-of-speech tags to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, performing word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result;
wherein the part of speech tagging comprises: marking nouns and verbs in the text data according to a preset part-of-speech marking template; searching words which have the length larger than the preset length characters and contain ' or ' ground ' in the text data; judging whether the front and rear words of the words with the length larger than the preset length characters and containing 'or' ground 'in the text data are nouns or verbs, and if the front and rear words are nouns or verbs, marking the words with the length larger than two characters and containing' or 'ground' as adjectives or adverbs;
the word vectorization processing includes: establishing a classification probability model based on the word segmentation sequence set; constructing a conditional probability model based on the classification probability model; performing accumulation and summation operation on the conditional probability model to obtain a log-likelihood function; maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set;
the classification probability modelThe method comprises the following steps:
wherein X is the word segmentation sequence set, omega is the characteristic word in the word segmentation sequence set, the characteristic word comprises nouns, verbs, adjectives and adverbs marked by the parts of speech, e is infinite non-circulating decimal,is X ω Transposed matrix of X ω An accumulation and summation operation for ω, the accumulation and summation operation being:
wherein c is the data number, V (omega) i ) For the ith feature word omega i Is used to vector data.
2. The intelligent text classification method of claim 1, wherein the classification model comprises a convolutional neural network, an activation function, and a loss function, wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers, and a fully connected layer; a kind of electronic device with high-pressure air-conditioning system
Inputting the word vectorization data set and the tag set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training, including:
after the convolutional neural network receives the word vectorization data set, inputting the word vectorization data set into the nineteenth convolutional layer and the nineteenth pooling layer for convolution operation and maximum pooling operation to obtain a reduced-dimension data set, and inputting the reduced-dimension data set into a full-connection layer;
the full connection layer receives the reduced-dimension data set, combines the activation function to calculate a prediction classification set, inputs the prediction classification set and the label set into the loss function to calculate a loss value, judges the magnitude relation between the loss value and a preset threshold value, and exits training until the loss value is smaller than the preset threshold value.
3. An intelligent text classification apparatus comprising a memory and a processor, said memory having stored thereon a text classification program operable on said processor, said text classification program when executed by said processor performing the steps of:
receiving text data and a tag set, and marking the parts of speech of the text data;
performing fine-granularity word segmentation on the text data according to the part-of-speech tags to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training and obtaining a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, performing word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result;
wherein the part of speech tagging comprises: marking nouns and verbs in the text data according to a preset part-of-speech marking template; searching words which have the length larger than the preset length characters and contain ' or ' ground ' in the text data; judging whether the front and rear words of the words with the length larger than the preset length characters and containing 'or' ground 'in the text data are nouns or verbs, and if the front and rear words are nouns or verbs, marking the words with the length larger than two characters and containing' or 'ground' as adjectives or adverbs;
the word vectorization processing includes: establishing a classification probability model based on the word segmentation sequence set; constructing a conditional probability model based on the classification probability model; performing accumulation and summation operation on the conditional probability model to obtain a log-likelihood function; maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set;
the classification probability modelThe method comprises the following steps:
wherein X is the word segmentation sequence set, omega is the characteristic word in the word segmentation sequence set, the characteristic word comprises nouns, verbs, adjectives and adverbs marked by the parts of speech, e is infinite non-circulating decimal,is X ω Transposed matrix of X ω An accumulation and summation operation for ω, the accumulation and summation operation being:
wherein c is the data number, V (omega) i ) For the ith feature word omega i Is used to vector data.
4. A computer readable storage medium having stored thereon a text classification program executable by one or more processors to implement the steps of the intelligent text classification method of claim 1 or 2.
CN201910540265.3A 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium Active CN110413773B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910540265.3A CN110413773B (en) 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium
PCT/CN2019/117341 WO2020253043A1 (en) 2019-06-20 2019-11-12 Intelligent text classification method and apparatus, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910540265.3A CN110413773B (en) 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN110413773A CN110413773A (en) 2019-11-05
CN110413773B true CN110413773B (en) 2023-09-22

Family

ID=68359559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910540265.3A Active CN110413773B (en) 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN110413773B (en)
WO (1) WO2020253043A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413773B (en) * 2019-06-20 2023-09-22 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN112906386B (en) * 2019-12-03 2023-08-11 深圳无域科技技术有限公司 Method and device for determining text characteristics
CN111275091B (en) * 2020-01-16 2024-05-10 平安科技(深圳)有限公司 Text conclusion intelligent recommendation method and device and computer readable storage medium
CN111339300B (en) * 2020-02-28 2023-08-22 中国工商银行股份有限公司 Text classification method and device
CN112434153A (en) * 2020-12-16 2021-03-02 中国计量大学上虞高等研究院有限公司 Junk information filtering method based on ELMo and convolutional neural network
CN112883191B (en) * 2021-02-05 2023-03-24 山东麦港数据系统有限公司 Agricultural entity automatic identification classification method and device
CN116912845B (en) * 2023-06-16 2024-03-19 广东电网有限责任公司佛山供电局 Intelligent content identification and analysis method and device based on NLP and AI

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN107180023A (en) * 2016-03-11 2017-09-19 科大讯飞股份有限公司 A kind of file classification method and system
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification
CN108763539A (en) * 2018-05-31 2018-11-06 华中科技大学 A kind of file classification method and system based on parts of speech classification
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161067A1 (en) * 2009-12-29 2011-06-30 Dynavox Systems, Llc System and method of using pos tagging for symbol assignment
CN109471933B (en) * 2018-10-11 2024-05-07 平安科技(深圳)有限公司 Text abstract generation method, storage medium and server
CN110413773B (en) * 2019-06-20 2023-09-22 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN107180023A (en) * 2016-03-11 2017-09-19 科大讯飞股份有限公司 A kind of file classification method and system
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification
CN108763539A (en) * 2018-05-31 2018-11-06 华中科技大学 A kind of file classification method and system based on parts of speech classification
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning

Also Published As

Publication number Publication date
WO2020253043A1 (en) 2020-12-24
CN110413773A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110413773B (en) Intelligent text classification method, device and computer readable storage medium
CN109522557B (en) Training method and device of text relation extraction model and readable storage medium
CN110851596B (en) Text classification method, apparatus and computer readable storage medium
CN108416370B (en) Image classification method and device based on semi-supervised deep learning and storage medium
WO2020232861A1 (en) Named entity recognition method, electronic device and storage medium
CN110909548B (en) Chinese named entity recognition method, device and computer readable storage medium
CN111241304B (en) Answer generation method based on deep learning, electronic device and readable storage medium
JP2019114239A (en) Automatic hierarchical type document classification and meta data identification using machine learning and fuzzy matching
US20120253792A1 (en) Sentiment Classification Based on Supervised Latent N-Gram Analysis
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
US9336299B2 (en) Acquisition of semantic class lexicons for query tagging
CN110442857B (en) Emotion intelligent judging method and device and computer readable storage medium
CN112101031B (en) Entity identification method, terminal equipment and storage medium
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN110502742B (en) Complex entity extraction method, device, medium and system
US20220318515A1 (en) Intelligent text cleaning method and apparatus, and computer-readable storage medium
CN110866042B (en) Intelligent query method and device for table and computer readable storage medium
CN110335206B (en) Intelligent filter method, device and computer readable storage medium
WO2020248366A1 (en) Text intention intelligent classification method and device, and computer-readable storage medium
CN110866098A (en) Machine reading method and device based on transformer and lstm and readable storage medium
CN110968697A (en) Text classification method, device and equipment and readable storage medium
CN114238602A (en) Dialogue analysis method, device, equipment and storage medium based on corpus matching
US20220318506A1 (en) Method and apparatus for event extraction and extraction model training, device and medium
CN114091456B (en) Intelligent positioning method and system for quotation contents
CN113468295B (en) Determination method and device of main guest correspondence, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant