CN110413773A - Intelligent text classification method, device and computer readable storage medium - Google Patents

Intelligent text classification method, device and computer readable storage medium Download PDF

Info

Publication number
CN110413773A
CN110413773A CN201910540265.3A CN201910540265A CN110413773A CN 110413773 A CN110413773 A CN 110413773A CN 201910540265 A CN201910540265 A CN 201910540265A CN 110413773 A CN110413773 A CN 110413773A
Authority
CN
China
Prior art keywords
text
word
classification
data
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910540265.3A
Other languages
Chinese (zh)
Other versions
CN110413773B (en
Inventor
郑子欧
刘京华
汪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910540265.3A priority Critical patent/CN110413773B/en
Publication of CN110413773A publication Critical patent/CN110413773A/en
Priority to PCT/CN2019/117341 priority patent/WO2020253043A1/en
Application granted granted Critical
Publication of CN110413773B publication Critical patent/CN110413773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of artificial intelligence technologys, disclose a kind of intelligent text classification method, it include: to receive text data and tally set, part-of-speech tagging is carried out to the text data, it segments text data progress fine granularity to obtain segmentation sequence collection according to the part-of-speech tagging, term vector is carried out to the segmentation sequence collection to handle to obtain term vector data set, the term vector data set and the tally set are input in disaggregated model and trains and obtains trained values, when the trained values are less than preset threshold, the disaggregated model exits training, receive the text of user's input, the term vector is carried out to the text to operate to obtain text term vector, the text term vector is input to the disaggregated model judgement and output category result.The present invention also proposes a kind of intelligent text sorter and a kind of computer readable storage medium.Accurate text classification function may be implemented in the present invention.

Description

Intelligent text classification method and device and computer readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an intelligent text classification method and device and a computer readable storage medium.
Background
Text classification is an important part of text processing, and its application is also very wide, such as: garbage filtering, news classification, part-of-speech tagging, and the like. The content of different texts is classified, and at present, the content is generally classified by labeling keywords. The classification method ignores chapter information in the text, and due to lack of consideration on part of speech, the classification method is incomplete and not detailed, so that the accuracy is low.
Disclosure of Invention
The invention provides an intelligent text classification method, an intelligent text classification device and a computer readable storage medium, and mainly aims to provide accurate classification results for a user when the user inputs texts.
In order to achieve the above object, the present invention provides an intelligent text classification method, which comprises:
receiving text data and a tag set, and performing part-of-speech tagging on the text data;
performing fine-grained word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.
Optionally, the part-of-speech tagging comprises:
firstly, according to a preset part-of-speech mark template, marking nouns and verbs in the text data;
searching for words in the text data that are longer than two characters and contain either "of" or "ground";
and judging whether the preceding and following words of the words with the length larger than the preset length characters and containing the 'or' ground 'in the text data are nouns or verbs, and if the preceding and following words are nouns or verbs, marking the words with the length larger than two characters and containing the' or 'ground' as adjectives or adverbs.
Optionally, the word vectorization processing includes:
establishing a classification probability model based on the word segmentation sequence set;
constructing a conditional probability model based on the classification probability model;
performing accumulation summation operation on the conditional probability model to obtain a log-likelihood function;
and maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set.
Optionally, the classification probability modelComprises the following steps:
wherein X is the word segmentation sequence set, omega is the noun, verb, adjective and adverb labeled by the part of speech, which can also be called as a characteristic word, e is an infinite endless decimal,is XωTransposed matrix of (2), XωAn accumulated sum operation of ω, the accumulated sum operation being:
wherein c is the data number of the word segmentation sequence set, V (omega)i) A data set is vectorized for a word after the assumption has been word vectorized.
Optionally, the classification model comprises a convolutional neural network, an activation function and a loss function, wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and one full-link layer; and
inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model, including:
after receiving the word vectorization data set, the convolutional neural network inputs the word vectorization data set to the nineteen convolutional layers and the nineteen pooling layers to carry out convolution operation and maximum pooling operation to obtain a dimension reduction data set, and inputs the dimension reduction data set to a full connection layer;
and the full connection layer receives the dimensionality reduction data set, calculates by combining the activation function to obtain a prediction classification set, inputs the prediction classification set and the label set into the loss function to calculate a loss value, judges the size relation between the loss value and a preset threshold value, and quits training until the loss value is smaller than the preset threshold value.
In addition, in order to achieve the above object, the present invention further provides an intelligent text classification apparatus, which includes a memory and a processor, wherein the memory stores a text classification program operable on the processor, and the text classification program implements the following steps when executed by the processor:
receiving text data and a tag set, and performing part-of-speech tagging on the text data;
performing fine-grained word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.
Optionally, the part-of-speech tagging comprises:
firstly, according to a preset part-of-speech mark template, marking nouns and verbs in the text data;
searching for words in the text data that are longer than two characters and contain either "of" or "ground";
and judging whether the preceding and following words of the words with the length larger than the preset length characters and containing the 'or' ground 'in the text data are nouns or verbs, and if the preceding and following words are nouns or verbs, marking the words with the length larger than two characters and containing the' or 'ground' as adjectives or adverbs.
Optionally, the word vectorization processing includes:
establishing a classification probability model based on the word segmentation sequence set;
constructing a conditional probability model based on the classification probability model;
performing accumulation summation operation on the conditional probability model to obtain a log-likelihood function;
and maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set.
Optionally, the classification probability modelComprises the following steps:
wherein X is the word segmentation sequence set, omega is the noun, verb, adjective and adverb labeled by the part of speech, which can also be called as a characteristic word, e is an infinite endless decimal,is XωTransposed matrix of (2), XωAn accumulated sum operation of ω, the accumulated sum operation being:
wherein c is the data number of the word segmentation sequence set, V (omega)i) A data set is vectorized for a word after the assumption has been word vectorized.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a text classification program stored thereon, the text classification program being executable by one or more processors to implement the steps of the intelligent text classification method as described above.
The invention provides an intelligent text classification method, an intelligent text classification device and a computer readable storage medium. The method carries out part-of-speech tagging according to the text content, can effectively convert the text data into part-of-speech data, simultaneously carries out word vectorization operation, can further unscramble the characteristics of the text data to a computer without loss for analysis, and can effectively improve the robustness and the accuracy of the classification type of the text data based on multiple training of a classification model. Therefore, the invention can provide accurate classification results for users.
Drawings
Fig. 1 is a schematic flowchart of an intelligent text classification method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an internal structure of an intelligent text classification apparatus according to an embodiment of the present invention;
fig. 3 is a schematic block diagram of a text classification program in the intelligent text classification apparatus according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides an intelligent text classification method. Fig. 1 is a schematic flow chart of an intelligent text classification method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, the intelligent text classification method includes:
and S1, receiving the text data and the label set, and performing part-of-speech tagging on the text data.
Preferably, the text data set includes text data of various subjects, such as subjects of finance, novel, education, property, sports, and the like, and the label set records labels of the text data in the text data set, such as recording text data a as sports and recording text data B as property.
In a preferred embodiment of the present invention, the part-of-speech tagging is performed by tagging a noun and a verb in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template is a recognizer tagged with characteristics of the noun and the verb, and the part-of-speech tagging template can determine the noun and the verb by recognizing characteristics of the word. The words are marked as nouns according to the part-of-speech mark template, wherein the words are [ I apple ] and [ basketball fitness ], [ enemy time ] and the words are [ favorite ], [ good for play ], [ yield ] are verbs;
searching for a word having a length greater than a preset length, such as two characters and containing "of" or "ground", in the text data, and judging whether a preceding word and a following word of the word having the length greater than two characters and containing "of" or "ground" in the text data are nouns or verbs. If the preceding and following words are nouns or verbs, the words with the length larger than the preset length characters and containing ' the ' or ' the ' ground ' are adjectives or adverbs, for example [ angry people violently and violently dozen hatable thieves]Firstly, people are identified according to the part of speech tag template]And can be opened][ thief]And recognizes that a word greater than two characters in length and containing "of" or "ground" is angry][ tough and tough ground]Can be hated]Judging whether there is noun or verb before and after the word, such as [ people]And can be opened][ thief]And thus adjectives or adverbs and labeled. Preferably, the annotation may take the form of including an annotation symbol, such as angryadjPeoplenTough and tough groundadvCan be openedvCan be hatedadjThiefn]。
S2, performing fine-grained word segmentation on the text data according to the part of speech tagging to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.
In a preferred embodiment of the present invention, the fine-grained word segmentation means that words which are not marked as nouns, verbs, adjectives, and adverbs in the text data are removed, and a word segmentation sequence set is obtained based on a label. Preferably, the removed words are called an allograph word set, such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc., and the stop words include the words "for" and "for", such as "an pound piece of rainadjIn the morningnHeavy rainnAll make the earthnWetting by flushingvBecome intovIs almost wetadjMudn]Obtaining (weight large rain) after said fine-grained divisionadjIn the morningnHeavy rainnSoil (W) for buildingnWetting by flushingvBecome intovAlmost wetadjMudn]Then based on said mark symbol to obtain said participle sequence set, such as [ pound large rainadjIn the morningnHeavy rainnSoil (W) for buildingnWetting by flushingvBecome intovAlmost wetadjMudn]Wet mud for overcast large rain based on said mark symbol]。
Further, a classification probability model is established based on the word segmentation sequence set, a conditional probability model is established based on the classification probability model, the conditional probability model is subjected to accumulation summation operation to obtain a log-likelihood function, the log-likelihood function is maximized to solve an optimal solution, and the optimal solution is the word vectorization data set.
Preferably, the classification probability modelComprises the following steps:
wherein X is the word segmentation sequence set, omega is noun, verb, adjective and adverb of the word segmentation sequence set, which can also be called as a feature word, e is an infinite acyclic decimal,is XωTransposed matrix of (2), XωAn accumulated sum operation of ω, the accumulated sum operation being:
wherein c is the data number of the word segmentation sequence set, V (omega)i) And in order to assume a word vectorization data set after word vectorization, the log likelihood function is maximized in the following steps.
The conditional probability model p (ω | V (ω)i) ) is:
wherein lωThe number of nodes included in omega in huffman coding is represented, in the huffman coding, in combination with a huffman binary tree, a tree is a nonlinear data structure formed by data elements (also called nodes) organized according to a branch relation, and a set of a plurality of trees is called a forest. A binary tree is an ordered tree with at most two subtrees per node, called left and right subtrees, respectively. If the path length of one binary tree is the minimum, the binary tree is called a Huffman binary tree, therefore omega is a leaf node, the weight value of each leaf node is expressed by Huffman coding, the invention uses different arrangements of 0 and 1 codes to express words,is shown on path pωIn the interior, the Huffman code corresponding to the jth node, the root node has no code,is the code for the word co and,represents a path pωWithin, the j-1 th non-leaf node corresponds to a vector, since the word ω is a leaf node, there is no corresponding vector.
Preferably, the log likelihood function ζ is
WhereinThe word stock comprises all nouns, verbs, adjectives and adverbs in the word segmentation sequence set.
Further, maximizing the log-likelihood function is:
wherein,a partial derivative representing a transpose of the log-likelihood function to the accumulated sum operation. Continuously optimizing the V (ω) based on the partial derivativei) The optimization process comprises the following steps:
wherein η is a set learning rate, and the word vectorization data set V (ω) is obtained based on the above.
And S3, inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training.
Preferably, the classification model includes a convolutional neural network, an activation function, and a loss function. Wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and a full-link layer.
Inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model, including:
preferably, after receiving the word vectorization data set, the convolutional neural network inputs the word vectorization data set to the nineteen convolutional layers and the nineteen pooling layers for convolution operation and maximum pooling operation to obtain a dimension reduction data set, and inputs the dimension reduction data set to the full connection layer.
Further, the full connection layer receives the dimensionality reduction data set, obtains a prediction classification set by combining with the activation function calculation, inputs the prediction classification set and the label set into the loss function to calculate a loss value, and judges the size relation between the loss value and a preset threshold value until the loss value is smaller than the preset threshold value, and the classification model exits from training.
The convolution operation in the preferred embodiment of the present invention is:
wherein ω' is output data, ω is input data, k is the size of a convolution kernel, s is the step of the convolution operation, p is a data zero-padding matrix, the pooling operation can select a maximum pooling operation, and the maximum pooling operation is to select a value with the largest value in matrix data in the matrix to replace the whole matrix;
the activation function is:
where y is the prediction classification set and e is an infinite acyclic decimal.
In the preferred embodiment of the present invention, the loss value T is:
where n is the data size of the prediction classification set, ytIs said set of tags, mutFor the prediction classification set, the preset threshold is typically set at 0.01.
And S4, receiving a text input by a user, carrying out word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.
The invention also provides an intelligent text classification device. Fig. 3 is a schematic diagram of an internal structure of the intelligent text classification device according to an embodiment of the present invention.
In the present embodiment, the intelligent text classification device 1 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, or a mobile Computer, or may be a server. The intelligent text classification device 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the intelligent text classification device 1, such as a hard disk of the intelligent text classification device 1. The memory 11 may also be an external storage device of the intelligent text classification device 1 in other embodiments, such as a plug-in hard disk provided on the intelligent text classification device 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on. Further, the memory 11 may also include both an internal storage unit of the intelligent text classification apparatus 1 and an external storage device. The memory 11 may be used not only to store application software installed in the intelligent text classification device 1 and various types of data, such as codes of the text classification program 01, but also to temporarily store data that has been output or is to be output.
Processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, executes program code or processes data stored in memory 11, such as executing text classifier 01.
The communication bus 13 is used to realize connection communication between these components.
The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.
Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a keyboard (KeVboard), and an optional user interface may also comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the intelligent text classification apparatus 1 and for displaying a visual user interface.
While fig. 3 only shows the intelligent text classification device 1 with the components 11-14 and the text classification program 01, it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the intelligent text classification device 1, and may comprise fewer or more components than shown, or some components may be combined, or a different arrangement of components.
In the embodiment of the apparatus 1 shown in fig. 3, a text classification program 01 is stored in the memory 11; the processor 12, when executing the text classification program 01 stored in the memory 11, implements the following steps:
the method comprises the steps of receiving text data and a tag set, and performing part-of-speech tagging on the text data.
Preferably, the text data set includes text data of various subjects, such as subjects of finance, novel, education, property, sports, and the like, and the label set records labels of the text data in the text data set, such as recording text data a as sports and recording text data B as property.
In a preferred embodiment of the present invention, the part-of-speech tagging is performed by tagging a noun and a verb in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template is a recognizer tagged with characteristics of the noun and the verb, and the part-of-speech tagging template can determine the noun and the verb by recognizing characteristics of the word. The words such as [ i particularly like eating apple ], [ playing basketball is beneficial to fitness ], [ enemy yielded at last time ], and [ i apple ], [ basketball fitness ], [ enemy time ] are marked as names according to the part-of-speech mark template, and [ like eating ], [ playing beneficial ], [ yielding ] are verbs;
searching for a word having a length greater than a preset length, such as two characters and containing "of" or "ground", in the text data, and judging whether a preceding word and a following word of the word having the length greater than two characters and containing "of" or "ground" in the text data are nouns or verbs. If the preceding and following words are nouns or verbs, the words with the length larger than the preset length characters and containing ' the ' or ' the ' ground ' are adjectives or adverbs, for example [ angry people violently and violently dozen hatable thieves]Firstly, people are identified according to the part of speech tag template]And can be opened][ thief]And recognizes that a word greater than two characters in length and containing "of" or "ground" is angry][ tough and tough ground]Can be hated]Judging whether there is noun or verb before and after the word, such as [ people]And can be opened][ thief]And thus adjectives or adverbs and labeled. Preferably, the annotation may take the form of including an annotation symbol, such as angryadjPeoplenTough and tough groundadvCan be openedvCan be hatedadjThiefn]。
And secondly, performing fine-grained word segmentation on the text data according to the part of speech tags to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.
In a preferred embodiment of the present invention, the fine-grained word segmentation means that words which are not marked as nouns, verbs, adjectives, and adverbs in the text data are removed, and a word segmentation sequence set is obtained based on a label. Preferably, the removed words are called an allograph word set, such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc., and the stop words include the words "for" and "for", such as "an pound piece of rainadjIn the morningnHeavy rainnAll make the earthnWetting by flushingvBecome intovIs almost wetadjMudn]Obtaining (weight large rain) after said fine-grained divisionadjIn the morningnHeavy rainnSoil (W) for buildingnWetting by flushingvBecome intovAlmost wetadjMudn]Then based on said mark symbol to obtain said participle sequence set, such as [ pound large rainadjIn the morningnHeavy rainnSoil (W) for buildingnWetting by flushingvBecome intovAlmost wetadjMudn]Wet mud for overcast large rain based on said mark symbol]。
Further, a classification probability model is established based on the word segmentation sequence set, a conditional probability model is established based on the classification probability model, the conditional probability model is subjected to accumulation summation operation to obtain a log-likelihood function, the log-likelihood function is maximized to solve an optimal solution, and the optimal solution is the word vectorization data set.
Preferably, the classification probability modelComprises the following steps:
wherein X is the word segmentation sequence set, and omega is the noun and the action of the word segmentation sequence setWords, adjectives, adverbs, which may also be called characteristic words, e is an infinite acyclic decimal,is XωTransposed matrix of (2), XωAn accumulated sum operation of ω, the accumulated sum operation being:
wherein c is the data number of the word segmentation sequence set, V (omega)i) And in order to assume a word vectorization data set after word vectorization, the log likelihood function is maximized in the following steps.
The conditional probability model p (ω | V (ω)i) ) is:
wherein lωThe number of nodes included in omega in huffman coding is represented, in the huffman coding, in combination with a huffman binary tree, a tree is a nonlinear data structure formed by data elements (also called nodes) organized according to a branch relation, and a set of a plurality of trees is called a forest. A binary tree is an ordered tree with at most two subtrees per node, called left and right subtrees, respectively. If the path length of one binary tree is the minimum, the binary tree is called a Huffman binary tree, therefore omega is a leaf node, the weight value of each leaf node is expressed by Huffman coding, the invention uses different arrangements of 0 and 1 codes to express words,is shown on path pωIn the interior, the Huffman code corresponding to the jth node, the root node has no code,is the code for the word co and,represents a path pωWithin, the j-1 th non-leaf node corresponds to a vector, since the word ω is a leaf node, there is no corresponding vector.
Preferably, the log likelihood function ζ is
WhereinThe word stock comprises all nouns, verbs, adjectives and adverbs in the word segmentation sequence set.
Further, maximizing the log-likelihood function is:
wherein,a partial derivative representing a transpose of the log-likelihood function to the accumulated sum operation. Continuously optimizing the V (ω) based on the partial derivativei) The optimization process comprises the following steps:
wherein η is a set learning rate, and the word vectorization data set V (ω) is obtained based on the above.
Inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training.
Preferably, the classification model includes a convolutional neural network, an activation function, and a loss function. Wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers and a full-link layer.
Inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model, including:
preferably, after receiving the word vectorization data set, the convolutional neural network inputs the word vectorization data set to the nineteen convolutional layers and the nineteen pooling layers for convolution operation and maximum pooling operation to obtain a dimension reduction data set, and inputs the dimension reduction data set to the full connection layer.
Further, the full connection layer receives the dimensionality reduction data set, obtains a prediction classification set by combining with the activation function calculation, inputs the prediction classification set and the label set into the loss function to calculate a loss value, and judges the size relation between the loss value and a preset threshold value until the loss value is smaller than the preset threshold value, and the classification model exits from training.
The convolution operation in the preferred embodiment of the present invention is:
wherein ω' is output data, ω is input data, k is the size of a convolution kernel, s is the step of the convolution operation, p is a data zero-padding matrix, the pooling operation can select a maximum pooling operation, and the maximum pooling operation is to select a value with the largest value in matrix data in the matrix to replace the whole matrix;
the activation function is:
where y is the prediction classification set and e is an infinite acyclic decimal.
In the preferred embodiment of the present invention, the loss value T is:
where n is the data size of the prediction classification set, ytIs said set of tags, mutFor the prediction classification set, the preset threshold is typically set at 0.01.
And step four, receiving a text input by a user, carrying out word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.
Alternatively, in other embodiments, the text classification program may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention, where the module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions for describing the execution process of the text classification program in the intelligent text classification device.
For example, referring to fig. 3, a schematic diagram of program modules of a text classification program in an embodiment of the intelligent text classification device of the present invention is shown, in this embodiment, the text classification program may be divided into a word tagging module 10, a word vectorization conversion module 20, a model training module 30, and a text classification result output module 40, which exemplarily:
the part of speech tagging module 10 is configured to: receiving text data and a tag set, and performing part-of-speech tagging on the text data.
The word vectorization conversion module 20 is configured to: and performing fine-grained word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and performing word vectorization on the word segmentation sequence set to obtain a word vectorization data set.
The model training module 30 is configured to: and inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model exits from training.
The text classification result output module 40 is configured to: receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.
The functions or operation steps of the above-mentioned part of speech tagging module 10, word vectorization conversion module 20, model training module 30, and text classification result output module 40 when executed are substantially the same as those of the above-mentioned embodiments, and are not described herein again.
Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, on which a text classification program is stored, where the text classification program is executable by one or more processors to implement the following operations:
receiving text data, and performing part-of-speech tagging on the text data to obtain text data;
performing fine-grained word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.
The embodiment of the computer-readable storage medium of the present invention is substantially the same as that of the foregoing embodiments of the intelligent text classification device and method, and will not be described herein again.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. The term "comprising" is used to specify the presence of stated features, integers, steps, operations, elements, components, groups, integers, operations, elements, components, groups, elements, groups, integers, operations, elements.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims (10)

1. An intelligent text classification method, characterized in that the method comprises:
receiving text data and a tag set, and performing part-of-speech tagging on the text data;
performing fine-grained word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.
2. The intelligent text classification method according to claim 1, characterized in that the part-of-speech tagging comprises:
marking nouns and verbs in the text data according to a preset part-of-speech mark template;
searching words with length larger than preset length characters and containing 'of' or 'place' in the text data;
and judging whether the preceding and following words of the words with the length larger than the preset length characters and containing the 'or' ground 'in the text data are nouns or verbs, and if the preceding and following words are nouns or verbs, marking the words with the length larger than two characters and containing the' or 'ground' as adjectives or adverbs.
3. The intelligent text classification method according to claim 1 or 2, characterized in that the word vectorization process comprises:
establishing a classification probability model based on the word segmentation sequence set;
constructing a conditional probability model based on the classification probability model;
performing accumulation summation operation on the conditional probability model to obtain a log-likelihood function;
and maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set.
4. The intelligent text classification method of claim 3, wherein the classification probability modelComprises the following steps:
wherein X is the word segmentation sequence set, omega is the noun, verb, adjective and adverb labeled by the part of speech, which can also be called as a characteristic word, e is an infinite endless decimal,is XωTransposed matrix of (2), XωAn accumulated sum operation of ω, the accumulated sum operation being:
wherein c is the data number of the word segmentation sequence set, V (omega)i) A data set is vectorized for a word after the assumption has been word vectorized.
5. The intelligent text classification method of claim 4, wherein the classification model comprises a convolutional neural network, an activation function, and a loss function, wherein the convolutional neural network comprises nineteen convolutional layers, nineteen pooling layers, and one fully-connected layer; and
inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, exiting the training of the classification model, including:
after receiving the word vectorization data set, the convolutional neural network inputs the word vectorization data set to the nineteen convolutional layers and the nineteen pooling layers to carry out convolution operation and maximum pooling operation to obtain a dimension reduction data set, and inputs the dimension reduction data set to a full connection layer;
and the full-connection layer receives the dimensionality reduction data set, calculates a prediction classification set by combining the activation function, inputs the prediction classification set and the label set into the loss function to calculate a loss value, judges the size relation between the loss value and a preset threshold value, and quits training until the loss value is smaller than the preset threshold value.
6. An intelligent text classification apparatus, comprising a memory and a processor, the memory having stored thereon a text classification program operable on the processor, the text classification program when executed by the processor implementing the steps of:
receiving text data and a tag set, and performing part-of-speech tagging on the text data;
performing fine-grained word segmentation on the text data according to the part-of-speech labels to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set;
inputting the word vectorization data set and the label set into a classification model for training to obtain a training value, and when the training value is smaller than a preset threshold value, the classification model quits training;
receiving a text input by a user, carrying out the word vectorization operation on the text to obtain a text word vector, inputting the text word vector into the classification model for judgment, and outputting a classification result.
7. The intelligent text classification device according to claim 6, wherein the part-of-speech tagging comprises:
marking nouns and verbs in the text data according to a preset part-of-speech mark template;
searching words with length larger than preset length characters and containing 'of' or 'place' in the text data;
and judging whether the preceding and following words of the words with the length larger than the preset length characters and containing the 'or' ground 'in the text data are nouns or verbs, and if the preceding and following words are nouns or verbs, marking the words with the length larger than two characters and containing the' or 'ground' as adjectives or adverbs.
8. The intelligent text classification apparatus according to claim 6 or 7, wherein the word vectorization process comprises:
establishing a classification probability model based on the word segmentation sequence set;
constructing a conditional probability model based on the classification probability model;
performing accumulation summation operation on the conditional probability model to obtain a log-likelihood function;
and maximizing the log-likelihood function to solve an optimal solution, wherein the optimal solution is the word vectorization data set.
9. The intelligent text classification device of claim 8, wherein the classification probability modelComprises the following steps:
wherein X is the word segmentation sequence set, omega is the noun, verb, adjective and adverb labeled by the part of speech, which can also be called as a characteristic word, e is an infinite endless decimal,is XωTransposed matrix of (2), XωAn accumulated sum operation of ω, the accumulated sum operation being:
wherein c is the data number of the word segmentation sequence set, V (omega)i) A data set is vectorized for a word after the assumption has been word vectorized.
10. A computer-readable storage medium having stored thereon a text classification program executable by one or more processors to perform the steps of the intelligent text classification method of any one of claims 1 to 5.
CN201910540265.3A 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium Active CN110413773B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910540265.3A CN110413773B (en) 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium
PCT/CN2019/117341 WO2020253043A1 (en) 2019-06-20 2019-11-12 Intelligent text classification method and apparatus, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910540265.3A CN110413773B (en) 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN110413773A true CN110413773A (en) 2019-11-05
CN110413773B CN110413773B (en) 2023-09-22

Family

ID=68359559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910540265.3A Active CN110413773B (en) 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN110413773B (en)
WO (1) WO2020253043A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111275091A (en) * 2020-01-16 2020-06-12 平安科技(深圳)有限公司 Intelligent text conclusion recommendation method and device and computer readable storage medium
CN111339300A (en) * 2020-02-28 2020-06-26 中国工商银行股份有限公司 Text classification method and device
CN111626047A (en) * 2020-04-23 2020-09-04 平安科技(深圳)有限公司 Intelligent text error correction method and device, electronic equipment and readable storage medium
WO2020253043A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Intelligent text classification method and apparatus, and computer-readable storage medium
CN112434153A (en) * 2020-12-16 2021-03-02 中国计量大学上虞高等研究院有限公司 Junk information filtering method based on ELMo and convolutional neural network
CN112507663A (en) * 2020-12-16 2021-03-16 平安银行股份有限公司 Text-based judgment question generation method and device, electronic equipment and storage medium
CN112906386A (en) * 2019-12-03 2021-06-04 深圳无域科技技术有限公司 Method and device for determining text features

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883191B (en) * 2021-02-05 2023-03-24 山东麦港数据系统有限公司 Agricultural entity automatic identification classification method and device
CN113342981A (en) * 2021-06-30 2021-09-03 中国工商银行股份有限公司 Demand document classification method and device based on machine learning
CN116912845B (en) * 2023-06-16 2024-03-19 广东电网有限责任公司佛山供电局 Intelligent content identification and analysis method and device based on NLP and AI

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161067A1 (en) * 2009-12-29 2011-06-30 Dynavox Systems, Llc System and method of using pos tagging for symbol assignment
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN107180023A (en) * 2016-03-11 2017-09-19 科大讯飞股份有限公司 A kind of file classification method and system
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification
CN108763539A (en) * 2018-05-31 2018-11-06 华中科技大学 A kind of file classification method and system based on parts of speech classification
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109471933B (en) * 2018-10-11 2024-05-07 平安科技(深圳)有限公司 Text abstract generation method, storage medium and server
CN110413773B (en) * 2019-06-20 2023-09-22 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161067A1 (en) * 2009-12-29 2011-06-30 Dynavox Systems, Llc System and method of using pos tagging for symbol assignment
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN107180023A (en) * 2016-03-11 2017-09-19 科大讯飞股份有限公司 A kind of file classification method and system
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification
CN108763539A (en) * 2018-05-31 2018-11-06 华中科技大学 A kind of file classification method and system based on parts of speech classification
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020253043A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Intelligent text classification method and apparatus, and computer-readable storage medium
CN112906386A (en) * 2019-12-03 2021-06-04 深圳无域科技技术有限公司 Method and device for determining text features
CN112906386B (en) * 2019-12-03 2023-08-11 深圳无域科技技术有限公司 Method and device for determining text characteristics
CN111275091A (en) * 2020-01-16 2020-06-12 平安科技(深圳)有限公司 Intelligent text conclusion recommendation method and device and computer readable storage medium
CN111275091B (en) * 2020-01-16 2024-05-10 平安科技(深圳)有限公司 Text conclusion intelligent recommendation method and device and computer readable storage medium
CN111339300A (en) * 2020-02-28 2020-06-26 中国工商银行股份有限公司 Text classification method and device
CN111339300B (en) * 2020-02-28 2023-08-22 中国工商银行股份有限公司 Text classification method and device
CN111626047A (en) * 2020-04-23 2020-09-04 平安科技(深圳)有限公司 Intelligent text error correction method and device, electronic equipment and readable storage medium
CN112434153A (en) * 2020-12-16 2021-03-02 中国计量大学上虞高等研究院有限公司 Junk information filtering method based on ELMo and convolutional neural network
CN112507663A (en) * 2020-12-16 2021-03-16 平安银行股份有限公司 Text-based judgment question generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2020253043A1 (en) 2020-12-24
CN110413773B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN110413773B (en) Intelligent text classification method, device and computer readable storage medium
CN110851596B (en) Text classification method, apparatus and computer readable storage medium
CN111897908B (en) Event extraction method and system integrating dependency information and pre-training language model
CN110347835B (en) Text clustering method, electronic device and storage medium
CN110909548B (en) Chinese named entity recognition method, device and computer readable storage medium
WO2020232861A1 (en) Named entity recognition method, electronic device and storage medium
CN110442857B (en) Emotion intelligent judging method and device and computer readable storage medium
CN110362723B (en) Topic feature representation method, device and storage medium
CN110032632A (en) Intelligent customer service answering method, device and storage medium based on text similarity
US9336299B2 (en) Acquisition of semantic class lexicons for query tagging
CN107807968B (en) Question answering device and method based on Bayesian network and storage medium
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN108460011A (en) A kind of entitative concept mask method and system
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN112101031B (en) Entity identification method, terminal equipment and storage medium
CN110427480B (en) Intelligent personalized text recommendation method and device and computer readable storage medium
CN110866098B (en) Machine reading method and device based on transformer and lstm and readable storage medium
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
US20220318515A1 (en) Intelligent text cleaning method and apparatus, and computer-readable storage medium
CN108509423A (en) A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN111783471A (en) Semantic recognition method, device, equipment and storage medium of natural language
CN113821605A (en) Event extraction method
CN110866042A (en) Intelligent table query method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant