WO2020253043A1 - Intelligent text classification method and apparatus, and computer-readable storage medium - Google Patents

Intelligent text classification method and apparatus, and computer-readable storage medium Download PDF

Info

Publication number
WO2020253043A1
WO2020253043A1 PCT/CN2019/117341 CN2019117341W WO2020253043A1 WO 2020253043 A1 WO2020253043 A1 WO 2020253043A1 CN 2019117341 W CN2019117341 W CN 2019117341W WO 2020253043 A1 WO2020253043 A1 WO 2020253043A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
classification
text
data
training
Prior art date
Application number
PCT/CN2019/117341
Other languages
French (fr)
Chinese (zh)
Inventor
郑子欧
刘京华
汪伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020253043A1 publication Critical patent/WO2020253043A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to an intelligent text classification method, device and computer-readable storage medium.
  • Text classification is a very important part of text processing, and its applications are also very extensive, such as: spam filtering, news classification, part-of-speech tagging, etc.
  • To classify the content of different texts currently it is usually classified by labeling keywords.
  • Such a classification method ignores the textual information in the text, and because this classification method lacks part of speech considerations, the classification of the text is not comprehensive and detailed, resulting in low accuracy.
  • This application provides an intelligent text classification method, device, and computer-readable storage medium, the main purpose of which is to provide users with accurate classification results when they input text.
  • an intelligent text classification method includes:
  • the text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.
  • the present application also provides an intelligent text classification device, which includes a memory and a processor.
  • the memory stores a text classification program that can run on the processor.
  • the text classification program When executed by the processor, the following steps are implemented:
  • the text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.
  • the present application also provides a computer-readable storage medium having a text classification program stored on the computer-readable storage medium, and the text classification program can be executed by one or more processors to achieve The steps of the smart text classification method as described above.
  • the intelligent text classification method, device and computer readable storage medium proposed in this application.
  • This application performs part-of-speech tagging based on text content, which can effectively convert text data into part-of-speech data.
  • word vectorization can further interpret the characteristics of text data to a computer for analysis without loss.
  • Multiple training based on classification models can be effective Improve the robustness and accuracy of text data classification types. Therefore, this application can provide users with accurate classification results.
  • FIG. 1 is a schematic flowchart of an intelligent text classification method provided by an embodiment of this application
  • FIG. 2 is a schematic diagram of the internal structure of an intelligent text classification device provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of modules of a text classification program in an intelligent text classification device provided by an embodiment of the application.
  • This application provides an intelligent text classification method.
  • FIG. 1 it is a schematic flowchart of a smart text classification method provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the intelligent text classification method includes:
  • S1 Receive text data and a tag set, and perform part-of-speech tagging on the text data.
  • the text data set includes text data of various subjects, such as finance, novels, education, real estate, sports, etc.
  • the label set records the labels of each text data in the text data set, such as Record text data A for sports, text data B for real estate, etc.
  • the part-of-speech tagging first annotates the nouns and verbs in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template refers to a recognizer that has marked the features of nouns and verbs.
  • the predicate tag template can identify nouns and verbs by identifying the characteristics of words.
  • the words that are longer than the preset length and contain " ⁇ ” or “ ⁇ ” are adjectives or adverbs, such as [Angry people fight the hateful thief fiercely].
  • the labeling method may adopt a form including labeling symbols, such as [Angry adj people n fiercely adv fight v hate adj thief n ].
  • the fine-grained word segmentation refers to removing words that are not marked as nouns, verbs, adjectives, or adverbs in the text data, and obtaining a word segmentation sequence set based on the marked symbols.
  • the excluded words are called heteromorphic word sets, such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc.
  • the stop words include words such as " ⁇ ", " ⁇ ", etc.
  • a classification probability model is established based on the word segmentation sequence set, a conditional probability model is constructed based on the classification probability model, and the cumulative sum operation is performed on the conditional probability model to obtain a log likelihood function, which maximizes the log likelihood However, the function solves the optimal solution, and the optimal solution is the word vectorized data set.
  • the classification probability model for:
  • X is the word segmentation sequence set
  • is the nouns, verbs, adjectives, and adverbs of the word segmentation sequence set, which can also be called characteristic words
  • e is an infinite non-recurring decimal.
  • a transposed matrix X ⁇ , the cumulative summing operation X ⁇ of the [omega], is the accumulated sum operation:
  • c is the number of data in the word segmentation sequence set
  • V( ⁇ i ) is the word vectorized data set assuming that the word has been vectorized, which can be obtained by subsequently maximizing the log likelihood function.
  • V( ⁇ i )) is:
  • l ⁇ represents the number of nodes in the Huffman coding
  • the Huffman coding combined with the Huffman binary tree the tree is the data elements (also called nodes) organized according to the branch relationship
  • a non-linear data structure a collection of several trees is called a forest.
  • a binary tree is an ordered tree with at most two subtrees per node. The two subtrees are called the left subtree and the right subtree respectively. If there is a binary tree with the smallest path length, it is called a Huffman binary tree, so the ⁇ is a leaf node, and the weight of each leaf node is expressed by Huffman coding.
  • This application uses 0, 1 code Different arrangements to represent words, Indicates the Huffman code corresponding to the j-th node in the path p ⁇ , and the root node has no code, Is the encoding of the word ⁇ , Represents the vector corresponding to the j-1th non-leaf node in the path p ⁇ . Because the word ⁇ is a leaf node, there is no corresponding vector.
  • the log likelihood function ⁇ is
  • V( ⁇ i ) Represents the partial derivative of the log likelihood function to the transposed matrix of the cumulative sum operation.
  • the V( ⁇ i ) is continuously optimized based on the partial derivative, and the optimization process is:
  • is the set learning rate
  • V( ⁇ ) is obtained based on the above.
  • the classification model of the present application includes a convolutional neural network, an activation function and a loss function.
  • the convolutional neural network includes nineteen layers of convolutional layers, nineteen layers of pooling layers, and one layer of fully connected layers.
  • the inputting the word vectorized data set and the label set into a classification model for training and obtaining a training value, and when the training value is less than a preset threshold, the classification model exiting training includes:
  • the convolutional neural network After the convolutional neural network receives the word vectorized data set, it inputs the word vectorized data set to the nineteen-layer convolutional layer and nineteen-layer pooling layer for convolution operation and The maximum pooling operation obtains a dimensionality reduction data set, and the dimensionality reduction data set is input to the fully connected layer.
  • the fully connected layer receives the dimensionality reduction data set, calculates a prediction classification set in combination with the activation function, and inputs the prediction classification set and the label set into the loss function to calculate the loss Value, judging the magnitude relationship between the loss value and a preset threshold, until the loss value is less than the preset threshold, the classification model exits training.
  • ⁇ ' is the output data
  • is the input data
  • k is the size of the convolution kernel
  • s is the stride of the convolution operation
  • p is the data zero-filling matrix.
  • the pooling operation can select the maximum pooling operation, The maximum pooling operation is to select the largest value in the matrix data in the matrix to replace the entire matrix;
  • the activation function is:
  • n is the data size of the prediction classification set
  • y t is the label set
  • ⁇ t is the prediction classification set
  • the preset threshold is generally set at 0.01.
  • This application also provides an intelligent text classification device.
  • FIG. 3 it is a schematic diagram of the internal structure of an intelligent text classification device provided by an embodiment of this application.
  • the smart text classification device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server.
  • the intelligent text classification device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 11 may be an internal storage unit of the intelligent text classification device 1, for example, the hard disk of the intelligent text classification device 1.
  • the memory 11 may also be an external storage device of the smart text classification device 1, such as a plug-in hard disk equipped on the smart text classification device 1, a smart media card (SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the intelligent text classification device 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the intelligent text classification device 1, such as the code of the text classification program 01, etc., but also to temporarily store data that has been output or will be output.
  • the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, for example, execute text classification program 01, etc.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor microprocessor
  • the communication bus 13 is used to realize the connection and communication between these components.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.
  • the device 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc.
  • the display can also be called a display screen or a display unit as appropriate, and is used to display the information processed in the intelligent text classification device 1 and to display a visualized user interface.
  • FIG. 3 only shows the intelligent text classification device 1 with components 11-14 and the text classification program 01.
  • FIG. 1 does not constitute a limitation on the intelligent text classification device 1. It may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.
  • the text classification program 01 is stored in the memory 11; when the processor 12 executes the text classification program 01 stored in the memory 11, the following steps are implemented:
  • Step 1 Receive text data and a tag set, and perform part-of-speech tagging on the text data.
  • the text data set includes text data of various subjects, such as finance, novels, education, real estate, sports, etc.
  • the label set records the labels of each text data in the text data set, such as Record text data A for sports, text data B for real estate, etc.
  • the part-of-speech tagging first annotates nouns and verbs in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template refers to a recognizer that has marked the features of nouns and verbs,
  • the predicate tag template can identify nouns and verbs by identifying the characteristics of words.
  • the words that are longer than the preset length and contain " ⁇ ” or “ ⁇ ” are adjectives or adverbs, such as [Angry people fight the hateful thief fiercely].
  • the labeling method may adopt a form including labeling symbols, such as [Angry adj people n fiercely adv fight v hate adj thief n ].
  • Step 2 Perform fine-grained word segmentation of the text data according to the part of speech tagging to obtain a word segmentation sequence set, and perform word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set.
  • the fine-grained word segmentation refers to removing words that are not marked as nouns, verbs, adjectives, or adverbs in the text data, and obtaining a word segmentation sequence set based on the marked symbols.
  • the excluded words are called heteromorphic word sets, such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc.
  • the stop words include words such as " ⁇ ", " ⁇ ", etc.
  • a classification probability model is established based on the word segmentation sequence set, a conditional probability model is constructed based on the classification probability model, and the cumulative sum operation is performed on the conditional probability model to obtain a log likelihood function, which maximizes the log likelihood However, the function solves the optimal solution, and the optimal solution is the word vectorized data set.
  • the classification probability model for:
  • X is the word segmentation sequence set
  • is the nouns, verbs, adjectives, and adverbs of the word segmentation sequence set, which can also be called characteristic words
  • e is an infinite non-recurring decimal.
  • a transposed matrix X ⁇ , the cumulative summing operation X ⁇ of the [omega], is the accumulated sum operation:
  • c is the number of data in the word segmentation sequence set
  • V( ⁇ i ) is the word vectorized data set assuming that the word has been vectorized, which can be obtained by subsequently maximizing the log likelihood function.
  • V( ⁇ i )) is:
  • l ⁇ represents the number of nodes in the Huffman coding
  • the Huffman coding combined with the Huffman binary tree the tree is the data elements (also called nodes) organized according to the branch relationship
  • a non-linear data structure a collection of several trees is called a forest.
  • a binary tree is an ordered tree with at most two subtrees per node. The two subtrees are called the left subtree and the right subtree respectively. If there is a binary tree with the smallest path length, it is called a Huffman binary tree, so the ⁇ is a leaf node, and the weight of each leaf node is expressed by Huffman coding.
  • This application uses 0, 1 code Different arrangements to represent words, Indicates the Huffman code corresponding to the j-th node in the path p ⁇ , and the root node has no code, Is the encoding of the word ⁇ , Represents the vector corresponding to the j-1th non-leaf node in the path p ⁇ . Because the word ⁇ is a leaf node, there is no corresponding vector.
  • the log likelihood function ⁇ is
  • V( ⁇ i ) Represents the partial derivative of the log likelihood function to the transposed matrix of the cumulative sum operation.
  • the V( ⁇ i ) is continuously optimized based on the partial derivative, and the optimization process is:
  • is the set learning rate
  • V( ⁇ ) is obtained based on the above.
  • Step 3 Input the word vectorized data set and the label set into a classification model for training and obtain a training value.
  • the training value is less than a preset threshold, the classification model exits the training.
  • the classification model of the present application includes a convolutional neural network, an activation function and a loss function.
  • the convolutional neural network includes nineteen layers of convolutional layers, nineteen layers of pooling layers, and one layer of fully connected layers.
  • the inputting the word vectorized data set and the label set into a classification model for training and obtaining a training value, and when the training value is less than a preset threshold, the classification model exiting training includes:
  • the convolutional neural network After the convolutional neural network receives the word vectorized data set, it inputs the word vectorized data set to the nineteen-layer convolutional layer and nineteen-layer pooling layer for convolution operation and The maximum pooling operation obtains a dimensionality reduction data set, and the dimensionality reduction data set is input to the fully connected layer.
  • the fully connected layer receives the dimensionality reduction data set, calculates a prediction classification set in combination with the activation function, and inputs the prediction classification set and the label set into the loss function to calculate the loss Value, judging the magnitude relationship between the loss value and a preset threshold, until the loss value is less than the preset threshold, the classification model exits training.
  • ⁇ ' is the output data
  • is the input data
  • k is the size of the convolution kernel
  • s is the stride of the convolution operation
  • p is the data zero-filling matrix.
  • the pooling operation can select the maximum pooling operation, The maximum pooling operation is to select the largest value in the matrix data in the matrix to replace the entire matrix;
  • the activation function is:
  • n is the data size of the prediction classification set
  • y t is the label set
  • ⁇ t is the prediction classification set
  • the preset threshold is generally set at 0.01.
  • Step 4 Receive the text input by the user, perform the word vectorization operation on the text to obtain a text word vector, input the text word vector to the classification model for judgment and output the classification result.
  • the text classification program may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and are executed by one or more processors (in this embodiment, the processing The module 12) is executed to complete this application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the execution process of the text classification program in the intelligent text classification device.
  • FIG. 4 is a schematic diagram of the program modules of the text classification program in an embodiment of the intelligent text classification device of this application.
  • the text classification program can be divided into a part-of-speech tagging module 10 and word vectorization.
  • the conversion module 20, the model training module 30, and the text classification result output module 40 are exemplary:
  • the part-of-speech tagging module 10 is configured to receive text data and a tag set, and perform part-of-speech tagging on the text data.
  • the word vectorization conversion module 20 is configured to perform fine-grained word segmentation of the text data according to the part of speech tagging to obtain a word segmentation sequence set, and perform word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.
  • the model training module 30 is configured to: input the word vectorized data set and the tag set into a classification model for training and obtain a training value, and when the training value is less than a preset threshold, the classification model exits training .
  • the text classification result output module 40 is configured to: receive text input by a user, perform the word vectorization operation on the text to obtain a text word vector, input the text word vector to the classification model for judgment and output the classification result .
  • part-of-speech tagging module 10 word vectorization conversion module 20, model training module 30, and text classification result output module 40 implement functions or operation steps when executed, which are substantially the same as those in the foregoing embodiment, and will not be repeated here.
  • an embodiment of the present application also proposes a computer-readable storage medium having a text classification program stored on the computer-readable storage medium, and the text classification program can be executed by one or more processors to implement the following operations:
  • the text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to artificial intelligence technology. Disclosed is an intelligent text classification method, comprising: receiving text data and a tag set, and performing part-of-speech tagging on the text data; performing fine-grained word segmentation on the text data according to the part-of-speech tagging in order to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set in order to obtain a word vectorization data set; inputting the word vectorization data set and the tag set into a classification model for training and obtaining a training value, wherein the classification model quits the training when the training value is less than a preset threshold value; and receiving a text input by a user, performing the word vectorization operation on the text to obtain text word vectors, and inputting the text word vectors into the classification model for determination, and outputting a classification result. Further provided are an intelligent text classification apparatus and a computer-readable storage medium. The present application can realize an accurate text classification function.

Description

智能文本分类方法、装置及计算机可读存储介质Intelligent text classification method, device and computer readable storage medium
本申请基于巴黎公约申明享有2019年6月20日递交的申请号为CN201910540265.3、名称为“智能文本分类方法、装置及计算机可读存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application is based on the Paris Convention declaration that it enjoys the priority of the Chinese patent application filed on June 20, 2019 with the application number CN201910540265.3 and titled "Smart text classification method, device and computer readable storage medium". This Chinese patent application The overall content of is incorporated in this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种智能文本分类方法、装置及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to an intelligent text classification method, device and computer-readable storage medium.
背景技术Background technique
文本分类是文本处理中很重要的一个部分,它的应用也非常广泛,比如:垃圾过滤,新闻分类,词性标注等。针对不同文本的内容进行分类,目前通常都是标注关键字进行分类。这样的分类方法会忽略文本中的篇章信息,并且由于这种分类方法缺乏对词性的考量,使得对文本的划分不全面、不细致,从而导致准确度低。Text classification is a very important part of text processing, and its applications are also very extensive, such as: spam filtering, news classification, part-of-speech tagging, etc. To classify the content of different texts, currently it is usually classified by labeling keywords. Such a classification method ignores the textual information in the text, and because this classification method lacks part of speech considerations, the classification of the text is not comprehensive and detailed, resulting in low accuracy.
发明内容Summary of the invention
本申请提供一种智能文本分类方法、装置及计算机可读存储介质,其主要目的在于当用户输入文本时,为用户提供精准的分类结果。This application provides an intelligent text classification method, device, and computer-readable storage medium, the main purpose of which is to provide users with accurate classification results when they input text.
为实现上述目的,本申请提供的一种智能文本分类方法,包括:In order to achieve the above purpose, an intelligent text classification method provided by this application includes:
接收文本数据及标签集,对所述文本数据进行词性标注;Receive text data and a tag set, and perform part-of-speech tagging on the text data;
根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集;Performing fine-grained word segmentation of the text data according to the part-of-speech tagging to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set;
将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练;Input the word vectorized data set and the label set into a classification model for training and obtain a training value, and when the training value is less than a preset threshold, the classification model exits the training;
接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。The text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.
此外,为实现上述目的,本申请还提供一种智能文本分类装置,该装置 包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的文本分类程序,所述文本分类程序被所述处理器执行时实现如下步骤:In addition, in order to achieve the above object, the present application also provides an intelligent text classification device, which includes a memory and a processor. The memory stores a text classification program that can run on the processor. The text classification program When executed by the processor, the following steps are implemented:
接收文本数据及标签集,对所述文本数据进行词性标注;Receive text data and a tag set, and perform part-of-speech tagging on the text data;
根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集;Performing fine-grained word segmentation of the text data according to the part-of-speech tagging to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set;
将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练;Input the word vectorized data set and the label set into a classification model for training and obtain a training value, and when the training value is less than a preset threshold, the classification model exits the training;
接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。The text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有文本分类程序,所述文本分类程序可被一个或者多个处理器执行,以实现如上所述的智能文本分类方法的步骤。In addition, in order to achieve the above-mentioned object, the present application also provides a computer-readable storage medium having a text classification program stored on the computer-readable storage medium, and the text classification program can be executed by one or more processors to achieve The steps of the smart text classification method as described above.
本申请提出的智能文本分类方法、装置及计算机可读存储介质。本申请根据文本内容进行词性标注,可有效将文本数据转化为词性数据,同时进行词向量化操作可将文本数据的特征进一步无损失的解读给计算机进行分析,基于分类模型的多次训练可有效提高文本数据分类类型的鲁棒性和准确性。因此本申请可为用户提供精准的分类结果。The intelligent text classification method, device and computer readable storage medium proposed in this application. This application performs part-of-speech tagging based on text content, which can effectively convert text data into part-of-speech data. At the same time, word vectorization can further interpret the characteristics of text data to a computer for analysis without loss. Multiple training based on classification models can be effective Improve the robustness and accuracy of text data classification types. Therefore, this application can provide users with accurate classification results.
附图说明Description of the drawings
图1为本申请一实施例提供的智能文本分类方法的流程示意图;FIG. 1 is a schematic flowchart of an intelligent text classification method provided by an embodiment of this application;
图2为本申请一实施例提供的智能文本分类装置的内部结构示意图;2 is a schematic diagram of the internal structure of an intelligent text classification device provided by an embodiment of the application;
图3为本申请一实施例提供的智能文本分类装置中文本分类程序的模块示意图。FIG. 3 is a schematic diagram of modules of a text classification program in an intelligent text classification device provided by an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
本申请提供一种智能文本分类方法。参照图1所示,为本申请一实施例 提供的智能文本分类方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides an intelligent text classification method. Referring to FIG. 1, it is a schematic flowchart of a smart text classification method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,智能文本分类方法包括:In this embodiment, the intelligent text classification method includes:
S1、接收文本数据及标签集,对所述文本数据进行词性标注。S1. Receive text data and a tag set, and perform part-of-speech tagging on the text data.
较佳地,所述文本数据集包括各种题材的文本数据,如财经、小说、教育、房产、体育等题材,所述标签集即记录了所述文本数据集内各文本数据的标签,如记录文本数据A为体育类、文本数据B为房产类等。Preferably, the text data set includes text data of various subjects, such as finance, novels, education, real estate, sports, etc. The label set records the labels of each text data in the text data set, such as Record text data A for sports, text data B for real estate, etc.
本申请较佳实施例中,所述词性标注先根据预设的词性标记模板标注所述文本数据中的名词、动词,其中所述词性标记模板是指标记了名词、动词特征的识别器,所述词性标记模板可以通过识别词语的特征来确定名词、动词。如[我特别的喜欢吃苹果]、[打篮球有益于健身]、[敌人在最后的时间屈服了],根据所述词性标记模板标注出[我 苹果]、[篮球 健身]、[敌人 时间]为名词,[喜欢 吃]、[打 有益]、[屈服]为动词;In a preferred embodiment of the present application, the part-of-speech tagging first annotates the nouns and verbs in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template refers to a recognizer that has marked the features of nouns and verbs. The predicate tag template can identify nouns and verbs by identifying the characteristics of words. For example, [I like to eat apples in particular], [Playing basketball is good for fitness], [The enemy succumbed at the last time], mark [my apple], [basketball fitness], [enemy time] according to the part-of-speech tag template Are nouns, [like to eat], [打 益], and [屈服] are verbs;
搜索所述文本数据内长度大于预设长度,如两个字符并含有“的”或“地”的词,并判断所述长度大于两个字符并含有“的”或“地”的词在所述文本数据中的前后词是否是名词或动词。若所述前后词是名词或动词,则所述长度大于预设长度字符并含有“的”或“地”的词即为形容词或副词,如[愤怒的人们狠狠地厮打可恨的小偷],先根据所述词性标记模板识别出[人们]、[厮打]、[小偷],并识别出长度大于两个字符并含有“的”或“地”的词为[愤怒的]、[狠狠地]、[可恨的],判断出所述词前后都有名词或动词如[人们]、[厮打]、[小偷],因此为形容词或副词并标注。较佳地,所述标注方式可采用包括标注符号的形式,如[愤怒的 adj人们 n狠狠地 adv厮打 v可恨的 adj小偷 n]。 Search for words with a length greater than a preset length in the text data, such as two characters and containing "的" or "地", and determine that the words with a length greater than two characters and containing "的" or "地" are in the place Whether the words before and after the predicate text data are nouns or verbs. If the preceding and following words are nouns or verbs, the words that are longer than the preset length and contain "的" or "地" are adjectives or adverbs, such as [Angry people fight the hateful thief fiercely], Firstly, according to the part-of-speech tagging template, [people], [厮打], [thief] are identified, and words that are longer than two characters and contain "的" or "地" are recognized as [angry], [給打地] ], [Hateful], it is judged that there are nouns or verbs such as [人], [厮打], [thief] before and after the word, so they are adjectives or adverbs and marked. Preferably, the labeling method may adopt a form including labeling symbols, such as [Angry adj people n fiercely adv fight v hate adj thief n ].
S2、根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集。S2. Perform fine-grained word segmentation of the text data according to the part of speech tagging to obtain a word segmentation sequence set, and perform word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set.
本申请较佳实施例中,所述细粒度分词是指剔除出所述文本数据中没有被标记为名词、动词、形容词、副词的词,并基于标注符号得到分词序列集。较佳地,所述剔除的词称为异形词集,如所有英文字母、阿拉伯数字、中文数字、标点符号、停用词等,所述停用词包括“了”“于”等用词,如[一个磅礴大雨的 adj上午 n,大雨 n都把土地 n冲湿 v了,变成 v了湿乎乎的 adjn]经过所述细粒度分词后得到[磅礴大雨的 adj上午 n大雨 n土地 n冲湿 v变成 v湿乎乎的 adjn],然后基于所述标注符号得到所述分词序列集,如[磅礴大雨的 adj上午 n大雨 n土地 n冲湿 v变成 v湿乎乎的 adjn]基于所述标注符号得到[磅礴大雨的上午大雨土地冲湿变成湿乎乎的泥]。 In a preferred embodiment of the present application, the fine-grained word segmentation refers to removing words that are not marked as nouns, verbs, adjectives, or adverbs in the text data, and obtaining a word segmentation sequence set based on the marked symbols. Preferably, the excluded words are called heteromorphic word sets, such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc. The stop words include words such as "了", "于", etc. the [adj morning of a boundless n rain, heavy rain washed wet n n v regarded land, into a damp v adj mud n] after the fine-grained segmentation obtained [adj boundless morning of rain n Heavy rain n land n wetted v becomes v wet adj mud n ], and then based on the labeling symbols to obtain the word sequence set, such as [majestic heavy rain adj morning n heavy rain n land n wet v becomes v adj damp clay n] based on the reference symbol to give [boundless morning rain rain wet land red mud becomes damp].
进一步地,基于所述分词序列集建立分类概率模型,基于所述分类概率模型构建条件概率模型,对所述条件概率模型进行累加求和操作得到对数似然函数,最大化所述对数似然函数求解最优解,所述最优解即为所述词向量化数据集。Further, a classification probability model is established based on the word segmentation sequence set, a conditional probability model is constructed based on the classification probability model, and the cumulative sum operation is performed on the conditional probability model to obtain a log likelihood function, which maximizes the log likelihood However, the function solves the optimal solution, and the optimal solution is the word vectorized data set.
较佳地,所述分类概率模型
Figure PCTCN2019117341-appb-000001
为:
Preferably, the classification probability model
Figure PCTCN2019117341-appb-000001
for:
Figure PCTCN2019117341-appb-000002
Figure PCTCN2019117341-appb-000002
其中,X为所述分词序列集,ω为所述分词序列集的名词、动词、形容词、副词,也可称为特征词,e为无限不循环小数,
Figure PCTCN2019117341-appb-000003
为X ω的转置矩阵,X ω为ω的累加求和操作,所述累加求和操作为:
Among them, X is the word segmentation sequence set, ω is the nouns, verbs, adjectives, and adverbs of the word segmentation sequence set, which can also be called characteristic words, and e is an infinite non-recurring decimal.
Figure PCTCN2019117341-appb-000003
A transposed matrix X ω, the cumulative summing operation X ω of the [omega], is the accumulated sum operation:
Figure PCTCN2019117341-appb-000004
Figure PCTCN2019117341-appb-000004
其中,c为所述分词序列集的数据个数,V(ω i)为假设已词向量化后的词向量化数据集,后续进行所述最大化所述对数似然函数即可得到。 Wherein, c is the number of data in the word segmentation sequence set, and V(ω i ) is the word vectorized data set assuming that the word has been vectorized, which can be obtained by subsequently maximizing the log likelihood function.
所述条件概率模型p(ω|V(ω i))为: The conditional probability model p(ω|V(ω i )) is:
Figure PCTCN2019117341-appb-000005
Figure PCTCN2019117341-appb-000005
其中,l ω表示霍夫曼编码中所述ω包括结点的数量,所述霍夫曼编码结合霍夫曼二叉树来说,树是数据元素(又称为结点)按照分支关系组织出的一种非线性数据结构,若干颗树的集合称为森林。二叉树是每个结点最多只有两个子树的有序树,两个子树分别称为左子树和右子树。若存在一颗二叉树的路径长度最小,则称为霍夫曼二叉树,因此所述ω为叶子结点,各叶子结点的权值通过霍夫曼编码编码表现,本申请用0、1码的不同排列来表示单词,
Figure PCTCN2019117341-appb-000006
表示在路径p ω内,第j个结点对应的霍夫曼编码,根结点无编码,
Figure PCTCN2019117341-appb-000007
为词ω的编码,
Figure PCTCN2019117341-appb-000008
表示路径p ω内,第j-1个非叶子结点对应的向量,因为词ω是叶子结点,因此无对应向量。
Among them, l ω represents the number of nodes in the Huffman coding, and the Huffman coding combined with the Huffman binary tree, the tree is the data elements (also called nodes) organized according to the branch relationship A non-linear data structure, a collection of several trees is called a forest. A binary tree is an ordered tree with at most two subtrees per node. The two subtrees are called the left subtree and the right subtree respectively. If there is a binary tree with the smallest path length, it is called a Huffman binary tree, so the ω is a leaf node, and the weight of each leaf node is expressed by Huffman coding. This application uses 0, 1 code Different arrangements to represent words,
Figure PCTCN2019117341-appb-000006
Indicates the Huffman code corresponding to the j-th node in the path p ω , and the root node has no code,
Figure PCTCN2019117341-appb-000007
Is the encoding of the word ω,
Figure PCTCN2019117341-appb-000008
Represents the vector corresponding to the j-1th non-leaf node in the path p ω . Because the word ω is a leaf node, there is no corresponding vector.
较佳地,所述对数似然函数ζ为Preferably, the log likelihood function ζ is
Figure PCTCN2019117341-appb-000009
Figure PCTCN2019117341-appb-000009
其中
Figure PCTCN2019117341-appb-000010
为词库,所述词库包括了所述分词序列集中所有的名词、动词、形容词与副词。
among them
Figure PCTCN2019117341-appb-000010
It is a thesaurus, which includes all nouns, verbs, adjectives and adverbs in the word segmentation sequence set.
进一步地,最大化所述对数似然函数为:Further, maximizing the log likelihood function is:
Figure PCTCN2019117341-appb-000011
Figure PCTCN2019117341-appb-000011
其中,
Figure PCTCN2019117341-appb-000012
表示所述对数似然函数对所述累加求和操作转置矩阵的偏导。基于所述偏导不断优化所述V(ω i),所述优化过程为:
among them,
Figure PCTCN2019117341-appb-000012
Represents the partial derivative of the log likelihood function to the transposed matrix of the cumulative sum operation. The V(ω i ) is continuously optimized based on the partial derivative, and the optimization process is:
Figure PCTCN2019117341-appb-000013
Figure PCTCN2019117341-appb-000013
其中,η为设定的学习率,基于以上所述得到词向量化数据集V(ω)。Among them, η is the set learning rate, and the word vectorized data set V(ω) is obtained based on the above.
S3、将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练。S3. Input the word vectorized data set and the label set into a classification model for training and obtain a training value. When the training value is less than a preset threshold, the classification model exits the training.
本申请较佳地,所述分类模型包括卷积神经网络、激活函数和损失函数。其中所述卷积神经网络包括十九层卷积层、十九层池化层和一层全连接层。Preferably, the classification model of the present application includes a convolutional neural network, an activation function and a loss function. The convolutional neural network includes nineteen layers of convolutional layers, nineteen layers of pooling layers, and one layer of fully connected layers.
所述将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练,包括:The inputting the word vectorized data set and the label set into a classification model for training and obtaining a training value, and when the training value is less than a preset threshold, the classification model exiting training includes:
较佳地,所述卷积神经网络接收所述词向量化数据集后,将所述词向量化数据集输入至所述十九层卷积层和十九层池化层进行卷积操作和最大池化操作,得到降维数据集,将所述降维数据集输入至全连接层。Preferably, after the convolutional neural network receives the word vectorized data set, it inputs the word vectorized data set to the nineteen-layer convolutional layer and nineteen-layer pooling layer for convolution operation and The maximum pooling operation obtains a dimensionality reduction data set, and the dimensionality reduction data set is input to the fully connected layer.
进一步地,所述全连接层接收所述降维数据集,并结合所述激活函数计算得到预测分类集,并将所述预测分类集和所述标签集输入至所述损失函数中计算出损失值,判断所述损失值与预设阈值的大小关系,直至所述损失值小于所述预设阈值时,所述分类模型退出训练。Further, the fully connected layer receives the dimensionality reduction data set, calculates a prediction classification set in combination with the activation function, and inputs the prediction classification set and the label set into the loss function to calculate the loss Value, judging the magnitude relationship between the loss value and a preset threshold, until the loss value is less than the preset threshold, the classification model exits training.
本申请较佳实施例所述卷积操作为:The convolution operation described in the preferred embodiment of this application is:
Figure PCTCN2019117341-appb-000014
Figure PCTCN2019117341-appb-000014
其中ω’为输出数据,ω为输入数据,k为卷积核的大小,s为所述卷积操作的步幅,p为数据补零矩阵,所述池化操作可选择最大池化操作,所述最大池化操作是在矩阵内选择矩阵数据中数值最大的值代替整个矩阵;Where ω'is the output data, ω is the input data, k is the size of the convolution kernel, s is the stride of the convolution operation, and p is the data zero-filling matrix. The pooling operation can select the maximum pooling operation, The maximum pooling operation is to select the largest value in the matrix data in the matrix to replace the entire matrix;
所述激活函数为:The activation function is:
Figure PCTCN2019117341-appb-000015
Figure PCTCN2019117341-appb-000015
其中y为所述预测分类集,e为无限不循环小数。Where y is the predicted classification set, and e is an infinite non-recurring decimal.
本申请较佳实施例所述损失值T为:The loss value T in the preferred embodiment of the present application is:
Figure PCTCN2019117341-appb-000016
Figure PCTCN2019117341-appb-000016
其中,n为所述预测分类集的数据大小,y t为所述标签集,μ t为所述所述预测分类集,所述预设阈值一般设置在0.01。 Wherein, n is the data size of the prediction classification set, y t is the label set, μ t is the prediction classification set, and the preset threshold is generally set at 0.01.
S4、接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。S4. Receive the text input by the user, perform the word vectorization operation on the text to obtain a text word vector, input the text word vector to the classification model for judgment and output a classification result.
本申请还提供一种智能文本分类装置。参照图3所示,为本申请一实施例提供的智能文本分类装置的内部结构示意图。This application also provides an intelligent text classification device. Referring to FIG. 3, it is a schematic diagram of the internal structure of an intelligent text classification device provided by an embodiment of this application.
在本实施例中,所述智能文本分类装置1可以是PC(Personal Computer,个人电脑),或者是智能手机、平板电脑、便携计算机等终端设备,也可以是一种服务器等。该智能文本分类装置1至少包括存储器11、处理器12,通信总线13,以及网络接口14。In this embodiment, the smart text classification device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server. The intelligent text classification device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是智能文本分类装置1的内部存储单元,例如该智能文本分类装置1的硬盘。存储器11在另一些实施例中也可以是智能文本分类装置1的外部存储设备,例如智能文本分类装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括智能文本分类装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于智能文本分类装置1的应用软件及各类数据,例如文本分类程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 11 may be an internal storage unit of the intelligent text classification device 1, for example, the hard disk of the intelligent text classification device 1. In other embodiments, the memory 11 may also be an external storage device of the smart text classification device 1, such as a plug-in hard disk equipped on the smart text classification device 1, a smart media card (SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the intelligent text classification device 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the intelligent text classification device 1, such as the code of the text classification program 01, etc., but also to temporarily store data that has been output or will be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行文本分类程序01等。In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, for example, execute text classification program 01, etc.
通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection and communication between these components.
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.
可选地,该装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在智能文本分类装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display can also be called a display screen or a display unit as appropriate, and is used to display the information processed in the intelligent text classification device 1 and to display a visualized user interface.
图3仅示出了具有组件11-14以及文本分类程序01的智能文本分类装置1,本领域技术人员可以理解的是,图1示出的结构并不构成对智能文本分类装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 3 only shows the intelligent text classification device 1 with components 11-14 and the text classification program 01. Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the intelligent text classification device 1. It may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.
在图3所示的装置1实施例中,存储器11中存储有文本分类程序01;处理器12执行存储器11中存储的文本分类程序01时实现如下步骤:In the embodiment of the apparatus 1 shown in FIG. 3, the text classification program 01 is stored in the memory 11; when the processor 12 executes the text classification program 01 stored in the memory 11, the following steps are implemented:
步骤一、接收文本数据及标签集,对所述文本数据进行词性标注。Step 1: Receive text data and a tag set, and perform part-of-speech tagging on the text data.
较佳地,所述文本数据集包括各种题材的文本数据,如财经、小说、教育、房产、体育等题材,所述标签集即记录了所述文本数据集内各文本数据的标签,如记录文本数据A为体育类、文本数据B为房产类等。Preferably, the text data set includes text data of various subjects, such as finance, novels, education, real estate, sports, etc. The label set records the labels of each text data in the text data set, such as Record text data A for sports, text data B for real estate, etc.
本申请较佳实施例中,所述词性标注先根据预设的词性标记模板标注所述文本数据中的名词、动词,其中所述词性标记模板是指标记了名词、动词特征的识别器,所述词性标记模板可以通过识别词语的特征来确定名词、动词。如[我特别的喜欢吃苹果]、[打篮球有益于健身]、[敌人在最后的时间屈服了],根据所述词性标记模板标注出[我 苹果]、[篮球 健身]、[敌人 时间]为名称,[喜欢 吃]、[打 有益]、[屈服]为动词;In a preferred embodiment of the present application, the part-of-speech tagging first annotates nouns and verbs in the text data according to a preset part-of-speech tagging template, where the part-of-speech tagging template refers to a recognizer that has marked the features of nouns and verbs, The predicate tag template can identify nouns and verbs by identifying the characteristics of words. For example, [I like to eat apples in particular], [Playing basketball is good for fitness], [The enemy succumbed at the last time], mark [my apple], [basketball fitness], [enemy time] according to the part-of-speech tag template Is the name, and [likes to eat], [打 益], [quit] are verbs;
搜索所述文本数据内长度大于预设长度,如两个字符并含有“的”或“地”的词,并判断所述长度大于两个字符并含有“的”或“地”的词在所述文本数据中的前后词是否是名词或动词。若所述前后词是名词或动词,则所述长度大于预设长度字符并含有“的”或“地”的词即为形容词或副词,如[愤怒的人们狠狠 地厮打可恨的小偷],先根据所述词性标记模板识别出[人们]、[厮打]、[小偷],并识别出长度大于两个字符并含有“的”或“地”的词为[愤怒的]、[狠狠地]、[可恨的],判断出所述词前后都有名词或动词如[人们]、[厮打]、[小偷],因此为形容词或副词并标注。较佳地,所述标注方式可采用包括标注符号的形式,如[愤怒的 adj人们 n狠狠地 adv厮打 v可恨的 adj小偷 n]。 Search for words with a length greater than a preset length in the text data, such as two characters and containing "的" or "地", and determine that the words with a length greater than two characters and containing "的" or "地" are in the place Whether the preceding and following words in the predicate text data are nouns or verbs. If the preceding and following words are nouns or verbs, the words that are longer than the preset length and contain "的" or "地" are adjectives or adverbs, such as [Angry people fight the hateful thief fiercely], Firstly, according to the part-of-speech tagging template, [people], [厮打], [thief] are identified, and words that are longer than two characters and contain "的" or "地" are recognized as [angry], [給打地] ], [Hateful], it is judged that there are nouns or verbs such as [人], [厮打], [thief] before and after the word, so they are adjectives or adverbs and marked. Preferably, the labeling method may adopt a form including labeling symbols, such as [Angry adj people n fiercely adv fight v hate adj thief n ].
步骤二、根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集。Step 2: Perform fine-grained word segmentation of the text data according to the part of speech tagging to obtain a word segmentation sequence set, and perform word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set.
本申请较佳实施例中,所述细粒度分词是指剔除出所述文本数据中没有被标记为名词、动词、形容词、副词的词,并基于标注符号得到分词序列集。较佳地,所述剔除的词称为异形词集,如所有英文字母、阿拉伯数字、中文数字、标点符号、停用词等,所述停用词包括“了”“于”等用词,如[一个磅礴大雨的 adj上午 n,大雨 n都把土地 n冲湿 v了,变成 v了湿乎乎的 adjn]经过所述细粒度分词后得到[磅礴大雨的 adj上午 n大雨 n土地 n冲湿 v变成 v湿乎乎的 adjn],然后基于所述标注符号得到所述分词序列集,如[磅礴大雨的 adj上午 n大雨 n土地 n冲湿 v变成 v湿乎乎的 adjn]基于所述标注符号得到[磅礴大雨的上午大雨土地冲湿变成湿乎乎的泥]。 In a preferred embodiment of the present application, the fine-grained word segmentation refers to removing words that are not marked as nouns, verbs, adjectives, or adverbs in the text data, and obtaining a word segmentation sequence set based on the marked symbols. Preferably, the excluded words are called heteromorphic word sets, such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc. The stop words include words such as "了", "于", etc. the [adj morning of a boundless n rain, heavy rain washed wet n n v regarded land, into a damp v adj mud n] after the fine-grained segmentation obtained [adj boundless morning of rain n Heavy rain n land n wetted v becomes v wet adj mud n ], and then based on the labeling symbols to obtain the word sequence set, such as [majestic heavy rain adj morning n heavy rain n land n wet v becomes v adj damp clay n] based on the reference symbol to give [boundless morning rain rain wet land red mud becomes damp].
进一步地,基于所述分词序列集建立分类概率模型,基于所述分类概率模型构建条件概率模型,对所述条件概率模型进行累加求和操作得到对数似然函数,最大化所述对数似然函数求解最优解,所述最优解即为所述词向量化数据集。Further, a classification probability model is established based on the word segmentation sequence set, a conditional probability model is constructed based on the classification probability model, and the cumulative sum operation is performed on the conditional probability model to obtain a log likelihood function, which maximizes the log likelihood However, the function solves the optimal solution, and the optimal solution is the word vectorized data set.
较佳地,所述分类概率模型
Figure PCTCN2019117341-appb-000017
为:
Preferably, the classification probability model
Figure PCTCN2019117341-appb-000017
for:
Figure PCTCN2019117341-appb-000018
Figure PCTCN2019117341-appb-000018
其中,X为所述分词序列集,ω为所述分词序列集的名词、动词、形容词、副词,也可称为特征词,e为无限不循环小数,
Figure PCTCN2019117341-appb-000019
为X ω的转置矩阵,X ω为ω的累加求和操作,所述累加求和操作为:
Among them, X is the word segmentation sequence set, ω is the nouns, verbs, adjectives, and adverbs of the word segmentation sequence set, which can also be called characteristic words, and e is an infinite non-recurring decimal.
Figure PCTCN2019117341-appb-000019
A transposed matrix X ω, the cumulative summing operation X ω of the [omega], is the accumulated sum operation:
Figure PCTCN2019117341-appb-000020
Figure PCTCN2019117341-appb-000020
其中,c为所述分词序列集的数据个数,V(ω i)为假设已词向量化后的词向量化数据集,后续进行所述最大化所述对数似然函数即可得到。 Wherein, c is the number of data in the word segmentation sequence set, and V(ω i ) is the word vectorized data set assuming that the word has been vectorized, which can be obtained by subsequently maximizing the log likelihood function.
所述条件概率模型p(ω|V(ω i))为: The conditional probability model p(ω|V(ω i )) is:
Figure PCTCN2019117341-appb-000021
Figure PCTCN2019117341-appb-000021
其中,l ω表示霍夫曼编码中所述ω包括结点的数量,所述霍夫曼编码结合霍夫曼二叉树来说,树是数据元素(又称为结点)按照分支关系组织出的一种非线性数据结构,若干颗树的集合称为森林。二叉树是每个结点最多只有两个子树的有序树,两个子树分别称为左子树和右子树。若存在一颗二叉树的路径长度最小,则称为霍夫曼二叉树,因此所述ω为叶子结点,各叶子结点的权值通过霍夫曼编码编码表现,本申请用0、1码的不同排列来表示单词,
Figure PCTCN2019117341-appb-000022
表示在路径p ω内,第j个结点对应的霍夫曼编码,根结点无编码,
Figure PCTCN2019117341-appb-000023
为词ω的编码,
Figure PCTCN2019117341-appb-000024
表示路径p ω内,第j-1个非叶子结点对应的向量,因为词ω是叶子结点,因此无对应向量。
Among them, l ω represents the number of nodes in the Huffman coding, and the Huffman coding combined with the Huffman binary tree, the tree is the data elements (also called nodes) organized according to the branch relationship A non-linear data structure, a collection of several trees is called a forest. A binary tree is an ordered tree with at most two subtrees per node. The two subtrees are called the left subtree and the right subtree respectively. If there is a binary tree with the smallest path length, it is called a Huffman binary tree, so the ω is a leaf node, and the weight of each leaf node is expressed by Huffman coding. This application uses 0, 1 code Different arrangements to represent words,
Figure PCTCN2019117341-appb-000022
Indicates the Huffman code corresponding to the j-th node in the path p ω , and the root node has no code,
Figure PCTCN2019117341-appb-000023
Is the encoding of the word ω,
Figure PCTCN2019117341-appb-000024
Represents the vector corresponding to the j-1th non-leaf node in the path p ω . Because the word ω is a leaf node, there is no corresponding vector.
较佳地,所述对数似然函数ζ为Preferably, the log likelihood function ζ is
Figure PCTCN2019117341-appb-000025
Figure PCTCN2019117341-appb-000025
其中
Figure PCTCN2019117341-appb-000026
为词库,所述词库包括了所述分词序列集中所有的名词、动词、形容词与副词。
among them
Figure PCTCN2019117341-appb-000026
It is a thesaurus, which includes all nouns, verbs, adjectives and adverbs in the word segmentation sequence set.
进一步地,最大化所述对数似然函数为:Further, maximizing the log likelihood function is:
Figure PCTCN2019117341-appb-000027
Figure PCTCN2019117341-appb-000027
其中,
Figure PCTCN2019117341-appb-000028
表示所述对数似然函数对所述累加求和操作转置矩阵的偏导。基于所述偏导不断优化所述V(ω i),所述优化过程为:
among them,
Figure PCTCN2019117341-appb-000028
Represents the partial derivative of the log likelihood function to the transposed matrix of the cumulative sum operation. The V(ω i ) is continuously optimized based on the partial derivative, and the optimization process is:
Figure PCTCN2019117341-appb-000029
Figure PCTCN2019117341-appb-000029
其中,η为设定的学习率,基于以上所述得到词向量化数据集V(ω)。Among them, η is the set learning rate, and the word vectorized data set V(ω) is obtained based on the above.
步骤三、将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练。Step 3: Input the word vectorized data set and the label set into a classification model for training and obtain a training value. When the training value is less than a preset threshold, the classification model exits the training.
本申请较佳地,所述分类模型包括卷积神经网络、激活函数和损失函数。其中所述卷积神经网络包括十九层卷积层、十九层池化层和一层全连接层。Preferably, the classification model of the present application includes a convolutional neural network, an activation function and a loss function. The convolutional neural network includes nineteen layers of convolutional layers, nineteen layers of pooling layers, and one layer of fully connected layers.
所述将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练,包括:The inputting the word vectorized data set and the label set into a classification model for training and obtaining a training value, and when the training value is less than a preset threshold, the classification model exiting training includes:
较佳地,所述卷积神经网络接收所述词向量化数据集后,将所述词向量化数据集输入至所述十九层卷积层和十九层池化层进行卷积操作和最大池化 操作,得到降维数据集,将所述降维数据集输入至全连接层。Preferably, after the convolutional neural network receives the word vectorized data set, it inputs the word vectorized data set to the nineteen-layer convolutional layer and nineteen-layer pooling layer for convolution operation and The maximum pooling operation obtains a dimensionality reduction data set, and the dimensionality reduction data set is input to the fully connected layer.
进一步地,所述全连接层接收所述降维数据集,并结合所述激活函数计算得到预测分类集,并将所述预测分类集和所述标签集输入至所述损失函数中计算出损失值,判断所述损失值与预设阈值的大小关系,直至所述损失值小于所述预设阈值时,所述分类模型退出训练。Further, the fully connected layer receives the dimensionality reduction data set, calculates a prediction classification set in combination with the activation function, and inputs the prediction classification set and the label set into the loss function to calculate the loss Value, judging the magnitude relationship between the loss value and a preset threshold, until the loss value is less than the preset threshold, the classification model exits training.
本申请较佳实施例所述卷积操作为:The convolution operation described in the preferred embodiment of this application is:
Figure PCTCN2019117341-appb-000030
Figure PCTCN2019117341-appb-000030
其中ω’为输出数据,ω为输入数据,k为卷积核的大小,s为所述卷积操作的步幅,p为数据补零矩阵,所述池化操作可选择最大池化操作,所述最大池化操作是在矩阵内选择矩阵数据中数值最大的值代替整个矩阵;Where ω'is the output data, ω is the input data, k is the size of the convolution kernel, s is the stride of the convolution operation, and p is the data zero-filling matrix. The pooling operation can select the maximum pooling operation, The maximum pooling operation is to select the largest value in the matrix data in the matrix to replace the entire matrix;
所述激活函数为:The activation function is:
Figure PCTCN2019117341-appb-000031
Figure PCTCN2019117341-appb-000031
其中y为所述预测分类集,e为无限不循环小数。Where y is the predicted classification set, and e is an infinite non-recurring decimal.
本申请较佳实施例所述损失值T为:The loss value T in the preferred embodiment of the present application is:
Figure PCTCN2019117341-appb-000032
Figure PCTCN2019117341-appb-000032
其中,n为所述预测分类集的数据大小,y t为所述标签集,μ t为所述所述预测分类集,所述预设阈值一般设置在0.01。 Wherein, n is the data size of the prediction classification set, y t is the label set, μ t is the prediction classification set, and the preset threshold is generally set at 0.01.
步骤四、接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。Step 4: Receive the text input by the user, perform the word vectorization operation on the text to obtain a text word vector, input the text word vector to the classification model for judgment and output the classification result.
可选地,在其他实施例中,文本分类程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述文本分类程序在智能文本分类装置中的执行过程。Optionally, in other embodiments, the text classification program may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and are executed by one or more processors (in this embodiment, the processing The module 12) is executed to complete this application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the execution process of the text classification program in the intelligent text classification device.
例如,参照图4所示,为本申请智能文本分类装置一实施例中的文本分类程序的程序模块示意图,该实施例中,所述文本分类程序可以被分割为词性标注模块10、词向量化转化模块20、模型训练模块30以及文本分类结果输出模块40,示例性地:For example, referring to FIG. 4, which is a schematic diagram of the program modules of the text classification program in an embodiment of the intelligent text classification device of this application. In this embodiment, the text classification program can be divided into a part-of-speech tagging module 10 and word vectorization. The conversion module 20, the model training module 30, and the text classification result output module 40 are exemplary:
所述词性标注模块10用于:接收文本数据及标签集,对所述文本数据进行词性标注。The part-of-speech tagging module 10 is configured to receive text data and a tag set, and perform part-of-speech tagging on the text data.
所述词向量化转化模块20用于:根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集。The word vectorization conversion module 20 is configured to perform fine-grained word segmentation of the text data according to the part of speech tagging to obtain a word segmentation sequence set, and perform word vectorization processing on the word segmentation sequence set to obtain a word vectorization data set.
所述模型训练模块30用于:将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练。The model training module 30 is configured to: input the word vectorized data set and the tag set into a classification model for training and obtain a training value, and when the training value is less than a preset threshold, the classification model exits training .
所述文本分类结果输出模块40用于:接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。The text classification result output module 40 is configured to: receive text input by a user, perform the word vectorization operation on the text to obtain a text word vector, input the text word vector to the classification model for judgment and output the classification result .
上述词性标注模块10、词向量化转化模块20、模型训练模块30以及文本分类结果输出模块40等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The above-mentioned part-of-speech tagging module 10, word vectorization conversion module 20, model training module 30, and text classification result output module 40 implement functions or operation steps when executed, which are substantially the same as those in the foregoing embodiment, and will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有文本分类程序,所述文本分类程序可被一个或多个处理器执行,以实现如下操作:In addition, an embodiment of the present application also proposes a computer-readable storage medium having a text classification program stored on the computer-readable storage medium, and the text classification program can be executed by one or more processors to implement the following operations:
接收文本数据,对所述文本数据进行词性标注得到文本数据;Receiving text data, and performing part-of-speech tagging on the text data to obtain text data;
根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集;Performing fine-grained word segmentation of the text data according to the part-of-speech tagging to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set;
将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练;Input the word vectorized data set and the label set into a classification model for training and obtain a training value, and when the training value is less than a preset threshold, the classification model exits the training;
接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。The text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.
本申请计算机可读存储介质具体实施方式与上述智能文本分类装置和方法各实施例基本相同,在此不作累述。The specific implementation of the computer-readable storage medium of this application is basically the same as the above-mentioned embodiments of the intelligent text classification device and method, and will not be repeated here.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他 性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the serial numbers of the above embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种智能文本分类方法,其特征在于,所述方法包括:An intelligent text classification method, characterized in that the method includes:
    接收文本数据及标签集,对所述文本数据进行词性标注;Receive text data and a tag set, and perform part-of-speech tagging on the text data;
    根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集;Performing fine-grained word segmentation of the text data according to the part-of-speech tagging to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set;
    将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练;Input the word vectorized data set and the label set into a classification model for training and obtain a training value, and when the training value is less than a preset threshold, the classification model exits the training;
    接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。The text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.
  2. 如权利要求1所述的智能文本分类方法,其特征在于,所述词性标注包括:5. The intelligent text classification method of claim 1, wherein the part-of-speech tagging comprises:
    根据预设的词性标记模板标注所述文本数据中的名词、动词;Label the nouns and verbs in the text data according to a preset part of speech tagging template;
    搜索所述文本数据内长度大于预设长度字符并含有“的”或“地”的词;Search for words in the text data that are longer than the preset length and contain "的" or "地";
    判断所述长度大于预设长度字符并含有“的”或“地”的词在所述文本数据中的前后词是否是名词或动词,若所述前后词是名词或动词,则标记所述长度大于两个字符并含有“的”或“地”的词即为形容词或副词。Determine whether the preceding and following words in the text data of the word whose length is greater than the preset length and containing "的" or "地" are nouns or verbs, and if the preceding and following words are nouns or verbs, mark the length Words that are more than two characters and contain "的" or "地" are adjectives or adverbs.
  3. 如权利要求1或2所述的智能文本分类方法,其特征在于,所述词向量化处理,包括:3. The intelligent text classification method of claim 1 or 2, wherein the word vectorization processing includes:
    基于所述分词序列集建立分类概率模型;Establishing a classification probability model based on the word segmentation sequence set;
    基于所述分类概率模型构建条件概率模型;Constructing a conditional probability model based on the classification probability model;
    对所述条件概率模型进行累加求和操作得到对数似然函数;Performing cumulative summation operations on the conditional probability model to obtain a log likelihood function;
    最大化所述对数似然函数求解最优解,所述最优解即为所述词向量化数据集。Maximizing the log likelihood function to solve an optimal solution, the optimal solution is the word vectorized data set.
  4. 如权利要求3所述的智能文本分类方法,其特征在于,所述分类概率模型
    Figure PCTCN2019117341-appb-100001
    为:
    The intelligent text classification method of claim 3, wherein the classification probability model
    Figure PCTCN2019117341-appb-100001
    for:
    Figure PCTCN2019117341-appb-100002
    Figure PCTCN2019117341-appb-100002
    其中,X为所述分词序列集,ω为所述词性标注的名词、动词、形容词、副词,也可称为特征词,e为无限不循环小数,
    Figure PCTCN2019117341-appb-100003
    为X ω的转置矩阵,X ω为ω的累加求和操作,所述累加求和操作为:
    Among them, X is the word segmentation sequence set, ω is the nouns, verbs, adjectives, and adverbs marked by the part of speech, which can also be called characteristic words, and e is the infinite non-recurring decimal.
    Figure PCTCN2019117341-appb-100003
    A transposed matrix X ω, the cumulative summing operation X ω of the [omega], is the accumulated sum operation:
    Figure PCTCN2019117341-appb-100004
    Figure PCTCN2019117341-appb-100004
    其中,c为所述分词序列集的数据个数,V(ω i)为假设已词向量化后的词向量化数据集。 Wherein, c is the number of data in the word segmentation sequence set, and V(ω i ) is the word vectorized data set assuming that the word has been vectorized.
  5. 如权利要求4所述的智能文本分类方法,其特征在于,所述分类模型包括卷积神经网络、激活函数和损失函数,其中,所述卷积神经网络包括十九层卷积层、十九层池化层和一层全连接层;及The intelligent text classification method according to claim 4, wherein the classification model includes a convolutional neural network, an activation function, and a loss function, wherein the convolutional neural network includes nineteen convolutional layers, nineteen Layer pooling layer and a fully connected layer; and
    所述将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练,包括:The inputting the word vectorized data set and the label set into a classification model for training and obtaining a training value, and when the training value is less than a preset threshold, the classification model exiting training includes:
    所述卷积神经网络接收所述词向量化数据集后,将所述词向量化数据集输入至所述十九层卷积层和十九层池化层进行卷积操作和最大池化操作,得到降维数据集,将所述降维数据集输入至全连接层;After the convolutional neural network receives the word vectorized data set, it inputs the word vectorized data set to the nineteen-layer convolutional layer and nineteen-layer pooling layer for convolution operation and maximum pooling operation , Obtain a dimensionality reduction data set, and input the dimensionality reduction data set to the fully connected layer;
    所述全连接层接收所述降维数据集,并结合所述激活函数计算得到预测分类集,并将所述预测分类集和所述标签集输入至所述损失函数中计算出损失值,判断所述损失值与预设阈值的大小关系,直至所述损失值小于所述预设阈值时,所述分类模型退出训练。The fully connected layer receives the dimensionality reduction data set, calculates a prediction classification set in combination with the activation function, inputs the prediction classification set and the label set into the loss function to calculate a loss value, and judges The magnitude relationship between the loss value and a preset threshold value, until the loss value is less than the preset threshold value, the classification model exits training.
  6. 如权利要求5所述的智能文本分类方法,其特征在于,所述卷积操作为:The intelligent text classification method of claim 5, wherein the convolution operation is:
    Figure PCTCN2019117341-appb-100005
    Figure PCTCN2019117341-appb-100005
    其中,ω’为输出数据,ω为输入数据,k为卷积核的大小,s为所述卷积操作的步幅,p为数据补零矩阵。Among them, ω'is output data, ω is input data, k is the size of the convolution kernel, s is the stride of the convolution operation, and p is the data zero-filling matrix.
  7. 如权利要求5所述的智能文本分类方法,其特征在于,所述损失函数为:The intelligent text classification method according to claim 5, wherein the loss function is:
    Figure PCTCN2019117341-appb-100006
    Figure PCTCN2019117341-appb-100006
    其中,T为损失值,n为预测分类集的数据大小,y t为标签集,μ t为预测分类集。 Among them, T is the loss value, n is the data size of the predicted classification set, y t is the label set, and μ t is the predicted classification set.
  8. 一种智能文本分类装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的文本分类程序,所述文本分类 程序被所述处理器执行时实现如下步骤:An intelligent text classification device, characterized in that the device includes a memory and a processor, the memory stores a text classification program that can be run on the processor, and the text classification program is executed by the processor When implementing the following steps:
    接收文本数据及标签集,对所述文本数据进行词性标注;Receive text data and a tag set, and perform part-of-speech tagging on the text data;
    根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集;Performing fine-grained word segmentation of the text data according to the part-of-speech tagging to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set;
    将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练;Input the word vectorized data set and the label set into a classification model for training and obtain a training value, and when the training value is less than a preset threshold, the classification model exits the training;
    接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。The text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.
  9. 如权利要求8所述的智能文本分类装置,其特征在于,所述词性标注包括:8. The intelligent text classification device of claim 8, wherein the part-of-speech tagging comprises:
    根据预设的词性标记模板标注所述文本数据中的名词、动词;Label the nouns and verbs in the text data according to a preset part of speech tagging template;
    搜索所述文本数据内长度大于预设长度字符并含有“的”或“地”的词;Search for words in the text data that are longer than the preset length and contain "的" or "地";
    判断所述长度大于预设长度字符并含有“的”或“地”的词在所述文本数据中的前后词是否是名词或动词,若所述前后词是名词或动词,则标记所述长度大于两个字符并含有“的”或“地”的词即为形容词或副词。Determine whether the preceding and following words in the text data of the word whose length is greater than the preset length and containing "的" or "地" are nouns or verbs, and if the preceding and following words are nouns or verbs, mark the length Words that are more than two characters and contain "的" or "地" are adjectives or adverbs.
  10. 如权利要求8或9所述的智能文本分类装置,其特征在于,所述词向量化处理,包括:The intelligent text classification device according to claim 8 or 9, wherein the word vectorization processing includes:
    基于所述分词序列集建立分类概率模型;Establishing a classification probability model based on the word segmentation sequence set;
    基于所述分类概率模型构建条件概率模型;Constructing a conditional probability model based on the classification probability model;
    对所述条件概率模型进行累加求和操作得到对数似然函数;Accumulating and summing the conditional probability model to obtain a log likelihood function;
    最大化所述对数似然函数求解最优解,所述最优解即为所述词向量化数据集。The optimal solution is solved by maximizing the log likelihood function, and the optimal solution is the word vectorized data set.
  11. 如权利要求10所述的智能文本分类装置,其特征在于,所述分类概率模型
    Figure PCTCN2019117341-appb-100007
    为:
    The intelligent text classification device of claim 10, wherein the classification probability model
    Figure PCTCN2019117341-appb-100007
    for:
    Figure PCTCN2019117341-appb-100008
    Figure PCTCN2019117341-appb-100008
    其中,X为所述分词序列集,ω为所述词性标注的名词、动词、形容词、副词,也可称为特征词,e为无限不循环小数,
    Figure PCTCN2019117341-appb-100009
    为X ω的转置矩阵,X ω为ω的累加求和操作,所述累加求和操作为:
    Among them, X is the word segmentation sequence set, ω is the nouns, verbs, adjectives, and adverbs marked by the part of speech, which can also be called characteristic words, and e is the infinite non-recurring decimal.
    Figure PCTCN2019117341-appb-100009
    A transposed matrix X ω, the cumulative summing operation X ω of the [omega], is the accumulated sum operation:
    Figure PCTCN2019117341-appb-100010
    Figure PCTCN2019117341-appb-100010
    其中,c为所述分词序列集的数据个数,V(ω i)为假设已词向量化后的词向量化数据集。 Wherein, c is the number of data in the word segmentation sequence set, and V(ω i ) is the word vectorized data set assuming that the word has been vectorized.
  12. 如权利要求11所述的智能文本分类装置,其特征在于,所述分类模型包括卷积神经网络、激活函数和损失函数,其中,所述卷积神经网络包括十九层卷积层、十九层池化层和一层全连接层;及The intelligent text classification device of claim 11, wherein the classification model includes a convolutional neural network, an activation function and a loss function, wherein the convolutional neural network includes nineteen convolutional layers, nineteen Layer pooling layer and a fully connected layer; and
    所述将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练,包括:The inputting the word vectorized data set and the label set into a classification model for training and obtaining a training value, and when the training value is less than a preset threshold, the classification model exiting training includes:
    所述卷积神经网络接收所述词向量化数据集后,将所述词向量化数据集输入至所述十九层卷积层和十九层池化层进行卷积操作和最大池化操作,得到降维数据集,将所述降维数据集输入至全连接层;After the convolutional neural network receives the word vectorized data set, it inputs the word vectorized data set to the nineteen-layer convolutional layer and nineteen-layer pooling layer for convolution operation and maximum pooling operation , Obtain a dimensionality reduction data set, and input the dimensionality reduction data set to the fully connected layer;
    所述全连接层接收所述降维数据集,并结合所述激活函数计算得到预测分类集,并将所述预测分类集和所述标签集输入至所述损失函数中计算出损失值,判断所述损失值与预设阈值的大小关系,直至所述损失值小于所述预设阈值时,所述分类模型退出训练。The fully connected layer receives the dimensionality reduction data set, calculates a prediction classification set in combination with the activation function, inputs the prediction classification set and the label set into the loss function to calculate a loss value, and judges The magnitude relationship between the loss value and a preset threshold value, until the loss value is less than the preset threshold value, the classification model exits training.
  13. 如权利要求12所述的智能文本分类装置,其特征在于,所述卷积操作为:The intelligent text classification device of claim 12, wherein the convolution operation is:
    Figure PCTCN2019117341-appb-100011
    Figure PCTCN2019117341-appb-100011
    其中,ω’为输出数据,ω为输入数据,k为卷积核的大小,s为所述卷积操作的步幅,p为数据补零矩阵。Among them, ω'is output data, ω is input data, k is the size of the convolution kernel, s is the stride of the convolution operation, and p is the data zero-filling matrix.
  14. 如权利要求12所述的智能文本分类装置,其特征在于,所述损失函数为:The intelligent text classification device of claim 12, wherein the loss function is:
    Figure PCTCN2019117341-appb-100012
    Figure PCTCN2019117341-appb-100012
    其中,T为损失值,n为预测分类集的数据大小,y t为标签集,μ t为预测分类集。 Among them, T is the loss value, n is the data size of the predicted classification set, y t is the label set, and μ t is the predicted classification set.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有文本分类程序,所述文本分类程序可被一个或者多个处理器执行,以实现如下步骤:A computer-readable storage medium, characterized in that a text classification program is stored on the computer-readable storage medium, and the text classification program can be executed by one or more processors to implement the following steps:
    接收文本数据及标签集,对所述文本数据进行词性标注;Receive text data and a tag set, and perform part-of-speech tagging on the text data;
    根据所述词性标注将所述文本数据进行细粒度分词得到分词序列集,对所述分词序列集进行词向量化处理得到词向量化数据集;Performing fine-grained word segmentation of the text data according to the part-of-speech tagging to obtain a word segmentation sequence set, and performing word vectorization processing on the word segmentation sequence set to obtain a word vectorized data set;
    将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练;Input the word vectorized data set and the label set into a classification model for training and obtain a training value, and when the training value is less than a preset threshold, the classification model exits the training;
    接收用户输入的文本,对所述文本进行所述词向量化操作得到文本词向量,将所述文本词向量输入至所述分类模型判断并输出分类结果。The text input by the user is received, the word vectorization operation is performed on the text to obtain a text word vector, the text word vector is input to the classification model for judgment and the classification result is output.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述词性标注包括:15. The computer-readable storage medium according to claim 15, wherein the part-of-speech tag comprises:
    根据预设的词性标记模板标注所述文本数据中的名词、动词;Label the nouns and verbs in the text data according to a preset part of speech tagging template;
    搜索所述文本数据内长度大于预设长度字符并含有“的”或“地”的词;Search for words in the text data that are longer than the preset length and contain "的" or "地";
    判断所述长度大于预设长度字符并含有“的”或“地”的词在所述文本数据中的前后词是否是名词或动词,若所述前后词是名词或动词,则标记所述长度大于两个字符并含有“的”或“地”的词即为形容词或副词。Determine whether the preceding and following words in the text data of the word whose length is greater than the preset length and containing "的" or "地" are nouns or verbs, and if the preceding and following words are nouns or verbs, mark the length Words that are more than two characters and contain "的" or "地" are adjectives or adverbs.
  17. 如权利要求15或16所述的计算机可读存储介质,其特征在于,所述词向量化处理,包括:The computer-readable storage medium according to claim 15 or 16, wherein the word vectorization processing includes:
    基于所述分词序列集建立分类概率模型;Establishing a classification probability model based on the word segmentation sequence set;
    基于所述分类概率模型构建条件概率模型;Constructing a conditional probability model based on the classification probability model;
    对所述条件概率模型进行累加求和操作得到对数似然函数;Accumulating and summing the conditional probability model to obtain a log likelihood function;
    最大化所述对数似然函数求解最优解,所述最优解即为所述词向量化数据集。The optimal solution is solved by maximizing the log likelihood function, and the optimal solution is the word vectorized data set.
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述分类概率模型
    Figure PCTCN2019117341-appb-100013
    为:
    The computer-readable storage medium of claim 17, wherein the classification probability model
    Figure PCTCN2019117341-appb-100013
    for:
    Figure PCTCN2019117341-appb-100014
    Figure PCTCN2019117341-appb-100014
    其中,X为所述分词序列集,ω为所述词性标注的名词、动词、形容词、副词,也可称为特征词,e为无限不循环小数,
    Figure PCTCN2019117341-appb-100015
    为X ω的转置矩阵,X ω为ω的累加求和操作,所述累加求和操作为:
    Among them, X is the word segmentation sequence set, ω is the nouns, verbs, adjectives, and adverbs marked by the part of speech, which can also be called characteristic words, and e is the infinite non-recurring decimal.
    Figure PCTCN2019117341-appb-100015
    A transposed matrix X ω, the cumulative summing operation X ω of the [omega], is the accumulated sum operation:
    Figure PCTCN2019117341-appb-100016
    Figure PCTCN2019117341-appb-100016
    其中,c为所述分词序列集的数据个数,V(ω i)为假设已词向量化后的词 向量化数据集。 Wherein, c is the number of data in the word segmentation sequence set, and V(ω i ) is the word vectorized data set assuming that the word has been vectorized.
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,所述分类模型包括卷积神经网络、激活函数和损失函数,其中,所述卷积神经网络包括十九层卷积层、十九层池化层和一层全连接层;及The computer-readable storage medium according to claim 18, wherein the classification model includes a convolutional neural network, an activation function, and a loss function, wherein the convolutional neural network includes nineteen convolutional layers, ten Nine pooling layers and one fully connected layer; and
    所述将所述词向量化数据集及所述标签集输入至分类模型中训练并得到训练值,当所述训练值小于预设阈值时,所述分类模型退出训练,包括:The inputting the word vectorized data set and the label set into a classification model for training and obtaining a training value, and when the training value is less than a preset threshold, the classification model exiting training includes:
    所述卷积神经网络接收所述词向量化数据集后,将所述词向量化数据集输入至所述十九层卷积层和十九层池化层进行卷积操作和最大池化操作,得到降维数据集,将所述降维数据集输入至全连接层;After the convolutional neural network receives the word vectorized data set, it inputs the word vectorized data set to the nineteen-layer convolutional layer and nineteen-layer pooling layer for convolution operation and maximum pooling operation , Obtain a dimensionality reduction data set, and input the dimensionality reduction data set to the fully connected layer;
    所述全连接层接收所述降维数据集,并结合所述激活函数计算得到预测分类集,并将所述预测分类集和所述标签集输入至所述损失函数中计算出损失值,判断所述损失值与预设阈值的大小关系,直至所述损失值小于所述预设阈值时,所述分类模型退出训练。The fully connected layer receives the dimensionality reduction data set, calculates a prediction classification set in combination with the activation function, inputs the prediction classification set and the label set into the loss function to calculate a loss value, and judges The magnitude relationship between the loss value and a preset threshold value, until the loss value is less than the preset threshold value, the classification model exits training.
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述卷积操作为:The computer-readable storage medium of claim 19, wherein the convolution operation is:
    Figure PCTCN2019117341-appb-100017
    Figure PCTCN2019117341-appb-100017
    其中,ω’为输出数据,ω为输入数据,k为卷积核的大小,s为所述卷积操作的步幅,p为数据补零矩阵。Among them, ω'is output data, ω is input data, k is the size of the convolution kernel, s is the stride of the convolution operation, and p is the data zero-filling matrix.
PCT/CN2019/117341 2019-06-20 2019-11-12 Intelligent text classification method and apparatus, and computer-readable storage medium WO2020253043A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910540265.3A CN110413773B (en) 2019-06-20 2019-06-20 Intelligent text classification method, device and computer readable storage medium
CN201910540265.3 2019-06-20

Publications (1)

Publication Number Publication Date
WO2020253043A1 true WO2020253043A1 (en) 2020-12-24

Family

ID=68359559

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117341 WO2020253043A1 (en) 2019-06-20 2019-11-12 Intelligent text classification method and apparatus, and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN110413773B (en)
WO (1) WO2020253043A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883191A (en) * 2021-02-05 2021-06-01 山东麦港数据系统有限公司 Agricultural entity automatic identification classification method and device
CN113342981A (en) * 2021-06-30 2021-09-03 中国工商银行股份有限公司 Demand document classification method and device based on machine learning
CN116912845A (en) * 2023-06-16 2023-10-20 广东电网有限责任公司佛山供电局 Intelligent content identification and analysis method and device based on NLP and AI

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413773B (en) * 2019-06-20 2023-09-22 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN112906386B (en) * 2019-12-03 2023-08-11 深圳无域科技技术有限公司 Method and device for determining text characteristics
CN111275091B (en) * 2020-01-16 2024-05-10 平安科技(深圳)有限公司 Text conclusion intelligent recommendation method and device and computer readable storage medium
CN111339300B (en) * 2020-02-28 2023-08-22 中国工商银行股份有限公司 Text classification method and device
CN112434153A (en) * 2020-12-16 2021-03-02 中国计量大学上虞高等研究院有限公司 Junk information filtering method based on ELMo and convolutional neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification
CN109471933A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of generation method of text snippet, storage medium and server
CN110413773A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161067A1 (en) * 2009-12-29 2011-06-30 Dynavox Systems, Llc System and method of using pos tagging for symbol assignment
CN103207855B (en) * 2013-04-12 2019-04-26 广东工业大学 For the fine granularity sentiment analysis system and method for product review information
CN107085581B (en) * 2016-02-16 2020-04-07 腾讯科技(深圳)有限公司 Short text classification method and device
CN107180023B (en) * 2016-03-11 2022-01-04 科大讯飞股份有限公司 Text classification method and system
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN108763539B (en) * 2018-05-31 2020-11-10 华中科技大学 Text classification method and system based on part-of-speech classification
CN109086267B (en) * 2018-07-11 2022-07-26 南京邮电大学 Chinese word segmentation method based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification
CN109471933A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of generation method of text snippet, storage medium and server
CN110413773A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883191A (en) * 2021-02-05 2021-06-01 山东麦港数据系统有限公司 Agricultural entity automatic identification classification method and device
CN113342981A (en) * 2021-06-30 2021-09-03 中国工商银行股份有限公司 Demand document classification method and device based on machine learning
CN116912845A (en) * 2023-06-16 2023-10-20 广东电网有限责任公司佛山供电局 Intelligent content identification and analysis method and device based on NLP and AI
CN116912845B (en) * 2023-06-16 2024-03-19 广东电网有限责任公司佛山供电局 Intelligent content identification and analysis method and device based on NLP and AI

Also Published As

Publication number Publication date
CN110413773A (en) 2019-11-05
CN110413773B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
WO2020253043A1 (en) Intelligent text classification method and apparatus, and computer-readable storage medium
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
CN110347835B (en) Text clustering method, electronic device and storage medium
WO2020237856A1 (en) Smart question and answer method and apparatus based on knowledge graph, and computer storage medium
US11893345B2 (en) Inducing rich interaction structures between words for document-level event argument extraction
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
WO2020232861A1 (en) Named entity recognition method, electronic device and storage medium
WO2021068329A1 (en) Chinese named-entity recognition method, device, and computer-readable storage medium
CN107180023B (en) Text classification method and system
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
CN108460011B (en) Entity concept labeling method and system
WO2020253042A1 (en) Intelligent sentiment judgment method and device, and computer readable storage medium
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
WO2020252919A1 (en) Resume identification method and apparatus, and computer device and storage medium
WO2021151271A1 (en) Method and apparatus for textual question answering based on named entities, and device and storage medium
CN112101031B (en) Entity identification method, terminal equipment and storage medium
WO2020258481A1 (en) Method and apparatus for intelligently recommending personalized text, and computer-readable storage medium
WO2021000391A1 (en) Text intelligent cleaning method and device, and computer-readable storage medium
CN113722483B (en) Topic classification method, device, equipment and storage medium
Guan et al. Tag-based Weakly-supervised Hashing for Image Retrieval.
WO2021068565A1 (en) Table intelligent query method and apparatus, electronic device and computer readable storage medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
WO2020248366A1 (en) Text intention intelligent classification method and device, and computer-readable storage medium
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933333

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933333

Country of ref document: EP

Kind code of ref document: A1