CN112445914A - Text classification method, device, computer equipment and medium - Google Patents

Text classification method, device, computer equipment and medium Download PDF

Info

Publication number
CN112445914A
CN112445914A CN202011389826.3A CN202011389826A CN112445914A CN 112445914 A CN112445914 A CN 112445914A CN 202011389826 A CN202011389826 A CN 202011389826A CN 112445914 A CN112445914 A CN 112445914A
Authority
CN
China
Prior art keywords
text
classified
item set
word vector
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011389826.3A
Other languages
Chinese (zh)
Inventor
赵婧
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011389826.3A priority Critical patent/CN112445914A/en
Publication of CN112445914A publication Critical patent/CN112445914A/en
Priority to PCT/CN2021/084218 priority patent/WO2022116444A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method and the device for text classification can perform classification prediction according to a word vector matrix of a high-efficiency item set, and improve the accuracy of text classification. And more particularly, to a text classification method, apparatus, computer device, and medium, the text classification method including: acquiring a text to be classified, and performing item set mining on the text to be classified to obtain an efficient item set corresponding to the text to be classified, wherein the efficient item set comprises at least two phrases; vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified; and inputting the word vector matrix into a text classification model for classification prediction to obtain a text category corresponding to the text to be classified. In addition, the application also relates to a block chain technology, and the text to be classified can be stored in the block chain.

Description

Text classification method, device, computer equipment and medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a text classification method, apparatus, computer device, and medium.
Background
With the rapid development of the internet and the arrival of the big data era, text classification becomes a hot research problem in the current natural language processing field.
The existing text classification method generally performs prediction of text categories through a deep learning algorithm. The deep learning algorithm depends on selected text features in the prediction process of the text category, and the distance relation between the text features is determined by converting the text into word vectors and using the word vectors. However, the deep learning algorithm cannot eliminate the interference of synonyms on the text classification, and the accuracy of the text classification is reduced.
Therefore, how to improve the accuracy of text classification becomes an urgent problem to be solved.
Disclosure of Invention
The application provides a text classification method, a text classification device, computer equipment and a text classification medium, wherein a high-efficiency item set containing a plurality of strongly-associated words is obtained by carrying out item set mining on a text to be classified, classification prediction can be carried out according to a word vector matrix of the high-efficiency item set, and the accuracy of text classification is improved.
In a first aspect, the present application provides a text classification method, including:
acquiring a text to be classified, and performing item set mining on the text to be classified to obtain an efficient item set corresponding to the text to be classified, wherein the efficient item set comprises at least two phrases;
vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified;
and inputting the word vector matrix into a text classification model for classification prediction to obtain a text category corresponding to the text to be classified.
In a second aspect, the present application further provides a text classification apparatus, including:
the system comprises an item set mining module, a text classification module and a text classification module, wherein the item set mining module is used for acquiring a text to be classified and mining an item set of the text to be classified to obtain a high-efficiency item set corresponding to the text to be classified, and the high-efficiency item set comprises at least two phrases;
the vectorization module is used for vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified;
and the classification prediction module is used for inputting the word vector matrix into a text classification model for classification prediction to obtain a text category corresponding to the text to be classified.
In a third aspect, the present application further provides a computer device comprising a memory and a processor;
the memory for storing a computer program;
the processor is configured to execute the computer program and to implement the text classification method as described above when executing the computer program.
In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the text classification method as described above.
The application discloses a text classification method, a text classification device, computer equipment and a medium, wherein a high-efficiency item set which corresponds to a text to be classified and comprises a plurality of strongly associated words can be obtained by carrying out item set mining on the text to be classified, and the high-efficiency item set comprising the plurality of strongly associated words can be subsequently subjected to text classification, so that the problem of interference of synonyms on text classification is solved; vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified; the word vector matrix is input into the text classification model for classification prediction, so that the prediction accuracy of the text category is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram of a text classification method provided by an embodiment of the present application;
FIG. 2 is a diagram illustrating a prediction process for text classification according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart diagram of sub-steps of item set mining for text to be classified provided by an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram of sub-steps of determining utility values for a set of items provided by an embodiment of the present application;
FIG. 5 is a schematic flow chart diagram of sub-steps of classification prediction according to a word vector matrix provided in an embodiment of the present application;
FIG. 6 is a schematic flow chart diagram illustrating sub-steps of a training process of a text classification model provided by an embodiment of the present application;
fig. 7 is a schematic block diagram of a text classification apparatus provided in an embodiment of the present application;
fig. 8 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a text classification method, a text classification device, computer equipment and a medium. The text classification method can be applied to a server or a terminal, an efficient use item set containing a plurality of strongly-associated words is obtained by performing item set mining on a text to be classified, classification prediction can be performed according to a word vector matrix of the efficient use item set, and accuracy of text classification is improved.
The server may be an independent server or a server cluster. The terminal can be an electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer and the like.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
As shown in fig. 1, the text classification method includes steps S10 through S30.
Step S10, obtaining a text to be classified, and mining an item set of the text to be classified to obtain an efficient item set corresponding to the text to be classified, wherein the efficient item set comprises at least two phrases.
For example, the text to be classified may be a text file uploaded to the server or the terminal by the user, a text file stored in a local disk of the server or the terminal, or a text file stored in a node of the blockchain.
In some embodiments, a text selection operation of a user on a text file may be received, and the selected text file is determined as a text to be classified according to the text selection operation.
It should be noted that term set mining refers to mining a highly relevant phrase in a text to be classified as an efficient term set. Wherein the high efficiency item set comprises at least two phrases.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating a text classification prediction process according to an embodiment of the present disclosure. As shown in fig. 2, firstly, mining an item set of a text to be classified to obtain an efficient item set corresponding to the text to be classified; then vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified; and finally, inputting the word vector matrix corresponding to the text to be classified into the text classification model for classification and prediction to obtain the text category corresponding to the text to be classified.
Referring to fig. 3, fig. 3 is a schematic flowchart of the sub-step of mining the item set of the text to be classified in step S10 to obtain the efficient item set corresponding to the text to be classified, which includes step S101 to step S103.
Step S101, performing word segmentation processing on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified.
For example, the text to be classified may include at least one sentence. It can be understood that the word segmentation processing of the text to be classified refers to the word segmentation of each sentence in the text to be classified.
In some embodiments, performing word segmentation on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified may include: and based on a preset word segmentation library, performing word segmentation processing on each sentence in the text to be classified to obtain a plurality of word groups corresponding to the text to be classified.
For example, the preset segmentation library may be a jieba library. It should be noted that the jieba library can analyze the association probability between the chinese characters and the chinese character phrases, and analyze the association probability of the chinese character phrases, and can also classify the chinese character phrases according to the user-defined phrases. Illustratively, the jieba library may include a precision mode, a full mode, and a search engine mode, with different modes being implemented by different functions. For example, the exact mode may be implemented by the lcut(s) function; the full mode may be implemented by an lcut (s, cut _ all ═ Ture) function; the search engine mode may be implemented by the lcut _ for _ search(s) function.
In the embodiment of the application, each sentence in the text to be classified can be subjected to word segmentation through the jieba library, so that a plurality of phrases corresponding to the text to be classified are obtained.
In some embodiments, after performing word segmentation processing on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified, the method may further include: and filtering the plurality of phrases based on a preset stop word stock to obtain a plurality of filtered phrases.
For example, the predetermined deactivation word library may be created in advance and stored in a local disk or a database. It will be appreciated that the deactivation thesaurus is used to deactivate low value words in the text or sentence. The low-value words are words which have little influence on the text or the sentence and are high in frequency. For example, low value words may include, but are not limited to, "some," "all," "one aspect," "general," "up-down," "o," "in accordance," "such as," "has," "thereby," "and the like.
In the embodiment of the application, after obtaining the plurality of phrases corresponding to the text to be classified, the stop word stock may be called to perform filtering processing on the plurality of phrases, so as to obtain the phrases after the filtering processing. Illustratively, low value phrases in the plurality of phrases are deleted by deactivating the lexicon.
By filtering the word groups after word segmentation of the text to be classified based on the preset stop word stock, the low-value word groups can be deleted, and the influence of the low-value word groups on the prediction of the text category is avoided.
And S102, combining the phrases to obtain a plurality of item sets corresponding to the text to be classified.
Illustratively, at least two of the plurality of phrases may be combined. For example, if there is phrase A, B, C, D, phrase A, B, C, D may be combined, and the resulting set of items may include (AB), (AC), (AD), (BC), (BD), (CD), (ABC) (ABD) (ACD), (BCD), and (ABCD).
By combining a plurality of phrases of the text to be classified, a plurality of item sets comprising at least two phrases can be obtained, and subsequently, the high-efficiency item sets can be determined according to the utility values corresponding to the item sets.
Step S103, determining the utility value of each item set corresponding to the text to be classified, and determining the item set with the corresponding utility value not less than a preset utility threshold value as the high-efficiency item set corresponding to the text to be classified.
It should be noted that the utility value is used to indicate the number of times the item set appears in the text to be classified. For example, if the number of times that the item set appears in the text to be classified is higher, the utility value corresponding to the item set is larger.
Referring to fig. 4, fig. 4 is a schematic flowchart of the sub-step of determining the utility value of the text to be classified corresponding to each item set in step S103, and specifically may include the following steps S1031 to S1033.
Step S1031, determining the number of times of occurrence of each phrase in each sentence of the text to be classified as a first utility value of each phrase corresponding to each sentence.
Illustratively, if the text to be classified consists of a set of sentences, i.e., D ═ T1,T2,…,Tn}; wherein D represents a text to be classified; t represents a sentence; n represents the number of sentences. Wherein each sentence can contain k phrases, i.e. Td={i1,i2,…,ik},1≤d≤n。
Illustratively, the first utility value for each sentence for each phrase may be represented as U (i)k,Td) Wherein U (i)k,Td) Can be defined as: first utility value U (i)k,Td) Express the phrase ikIn a sentence TdThe number of occurrences in is the word ikCorresponding utility value, i.e. U (i)k,Td)=Count(ik,Td) (ii) a Wherein, Count (i)k,Td) Express the phrase ikIn a sentence TdThe number of occurrences in (c).
For example, for a phrase a, if the number of times that the phrase a appears in a certain sentence is 1, it may be determined that the first utility value corresponding to the phrase a is 1, which may be denoted as (a, 1).
Step S1032, determine the sum of the first utility values of the statements corresponding to the phrases in each item set as the second utility value of each statement corresponding to each item set.
Illustratively, the second utility value corresponding to each statement for each set of items may be represented as U (X, T)d) Wherein, U (X, T)d) Can be defined as: each phrase in the item set X corresponds to a sentence TdFirst utility value U (i) ofk,Td) To sum, i.e.
Figure BDA0002811943700000061
Step S1033, determining the sum of the second utility values of the sentences corresponding to each item set as the utility value of the text to be classified corresponding to each item set.
For example, the utility value of each item set corresponding to the text to be classified may be represented as U (X, D), where U (X, D) may be defined as: the term set X corresponds to each sentence TdSecond utility value U (X, T)d) To sum, i.e.
Figure BDA0002811943700000062
In the embodiment of the present application, the term set corresponds to the utility value of the text to be classified, and table 1 is taken as an example for explanation.
TABLE 1
Numbering Sentence First utility value
T1 CACECE (A,1),(C,3),(E,2)
T2 ABAFEF (A,2),(B,1),(E,1),(F,2)
T3 DBDFD (B,1),(D,3),(F,1)
T4 BDCDBE (B,2),(C,1),(D,2),(E,1)
In table 1, the text to be classified includes four sentences of T1, T2, T3, and T4. In the T1 statement, the first utility value corresponding to the phrase a is U ({ a }, T1) ═ 1, and the first utility value corresponding to the phrase C is U ({ C }, T1) ═ 3.
Illustratively, for item set AC, the second utility value in the T1 statement is U ({ AC }, T1) ═ 4; the utility value of the item set AC in the text to be classified is U ({ AC }) ═ 4.
For example, for the term set AE, the second utility value of the term set AE in the T1 statement is U ({ AE }, T1) ═ 3, and the second utility value in the T2 statement is U ({ AE }, T2) ═ 3, then the utility value of the term set in the text to be classified may be determined to be U ({ AE }) -6.
In some embodiments, after determining that each item set corresponds to the sum of utility values of the texts to be classified, determining the item set of which the corresponding utility value is not less than a preset utility threshold value as the high-efficiency item set corresponding to the texts to be classified.
It can be understood that the utility value indicates that the number of times that the word groups in the item set appear in the text to be classified is large, and when the utility value corresponding to the item set is not less than the preset utility threshold value, the word groups in the item set are strongly associated words.
For example, the preset utility threshold may be set according to actual conditions, and the specific value is not limited herein. Wherein the preset utility threshold may be denoted as Q. For example, if the utility value U (X) ≧ E corresponding to the set of items X, then the set of items X may be determined to be a high-efficiency set of items.
For example, if the item set AC is greater than the preset utility threshold Q, the item set AC may be determined to be a high-efficiency item set. If the term set AE is greater than the preset utility threshold Q, the term set AE may be determined to be a high-efficiency term set. And if the item set BC is smaller than the preset utility threshold value Q, not taking the item set BC as the high-efficiency item set.
By determining the utility value of each item set corresponding to the text to be classified, the item sets with the utility value not less than a preset utility threshold value can be screened out, so that an efficient item set containing a plurality of strongly-associated words can be obtained; and subsequently, when classification prediction is carried out according to the word vector matrix corresponding to the high-efficiency use item set, the interference of synonyms on text classification can be eliminated, and the prediction accuracy of the text classification is improved.
And step S20, vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified.
In some embodiments, vectorizing each phrase in the high-efficiency use item set to obtain a word vector matrix corresponding to the text to be classified may include: obtaining a word vector model from a block chain; and vectorizing the word vector model input by each phrase to obtain a word vector matrix corresponding to the text to be classified.
For example, in the embodiment of the present application, the word vector model may be trained in advance to obtain a trained word vector model. It is emphasized that, in order to further ensure the privacy and security of the trained word vector model, the trained word vector model may also be stored in a node of a block chain. When each phrase in the high-efficiency item set is vectorized, a word vector model can be called from the nodes of the block chain to carry out vectorization processing on each phrase, and a word vector matrix corresponding to the text to be classified is obtained.
In the word vector matrix, each row may represent a word vector corresponding to a word group.
Illustratively, the word vector model may include, but is not limited to, word2vec (word vector) model, glove (global vectors for word representation) model, and bert (bidirectional Encoder retrieval from transform) model, among others.
By vectorizing each phrase in the high-efficiency item set, a word vector matrix corresponding to the text to be classified can be obtained, and the word vector matrix can be input into a text classification model for classification prediction subsequently.
And step S30, inputting the word vector matrix into a text classification model for classification and prediction to obtain a text category corresponding to the text to be classified.
Illustratively, the text classification model is a trained text classification model. The text classification model may include, but is not limited to, a Convolutional Neural Network (CNN), a Han model, and/or a Recurrent Neural Network (RNN), among others.
The word vector matrix is input into the trained text classification model for classification prediction, so that the prediction accuracy of the text category corresponding to the text to be classified can be improved.
In the embodiment of the present application, a text classification model is taken as an example of a convolutional neural network, and a prediction process of text classification is described in detail. Illustratively, a convolutional neural network may include convolutional layers, pooling layers, fully-connected layers, and normalization layers.
Referring to fig. 5, fig. 5 is a schematic flowchart of the sub-step of inputting the word vector matrix into the text classification model in step S30 to perform classification prediction to obtain a text category corresponding to the text to be classified, and may specifically include the following steps S301 to S303.
Step S301, inputting the word vector matrix into the convolution layer for convolution processing to obtain a characteristic image corresponding to the word vector matrix.
The convolution processing refers to extracting high-level features in the word vector matrix.
For example, a preset convolution filter may be used to perform feature extraction on the training sample, so as to obtain a feature image corresponding to the training sample. The number of convolution kernels of the convolution filter, the size of each convolution kernel, and the convolution step length may be set according to actual conditions, and specific values are not limited herein.
Illustratively, a convolution operation is performed on the word vector matrix by using n filters with different window sizes to obtain (y) a feature image s corresponding to the word vector matrix1,y2,…,yn)。
Step S302, inputting the characteristic image into the pooling layer for pooling to obtain the characteristic image after pooling.
It should be noted that pooling is a replacement of a value, such as a maximum or average, for a certain area of the image. Maximum pooling (Max-pooling) if maximum is used; if the Mean value is substituted, it is called Mean-pooling (Mean-pooling). Pooling can reduce image size and achieve translational, rotational invariance. This is because the output values are calculated from a region of the image and are not sensitive to translation and rotation. In the embodiment of the present application, maximum pooling can be used to pool feature images.
For example, the formula for maximum pooling can be expressed as:
q=max(s)
illustratively, let the feature image s ═ y1,y2,…,yn) Inputting the image into a pooling layer for pooling treatment to obtain a pooled characteristic image.
Step S303, inputting the pooled feature images into the full-link layer for full-link processing, and performing normalization processing on the result of the full-link processing through the normalization layer to obtain the text category corresponding to the text to be classified.
It should be noted that the Fully connected layers (FC) function as "classifiers" in the whole convolutional neural network, and are used to connect all the features of the upper layer and send the output values to the normalization layer.
For example, the feature vectors output by the full connection layer may be normalized according to a normalization layer in the convolutional neural network, and output is a class probability distribution corresponding to the text to be classified. Illustratively, the normalization layer may output the class probability distribution via a softmax function.
Illustratively, the expression of the softmax function is:
Figure BDA0002811943700000091
wherein c represents a category, and q represents a feature vector output by a full-link layer; j denotes the jth element in the feature vector q.
For example, the class probability distribution can include classes for which the class probabilities correspond to the class probabilities.
Categories may include, but are not limited to, insurance, medical, financial, travel, sports, scientific, and agricultural, among others.
In the embodiment of the present application, the category corresponding to the maximum category probability may be determined as the text category corresponding to the text to be classified. For example, if the class probability distribution includes class probabilities corresponding to 1-4 classes: 0.20, 0.02, 0.08, 0.70, the 4 th category can be determined as the text category corresponding to the text to be classified.
Referring to fig. 6, fig. 6 is a schematic flow chart illustrating sub-steps of a training process of a text classification model according to an embodiment of the present application. As shown in fig. 6, the training process of the text classification model may specifically include the following steps S401 to S404.
Step S401, word vector matrixes of high-efficiency item sets corresponding to a preset number of original texts are obtained, category labeling is carried out on each word vector matrix according to real categories corresponding to the original texts, and the word vector matrixes after category labeling are used as training samples.
In the embodiment of the application, the initial text classification model can be trained to obtain the trained text classification model. Wherein the initial text classification model may be a convolutional neural network.
Illustratively, a preset number of original texts can be collected, and item set mining is performed on the original texts to obtain an efficient item set corresponding to the original texts; and then vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the original text.
For a specific item set mining process and vectorization of phrases, reference may be made to the detailed description of the above embodiments, and specific implementation processes are not described herein again.
Illustratively, the original text may be a plurality of different categories of text.
In some embodiments, the category labeling may be performed on each word vector matrix according to the real category corresponding to the original text, so as to obtain a word vector matrix after the category labeling. And then the word vector matrix after class labeling is used as a training sample. And carrying the real category by the word vector matrix after category labeling.
Illustratively, real categories may include, but are not limited to, insurance, medical, financial, travel, sports, scientific, and agricultural, among others.
Training an initial text classification model by taking a word vector matrix of an efficient term set in an original text as a training sample, wherein the trained text classification model can more accurately predict the category of the text by combining a plurality of strongly-associated words in the efficient term set; meanwhile, the interference of synonyms on text classification can be eliminated, and therefore the text classification effect is improved.
And S402, inputting the training samples into the text classification model for classification training to obtain the prediction classes corresponding to the training samples.
Illustratively, a training sample is input into a text classification model, and is processed sequentially through a convolution layer, a pooling layer, a full-link layer and a normalization layer in the text classification model, and a prediction category corresponding to the training sample is output.
Step S403, based on a preset loss function, calculating a loss function value according to the prediction category corresponding to the training sample and the real category corresponding to the training sample.
It should be noted that the loss function is used to evaluate the degree of difference between the predicted value and the actual value of the model, and the smaller the loss function is, the better the performance of the model is.
Exemplary loss functions may include, but are not limited to, 0-1 loss functions, absolute value loss functions, logarithmic loss functions, squared loss functions, and exponential loss functions, among others. In the embodiment of the present application, the preset loss function may be a logarithmic loss function. And calculating the loss function value of each round of training according to the prediction class corresponding to the training sample and the real class corresponding to the training sample by using the logarithmic loss function.
And S404, based on a preset gradient descent algorithm, adjusting parameters in the text classification model according to the loss function values and performing next round of training, and ending the training when the obtained loss function values are smaller than a preset loss threshold value to obtain the trained text classification model.
For example, the parameters of the text classification model may be adjusted according to the loss function value based on a gradient descent algorithm, so that the loss function value of the text classification model reaches a minimum value.
The gradient descent algorithm may include, but is not limited to, a batch gradient descent method, a random gradient descent method, a small batch gradient descent method, and the like.
In some embodiments, if the loss function value is less than or equal to the preset loss threshold, the training is ended. If the loss function value is larger than the preset loss threshold value, adjusting parameters in the text classification model according to a gradient descent algorithm, performing the next round of training and calculating the loss function value of each round; and when the calculated loss function value is smaller than the preset loss threshold value or is not smaller, finishing the training to obtain the trained text classification model.
The preset loss threshold may be set according to an actual situation, and a specific value is not limited herein.
In some embodiments, to further ensure the privacy and security of the trained text classification model, the trained text classification model may also be stored in a node of a blockchain. When the trained text classification model needs to be used, the trained text classification model can be obtained from the nodes of the block chain.
The initial text classification model is trained based on a loss function and a gradient descent algorithm, so that the text classification model is rapidly converged, and the prediction accuracy of the text classification of the trained text classification model is improved.
According to the text classification method provided by the embodiment, the low-value phrases can be deleted by filtering the phrases after the word segmentation of the text to be classified based on the preset disabled word stock, so that the influence of the low-value phrases on the prediction of the text category is avoided; combining a plurality of phrases of a text to be classified to obtain a plurality of item sets comprising at least two phrases, and subsequently determining a high-efficiency item set according to utility values corresponding to the item sets; by determining the utility value of each item set corresponding to the text to be classified, the item sets with the utility value not less than a preset utility threshold value can be screened out, so that an efficient item set containing a plurality of strongly-associated words can be obtained; when classification prediction is carried out according to the word vector matrix corresponding to the high-efficiency use item set, the interference of synonyms on text classification can be eliminated, and the prediction accuracy of the text classification is improved; vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified, and subsequently inputting the word vector matrix into a text classification model for classification prediction; the word vector matrix is input into the trained text classification model for classification prediction, so that the prediction accuracy of the text category corresponding to the text to be classified can be improved.
Referring to fig. 7, fig. 7 is a schematic block diagram of a text classification apparatus 1000 according to an embodiment of the present application, which is used for executing the foregoing text classification method. Wherein, the text classification device can be configured in a server or a terminal.
As shown in fig. 7, the text classification apparatus 1000 includes: an item set mining module 1001, a vectorization module 1002, and a classification prediction module 1003.
The item set mining module 1001 is configured to acquire a text to be classified, perform item set mining on the text to be classified, and obtain an efficient item set corresponding to the text to be classified, where the efficient item set includes at least two phrases.
A vectorization module 1002, configured to perform vectorization on each word group in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified.
And the classification prediction module 1003 is configured to input the word vector matrix into a text classification model for classification prediction, so as to obtain a text category corresponding to the text to be classified.
It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.
Referring to fig. 8, the computer device includes a processor and a memory connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any of the text classification methods.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring a text to be classified, and performing item set mining on the text to be classified to obtain an efficient item set corresponding to the text to be classified, wherein the efficient item set comprises at least two phrases; vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified; and inputting the word vector matrix into a text classification model for classification prediction to obtain a text category corresponding to the text to be classified.
In one embodiment, when implementing item set mining on the text to be classified to obtain an efficient item set corresponding to the text to be classified, the processor is configured to implement:
performing word segmentation processing on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified; combining the phrases to obtain a plurality of item sets corresponding to the text to be classified; and determining the utility value of each item set corresponding to the text to be classified, and determining the item set with the corresponding utility value not less than a preset utility threshold value as the high-efficiency item set corresponding to the text to be classified.
In one embodiment, when implementing word segmentation processing on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified, the processor is configured to implement:
and performing word segmentation processing on each sentence in the text to be classified based on a preset word segmentation library to obtain a plurality of word groups corresponding to the text to be classified.
In one embodiment, after implementing word segmentation processing on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified, the processor is further configured to implement:
and filtering the plurality of phrases based on a preset stop word stock to obtain a plurality of filtered phrases.
In one embodiment, the processor, in performing determining that each of the sets of items corresponds to a utility value of the text to be classified, is configured to perform:
determining the frequency of occurrence of each phrase in each statement of the text to be classified in each item set as a first utility value of each statement corresponding to each phrase; determining the sum of the first utility values of the sentences corresponding to the phrases in each item set as a second utility value of each sentence corresponding to each item set; and determining the sum of the second utility values of the sentences corresponding to each item set as the utility value of the text to be classified corresponding to each item set.
In one embodiment, when the processor implements vectorization of each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified, the processor is configured to implement:
obtaining a word vector model from a block chain; and inputting each phrase into the word vector model for vectorization to obtain the word vector matrix corresponding to the text to be classified.
In one embodiment, the text classification model includes a convolutional layer, a pooling layer, a fully-connected layer, and a normalization layer; when the processor inputs the word vector matrix into a text classification model for classification prediction to obtain a text category corresponding to the text to be classified, the processor is used for realizing that:
inputting the word vector matrix into the convolution layer for convolution processing to obtain a characteristic image corresponding to the word vector matrix;
inputting the characteristic image into the pooling layer to perform pooling treatment to obtain the characteristic image after the pooling treatment;
inputting the characteristic images after the pooling into the full-connection layer for full-connection processing, and normalizing the result of the full-connection processing through the normalization layer to obtain the text category corresponding to the text to be classified.
In one embodiment, when the processor implements vectorization of each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified, the processor is configured to implement:
obtaining word vector matrixes of high-efficiency item sets corresponding to a preset number of original texts, performing category marking on each word vector matrix according to real categories corresponding to the original texts, and taking the word vector matrixes after the category marking as training samples; inputting the training samples into the text classification model for classification training to obtain prediction classes corresponding to the training samples; based on a preset loss function, calculating a loss function value according to the prediction category corresponding to the training sample and the real category corresponding to the training sample; and adjusting parameters in the text classification model according to the loss function value based on a preset gradient descent algorithm, and performing next round of training until the obtained loss function value is smaller than a preset loss threshold value, and finishing the training to obtain the trained text classification model.
The embodiment of the application further provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program comprises program instructions, and the processor executes the program instructions to implement any text classification method provided by the embodiment of the application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital Card (SD Card), a Flash memory Card (Flash Card), and the like provided on the computer device.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of text classification, comprising:
acquiring a text to be classified, and performing item set mining on the text to be classified to obtain an efficient item set corresponding to the text to be classified, wherein the efficient item set comprises at least two phrases;
vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified;
and inputting the word vector matrix into a text classification model for classification prediction to obtain a text category corresponding to the text to be classified.
2. The method for classifying texts according to claim 1, wherein the performing item set mining on the texts to be classified to obtain an efficient item set corresponding to the texts to be classified comprises:
performing word segmentation processing on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified;
combining the phrases to obtain a plurality of item sets corresponding to the text to be classified;
and determining the utility value of each item set corresponding to the text to be classified, and determining the item set with the corresponding utility value not less than a preset utility threshold value as the high-efficiency item set corresponding to the text to be classified.
3. The method according to claim 2, wherein the performing word segmentation processing on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified includes:
performing word segmentation processing on each sentence in the text to be classified based on a preset word segmentation library to obtain a plurality of word groups corresponding to the text to be classified;
after the word segmentation processing is performed on the text to be classified to obtain a plurality of word groups corresponding to the text to be classified, the method further includes:
and filtering the plurality of phrases based on a preset stop word stock to obtain a plurality of filtered phrases.
4. The method of claim 2, wherein the determining a utility value for each of the sets of terms corresponding to the text to be classified comprises:
determining the frequency of occurrence of each phrase in each statement of the text to be classified in each item set as a first utility value of each statement corresponding to each phrase;
determining the sum of the first utility values of the sentences corresponding to the phrases in each item set as a second utility value of each sentence corresponding to each item set;
and determining the sum of the second utility values of the sentences corresponding to each item set as the utility value of the text to be classified corresponding to each item set.
5. The method according to claim 1, wherein the vectorizing each word group in the high-efficiency use item set to obtain a word vector matrix corresponding to the text to be classified includes:
obtaining a word vector model from a block chain;
and inputting each phrase into the word vector model for vectorization to obtain the word vector matrix corresponding to the text to be classified.
6. The text classification method according to claim 1, wherein the text classification model includes a convolutional layer, a pooling layer, a fully-connected layer, and a normalization layer; inputting the word vector matrix into a text classification model for classification prediction to obtain a text category corresponding to the text to be classified, wherein the text category comprises:
inputting the word vector matrix into the convolution layer for convolution processing to obtain a characteristic image corresponding to the word vector matrix;
inputting the characteristic image into the pooling layer to perform pooling treatment to obtain the characteristic image after the pooling treatment;
inputting the characteristic images after the pooling into the full-connection layer for full-connection processing, and normalizing the result of the full-connection processing through the normalization layer to obtain the text category corresponding to the text to be classified.
7. The method of claim 1, wherein before inputting the word vector matrix into a text classification model for classification prediction, the method further comprises:
obtaining word vector matrixes of high-efficiency item sets corresponding to a preset number of original texts, performing category marking on each word vector matrix according to real categories corresponding to the original texts, and taking the word vector matrixes after the category marking as training samples;
inputting the training samples into the text classification model for classification training to obtain prediction classes corresponding to the training samples;
based on a preset loss function, calculating a loss function value according to the prediction category corresponding to the training sample and the real category corresponding to the training sample;
and adjusting parameters in the text classification model according to the loss function value based on a preset gradient descent algorithm, and performing next round of training until the obtained loss function value is smaller than a preset loss threshold value, and finishing the training to obtain the trained text classification model.
8. A text classification apparatus, comprising:
the system comprises an item set mining module, a text classification module and a text classification module, wherein the item set mining module is used for acquiring a text to be classified and mining an item set of the text to be classified to obtain a high-efficiency item set corresponding to the text to be classified, and the high-efficiency item set comprises at least two phrases;
the vectorization module is used for vectorizing each phrase in the high-efficiency item set to obtain a word vector matrix corresponding to the text to be classified;
and the classification prediction module is used for inputting the word vector matrix into a text classification model for classification prediction to obtain a text category corresponding to the text to be classified.
9. A computer device, wherein the computer device comprises a memory and a processor;
the memory for storing a computer program;
the processor for executing the computer program and implementing the text classification method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the text classification method according to any one of claims 1 to 7.
CN202011389826.3A 2020-12-01 2020-12-01 Text classification method, device, computer equipment and medium Pending CN112445914A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011389826.3A CN112445914A (en) 2020-12-01 2020-12-01 Text classification method, device, computer equipment and medium
PCT/CN2021/084218 WO2022116444A1 (en) 2020-12-01 2021-03-31 Text classification method and apparatus, and computer device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011389826.3A CN112445914A (en) 2020-12-01 2020-12-01 Text classification method, device, computer equipment and medium

Publications (1)

Publication Number Publication Date
CN112445914A true CN112445914A (en) 2021-03-05

Family

ID=74740461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011389826.3A Pending CN112445914A (en) 2020-12-01 2020-12-01 Text classification method, device, computer equipment and medium

Country Status (2)

Country Link
CN (1) CN112445914A (en)
WO (1) WO2022116444A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022116444A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Text classification method and apparatus, and computer device and medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473095B (en) * 2023-12-27 2024-03-29 合肥工业大学 Short text classification method and system based on theme enhancement word representation

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593454A (en) * 2013-11-21 2014-02-19 中国科学院深圳先进技术研究院 Mining method and system for microblog text classification
CN108334605B (en) * 2018-02-01 2020-06-16 腾讯科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN109189925B (en) * 2018-08-16 2020-01-17 华南师范大学 Word vector model based on point mutual information and text classification method based on CNN
CN110851598B (en) * 2019-10-30 2023-04-07 深圳价值在线信息科技股份有限公司 Text classification method and device, terminal equipment and storage medium
CN111708888B (en) * 2020-06-16 2023-10-24 腾讯科技(深圳)有限公司 Classification method, device, terminal and storage medium based on artificial intelligence
CN112445914A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Text classification method, device, computer equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴玉佳: "基于高效用神经网络的文本分类方法", 电子学报, vol. 48, no. 2, pages 279 - 284 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022116444A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Text classification method and apparatus, and computer device and medium

Also Published As

Publication number Publication date
WO2022116444A1 (en) 2022-06-09

Similar Documents

Publication Publication Date Title
CN110347835B (en) Text clustering method, electronic device and storage medium
CN107180023B (en) Text classification method and system
Xu et al. Investigation on the Chinese text sentiment analysis based on convolutional neural networks in deep learning.
CN111951805A (en) Text data processing method and device
WO2020211720A1 (en) Data processing method and pronoun resolution neural network training method
CN110633577B (en) Text desensitization method and device
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN107943792B (en) Statement analysis method and device, terminal device and storage medium
CN112256822A (en) Text search method and device, computer equipment and storage medium
EP3620982B1 (en) Sample processing method and device
CN112418320B (en) Enterprise association relation identification method, device and storage medium
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
Estevez-Velarde et al. AutoML strategy based on grammatical evolution: A case study about knowledge discovery from text
CN112445914A (en) Text classification method, device, computer equipment and medium
CN114330343A (en) Part-of-speech-aware nested named entity recognition method, system, device and storage medium
CN112784884A (en) Medical image classification method, system, medium and electronic terminal
CN113159013A (en) Paragraph identification method and device based on machine learning, computer equipment and medium
US20220374682A1 (en) Supporting Database Constraints in Synthetic Data Generation Based on Generative Adversarial Networks
US20220156489A1 (en) Machine learning techniques for identifying logical sections in unstructured data
CN110717407A (en) Human face recognition method, device and storage medium based on lip language password
CN112835798A (en) Cluster learning method, test step clustering method and related device
CN115357720B (en) BERT-based multitasking news classification method and device
CN115456421A (en) Work order dispatching method and device, processor and electronic equipment
CN113011153B (en) Text correlation detection method, device, equipment and storage medium
CN112215006B (en) Organization named entity normalization method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination