WO2022116444A1 - 文本分类方法、装置、计算机设备和介质 - Google Patents

文本分类方法、装置、计算机设备和介质 Download PDF

Info

Publication number
WO2022116444A1
WO2022116444A1 PCT/CN2021/084218 CN2021084218W WO2022116444A1 WO 2022116444 A1 WO2022116444 A1 WO 2022116444A1 CN 2021084218 W CN2021084218 W CN 2021084218W WO 2022116444 A1 WO2022116444 A1 WO 2022116444A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
classified
utility
phrases
word vector
Prior art date
Application number
PCT/CN2021/084218
Other languages
English (en)
French (fr)
Inventor
赵婧
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022116444A1 publication Critical patent/WO2022116444A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a text classification method, apparatus, computer equipment and medium.
  • Existing text classification methods generally use deep learning algorithms to predict text categories.
  • the deep learning algorithm is very dependent on the selected text features in the process of predicting the text category.
  • the word vectors are used to determine the distance relationship between the text features.
  • the inventors realized that deep learning algorithms cannot eliminate the interference of synonyms on text classification, reducing the accuracy of text classification.
  • the present application provides a text classification method, device, computer equipment and medium.
  • a high-utility itemset containing multiple strongly related words can be obtained.
  • the vector matrix is used for classification prediction, which improves the accuracy of text classification.
  • the present application provides a text classification method, the method comprising:
  • the word vector matrix is input into a text classification model for classification prediction, and a text category corresponding to the text to be classified is obtained.
  • the present application also provides a text classification device, the device comprising:
  • an itemset mining module configured to obtain the text to be classified, perform itemset mining on the text to be classified, and obtain a high-utility itemset corresponding to the text to be classified, wherein the high-utility itemset includes at least two phrases;
  • a vectorization module used to vectorize each phrase in the high-utility item set to obtain a word vector matrix corresponding to the text to be classified;
  • a classification prediction module configured to input the word vector matrix into a text classification model for classification prediction, and obtain a text category corresponding to the text to be classified.
  • the present application also provides a computer device, the computer device comprising a memory and a processor;
  • the memory for storing computer programs
  • the processor is configured to execute the computer program and implement the following steps when executing the computer program:
  • the word vector matrix is input into a text classification model for classification prediction, and a text category corresponding to the text to be classified is obtained.
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:
  • the word vector matrix is input into a text classification model for classification prediction, and a text category corresponding to the text to be classified is obtained.
  • the present application discloses a text classification method, device, computer equipment and medium.
  • a high-utility item set containing multiple strongly related words corresponding to the text to be classified can be obtained.
  • the high-utility item set of strongly related words is used for text classification, which solves the problem of synonyms interfering with text classification; by vectorizing each phrase in the high-utility item set, the word vector matrix corresponding to the text to be classified can be obtained;
  • the matrix is input into the text classification model for classification prediction, which improves the prediction accuracy of text categories.
  • FIG. 1 is a schematic flowchart of a text classification method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a prediction process for text classification provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of sub-steps of performing itemset mining on text to be classified according to an embodiment of the present application
  • FIG. 4 is a schematic flowchart of sub-steps for determining the utility value of an item set provided by an embodiment of the present application
  • FIG. 5 is a schematic flowchart of a sub-step of performing classification prediction according to a word vector matrix provided by an embodiment of the present application
  • FIG. 6 is a schematic flowchart of sub-steps of a training process of a text classification model provided by an embodiment of the present application
  • FIG. 7 is a schematic block diagram of a text classification apparatus provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural block diagram of a computer device provided by an embodiment of the present application.
  • Embodiments of the present application provide a text classification method, apparatus, computer device, and medium.
  • the text classification method can be applied to a server or a terminal.
  • itemset mining on the text to be classified
  • a high-utility itemset containing multiple strongly related words can be obtained, and classification and prediction can be performed according to the word vector matrix of the high-utility itemset, so as to improve the the accuracy of text classification.
  • the server may be an independent server or a server cluster.
  • Terminals can be electronic devices such as smart phones, tablet computers, notebook computers, and desktop computers.
  • the text classification method includes steps S10 to S30.
  • Step S10 Obtain the text to be classified, perform itemset mining on the text to be classified, and obtain a high-utility itemset corresponding to the text to be classified, wherein the high-utility itemset includes at least two phrases.
  • the text to be classified may be a text file uploaded by a user to a server or terminal, or a text file stored in a local disk of the server or terminal, or a text file stored in a node of the blockchain.
  • the user's text selection operation on the text file may be received, and the selected text file is determined as the text to be classified according to the text selection operation.
  • itemset mining refers to mining highly correlated phrases in the text to be classified as high-utility itemsets. Wherein, the high-utility itemset includes at least two phrases.
  • FIG. 2 is a schematic diagram of a prediction process of text classification provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of sub-steps of performing itemset mining on the text to be classified in step S10 to obtain a high-utility itemset corresponding to the text to be classified, which may specifically include the following steps S101 to S103 .
  • Step S101 Perform word segmentation on the text to be classified to obtain multiple phrases corresponding to the text to be classified.
  • the text to be classified may include at least one sentence. It can be understood that, performing word segmentation processing on the text to be classified refers to performing word segmentation on each sentence in the text to be classified.
  • performing word segmentation processing on the text to be classified to obtain a plurality of phrases corresponding to the text to be classified may include: performing word segmentation processing on each sentence in the text to be classified based on a preset word segmentation database to obtain the text to be classified corresponding multiple phrases.
  • the preset word segmentation library may be the jieba library.
  • the jieba library can use the Chinese thesaurus to analyze the association probability between Chinese characters and Chinese characters, as well as the association probability of Chinese character phrases, and can also be classified according to user-defined phrases.
  • the jieba library may include precise mode, full mode and search engine mode, and different modes are implemented by different functions.
  • the precise mode can be implemented by the lcut(s) function
  • the search engine mode can be implemented by the lcut_for_search(s) function.
  • each sentence in the text to be classified can be processed by word segmentation using the jieba library, to obtain multiple phrases corresponding to the text to be classified.
  • the method may further include: filtering the multiple phrases based on a preset stop word database, and obtaining the filtered multiple phrases.
  • the preset stop word database can be pre-created and stored in a local disk or database. Understandably, stop word banks are used to stop low value words in text or sentences.
  • low-value words refer to words that have little effect on text or sentences and are very frequent.
  • low-value words may include, but are not limited to, "some,” “everything,” “in one aspect,” “generally,” “up and down,” “ah,” “according to,” “such as,” “had,” “therefore” “, “and” “and”, etc.
  • a stop word database may be invoked to perform filtering processing on the multiple phrases to obtain filtered phrases.
  • exemplary, low-value phrases among multiple phrases are deleted through a stop word database.
  • the low-value phrases can be deleted to avoid the impact of the low-value phrases on the prediction of the text category.
  • Step S102 combining the multiple phrases to obtain multiple item sets corresponding to the text to be classified.
  • At least two of the plurality of phrases may be combined.
  • the phrases A, B, C, D can be combined, and the resulting itemsets can include (AB), (AC), (AD), (BC), ( BD), (CD), (ABC) (ABD) (ACD), (BCD) and (ABCD).
  • multiple itemsets containing at least two phrases can be obtained, and subsequently high-utility itemsets can be determined according to the utility values corresponding to the itemsets.
  • Step S103 Determine the utility value of each item set corresponding to the text to be classified, and determine the item set whose corresponding utility value is not less than a preset utility threshold as the high-utility item set corresponding to the text to be classified .
  • the utility value is used to represent the number of times the itemset appears in the text to be classified. For example, the higher the number of times an item set appears in the text to be classified, the greater the utility value corresponding to the item set.
  • FIG. 4 is a schematic flowchart of the sub-steps of determining the utility value of each item set corresponding to the text to be classified in step S103 , which may specifically include the following steps S1031 to S1033 .
  • Step S1031 Determine the number of times each of the phrases in each item set appears in each sentence of the text to be classified as the first utility value of each of the phrases corresponding to each of the sentences.
  • the phrase A if the number of times the phrase A appears in a certain sentence is 1, it can be determined that the first utility value corresponding to the phrase A is 1, which can be expressed as (A, 1).
  • Step S1032 Determine the sum of the first utility values of the phrases in each item set corresponding to each of the sentences as the second utility value of each of the item sets corresponding to each of the sentences.
  • the second utility value for each item set corresponding to each sentence can be expressed as U(X, T d ), where U(X, T d ) can be defined as: each phrase in the itemset X corresponds to The sum of the first utility value U(i k , T d ) in the statement T d is
  • Step S1033 Determine the sum of the second utility values of each of the item sets corresponding to each of the sentences as the utility value of each of the item sets corresponding to the text to be classified.
  • the utility value of each item set corresponding to the text to be classified can be expressed as U(X, D), where U(X, D) can be defined as: the second utility value of each item set X corresponding to each sentence T d The sum of U(X,T d ), namely
  • the utility value of the item set corresponding to the text to be classified is described by taking Table 1 as an example.
  • the text to be classified includes four sentences T1, T2, T3 and T4.
  • an item set whose corresponding utility value is not less than a preset utility threshold is determined as a high-utility item set corresponding to the text to be classified.
  • the utility value indicates that the phrases in the item set appear more frequently in the text to be classified.
  • the utility value corresponding to the item set is not less than the preset utility threshold, it means that the phrases in the item set are strongly related words.
  • the preset utility threshold may be set according to the actual situation, and the specific value is not limited herein.
  • the preset utility threshold can be expressed as Q. For example, if the utility value U(X) ⁇ E corresponding to the itemset X, it can be determined that the itemset X is a high-utility itemset.
  • the itemset AC is greater than the preset utility threshold Q, it can be determined that the itemset AC is a high-utility itemset. If the itemset AE is greater than the preset utility threshold Q, it can be determined that the itemset AE is a high-utility itemset. If the itemset BC is smaller than the preset utility threshold Q, the itemset BC is not regarded as a high-utility itemset.
  • the itemsets whose utility value is not less than the preset utility threshold can be filtered out, so that a high-utility itemset containing multiple strongly related words can be obtained;
  • the corresponding word vector matrix is used for classification prediction, the interference of synonyms on text classification can be eliminated, and the prediction accuracy of text classification can be improved.
  • Step S20 Vectorize each phrase in the high-utility item set to obtain a word vector matrix corresponding to the text to be classified.
  • vectorizing each phrase in the high-utility item set to obtain a word vector matrix corresponding to the text to be classified may include: obtaining a word vector model from the blockchain; inputting each phrase into the word vector model Perform vectorization to obtain the word vector matrix corresponding to the text to be classified.
  • the word vector model may be trained in advance to obtain a trained word vector model. It should be emphasized that, in order to further ensure the privacy and security of the above trained word vector model, the above trained word vector model can also be stored in a node of a blockchain. When vectorizing each phrase in the high-utility item set, the word vector model can be called from the nodes of the blockchain to vectorize each phrase, and the word vector matrix corresponding to the text to be classified can be obtained.
  • each row can represent a word vector corresponding to a phrase.
  • the word vector model may include, but is not limited to, the word2vec (word vector) model, the glove (Global vectors for word representation) model, and the BERT (Bidirectional Encoder Representations from Transformer) model, and so on.
  • a word vector matrix corresponding to the text to be classified can be obtained, and then the word vector matrix can be input into the text classification model for classification prediction.
  • Step S30 Input the word vector matrix into a text classification model to perform classification prediction, and obtain a text category corresponding to the text to be classified.
  • the text classification model is a trained text classification model.
  • the text classification model may include, but is not limited to, a convolutional neural network (Convolutional Neural Network, CNN), a Han model, or a recurrent neural network (Recurrent Neural Network, RNN), and the like.
  • the prediction accuracy of the text category corresponding to the text to be classified can be improved.
  • the prediction process of text classification is described in detail by taking the text classification model as a convolutional neural network as an example.
  • the convolutional neural network may include a convolutional layer, a pooling layer, a fully connected layer, and a normalization layer.
  • FIG. 5 is a schematic flowchart of the sub-steps of inputting the word vector matrix into the text classification model for classification prediction in step S30, and obtaining the text category corresponding to the text to be classified. Specifically, it may include the following steps S301 to S303 .
  • Step S301 inputting the word vector matrix into the convolution layer for convolution processing to obtain a feature image corresponding to the word vector matrix.
  • the convolution process refers to the extraction of high-level features in the word vector matrix.
  • a preset convolution filter may be used to perform feature extraction on the training samples to obtain feature images corresponding to the training samples.
  • the number of convolution kernels of the convolution filter, the size of each convolution kernel, and the convolution step size can be set according to actual conditions, and the specific values are not limited here.
  • Step S302 Input the feature image into the pooling layer to perform pooling processing, and obtain the feature image after the pooling process.
  • pooling is to replace a certain area of the image with a value, such as the maximum value or the average value. If the maximum value is used, it is Max-pooling; if the average value is used instead, it is called Mean-pooling.
  • the pooling operation can reduce the image size and achieve translation and rotation invariance. This is because the output value is calculated from a region of the image and is not sensitive to translation and rotation. In this embodiment of the present application, maximum pooling may be used to perform pooling processing on feature images.
  • the feature image s (y 1 , y 2 , . . . , y n ) is input into the pooling layer for pooling processing, and the pooled feature image is obtained.
  • Step S303 Input the pooled feature image into the fully-connected layer for full-connection processing, and normalize the result of the full-connection processing through the normalization layer to obtain the text to be classified the corresponding text category.
  • FC fully connected layer
  • the feature vector output by the fully connected layer may be normalized according to the normalization layer in the convolutional neural network, and the output is the category probability distribution corresponding to the text to be classified.
  • the normalization layer can output the class probability distribution through the softmax function.
  • the expression of the softmax function is:
  • c represents the category
  • q represents the feature vector output by the fully connected layer
  • j represents the jth element in the feature vector q.
  • the class probability distribution may include classes corresponding to class probabilities and class probabilities.
  • the categories may include but are not limited to categories such as insurance, medical care, finance, tourism, sports, technology, and agriculture.
  • the category corresponding to the maximum category probability may be determined as the text category corresponding to the text to be classified. For example, if the category probability distribution includes the category probabilities corresponding to the first to fourth categories: 0.20, 0.02, 0.08, 0.70, it can be determined that the fourth category is the text category corresponding to the text to be classified.
  • FIG. 6 is a schematic flowchart of sub-steps of a training process of a text classification model provided by an embodiment of the present application. As shown in FIG. 6 , the training process of the text classification model specifically includes steps S401 to S404.
  • Step S401 Obtain word vector matrices of high-utility item sets corresponding to a preset number of original texts, and perform category labeling on each word vector matrix according to the true category corresponding to the original text, and label the word vector after the category labeling. matrices as training samples.
  • an initial text classification model may be trained to obtain a trained text classification model.
  • the initial text classification model can be a convolutional neural network.
  • a preset number of original texts can be collected, and itemset mining can be performed on the original texts to obtain high-utility itemsets corresponding to the original texts; then each phrase in the high-utility itemsets can be vectorized to obtain the corresponding word vector matrix.
  • the original text may be a number of different categories of text.
  • each word vector matrix may be class-labeled according to the real class corresponding to the original text, to obtain a class-labeled word vector matrix. Then the word vector matrix after category labeling is used as a training sample. Among them, the word vector matrix after category labeling carries the real category.
  • real categories may include, but are not limited to, insurance, medical, financial, travel, sports, technology, and agriculture categories.
  • the trained text classification model can be more accurate by combining multiple strongly related words in the high-utility item set. It can also predict the category to which the text belongs; at the same time, it can also eliminate the interference of synonyms on text classification, thereby improving the effect of text classification.
  • Step S402 Input the training sample into the text classification model for classification training, and obtain the predicted category corresponding to the training sample.
  • the training samples are input into the text classification model, and are sequentially processed through the convolution layer, pooling layer, fully connected layer and normalization layer in the text classification model, and the predicted category corresponding to the training samples is output.
  • Step S403 based on a preset loss function, calculate a loss function value according to the predicted category corresponding to the training sample and the real category corresponding to the training sample.
  • the loss function is used to evaluate the degree to which the predicted value of the model is different from the actual value. The smaller the loss function, the better the performance of the model.
  • the loss function may include, but is not limited to, a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a squared loss function, an exponential loss function, and the like.
  • the preset loss function may be a logarithmic loss function. Through the logarithmic loss function, the loss function value of each round of training is calculated according to the predicted category corresponding to the training sample and the real category corresponding to the training sample.
  • Step S404 Based on the preset gradient descent algorithm, adjust the parameters in the text classification model according to the loss function value and perform the next round of training, until the obtained loss function value is less than the preset loss threshold, end the training, and obtain: The text classification model after training.
  • the parameters of the text classification model may be adjusted according to the value of the loss function based on the gradient descent algorithm, so that the value of the loss function of the text classification model reaches the minimum value.
  • the gradient descent algorithm may include, but is not limited to, batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.
  • the training ends. If the loss function value is greater than the preset loss threshold, adjust the parameters in the text classification model according to the gradient descent algorithm, carry out the next round of training and calculate the loss function value of each round; when the calculated loss function value is less than the preset loss If the threshold value or no longer becomes smaller, the training ends, and the trained text classification model is obtained.
  • the preset loss threshold can be set according to the actual situation, and the specific value is not limited here.
  • the above-mentioned trained text classification model may also be stored in a node of a blockchain.
  • the trained text classification model needs to be used, it can be obtained from the nodes of the blockchain.
  • the text classification model is rapidly converged and the prediction accuracy of the text classification of the trained text classification model is improved.
  • the text classification method provided by the above-mentioned embodiment, by filtering the multiple phrases after the word segmentation of the text to be classified based on the preset stop word database, the low-value phrases can be deleted, and the prediction of the text category by the low-value phrases can be avoided.
  • multiple itemsets containing at least two phrases can be obtained, and subsequently high-utility itemsets can be determined according to the utility values corresponding to the itemsets;
  • the utility value of the classified text can filter out the itemsets whose utility value is not less than the preset utility threshold, so as to obtain a high-utility itemset containing multiple strongly related words; in the follow-up, the classification prediction is performed according to the word vector matrix corresponding to the high-utility itemset.
  • the interference of synonyms on text classification can be eliminated, and the prediction accuracy of text classification can be improved; by vectorizing each phrase in the high-utility item set, the word vector matrix corresponding to the text to be classified can be obtained, and the word vector matrix can be used later.
  • FIG. 7 is a schematic block diagram of a text classification apparatus 1000 further provided by an embodiment of the present application, and the text classification apparatus is configured to perform the aforementioned text classification method.
  • the text classification apparatus may be configured in a server or a terminal.
  • the text classification apparatus 1000 includes: an itemset mining module 1001 , a vectorization module 1002 and a classification prediction module 1003 .
  • the itemset mining module 1001 is configured to obtain the text to be classified, perform itemset mining on the text to be classified, and obtain a high-utility itemset corresponding to the text to be classified, wherein the high-utility itemset includes at least two phrases .
  • the vectorization module 1002 is configured to vectorize each phrase in the high-utility item set to obtain a word vector matrix corresponding to the text to be classified.
  • the classification prediction module 1003 is configured to input the word vector matrix into a text classification model for classification prediction, and obtain the text category corresponding to the text to be classified.
  • the above-mentioned apparatus can be implemented in the form of a computer program that can be executed on a computer device as shown in FIG. 8 .
  • FIG. 8 is a schematic structural block diagram of a computer device provided by an embodiment of the present application.
  • the computer device can be a server or a terminal.
  • the computer device includes a processor and a memory connected through a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.
  • the processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
  • the internal memory provides an environment for running a computer program in a non-volatile storage medium, and when the computer program is executed by the processor, the processor can cause the processor to execute any text classification method.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated circuits) Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • the processor is configured to run a computer program stored in the memory to implement the following steps:
  • the processor when the processor performs itemset mining on the text to be classified to obtain a high-utility itemset corresponding to the text to be classified, the processor is configured to:
  • the processor when the processor implements word segmentation processing on the text to be classified and obtains multiple phrases corresponding to the text to be classified, the processor is used to implement:
  • word segmentation processing is performed on each sentence in the text to be classified, to obtain a plurality of the phrases corresponding to the text to be classified.
  • the processor after the processor performs word segmentation processing on the text to be classified and obtains multiple phrases corresponding to the text to be classified, the processor is further configured to:
  • filtering is performed on a plurality of the phrases to obtain a plurality of the filtered phrases.
  • the processor determines the utility value of each item set corresponding to the text to be classified, the processor is configured to:
  • the number of times each of the phrases in each of the item sets appears in each sentence of the text to be classified is determined as the first utility value of each of the phrases corresponding to each of the sentences;
  • Each phrase in the item set corresponds to the sum of the first utility values of each of the sentences, and is determined as the second utility value of each of the itemsets corresponding to each of the sentences;
  • the sum of the second utility values of the sentences is determined as the utility value of each item set corresponding to the text to be classified.
  • the processor when implementing the vectorization of each phrase in the high-utility item set to obtain a word vector matrix corresponding to the text to be classified, the processor is used to implement:
  • the word vector model is obtained from the blockchain; each of the phrases is input into the word vector model for vectorization, and the word vector matrix corresponding to the text to be classified is obtained.
  • the text classification model includes a convolution layer, a pooling layer, a fully connected layer, and a normalization layer; the processor performs classification prediction when the word vector matrix is input into the text classification model, When the text category corresponding to the text to be classified is obtained, it is used to realize:
  • the feature image after the pooling process is input into the full connection layer for full connection processing, and the result of the full connection processing is normalized through the normalization layer to obtain the text corresponding to the text to be classified category.
  • the processor when implementing the vectorization of each phrase in the high-utility item set to obtain a word vector matrix corresponding to the text to be classified, the processor is used to implement:
  • word vector matrices of high-utility item sets corresponding to a preset number of original texts, and perform category labeling on each word vector matrix according to the true category corresponding to the original text, and use the word vector matrix after category labeling as training sample; input the training sample into the text classification model for classification training to obtain the predicted category corresponding to the training sample; based on a preset loss function, according to the predicted category corresponding to the training sample and the corresponding training sample Calculate the value of the loss function based on the real category of , the training is ended, and the trained text classification model is obtained.
  • the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the processor executes the program instructions to implement the present application Any one of the text classification methods provided in the embodiments.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a Secure Digital Card (Secure Digital Card) , SD Card), flash memory card (Flash Card), etc.
  • the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Abstract

本申请涉及人工智能领域,通过对待分类文本进行项集挖掘,得到包含多个强关联词的高效用项集,可以根据高效用项集的词向量矩阵进行分类预测,提高了文本分类的准确性。尤其涉及一种文本分类方法、装置、计算机设备和介质,该文本分类方法包括:获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。此外,本申请还涉及区块链技术,待分类文本可存储于区块链中。

Description

文本分类方法、装置、计算机设备和介质
本申请要求于2020年12月01日提交中国专利局,申请号为202011389826.3,发明名称为“文本分类方法、装置、计算机设备和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种文本分类方法、装置、计算机设备和介质。
背景技术
随着互联网的高速发展和大数据时代的到来,文本分类成为当前自然语言处理领域的一个热点研究问题。
现有的文本分类方法一般通过深度学习算法进行文本类别的预测。深度学习算法在对文本类别的预测过程中非常依赖于所选择的文本特征,通过将文本转化为词向量,使用词向量确定各文本特征之间的距离关系。然而,发明人意识到,深度学习算法无法消除同义词对文本分类的干扰,降低了文本分类的准确性。
因此如何提高文本分类的准确性成为亟需解决的问题。
技术问题
有鉴于此,本申请提供了一种文本分类方法、装置、计算机设备和介质,通过对待分类文本进行项集挖掘,得到包含多个强关联词的高效用项集,可以根据高效用项集的词向量矩阵进行分类预测,提高了文本分类的准确性。
技术解决方案
第一方面,本申请提供了一种文本分类方法,所述方法包括:
获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;
对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;
将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
第二方面,本申请还提供了一种文本分类装置,所述装置包括:
项集挖掘模块,用于获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;
向量化模块,用于对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;
分类预测模块,用于将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
第三方面,本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器;
所述存储器,用于存储计算机程序;
所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如下步骤:
获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;
对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;
将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
第四方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储 有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;
对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;
将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
有益效果
本申请公开了一种文本分类方法、装置、计算机设备和介质,通过对待分类文本进行项集挖掘,可以得到待分类文本对应的包含多个强关联词的高效用项集,后续可以对包含多个强关联词的高效用项集进行文本分类,解决了同义词对文本分类的干扰问题;通过对高效用项集中的每个词组进行向量化,可以得到待分类文本对应的词向量矩阵;通过将词向量矩阵输入文本分类模型中进行分类预测,提高了文本类别的预测准确性。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请的实施例提供的一种文本分类方法的示意性流程图;
图2是本申请实施例提供的一种文本分类的预测过程的示意图;
图3是本申请实施例提供的对待分类文本进行项集挖掘的子步骤的示意性流程图;
图4是本申请实施例提供的确定项集的效用值的子步骤的示意性流程图;
图5是本申请实施例提供的根据词向量矩阵进行分类预测的子步骤的示意性流程图;
图6是本申请实施例提供的一种文本分类模型的训练过程的子步骤的示意性流程图;
图7是本申请实施例提供的一种文本分类装置的示意性框图;
图8是本申请实施例提供的一种计算机设备的结构示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。
应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
本申请的实施例提供了一种文本分类方法、装置、计算机设备和介质。其中,该文本分类方法可以应用于服务器或终端中,通过对待分类文本进行项集挖掘,得到包含多个强关联词的高效用项集,可以根据高效用项集的词向量矩阵进行分类预测,提高了文本分类的准确性。
其中,服务器可以为独立的服务器,也可以为服务器集群。终端可以是智能手机、平板电脑、笔记本电脑和台式电脑等电子设备。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
如图1所示,文本分类方法包括步骤S10至步骤S30。
步骤S10、获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组。
示例性的,待分类文本可以是用户上传至服务器或终端的文本文件,也可以是存储在服务器或终端的本地磁盘中的文本文件,还可以是存储在区块链的节点中的文本文件。
在一些实施例中,可以接收用户对文本文件的文本选中操作,根据文本选中操作将选中的文本文件确定为待分类文本。
需要说明的是,项集挖掘是指挖掘待分类文本中的关联性强的词组,作为高效用项集。其中,高效用项集包括至少两个词组。
请参阅图2,图2是本申请实施例提供的一种文本分类的预测过程的示意图。如图2所示,先对待分类文本进行项集挖掘,得到待分类文本对应的高效用项集;然后对高效用项集中的每个词组进行向量化,得到待分类文本对应的词向量矩阵;最后,将待分类文本对应的词向量矩阵输入文本分类模型中进行分类预测,得到待分类文本对应的文本类别。
请参阅图3,图3是步骤S10中对待分类文本进行项集挖掘,得到待分类文本对应的高效用项集的子步骤的示意性流程图,具体可以包括以下步骤S101至步骤S103。
步骤S101、对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组。
示例性的,待分类文本可以包括至少一个语句。可以理解的是,对待分类文本进行分词处理,是指将待分类文本中的每个语句进行分词。
在一些实施例中,对待分类文本进行分词处理,得到待分类文本对应的多个词组,可以包括:基于预设的分词库,对待分类文本中的每个语句进行分词处理,得到待分类文本对应的多个词组。
示例性的,预设的分词库可以为jieba库。需要说明的是,jieba库可以利用中文词库,分析汉字与汉字之间的关联几率,以及分析汉字词组的关联几率,还可以根据用户自定义的词组进行分类。示例性的,jieba库可以包括精确模式、全模式以及搜索引擎模式,不同模式通过不同的函数实现。例如,精确模式可以通过lcut(s)函数实现;全模式可以通过lcut(s,cut_all=Ture)函数实现;搜索引擎模式可以通过lcut_for_search(s)函数实现。
在本申请实施例中,可以通过jieba库对待分类文本中的每个语句进行分词处理,得到待分类文本对应的多个词组。
在一些实施例中,对待分类文本进行分词处理,得到待分类文本对应的多个词组之后,还可以包括:基于预设的停用词库,对多个词组进行过滤处理,得到过滤处理后的多个词组。
示例性的,预设的停用词库可以预先创建,并存储在本地磁盘或数据库中。可以理解的是,停用词库用于停用文本或语句中的低价值的词。其中,低价值的词是指对文本或语句影响不大、频率很高的词。例如,低价值的词可以包括但不限于“一些”、“一切”、“一方面”、“一般”、“上下”、“啊”、“按照”、“比如”、“了”、“从而”、“以及”“和”等等。
在本申请实施例中,在得到待分类文本对应的多个词组之后,可以调用停用词库对多个词组进行可以过滤处理,得到过滤处理后的词组。示例性的,通过停用词库将多个词组中的低价值的词组删除。
通过基于预设的停用词库对待分类文本的分词后的多个词组进行过滤处理,可以将低价值的词组删除,避免低价值的词组对文本类别的预测造成影响。
步骤S102、对所述多个词组进行组合,得到所述待分类文本对应的多个项集。
示例性的,可以对多个词组中的至少两个进行组合。例如,若有词组A、B、C、D,则可以对词组A、B、C、D进行组合,得到的项集可以包括(AB)、(AC)、(AD)、(BC)、(BD)、(CD)、(ABC)(ABD)(ACD)、(BCD)以及(ABCD)。
通过对待分类文本的多个词组进行组合,可以得到多个包含至少两个词组的项集,后续可以根据项集对应的效用值确定高效用项集。
步骤S103、确定每个所述项集对应所述待分类文本的效用值,将对应的效用值不小于预设效用阈值的项集,确定为所述待分类文本对应的所述高效用项集。
需要说明的是,效用值用于表示项集在待分类文本中出现的次数。例如,若项集在待分类文本中出现的次数越高,则项集对应的效用值越大。
请参阅图4,图4是步骤S103中确定每个项集对应待分类文本的效用值的子步骤的示意性流程图,具体可以包括以下步骤S1031至步骤S1033。
步骤S1031、将每个所述项集中每个所述词组在所述待分类文本的每个语句中出现的次数,确定为每个所述词组对应每个所述语句的第一效用值。
示例性的,若待分类文本由一组语句组成,即D={T 1,T 2,…,T n};其中,D表示待分类文本;T表示语句;n表示语句的个数。其中,每一个语句可以包含k个词组,即T d={i 1,i 2,…,i k},1≤d≤n。
示例性的,对于每个词组对应每个语句的第一效用值可以表示为U(i k,T d),其中,U(i k,T d)可以定义为:第一效用值U(i k,T d)表示词组i k在语句T d中出现的次数为词语i k对应的效用值,即U(i k,T d)=Count(i k,T d);其中,Count(i k,T d)表示词组i k在语句T d中出现的次数。
示例性的,对于词组A,若词组A在某个语句中出现的次数为1,则可以确定词组A对应的第一效用值为1,可以表示为(A,1)。
步骤S1032、将每个所述项集中的各个词组对应各个所述语句的第一效用值之和,确定为每个所述项集对应每个所述语句的第二效用值。
示例性的,对于每个项集对应每个语句的第二效用值可以表示为U(X,T d),其中,U(X,T d)可以定义为:项集X中的各个词组对应语句T d中的第一效用值U(i k,T d)之和,即
Figure PCTCN2021084218-appb-000001
步骤S1033、将每个所述项集对应各个所述语句的第二效用值之和,确定为每个所述项集对应所述待分类文本的所述效用值。
示例性的,每个项集对应待分类文本的效用值可以表示为U(X,D),其中,U(X,D)可以定义为:项集X对应各个语句T d的第二效用值U(X,T d)之和,即
Figure PCTCN2021084218-appb-000002
在本申请实施例中,项集对应待分类文本的效用值,以表1为例进行说明。
表1
编号 语句 第一效用值
T1 CACECE (A,1),(C,3),(E,2)
T2 ABAFEF (A,2),(B,1),(E,1),(F,2)
T3 DBDFD (B,1),(D,3),(F,1)
T4 BDCDBE (B,2),(C,1),(D,2),(E,1)
在表1中,待分类文本包括T1、T2、T3和T4四个语句。在T1语句中,词组A对应的第一效用值为U({A},T1)=1,词组C对应的第一效用值为U({C},T1)=3。
示例性的,对于项集AC,在T1语句中的第二效用值为U({AC},T1)=4;项集AC在待分类文本中的效用值为U({AC})=4。
示例性的,对于项集AE,项集AE在T1语句中的第二效用值为U({AE},T1)=3,在 T2语句中的第二效用值为U({AE},T2)=3,则可以确定项集在待分类文本中的效用值为U({AE})=6。
在一些实施方式中,在确定每个项集对应待分类文本的效用值之和,将对应的效用值不小于预设效用阈值的项集,确定为待分类文本对应的高效用项集。
可以理解的是,效用值表示项集中的词组在待分类文本中同时出现的次数较多,当项集对应的效用值不小于预设效用阈值,表示该项集中的词组为强关联词。
示例性的,预设效用阈值可以根据实际情况设定,具体数值在此不作限定。其中,预设效用阈值可以表示为Q。例如,若项集X对应的效用值U(X)≥E,则可以确定项集X为高效用项集。
示例性的,若项集AC大于预设效用阈值Q,则可以确定项集AC为高效用项集。若项集AE大于预设效用阈值Q,则可以确定项集AE为高效用项集。若项集BC小于预设效用阈值Q,则不将项集BC作为高效用项集。
通过确定每个项集对应待分类文本的效用值,可以筛选出效用值不小于预设效用阈值的项集,从而可以得到包含多个强关联词的高效用项集;后续在根据高效用项集对应的词向量矩阵进行分类预测时,可以消除同义词对文本分类的干扰,提高文本分类的预测准确性。
步骤S20、对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵。
在一些实施例中,对高效用项集中的每个词组进行向量化,得到待分类文本对应的词向量矩阵,可以包括:从区块链中获取词向量模型;将每个词组输入词向量模型进行向量化,得到待分类文本对应的词向量矩阵。
示例性的,在本申请实施例中,可以预先对词向量模型进行训练,得到训练后的词向量模型。需要强调的是,为进一步保证上述训练后的词向量模型的私密和安全性,上述训练后的词向量模型还可以存储于一区块链的节点中。在对高效用项集中的每个词组进行向量化时,可以从区块链的节点中调用词向量模型对每个词组进行向量化处理,得到待分类文本对应的词向量矩阵。
其中,在词向量矩阵中,每一行可以表示一个词组对应的词向量。
示例性的,词向量模型可以包括但不限于word2vec(词向量)模型、glove(Global vectors for word representation)模型以及BERT(Bidirectional Encoder Representations from Transformer)模型等等。
通过对高效用项集中的每个词组进行向量化,可以得到待分类文本对应的词向量矩阵,后续可以将词向量矩阵输入文本分类模型中进行分类预测。
步骤S30、将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
示例性的,文本分类模型为训练后的文本分类模型。其中,文本分类模型可以包括但不限于卷积神经网络(Convolutional Neural Network,CNN)、Han模型以及或循环神经网络(Recurrent Neural Network,RNN)等等。
通过将词向量矩阵输入训练后的文本分类模型中进行分类预测,可以提高待分类文本对应的文本类别的预测准确性。
在本申请实施例中,以文本分类模型为卷积神经网络为例,对文本分类的预测过程进行详细说明。示例性的,卷积神经网络可以包括卷积层、池化层、全连接层以及归一化层。
请参阅图5,图5是步骤S30中将词向量矩阵输入文本分类模型中进行分类预测,得到待分类文本对应的文本类别的子步骤的示意性流程图,具体可以包括以下步骤S301至步骤S303。
步骤S301、将所述词向量矩阵输入所述卷积层中进行卷积处理,得到所述词向量矩阵 对应的特征图像。
需要说明的是,卷积处理是指对词向量矩阵中的高层次特征进行抽取。
示例性的,可以使用预设的卷积滤波器对训练样本进行特征提取,得到训练样本对应的特征图像。其中,卷积滤波器的卷积核个数、各卷积核的尺寸以及卷积步长可以根据实际情况设定,具体数值在此不作限定。
示例性的,使用不同窗口大小的n个滤波器对词向量矩阵进行卷积操作,得到词向量矩阵对应的特征图像s=(y 1,y 2,...,y n)。
步骤S302、将所述特征图像输入所述池化层进行池化处理,得到池化处理后的所述特征图像。
需要说明的是,池化是对图像的某一个区域用一个值代替,如用最大值或平均值代替。如果采用最大值,则为最大值池化(Max-pooling);如果用平均值代替,叫做均值池化(Mean-pooling)。池化操作可以降低图像尺寸以及实现平移、旋转不变性。这是因为输出值由图像的一片区域计算得到,对于平移和旋转并不敏感。在本申请实施例中,可以采用最大值池化对特征图像进行池化处理。
示例性的,最大值池化的计算公式可以表示为:
q=max(s);
示例性的,将特征图像s=(y 1,y 2,...,y n)输入池化层进行池化处理,得到池化处理后的特征图像。
步骤S303、将池化处理后的所述特征图像输入所述全连接层进行全连接处理,并通过所述归一化层对全连接处理的结果进行归一化处理,得到所述待分类文本对应的文本类别。
需要说明的是,全连接层(Fully connected layers,FC)在整个卷积神经网络中起到“分类器”的作用,全连接层用于连接上一层所有的特征,并将输出值送到归一化层。
示例性的,可以根据卷积神经网络中的归一化层,对全连接层输出的特征向量进行归一化处理,输出的是待分类文本对应的类别概率分布。示例性的,归一化层可以通过softmax函数输出类别概率分布。
示例性的,softmax函数的表达式为:
Figure PCTCN2021084218-appb-000003
式中,c表示类别,q表示全连接层输出的特征向量;j表示特征向量q中第j个元素。
示例性的,类别概率分布可以包括类别概率与类别概率对应的类别。
其中,类别可以包括但不限于保险、医疗、金融、旅游、体育、科技以及农业等等类别。
在本申请实施例中,可以将最大类别概率对应的类别,确定为待分类文本对应的文本类别。例如,若类别概率分布包括第1-4个类别对应的类别概率为:0.20,0.02,0.08,0.70,则可以确定第4类别为待分类文本对应的文本类别。
请参阅图6,图6是本申请实施例提供的一种文本分类模型的训练过程的子步骤的示意性流程图。如图6所示,文本分类模型的训练过程,具体包括步骤S401至步骤S404。
步骤S401、获取预设数量的原始文本对应的高效用项集的词向量矩阵,并根据所述原始文本对应的真实类别对每个词向量矩阵进行类别标注,将类别标注后的所述词向量矩阵作为训练样本。
在本申请实施例中,可以对初始的文本分类模型进行训练,得到训练后的文本分类模型。其中,初始的文本分类模型可以是卷积神经网络。
示例性的,可以收集预设数量的原始文本,对原始文本进行项集挖掘,得到原始文本对应的高效用项集;然后对高效用项集中的每个词组进行向量化,得到原始文本对应的词 向量矩阵。
具体的项集挖掘过程和词组的向量化,可以参见上述实施例的详细描述,具体实现过程在此不再赘述。
示例性的,原始文本可以是多个不同类别的文本。
在一些实施例中,可以根据原始文本对应的真实类别对每个词向量矩阵进行类别标注,得到类别标注后的词向量矩阵。然后类别标注后的词向量矩阵作为训练样本。其中,类别标注后的词向量矩阵携带有真实类别。
示例性的,真实类别可以包括但不限于保险、医疗、金融、旅游、体育、科技以及农业等等类别。
通过将原始文本中的高效用项集的词向量矩阵作为训练样本,对初始的文本分类模型进行训练,可以通过高效用项集中多个强关联词的组合,使得训练后的文本分类模型可以更准确地预测文本所属的类别;同时还可以消除同义词对文本分类的干扰,从而提高了文本分类的效果。
步骤S402、将所述训练样本输入所述文本分类模型中进行分类训练,得到所述训练样本对应的预测类别。
示例性的,将训练样本输入文本分类模型中,依次通过文本分类模型中的卷积层、池化层、全连接层以及归一化层进行处理,输出训练样本对应的预测类别。
步骤S403、基于预设的损失函数,根据所述训练样本对应的预测类别以及所述训练样本对应的真实类别,计算损失函数值。
需要说明的是,损失函数用来评价模型的预测值和真实值不一样的程度,损失函数越小,通常模型的性能越好。
示例性的,损失函数可以包括但不限于0-1损失函数、绝对值损失函数、对数损失函数、平方损失函数以及指数损失函数等等。在本申请实施例中,预设的损失函数可以是对数损失函数。通过对数损失函数,根据训练样本对应的预测类别以及训练样本对应的真实类别,计算每一轮训练的损失函数值。
步骤S404、基于预设的梯度下降算法,根据所述损失函数值调整所述文本分类模型中的参数并进行下一轮训练,直至得到的损失函数值小于预设损失阈值时,结束训练,得到训练后的所述文本分类模型。
示例性的,可以基于梯度下降算法,根据损失函数值对文本分类模型的参数进行调整,以使文本分类模型的损失函数值达到最小值。
其中,梯度下降算法可以包括但不限于批量梯度下降法、随机梯度下降法以及小批量梯度下降法等等。
在一些实施方式中,若损失函数值小于或者等于预设损失阈值,则训练结束。若损失函数值大于预设损失阈值,则根据梯度下降算法调整文本分类模型中的参数,进行下一轮的训练并计算每一轮的损失函数值;当计算得到的损失函数值小于预设损失阈值或不再变小时,则训练结束,得到训练后的文本分类模型。
其中,预设损失阈值可以根据实际情况进行设定,具体数值在此不作限定。
在一些实施例中,为进一步保证上述训练后的文本分类模型的私密和安全性,上述训练后的文本分类模型还可以存储于一区块链的节点中。当需要使用训练后的文本分类模型时,可以从区块链的节点中获取。
通过基于损失函数和梯度下降算法对初始的文本分类模型进行训练,使得文本分类模型快速收敛并提高训练后的文本分类模型的文本分类的预测准确度。
上述实施例提供的文本分类方法,通过基于预设的停用词库对待分类文本的分词后的多个词组进行过滤处理,可以将低价值的词组删除,避免低价值的词组对文本类别的预测造成影响;通过对待分类文本的多个词组进行组合,可以得到多个包含至少两个词组的项集,后续可以根据项集对应的效用值确定高效用项集;通过确定每个项集对应待分类文本 的效用值,可以筛选出效用值不小于预设效用阈值的项集,从而可以得到包含多个强关联词的高效用项集;后续在根据高效用项集对应的词向量矩阵进行分类预测时,可以消除同义词对文本分类的干扰,提高文本分类的预测准确性;通过对高效用项集中的每个词组进行向量化,可以得到待分类文本对应的词向量矩阵,后续可以将词向量矩阵输入文本分类模型中进行分类预测;通过将词向量矩阵输入训练后的文本分类模型中进行分类预测,可以提高待分类文本对应的文本类别的预测准确性。
请参阅图7,图7是本申请的实施例还提供一种文本分类装置1000的示意性框图,该文本分类装置用于执行前述的文本分类方法。其中,该文本分类装置可以配置于服务器或终端中。
如图7所示,该文本分类装置1000,包括:项集挖掘模块1001、向量化模块1002和分类预测模块1003。
项集挖掘模块1001,用于获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组。
向量化模块1002,用于对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵。
分类预测模块1003,用于将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
上述的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图8所示的计算机设备上运行。
请参阅图8,图8是本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以是服务器或终端。
请参阅图8,该计算机设备包括通过系统总线连接的处理器和存储器,其中,存储器可以包括非易失性存储介质和内存储器。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种文本分类方法。
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:
获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
在一个实施例中,所述处理器在实现对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集时,用于实现:
对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组;对所述多个词组进行组合,得到所述待分类文本对应的多个项集;确定每个所述项集对应所述待分类文本的效用值,将对应的效用值不小于预设效用阈值的项集,确定为所述待分类文本对应的所述高效用项集。
在一个实施例中,所述处理器在实现对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组时,用于实现:
基于预设的分词库,对所述待分类文本中的每个语句进行分词处理,得到所述待分类文本对应的多个所述词组。
在一个实施例中,所述处理器在实现对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组之后,还用于实现:
基于预设的停用词库,对多个所述词组进行过滤处理,得到过滤处理后的多个所述词组。
在一个实施例中,所述处理器在实现确定每个所述项集对应所述待分类文本的效用值时,用于实现:
将每个所述项集中每个所述词组在所述待分类文本的每个语句中出现的次数,确定为每个所述词组对应每个所述语句的第一效用值;将每个所述项集中的各个词组对应各个所述语句的第一效用值之和,确定为每个所述项集对应每个所述语句的第二效用值;将每个所述项集对应各个所述语句的第二效用值之和,确定为每个所述项集对应所述待分类文本的所述效用值。
在一个实施例中,所述处理器在实现对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵时,用于实现:
从区块链中获取词向量模型;将每个所述词组输入所述词向量模型进行向量化,得到所述待分类文本对应的所述词向量矩阵。
在一个实施例中,所述文本分类模型包括卷积层、池化层、全连接层以及归一化层;所述处理器在实现将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别时,用于实现:
将所述词向量矩阵输入所述卷积层中进行卷积处理,得到所述词向量矩阵对应的特征图像;
将所述特征图像输入所述池化层进行池化处理,得到池化处理后的所述特征图像;
将池化处理后的所述特征图像输入所述全连接层进行全连接处理,并通过所述归一化层对全连接处理的结果进行归一化处理,得到所述待分类文本对应的文本类别。
在一个实施例中,所述处理器在实现对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵时,用于实现:
获取预设数量的原始文本对应的高效用项集的词向量矩阵,并根据所述原始文本对应的真实类别对每个词向量矩阵进行类别标注,将类别标注后的所述词向量矩阵作为训练样本;将所述训练样本输入所述文本分类模型中进行分类训练,得到所述训练样本对应的预测类别;基于预设的损失函数,根据所述训练样本对应的预测类别以及所述训练样本对应的真实类别,计算损失函数值;基于预设的梯度下降算法,根据所述损失函数值调整所述文本分类模型中的参数并进行下一轮训练,直至得到的损失函数值小于预设损失阈值时,结束训练,得到训练后的所述文本分类模型。
本申请的实施例中还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序中包括程序指令,所述处理器执行所述程序指令,实现本申请实施例提供的任一项文本分类方法。
其中,所述计算机可读存储介质可以是非易失性,也可以是易失性。所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字卡(Secure Digital Card,SD Card),闪存卡(Flash Card)等。
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块 链节点的使用所创建的数据等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种文本分类方法,其中,包括:
    获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;
    对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;
    将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
  2. 根据权利要求1所述的文本分类方法,其中,所述对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,包括:
    对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组;
    对所述多个词组进行组合,得到所述待分类文本对应的多个项集;
    确定每个所述项集对应所述待分类文本的效用值,将对应的效用值不小于预设效用阈值的项集,确定为所述待分类文本对应的所述高效用项集。
  3. 根据权利要求2所述的文本分类方法,其中,所述对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组,包括:
    基于预设的分词库,对所述待分类文本中的每个语句进行分词处理,得到所述待分类文本对应的多个所述词组;
    所述对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组之后,还包括:
    基于预设的停用词库,对多个所述词组进行过滤处理,得到过滤处理后的多个所述词组。
  4. 根据权利要求2所述的文本分类方法,其中,所述确定每个所述项集对应所述待分类文本的效用值,包括:
    将每个所述项集中每个所述词组在所述待分类文本的每个语句中出现的次数,确定为每个所述词组对应每个所述语句的第一效用值;
    将每个所述项集中的各个词组对应各个所述语句的第一效用值之和,确定为每个所述项集对应每个所述语句的第二效用值;
    将每个所述项集对应各个所述语句的第二效用值之和,确定为每个所述项集对应所述待分类文本的所述效用值。
  5. 根据权利要求1所述的文本分类方法,其中,所述对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵,包括:
    从区块链中获取词向量模型;
    将每个所述词组输入所述词向量模型进行向量化,得到所述待分类文本对应的所述词向量矩阵。
  6. 根据权利要求1所述的文本分类方法,其中,所述文本分类模型包括卷积层、池化层、全连接层以及归一化层;所述将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别,包括:
    将所述词向量矩阵输入所述卷积层中进行卷积处理,得到所述词向量矩阵对应的特征图像;
    将所述特征图像输入所述池化层进行池化处理,得到池化处理后的所述特征图像;
    将池化处理后的所述特征图像输入所述全连接层进行全连接处理,并通过所述归一化层对全连接处理的结果进行归一化处理,得到所述待分类文本对应的文本类别。
  7. 根据权利要求1所述的文本分类方法,其中,所述将所述词向量矩阵输入文本分类模型中进行分类预测之前,还包括:
    获取预设数量的原始文本对应的高效用项集的词向量矩阵,并根据所述原始文本对应 的真实类别对每个词向量矩阵进行类别标注,将类别标注后的所述词向量矩阵作为训练样本;
    将所述训练样本输入所述文本分类模型中进行分类训练,得到所述训练样本对应的预测类别;
    基于预设的损失函数,根据所述训练样本对应的预测类别以及所述训练样本对应的真实类别,计算损失函数值;
    基于预设的梯度下降算法,根据所述损失函数值调整所述文本分类模型中的参数并进行下一轮训练,直至得到的损失函数值小于预设损失阈值时,结束训练,得到训练后的所述文本分类模型。
  8. 一种文本分类装置,其中,包括:
    项集挖掘模块,用于获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;
    向量化模块,用于对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;
    分类预测模块,用于将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器;
    所述存储器,用于存储计算机程序;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如下步骤:
    获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;
    对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;
    将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
  10. 根据权利要求9所述的计算机设备,其中,所述处理器在实现对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集时,用于实现:
    对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组;
    对所述多个词组进行组合,得到所述待分类文本对应的多个项集;
    确定每个所述项集对应所述待分类文本的效用值,将对应的效用值不小于预设效用阈值的项集,确定为所述待分类文本对应的所述高效用项集。
  11. 根据权利要求10所述的计算机设备,其中,所述处理器在实现对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组时,用于实现:
    基于预设的分词库,对所述待分类文本中的每个语句进行分词处理,得到所述待分类文本对应的多个所述词组;
    所述对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组之后,还包括:
    基于预设的停用词库,对多个所述词组进行过滤处理,得到过滤处理后的多个所述词组。
  12. 根据权利要求10所述的计算机设备,其中,所述处理器在实现确定每个所述项集对应所述待分类文本的效用值时,用于实现:
    将每个所述项集中每个所述词组在所述待分类文本的每个语句中出现的次数,确定为每个所述词组对应每个所述语句的第一效用值;
    将每个所述项集中的各个词组对应各个所述语句的第一效用值之和,确定为每个所述项集对应每个所述语句的第二效用值;
    将每个所述项集对应各个所述语句的第二效用值之和,确定为每个所述项集对应所述待分类文本的所述效用值。
  13. 根据权利要求9所述的计算机设备,其中,所述处理器在实现对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵时,用于实现:
    从区块链中获取词向量模型;
    将每个所述词组输入所述词向量模型进行向量化,得到所述待分类文本对应的所述词向量矩阵。
  14. 根据权利要求9所述的计算机设备,其中,所述文本分类模型包括卷积层、池化层、全连接层以及归一化层;所述处理器在实现将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别时,用于实现:
    将所述词向量矩阵输入所述卷积层中进行卷积处理,得到所述词向量矩阵对应的特征图像;
    将所述特征图像输入所述池化层进行池化处理,得到池化处理后的所述特征图像;
    将池化处理后的所述特征图像输入所述全连接层进行全连接处理,并通过所述归一化层对全连接处理的结果进行归一化处理,得到所述待分类文本对应的文本类别。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
    获取待分类文本,对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集,其中,所述高效用项集包括至少两个词组;
    对所述高效用项集中的每个词组进行向量化,得到所述待分类文本对应的词向量矩阵;
    将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时使所述处理器在实现对所述待分类文本进行项集挖掘,得到所述待分类文本对应的高效用项集时,用于实现:
    对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组;
    对所述多个词组进行组合,得到所述待分类文本对应的多个项集;
    确定每个所述项集对应所述待分类文本的效用值,将对应的效用值不小于预设效用阈值的项集,确定为所述待分类文本对应的所述高效用项集。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时使所述处理器在实现对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组时,用于实现:
    基于预设的分词库,对所述待分类文本中的每个语句进行分词处理,得到所述待分类文本对应的多个所述词组;
    所述对所述待分类文本进行分词处理,得到所述待分类文本对应的多个词组之后,还包括:
    基于预设的停用词库,对多个所述词组进行过滤处理,得到过滤处理后的多个所述词组。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时使所述处理器在实现确定每个所述项集对应所述待分类文本的效用值时,用于实现:
    将每个所述项集中每个所述词组在所述待分类文本的每个语句中出现的次数,确定为每个所述词组对应每个所述语句的第一效用值;
    将每个所述项集中的各个词组对应各个所述语句的第一效用值之和,确定为每个所述项集对应每个所述语句的第二效用值;
    将每个所述项集对应各个所述语句的第二效用值之和,确定为每个所述项集对应所述待分类文本的所述效用值。
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时使所述处理器在实现对所述高效用项集中的每个词组进行向量化,得到所述待分类文 本对应的词向量矩阵时,用于实现:
    从区块链中获取词向量模型;
    将每个所述词组输入所述词向量模型进行向量化,得到所述待分类文本对应的所述词向量矩阵。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述文本分类模型包括卷积层、池化层、全连接层以及归一化层;所述计算机程序被处理器执行时使所述处理器在实现将所述词向量矩阵输入文本分类模型中进行分类预测,得到所述待分类文本对应的文本类别时,用于实现:
    将所述词向量矩阵输入所述卷积层中进行卷积处理,得到所述词向量矩阵对应的特征图像;
    将所述特征图像输入所述池化层进行池化处理,得到池化处理后的所述特征图像;
    将池化处理后的所述特征图像输入所述全连接层进行全连接处理,并通过所述归一化层对全连接处理的结果进行归一化处理,得到所述待分类文本对应的文本类别。
PCT/CN2021/084218 2020-12-01 2021-03-31 文本分类方法、装置、计算机设备和介质 WO2022116444A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011389826.3A CN112445914A (zh) 2020-12-01 2020-12-01 文本分类方法、装置、计算机设备和介质
CN202011389826.3 2020-12-01

Publications (1)

Publication Number Publication Date
WO2022116444A1 true WO2022116444A1 (zh) 2022-06-09

Family

ID=74740461

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084218 WO2022116444A1 (zh) 2020-12-01 2021-03-31 文本分类方法、装置、计算机设备和介质

Country Status (2)

Country Link
CN (1) CN112445914A (zh)
WO (1) WO2022116444A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473095A (zh) * 2023-12-27 2024-01-30 合肥工业大学 基于主题增强词表示的短文本分类方法和系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112445914A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 文本分类方法、装置、计算机设备和介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593454A (zh) * 2013-11-21 2014-02-19 中国科学院深圳先进技术研究院 面向微博文本分类的挖掘方法及系统
CN109189925A (zh) * 2018-08-16 2019-01-11 华南师范大学 基于点互信息的词向量模型和基于cnn的文本分类方法
WO2019149200A1 (zh) * 2018-02-01 2019-08-08 腾讯科技(深圳)有限公司 文本分类方法、计算机设备及存储介质
CN110851598A (zh) * 2019-10-30 2020-02-28 深圳价值在线信息科技股份有限公司 文本分类方法、装置、终端设备及存储介质
CN111708888A (zh) * 2020-06-16 2020-09-25 腾讯科技(深圳)有限公司 基于人工智能的分类方法、装置、终端和存储介质
CN112445914A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 文本分类方法、装置、计算机设备和介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593454A (zh) * 2013-11-21 2014-02-19 中国科学院深圳先进技术研究院 面向微博文本分类的挖掘方法及系统
WO2019149200A1 (zh) * 2018-02-01 2019-08-08 腾讯科技(深圳)有限公司 文本分类方法、计算机设备及存储介质
CN109189925A (zh) * 2018-08-16 2019-01-11 华南师范大学 基于点互信息的词向量模型和基于cnn的文本分类方法
CN110851598A (zh) * 2019-10-30 2020-02-28 深圳价值在线信息科技股份有限公司 文本分类方法、装置、终端设备及存储介质
CN111708888A (zh) * 2020-06-16 2020-09-25 腾讯科技(深圳)有限公司 基于人工智能的分类方法、装置、终端和存储介质
CN112445914A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 文本分类方法、装置、计算机设备和介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473095A (zh) * 2023-12-27 2024-01-30 合肥工业大学 基于主题增强词表示的短文本分类方法和系统
CN117473095B (zh) * 2023-12-27 2024-03-29 合肥工业大学 基于主题增强词表示的短文本分类方法和系统

Also Published As

Publication number Publication date
CN112445914A (zh) 2021-03-05

Similar Documents

Publication Publication Date Title
WO2021189974A1 (zh) 模型训练方法、文本分类方法、装置、计算机设备和介质
WO2022007823A1 (zh) 一种文本数据处理方法及装置
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
WO2019153551A1 (zh) 文章分类方法、装置、计算机设备及存储介质
US10795922B2 (en) Authorship enhanced corpus ingestion for natural language processing
WO2019136993A1 (zh) 文本相似度计算方法、装置、计算机设备和存储介质
US9318027B2 (en) Caching natural language questions and results in a question and answer system
WO2020211720A1 (zh) 数据处理方法和代词消解神经网络训练方法
WO2021139262A1 (zh) 文献主题词聚合方法、装置、计算机设备及可读存储介质
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
WO2020224106A1 (zh) 基于神经网络的文本分类方法、系统及计算机设备
WO2022116444A1 (zh) 文本分类方法、装置、计算机设备和介质
CN113139134B (zh) 一种社交网络中用户生成内容的流行度预测方法、装置
CN114330343B (zh) 词性感知嵌套命名实体识别方法、系统、设备和存储介质
WO2021008037A1 (zh) 基于A-BiLSTM神经网络的文本分类方法、存储介质及计算机设备
WO2019085332A1 (zh) 金融数据分析方法、应用服务器及计算机可读存储介质
WO2014073206A1 (ja) 情報処理装置、及び、情報処理方法
CN110569289A (zh) 基于大数据的列数据处理方法、设备及介质
WO2023060633A1 (zh) 增强语义的关系抽取方法、装置、计算机设备及存储介质
US20220374682A1 (en) Supporting Database Constraints in Synthetic Data Generation Based on Generative Adversarial Networks
WO2022116443A1 (zh) 语句判别方法、装置、设备及存储介质
WO2022174499A1 (zh) 文本韵律边界预测的方法、装置、设备及存储介质
CN115878761B (zh) 事件脉络生成方法、设备及介质
CN110222179B (zh) 一种通讯录文本分类方法、装置及电子设备
CN114547257B (zh) 类案匹配方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899488

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899488

Country of ref document: EP

Kind code of ref document: A1