CN111078887A - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN111078887A
CN111078887A CN201911326228.9A CN201911326228A CN111078887A CN 111078887 A CN111078887 A CN 111078887A CN 201911326228 A CN201911326228 A CN 201911326228A CN 111078887 A CN111078887 A CN 111078887A
Authority
CN
China
Prior art keywords
tone
word
vector
text
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911326228.9A
Other languages
Chinese (zh)
Other versions
CN111078887B (en
Inventor
蒋卓
赵建强
黄剑
张辉极
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201911326228.9A priority Critical patent/CN111078887B/en
Publication of CN111078887A publication Critical patent/CN111078887A/en
Application granted granted Critical
Publication of CN111078887B publication Critical patent/CN111078887B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The embodiment of the application discloses a text classification method and device. One embodiment of the method comprises: acquiring a text to be classified; performing word segmentation on a text to be classified to obtain a word list; performing tone division on characters in a text to be classified to obtain a tone combination list; determining a word vector of each word in the word list and determining a tone vector of each tone combination in the tone combination list; and inputting the obtained word vector and the tone vector into a pre-trained text classification model to obtain a label for representing the category of the text to be classified. The implementation mode combines the word vector and the tone vector, and respectively extracts the semantic and tone characteristics of the text from two dimensions of the word and the tone, and the defects of the character/word level characteristics can be effectively improved by using the characteristics, so that the accuracy of text classification is improved.

Description

Text classification method and device
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a text classification method and device.
Background
One of the classic tasks of natural language processing is text classification, also known as text classification. The purpose of this task is to assign a predefined tag to the text. The process of text classification is generally divided into two stages, feature extraction and label classification. In the first stage, some specific word combinations (such as biword, trigword, word frequency or word reverse text frequency) can be subjected to feature extraction by means of a machine learning model; in the second stage, the computer can have a relatively objective understanding and judgment on the attribute of the text through the information provided by the characteristics. The traditional text classification task is performed under the guidance of the set of framework.
With the development of deep learning, the end-to-end idea of the deep learning has a great impact on the traditional text classification method. Currently, there are many deep learning based models. For example, the textCNN model uses a multi-window convolution kernel to extract semantic features of different levels, which has a good effect on text classification. The FastText model calculates word vectors for n-grams of the whole text word, averages the word vectors to obtain text representation, and then directly classifies the word vectors. Consequently, research hotspots have tended to use large-scale corpora in conjunction with deeper neural network structures for pre-training of text, as typified by models such as BERT, ulmmit, and ERNIE.
Disclosure of Invention
The embodiment of the application provides an improved text classification method and device.
In a first aspect, an embodiment of the present application provides a text classification method, where the method includes: acquiring a text to be classified; performing word segmentation on a text to be classified to obtain a word list; performing tone division on characters in a text to be classified to obtain a tone combination list; determining a word vector of each word in the word list and determining a tone vector of each tone combination in the tone combination list; and inputting the obtained word vector and the tone vector into a pre-trained text classification model to obtain a label for representing the category of the text to be classified.
In some embodiments, determining a word vector for each word in the list of words and determining a tone vector for each tone combination in the list of tone combinations comprises: determining a word identifier corresponding to each word in the word list from a preset dictionary; determining word vectors respectively corresponding to each word identifier from a preset word vector set; determining a tone combination identifier corresponding to a tone combination in a tone combination list from a preset tone dictionary; and determining a tone vector corresponding to each tone combination from a preset tone vector set.
In some embodiments, the dictionary and word vector set are obtained in advance according to the following steps: segmenting words of texts in a preset corpus to obtain a word list of each text; deleting stop words in each word list, deleting words with word frequency smaller than a preset word frequency threshold, and collecting all the remaining words to obtain a dictionary; and training the first neural network model by using the words in the dictionary through a machine learning method to obtain word vectors corresponding to each word in the dictionary, and combining the obtained word vectors into a word vector set.
In some embodiments, the tone dictionary and the tone vector set are obtained in advance according to the following steps: determining the tone of characters included in each text in a corpus to obtain a tone sequence of each text; sequentially extracting a preset number of adjacent tones from each tone sequence to obtain a tone combination list corresponding to each text, and collecting all tone combinations to obtain a tone dictionary; and training a second neural network model by using words in the tone dictionary through a machine learning method to obtain tone vectors corresponding to each tone combination in the tone dictionary, and combining the obtained tone vectors into a tone vector set.
In some embodiments, the text classification model includes a word vector convolutional neural network, a tone vector convolutional neural network; inputting the obtained word vector and the tone vector into a pre-trained text classification model to obtain a label for representing the category of the text to be classified, wherein the label comprises the following steps: inputting the obtained word vectors into a word vector convolution neural network to obtain word characteristic data; inputting the obtained tone vector into a tone vector convolution neural network to obtain tone characteristic data; smoothing the word vector and the tone vector respectively to obtain a semantic average feature vector and a tone average feature vector; and classifying by using the word characteristic data, the tone characteristic data, the semantic average characteristic vector and the tone average characteristic vector to obtain a label representing the category of the text to be classified.
In some embodiments, smoothing the word vector and the tone vector to obtain a semantic average feature vector and a tone average feature vector includes: determining the mean value of elements at the same position in each obtained word vector to obtain a semantic average feature vector; and determining the average value of the elements at the same position in each tone vector to obtain the tone average characteristic vector.
In some embodiments, the text classification model is obtained by training in advance according to the following steps: obtaining a sample text set, wherein each sample text in the sample text set corresponds to a pre-labeled label; for each sample text in the sample text set, performing word segmentation on the sample text to obtain a sample word list corresponding to the sample text, and determining a sample word vector of each sample word in the sample word list; performing tone division on characters in the sample text to obtain a sample tone combination list, and determining a sample tone vector of each sample tone combination in the sample tone combination list; and taking a sample word vector and a sample tone vector corresponding to a sample text in the sample text set as input, taking a label corresponding to the input sample word vector and the sample tone vector as expected output, and training to obtain a text classification model.
In a second aspect, an embodiment of the present application provides a text classification apparatus, including: the acquisition module is used for acquiring texts to be classified; the word segmentation module is used for segmenting words of the text to be classified to obtain a word list; the dividing module is used for carrying out tone division on the characters in the text to be classified to obtain a tone combination list; the determining module is used for determining a word vector of each word in the word list and determining a tone vector of each tone combination in the tone combination list; and the classification module is used for inputting the obtained word vectors and the tone vectors into a pre-trained text classification model to obtain labels for representing the categories of the texts to be classified.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by one or more processors, cause the one or more processors to carry out a method as described in any one of the implementations of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect.
According to the text classification method and device provided by the embodiment of the application, a word list is obtained by segmenting a text to be classified, tone division is performed on characters in the text to be classified to obtain a tone combination list, word vectors of all words in the word list are determined, tone vectors of all tone combinations in the tone combination list are determined, finally the obtained word vectors and tone vectors are input into a pre-trained text classification model, and labels for representing categories of the text to be classified are obtained, so that the word vectors and the tone vectors are combined, semantic and tone features of the text are respectively extracted from two dimensions of the words and the tone, the defects of word/word level features can be effectively overcome by using the features, and the accuracy of text classification is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a text classification method according to the present application;
FIG. 3 is a flow diagram of yet another embodiment of a text classification method according to the present application;
FIG. 4 is a schematic diagram of a structure of a text classification model according to the application;
FIG. 5 is a flow diagram of yet another embodiment of a text classification method according to the present application;
FIG. 6 is a schematic diagram of an embodiment of a text classification device according to the application;
FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.
Detailed Description
Summary of the application
The preprocessing process of the text classification task is also characterized by different language families. Different from Latin languages such as English, the Chinese text classification task needs word segmentation in the preprocessing process, but because the clear boundaries do not exist between Chinese words, more noise is often introduced into word segmentation results; if the feature extraction is carried out only from the word granularity, the included semantics of the feature extraction are deficient, so that the extracted features have certain defects no matter the word granularity or the word granularity. The direct effect of these defects is that when a network model with a relatively simple structure, such as textCNN or FastText, is used for a chinese text classification task, the performance of the model is limited.
The model such as BERT can achieve the best result in each task after pre-training and fine-tuning, but the limitation is obvious. Firstly, the training of the model has almost strict requirements on hardware resources, and the training process is almost difficult to reproduce; secondly, in the process of transferring the model to a downstream task, as the number of parameters is too large, the improvement and optimization of the model lose clear guidance, and the difficulty of parameter adjustment is increased; finally, even if the fine adjustment is successful, the process also consumes a large amount of time and is inefficient. In view of these limitations, such models are also difficult to apply and popularize. Aiming at the problems of the mainstream algorithm, the method and the device provide a new text classification method by combining the characteristics of the Chinese text.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 illustrates an exemplary system architecture 100 to which the text classification method of embodiments of the present application may be applied.
As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The server 103 may be a server that provides various services, such as a text processing server that processes text uploaded by the terminal apparatus 101. The text processing server may process the received text and obtain a processing result (e.g., a label of the text).
It should be noted that the text classification method provided in the embodiment of the present application may also be executed by the terminal device 101 or the server 103, and accordingly, the text classification apparatus may be disposed in the terminal device 101 or the server 103.
It should be understood that the number of data servers, networks, and host servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, intermediate devices, and servers, as desired for implementation. In the case where the text to be classified does not need to be acquired from a remote location, the system architecture may not include a network, and only include a server or a terminal device.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for text classification according to the present application is shown. The method comprises the following steps:
step 201, obtaining a text to be classified.
In this embodiment, an execution subject of the text classification method (e.g., a terminal device or a server shown in fig. 1) may acquire the text to be classified from a remote location or from a local location. The text to be classified may be various types of text, such as news, comments, literary works, and the like.
Step 202, performing word segmentation on the text to be classified to obtain a word list.
In this embodiment, the execution main body may perform word segmentation on the text to be classified to obtain a word list. Specifically, the executing entity may perform word segmentation by using an existing chinese word segmentation method, which may include, but is not limited to, at least one of the following methods: dictionary-based methods, statistical-based methods, rule-based methods, and the like.
As an example, the text includes the sentence "the environment is closely related to the survival of humans", and the word list obtained after the word segmentation includes the following words: environmental, and human, survival, are closely related.
And step 203, performing tone division on the characters in the text to be classified to obtain a tone combination list.
In this embodiment, the executing body may perform tone division on the text to be classified to obtain a tone combination list. Specifically, the executing body may determine a tone of each character in the text to be classified, and obtain a tone sequence of each text. Wherein, the tone is the basic tone of the Chinese character, which is five kinds of 'soundless, flat, upbeat, de-beat and in-beat'. Then, the tones included in the tone sequence are divided to obtain a tone combination list composed of a plurality of tone combinations. By way of example, each tone combination includes three tones, according to the 3-grams idea, then the list of tone combinations may include the following tone combinations: top in, top in none … ….
Step 204, determine a word vector for each word in the word list, and determine a tone vector for each tone combination in the tone combination list.
In this embodiment, the execution body may determine a word vector of each word in the word list, and determine a tone vector of each tone combination in the tone combination list. Specifically, as an example, the execution main body may determine a word vector corresponding to each word in the word list from a preset set of word vectors, and determine a tone vector corresponding to each tone combination in the tone combination list from a preset set of tone vectors.
The tone vector is a structural representation of the tone, reflects the syllable rule of the text and the coherence feature of the tone, and can embed the syllable rule and the coherence feature of the tone into the internal semantics of the words by introducing the tone, thereby improving the task effect.
In some optional implementations of this embodiment, step 204 may be performed as follows:
first, a word identification corresponding to each word in the word list is determined from a preset dictionary.
The dictionary may include a plurality of words, each corresponding to a word Identification (ID). The dictionary may be obtained in advance in various ways, such as manually set, or extracted from a predetermined corpus.
Then, word vectors respectively corresponding to each word identifier are determined from a preset word vector set.
Wherein the set of word vectors comprises a plurality of word vectors, each word vector corresponding to a word identity. In general, the set of word vectors may be in the form of a mapping matrix, where each row of data of the mapping matrix is a word vector.
And then, determining a tone combination identifier corresponding to the tone combination in the tone combination list from a preset tone dictionary.
The tone dictionary may include a plurality of tone combinations, each tone combination corresponding to a tone combination Identification (ID). The tone dictionary can be obtained in advance through various ways, such as manual setting, or tone extraction according to characters in a preset corpus.
And finally, determining the tone vector corresponding to each tone combination from a preset tone vector set.
The tone vector set comprises a plurality of tone vectors, and each tone vector corresponds to one tone combination identifier. In general, the tone vector set may be in the form of a mapping matrix, where each row of data of the mapping matrix is a tone vector.
According to the implementation mode, the dictionary and the tone dictionary are used, word vectors and tone vectors can be rapidly determined, and the text classification efficiency is improved.
In some optional implementations of this embodiment, the dictionary and the word vector set may be obtained in advance according to the following steps:
firstly, performing word segmentation on texts in a preset corpus to obtain a word list of each text. The word segmentation method may be the same as the method described in step 202, and is not described herein again.
And then deleting stop words in each word list, deleting words with the word frequency smaller than a preset word frequency threshold, and collecting all the remaining words to obtain a dictionary. The stop word may be a word included in a preset stop word set, and as an example, the word in the stop word set may include an uncommon word and may also include a non-standard word such as a network word.
And finally, training the first neural network model by using the words in the dictionary through a machine learning method to obtain word vectors corresponding to each word in the dictionary, and combining the obtained word vectors into a word vector set. The first neural network model may include various structures of neural networks, such as a convolutional neural network, a cyclic neural network, and the like.
Generally, when training the first neural network model, each number in a word vector mapping matrix may be initialized randomly, each row of the mapping matrix corresponds to a word in a dictionary, then parameters of the first neural network model are adjusted by a machine learning method, each number in the word vector mapping matrix is updated continuously, and after training is finished, the obtained word vector mapping matrix is a word vector set.
The implementation mode can make the word vectors in the dictionary more pertinent by preprocessing the texts in the corpus and then determining the word vectors of each word, thereby being beneficial to improving the efficiency and the accuracy of text classification.
In some optional implementation manners of this embodiment, the tone dictionary may be obtained in advance according to the following steps:
first, the tone of a word included in each text in the corpus is determined, and a tone sequence of each text is obtained.
Then, sequentially extracting a preset number of adjacent tones from each tone sequence to obtain a tone combination list corresponding to each text, and collecting all tone combinations to obtain a tone dictionary. Wherein the preset number can be set arbitrarily. For example, according to the concept of n-grams, the value set of n can be set to {3,4,5}, and the tone combination list can be expressed as { …, no flat up, …, no flat up, flat up go, …, no flat up go, … }. Each tone combination list contains all possible combinations of tone n-grams. It should be noted that all tone combinations need to be deduplicated to obtain a tone dictionary.
And finally, training a second neural network model by using words in the tone dictionary through a machine learning method to obtain tone vectors corresponding to each tone combination in the tone dictionary, and combining the obtained tone vectors into a tone vector set.
The second neural network model may include various structures of neural networks, such as convolutional neural networks, cyclic neural networks, and the like.
Generally, when training the second neural network model, each number in a tone vector mapping matrix may be initialized at random, each row of the mapping matrix corresponds to one tone vector in a tone vector dictionary, then parameters of the second neural network model are adjusted by a machine learning method, each number in the tone vector mapping matrix is updated continuously, and after training is finished, the obtained tone vector mapping matrix is a tone vector set. It should be noted that the second neural network model and the first neural network model may be two separate neural network models or may be a single neural network model integrated together.
When the tone dictionary is used, the tone sequence of the text to be classified can be traversed, the tone sequence is segmented according to the value of n to form a tone n-grams list, then the tone n-grams are converted into a tone id list form through the tone dictionary, if the tone sequence of the text is 'upper-in upper-out', the tone 3-grams list is [ upper-in, upper-in none, upper-out, 4-grams list is [ upper-in none, upper-in none ], 5-grams list is [ upper-in upper-out ], and the final tone n-grams list is a set of the three. And matching each tone combination in the tone n-grams list with the tone dictionary to obtain a tone vector corresponding to each tone combination.
According to the implementation mode, the tone sequences are obtained based on the corpus, the preset number of adjacent tones are extracted from the tone sequences, the tone combination list is obtained, and the tone vector corresponding to each tone combination is determined by utilizing the neural network, so that the tone dictionary comprising various tone combinations can be comprehensively established, and the accuracy of text classification is improved.
Step 205, inputting the obtained word vector and the obtained tone vector into a pre-trained text classification model to obtain a label for representing the category of the text to be classified.
In this embodiment, the executing entity may input the obtained word vector and the obtained tone vector into a pre-trained text classification model, so as to obtain a label for representing a category of a text to be classified. In general, the word vectors and the pitch vectors of the input text classification model may be converted into a matrix form, respectively. Each row of the word vector matrix corresponds to a word, and the number of columns included in each row is the dimension of the word vector. Each row of the tone vector matrix corresponds to a tone combination, and each row includes the number of columns, i.e., the dimension of the tone vector.
The text classification model is used for representing the corresponding relation between word vectors and tone vectors of the text and labels of the text. As an example, the text classification model may include convolutional neural networks of various structures, and the convolutional neural networks may obtain semantic features representing text semantics and intonation features representing text intonation changes according to the input word vector matrix and the input intonation vector matrix, and classify the semantic features and the intonation features to obtain labels of the text.
Optionally, after obtaining the tag, the tag may be output in various manners, for example, displaying the tag at a corresponding position of a display, or managing and storing the tag and the text to be classified in a target memory (for example, a memory included in the execution main body or a memory of an electronic device communicatively connected to the execution main body).
According to the method provided by the embodiment of the application, the word list is obtained by segmenting the text to be classified, the tone of the text to be classified is divided to obtain the tone combination list, the word vector of each word in the word list is determined, the tone vector of each tone combination in the tone combination list is determined, finally the obtained word vector and tone vector are input into the pre-trained text classification model, and the label for representing the category of the text to be classified is obtained, so that the word vector and tone vector are combined, the semantic and tone features of the text are respectively extracted from two dimensions of the word and the tone, the defects of the word/level features can be effectively overcome by using the features, and the accuracy of text classification is improved.
With further reference to FIG. 3, shown is a flow diagram of yet another embodiment of a text classification method according to the present application. In the embodiment, the text classification model comprises a word vector convolution neural network and a tone vector convolution neural network. As shown in fig. 3, based on the embodiment shown in fig. 2, step 205 may include the following steps:
step 2051, the obtained word vectors are input into a word vector convolution neural network to obtain word feature data.
In this embodiment, the execution main body may input the obtained word vector into a word vector convolution neural network to obtain word feature data. The word vector convolutional neural network may include, for example, convolutional layers, pooling layers, feature layers, and the like. The convolution layer is used for performing convolution operation on input word vectors to obtain a feature map, the pooling layer is used for performing dimension reduction on the feature map, and the feature layer is used for generating word feature data according to data output by the pooling layer.
And step 2052, inputting the obtained tone vector into a tone vector convolution neural network to obtain tone characteristic data.
In this embodiment, the executing entity may input the obtained tone vector into a tone vector convolution neural network to obtain tone feature data. The tone vector convolutional neural network is similar to the word vector convolutional neural network, and is not described herein again.
As an example, the resulting word vector and pitch vector are expressed as:
Wi=[wi1,wi2,…,wij,…,win],1≤i≤m,1≤j≤n
Ti=[ti1,ti2,…,tiz,…,tig],1≤i≤m,1≤z≤g
wherein, WiAll word vectors, w, after word segmentation for the ith text (i.e., the text to be classified)ijThe number of the word vectors is the number of the j word of the ith text, m is the number of the text, n is the number of the word vectors, and g is the number of the tone vectors; t isiRepresenting all tone vectors, t, after the ith text word segmentationizIs the z-th tone vector of the ith text. Assuming that the length of the word vector and the tone vector are P and Q, respectively, W isi∈RN×PCan be represented as an N P matrix, Ti∈RG×QCan be represented as a matrix of gxq.
Convolution using a multi-window convolution kernel chFeature extraction of different representations of text as local perceptual domain: (
Figure BDA0002328449660000121
The convolution kernels respectively extracting the characteristics of the text word representation and the tone representation), wherein h represents the window of the convolution kernels, and the convolution process can be extracted if the step length is 1The image is:
fs=relu(ch·ej:j+h+b),1≤s≤n-h+1
wherein relu represents that convolution uses a relu activation function to carry out nonlinear activation, b is a bias term of convolution, and the bias term can be preset to be 0 and e in consideration of network training efficiencyj:j+hFor text vectors representing local regions that can be perceived by a convolution kernel, fsRepresenting the local features extracted by the convolution kernel, the extracted feature map is:
Fh=[f1…,fs,…fn-h+1],1≤s≤n-h+1
in order to retain the features of the text to the maximum extent and reduce the data dimension as much as possible, feature aggregation is performed on the feature map of the text by using maximum pooling, namely, the maximum value in the feature map is selected as the final characterization of feature extraction, and then the features extracted by convolution kernels of different windows are spliced to obtain the output result of the convolution neural network:
and step 2053, respectively smoothing the word vectors and the tone vectors to obtain semantic average feature vectors and tone average feature vectors.
In this embodiment, the execution body may perform smoothing on the word vector and the tone vector, respectively, to obtain a semantic average feature vector and a tone average feature vector.
In some optional implementations of this embodiment, the executing body may determine the semantic average feature vector and the intonation average feature vector according to the following steps:
firstly, determining the average value of elements at the same position in each obtained word vector to obtain a semantic average feature vector.
Then, determining the average value of the elements at the same position in each obtained tone vector to obtain the tone average feature vector.
Specifically, the semantic average feature vector and the intonation average feature vector may be determined as shown in the following equation (1):
Figure BDA0002328449660000131
wherein the content of the first and second substances,
Figure BDA0002328449660000132
in order to semantically average the feature vector,
Figure BDA0002328449660000133
is a tone-mean feature vector, wjAs a word vector, tzIs the tone vector, N is the number of word vectors, and G is the number of tone vectors.
It should be understood that the method shown in equation (1) is only an example, and other methods may be adopted besides the method shown in equation (1) above, and smoothing processing may be performed using the word vector and the pitch vector, for example, weighted summation may be performed on elements at the same position in each vector.
And step 2054, classifying by using the word feature data, the tone feature data, the semantic average feature vector and the tone average feature vector to obtain a label representing the category of the text to be classified.
In this embodiment, the execution main body may perform classification by using word feature data, tone feature data, semantic average feature vector, and tone average feature vector, to obtain a tag representing a category of the text to be classified.
Continuing with the above example, after feature extraction, the final features of the obtained text consist of four parts, including the semantic average feature vectorIntonation average feature vectorWord vector extracted by convolutional neural network
Figure BDA0002328449660000136
And tone vector
Figure BDA0002328449660000137
All the text features can be combined to be used as the basis for classification.
The classification is established on a full-connection layer, all nodes of the full-connection layer need to be randomly discarded in the training process to prevent overfitting, and then the classification to which the text belongs is scored by using a normalized exponential function.
Wherein the content of the first and second substances,
Figure BDA00023284496600001310
representing categories of final predicted text
Figure BDA00023284496600001311
U is a parameter matrix of the classification layer, buIs a bias term for the classification layer.
Referring now to fig. 4, a schematic diagram of a structure of a text classification model provided according to the present embodiment is shown. As shown in fig. 4, each word in the word list is subjected to word mapping (i.e. lookup from the dictionary) according to the word id to obtain a word vector; and performing tone mapping (namely searching from a tone dictionary) on each tone combination in the tone combination list to obtain a tone vector. And respectively inputting the word vectors into the word vector smoothing module and the word vector convolution neural network, respectively inputting the tone vectors into the tone vector smoothing module and the tone vector convolution neural network, outputting four kinds of characteristic data, connecting the four kinds of characteristic data, and classifying the connected data to obtain the label of the text to be classified.
As can be seen from fig. 3, the method provided in this embodiment can obtain information representing context of a word and information representing coherence of a tone through tone vector smoothing and word vector smoothing, and the smoothing can respectively aggregate the context information contained in the word vector and the tone law contained in the tone vector into an average feature vector of semantics and tones of a text, and simultaneously reduce noise introduced by different vector representations of the text as much as possible, and can improve accuracy of text classification by combining with features output by a word vector convolutional neural network and a tone vector convolutional neural network.
With further reference to FIG. 5, shown is a flow diagram of yet another embodiment of a text classification method in accordance with the present application. As shown in fig. 5, the text classification model may be obtained by training in advance according to the following steps:
step 501, a sample text set is obtained, wherein each sample text in the sample text set corresponds to a pre-labeled label. The sample text collection may include, among other things, various text, such as news, reviews, literary works, and so forth.
Step 502, for each sample text in the sample text set, performing the following steps:
step 5021, performing word segmentation on the sample text to obtain a sample word list corresponding to the sample text, and determining a sample word vector of each sample word in the sample word list. The method for segmenting the sample text and the method for determining the word vector may be the same as the method described in the embodiment corresponding to fig. 2, and are not described herein again.
Step 5022, tone division is carried out on characters in the sample text to obtain a sample tone combination list, and a sample tone vector of each sample tone combination in the sample tone combination list is determined. The method for dividing the tones and the method for determining the tone vectors may be the same as the method described in the embodiment corresponding to fig. 2, and are not described herein again.
Step 503, taking the sample word vector and the sample tone vector corresponding to the sample text in the sample text set as input, taking the input sample word vector and the label corresponding to the sample tone vector as expected output, and training to obtain a text classification model.
Specifically, the executing agent for training the text classification model may train an initial model (for example, the model having the structure shown in fig. 4) by using the sample word vector and the sample tone vector corresponding to the same sample text obtained in step 502 as inputs and using the label corresponding to the input sample word vector and the sample tone as an expected output through a machine learning method, and may obtain an actual output for each training input data. And the actual output is data actually output by the initial model and is used for representing the probability of the category to which the sample text belongs. Then, the executing body may adopt a gradient descent method and a back propagation method, adjust parameters of the initial model based on actual output and expected output, use the model obtained after each parameter adjustment as the initial model for the next training, and end the training under the condition that a preset training end condition is met, thereby obtaining the text classification model through training.
Here, the execution subject may train the initial model by using a batch training method, or may train the initial model by using a random training method, which is not limited in the embodiment of the present application.
It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the loss value calculated using a predetermined loss function (e.g., a cross entropy loss function) is less than a predetermined loss value threshold.
As an example, the direction of learning can be optimized by minimizing the cross entropy of the class distribution and the true distribution of text predictions, using SGD, MBGD, Adam, etc. algorithms as the model learning optimized algorithms. The loss function of the model can be expressed as:
wherein, yiAnd
Figure BDA0002328449660000152
the true label of the document and the predicted label of the model, respectively. Through iterative training, the trained text is finally obtainedAnd (5) classifying the models.
With further reference to fig. 6, as an implementation of the method shown in the above figures, the present application provides an embodiment of a text classification apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.
As shown in fig. 6, the text classification apparatus 600 of the present embodiment includes: an obtaining module 601, configured to obtain a text to be classified; a word segmentation module 602, configured to perform word segmentation on a text to be classified to obtain a word list; a dividing module 603, configured to perform tone division on the text to be classified to obtain a tone combination list; a determining module 604, configured to determine a word vector of each word in the word list, and determine a tone vector of each tone combination in the tone combination list; and the classification module 605 is configured to input the obtained word vectors and tone vectors into a pre-trained text classification model, so as to obtain a label for representing a category of the text to be classified.
In this embodiment, the obtaining module 601 may obtain the text to be classified from a remote location or a local location. The text to be classified may be various types of text, such as news, comments, literary works, and the like.
In this embodiment, the word segmentation module 602 may perform word segmentation on the text to be classified to obtain a word list. Specifically, the word segmentation module 602 may perform word segmentation using an existing chinese word segmentation method, which may include, but is not limited to, at least one of the following: dictionary-based methods, statistical-based methods, rule-based methods, and the like.
As an example, the text includes the sentence "the environment is closely related to the survival of humans", and the word list obtained after the word segmentation includes the following words: environmental, and human, survival, and social relevance
In this embodiment, the dividing module 603 may perform tone division on the text to be classified to obtain a tone combination list. Specifically, the dividing module 603 may determine a tone of each character in the text to be classified, to obtain a tone sequence of each text. Wherein, the tone is the basic tone of the Chinese character, which is five kinds of 'soundless, flat, upbeat, de-beat and in-beat'. Then, the tones included in the tone sequence are divided to obtain a tone combination list composed of a plurality of tone combinations. By way of example, each tone combination includes three tones, according to the 3-grams idea, then the list of tone combinations may include the following tone combinations: top in, top in none … ….
In this embodiment, the determining module 604 may determine a word vector for each word in the word list and determine a tone vector for each tone combination in the list of tone combinations. Specifically, as an example, the determining module 604 may determine a word vector corresponding to each word in the word list from a preset set of word vectors, and determine a tone vector corresponding to each tone combination in the tone combination list from a preset set of tone vectors.
The tone vector is a structural representation of the tone, reflects the syllable rule of the text and the coherence feature of the tone, and can embed the syllable rule and the coherence feature of the tone into the internal semantics of the words by introducing the tone, thereby improving the task effect.
In this embodiment, the classification module 605 may input the obtained word vector and the tone vector into a pre-trained text classification model to obtain a label for representing a category of the text to be classified. In general, the word vectors and the pitch vectors of the input text classification model may be converted into a matrix form, respectively. Each row of the word vector matrix corresponds to a word, and the number of columns included in each row is the dimension of the word vector. Each row of the tone vector matrix corresponds to a tone combination, and each row includes the number of columns, i.e., the dimension of the tone vector.
The text classification model is used for representing the corresponding relation between word vectors and tone vectors of the text and labels of the text. As an example, the text classification model may include convolutional neural networks of various structures, and the convolutional neural networks may obtain semantic features representing text semantics and intonation features representing text intonation changes according to the input word vector matrix and the input intonation vector matrix, and classify the semantic features and the intonation features to obtain labels of the text.
Optionally, after obtaining the tag, the tag may be output in various manners, such as displaying the tag at a corresponding position on a display, or managing and storing the tag and the text to be classified in a target memory (e.g., a memory included in the apparatus or a memory of an electronic device communicatively connected to the apparatus).
In some optional implementations of this embodiment, the determining module 604 may include: a first determining unit (not shown in the figure) for determining a word identifier corresponding to each word in the word list from a preset dictionary; a second determining unit (not shown in the figure) configured to determine, from a preset word vector set, word vectors respectively corresponding to each word identifier; a third determining unit (not shown in the figure) for determining a tone combination identifier corresponding to a tone combination in the tone combination list from a preset tone dictionary; a fourth determining unit (not shown in the figure) for determining a tone vector corresponding to each tone combination from a preset set of tone vectors.
In some optional implementations of this embodiment, the dictionary and the word vector set are obtained in advance according to the following steps: segmenting words of texts in a preset corpus to obtain a word list of each text; deleting stop words in each word list, deleting words with word frequency smaller than a preset word frequency threshold, and collecting all the remaining words to obtain a dictionary; and training the first neural network model by using the words in the dictionary through a machine learning method to obtain word vectors corresponding to each word in the dictionary, and combining the obtained word vectors into a word vector set.
In some optional implementations of this embodiment, the tone dictionary and the tone vector set are obtained in advance according to the following steps: determining the tone of characters included in each text in a corpus to obtain a tone sequence of each text; sequentially extracting a preset number of adjacent tones from each tone sequence to obtain a tone combination list corresponding to each text, and collecting all tone combinations to obtain a tone dictionary; and training a second neural network model by using words in the tone dictionary through a machine learning method to obtain tone vectors corresponding to each tone combination in the tone dictionary, and combining the obtained tone vectors into a tone vector set.
In some optional implementations of this embodiment, the text classification model may include a word vector convolutional neural network, a tone vector convolutional neural network; and the classification module 605 may include: a first obtaining unit (not shown in the figure) for inputting the obtained word vector into a word vector convolution neural network to obtain word feature data; a second obtaining unit (not shown in the figure) for inputting the obtained tone vector into the tone vector convolution neural network to obtain tone feature data; a third obtaining unit (not shown in the figure), configured to perform smoothing on the word vector and the tone vector, respectively, to obtain a semantic average feature vector and a tone average feature vector; and a classification unit (not shown in the figure) for classifying by using the word feature data, the tone feature data, the semantic average feature vector and the tone average feature vector to obtain a label representing a category of the text to be classified.
In some optional implementation manners of this embodiment, the third obtaining unit may include: a first determining subunit (not shown in the figure), configured to determine a mean value of elements at the same position in each obtained word vector, so as to obtain a semantic average feature vector; and a second determining subunit (not shown in the figure) configured to determine a mean of elements at the same position in each obtained tone vector, so as to obtain a tone average feature vector.
In some optional implementations of this embodiment, the text classification model may be obtained by training in advance according to the following steps: obtaining a sample text set, wherein each sample text in the sample text set corresponds to a pre-labeled label; for each sample text in the sample text set, performing word segmentation on the sample text to obtain a sample word list corresponding to the sample text, and determining a sample word vector of each sample word in the sample word list; performing tone division on characters in the sample text to obtain a sample tone combination list, and determining a sample tone vector of each sample tone combination in the sample tone combination list; and taking a sample word vector and a sample tone vector corresponding to a sample text in the sample text set as input, taking a label corresponding to the input sample word vector and the sample tone vector as expected output, and training to obtain a text classification model.
The device provided by the above embodiment of the present application obtains a word list by performing word segmentation on a text to be classified, obtains a tone combination list by performing tone segmentation on characters in the text to be classified, determines a word vector of each word in the word list, determines a tone vector of each tone combination in the tone combination list, and finally inputs the obtained word vector and tone vector into a pre-trained text classification model to obtain a tag for representing a category of the text to be classified, thereby implementing combination of the word vector and the tone vector, and respectively extracting semantic and tone features of the text from two dimensions of the word and the tone.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application.
It should be noted that the computer readable storage medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present application may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises an acquisition module, a word segmentation module, a division module, a determination module and a classification module. The names of these modules do not in some cases constitute a limitation on the unit itself, and for example, the acquiring module may also be described as a "module for acquiring text to be classified".
As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text to be classified; performing word segmentation on a text to be classified to obtain a word list; performing tone division on characters in a text to be classified to obtain a tone combination list; determining a word vector of each word in the word list and determining a tone vector of each tone combination in the tone combination list; and inputting the obtained word vector and the tone vector into a pre-trained text classification model to obtain a label for representing the category of the text to be classified.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A method of text classification, the method comprising:
acquiring a text to be classified;
performing word segmentation on the text to be classified to obtain a word list;
performing tone division on the characters in the text to be classified to obtain a tone combination list;
determining a word vector for each word in the list of words and determining a tone vector for each tone combination in the list of tone combinations;
and inputting the obtained word vector and the tone vector into a pre-trained text classification model to obtain a label for representing the category of the text to be classified.
2. The method of claim 1, wherein determining a word vector for each word in the list of words and determining a tone vector for each tone combination in the list of tone combinations comprises:
determining a word identifier corresponding to each word in the word list from a preset dictionary;
determining word vectors respectively corresponding to each word identifier from a preset word vector set;
determining a tone combination identifier corresponding to a tone combination in the tone combination list from a preset tone dictionary;
and determining a tone vector corresponding to each tone combination from a preset tone vector set.
3. The method of claim 2, wherein the dictionary and the set of word vectors are obtained in advance by:
segmenting words of texts in a preset corpus to obtain a word list of each text;
deleting stop words in each word list, deleting words with word frequency smaller than a preset word frequency threshold value, and collecting all the remaining words to obtain the dictionary;
and training a first neural network model by using the words in the dictionary through a machine learning method to obtain word vectors corresponding to each word in the dictionary, and combining the obtained word vectors into a word vector set.
4. The method of claim 3, wherein the tone dictionary and the set of tone vectors are obtained in advance by:
determining the tone of characters included in each text in the corpus to obtain a tone sequence of each text;
sequentially extracting a preset number of adjacent tones from each tone sequence to obtain a tone combination list corresponding to each text, and collecting all tone combinations to obtain a tone dictionary;
and training a second neural network model by using the words in the tone dictionary through a machine learning method to obtain tone vectors corresponding to each tone combination in the tone dictionary, and combining the obtained tone vectors into a tone vector set.
5. The method of claim 1, wherein the text classification model comprises a word vector convolutional neural network, a tone vector convolutional neural network; and
inputting the obtained word vector and the tone vector into a pre-trained text classification model to obtain a label for representing the category of the text to be classified, wherein the label comprises:
inputting the obtained word vector into the word vector convolution neural network to obtain word characteristic data;
inputting the obtained tone vector into the tone vector convolution neural network to obtain tone characteristic data;
smoothing the word vector and the tone vector respectively to obtain a semantic average feature vector and a tone average feature vector;
and classifying by using the word feature data, the tone feature data, the semantic average feature vector and the tone average feature vector to obtain a label representing the category of the text to be classified.
6. The method according to claim 5, wherein the smoothing the word vector and the tone vector to obtain a semantic average feature vector and a tone average feature vector comprises:
determining the mean value of elements at the same position in each obtained word vector to obtain a semantic average feature vector;
and determining the average value of the elements at the same position in each tone vector to obtain the tone average characteristic vector.
7. The method according to one of claims 1 to 6, wherein the text classification model is obtained by training in advance according to the following steps:
obtaining a sample text set, wherein each sample text in the sample text set corresponds to a pre-labeled label;
for each sample text in the sample text set, performing word segmentation on the sample text to obtain a sample word list corresponding to the sample text, and determining a sample word vector of each sample word in the sample word list; performing tone division on characters in the sample text to obtain a sample tone combination list, and determining a sample tone vector of each sample tone combination in the sample tone combination list;
and taking the sample word vector and the sample tone vector corresponding to the sample text in the sample text set as input, taking the input sample word vector and the label corresponding to the sample tone vector as expected output, and training to obtain the text classification model.
8. An apparatus for classifying text, the apparatus comprising:
the acquisition module is used for acquiring texts to be classified;
the word segmentation module is used for segmenting words of the text to be classified to obtain a word list;
the dividing module is used for carrying out tone division on the characters in the text to be classified to obtain a tone combination list;
a determining module for determining a word vector for each word in the word list and determining a tone vector for each tone combination in the tone combination list;
and the classification module is used for inputting the obtained word vectors and the tone vectors into a pre-trained text classification model to obtain labels for representing the classes of the texts to be classified.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN201911326228.9A 2019-12-20 2019-12-20 Text classification method and device Active CN111078887B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911326228.9A CN111078887B (en) 2019-12-20 2019-12-20 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911326228.9A CN111078887B (en) 2019-12-20 2019-12-20 Text classification method and device

Publications (2)

Publication Number Publication Date
CN111078887A true CN111078887A (en) 2020-04-28
CN111078887B CN111078887B (en) 2022-04-29

Family

ID=70316362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911326228.9A Active CN111078887B (en) 2019-12-20 2019-12-20 Text classification method and device

Country Status (1)

Country Link
CN (1) CN111078887B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563379A (en) * 2020-05-12 2020-08-21 厦门市美亚柏科信息股份有限公司 Text recognition method and device based on Chinese word vector model and storage medium
CN112100385A (en) * 2020-11-11 2020-12-18 震坤行网络技术(南京)有限公司 Single label text classification method, computing device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543036A (en) * 2018-11-20 2019-03-29 四川长虹电器股份有限公司 Text Clustering Method based on semantic similarity
CN109697232A (en) * 2018-12-28 2019-04-30 四川新网银行股份有限公司 A kind of Chinese text sentiment analysis method based on deep learning
CN110119443A (en) * 2018-01-18 2019-08-13 中国科学院声学研究所 A kind of sentiment analysis method towards recommendation service
CN110147550A (en) * 2019-04-23 2019-08-20 南京邮电大学 Pronunciation character fusion method neural network based
US20190325897A1 (en) * 2018-04-21 2019-10-24 International Business Machines Corporation Quantifying customer care utilizing emotional assessments

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119443A (en) * 2018-01-18 2019-08-13 中国科学院声学研究所 A kind of sentiment analysis method towards recommendation service
US20190325897A1 (en) * 2018-04-21 2019-10-24 International Business Machines Corporation Quantifying customer care utilizing emotional assessments
CN109543036A (en) * 2018-11-20 2019-03-29 四川长虹电器股份有限公司 Text Clustering Method based on semantic similarity
CN109697232A (en) * 2018-12-28 2019-04-30 四川新网银行股份有限公司 A kind of Chinese text sentiment analysis method based on deep learning
CN110147550A (en) * 2019-04-23 2019-08-20 南京邮电大学 Pronunciation character fusion method neural network based

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭威彤,杨鸿武,宋继华等: "面向方言语音合成的文本分析研究", 《计算机工程》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563379A (en) * 2020-05-12 2020-08-21 厦门市美亚柏科信息股份有限公司 Text recognition method and device based on Chinese word vector model and storage medium
CN111563379B (en) * 2020-05-12 2022-12-02 厦门市美亚柏科信息股份有限公司 Text recognition method and device based on Chinese word vector model and storage medium
CN112100385A (en) * 2020-11-11 2020-12-18 震坤行网络技术(南京)有限公司 Single label text classification method, computing device and computer readable storage medium
CN112100385B (en) * 2020-11-11 2021-02-09 震坤行网络技术(南京)有限公司 Single label text classification method, computing device and computer readable storage medium

Also Published As

Publication number Publication date
CN111078887B (en) 2022-04-29

Similar Documents

Publication Publication Date Title
CN107679039B (en) Method and device for determining statement intention
CN107301170B (en) Method and device for segmenting sentences based on artificial intelligence
CN110347835B (en) Text clustering method, electronic device and storage medium
CN109214386B (en) Method and apparatus for generating image recognition model
US20190163742A1 (en) Method and apparatus for generating information
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN107861954B (en) Information output method and device based on artificial intelligence
CN111078887B (en) Text classification method and device
CN109408824B (en) Method and device for generating information
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN107766498B (en) Method and apparatus for generating information
WO2022007823A1 (en) Text data processing method and device
CN112507190B (en) Method and system for extracting keywords of financial and economic news
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN112633004A (en) Text punctuation deletion method and device, electronic equipment and storage medium
CN111191242A (en) Vulnerability information determination method and device, computer readable storage medium and equipment
CN113408507B (en) Named entity identification method and device based on resume file and electronic equipment
CN114579876A (en) False information detection method, device, equipment and medium
US20220319493A1 (en) Learning device, learning method, learning program, retrieval device, retrieval method, and retrieval program
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
CN112188311B (en) Method and apparatus for determining video material of news
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN111723188A (en) Sentence display method and electronic equipment based on artificial intelligence for question-answering system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant