CN109033307B

CN109033307B - CRP clustering-based word multi-prototype vector representation and word sense disambiguation method

Info

Publication number: CN109033307B
Application number: CN201810783010.5A
Authority: CN
Inventors: 李国佳; 郭鸿奇; 杨喜亮; 王国卿; 杨振中
Original assignee: North China University of Water Resources and Electric Power
Current assignee: North China University of Water Resources and Electric Power
Priority date: 2018-07-17
Filing date: 2018-07-17
Publication date: 2021-08-31
Anticipated expiration: 2038-07-17
Also published as: CN109033307A

Abstract

The invention discloses a CRP clustering-based word multi-prototype vector representation and word sense disambiguation method, which comprises the following steps: purifying and preprocessing texts in a mass text corpus set to obtain a pure text, clustering the context window representation of target polysemous words in the text corpus set based on a CRP algorithm, marking the target polysemous words in the text corpus set according to cluster categories, and training on the marked text corpus set to obtain polysemous vector representation of the polysemous words; step two: preprocessing a target short text to obtain a short text word sequence, identifying target polysemous words in the word sequence, calculating the similarity between the target polysemous word context window representation and the centroid of each cluster corresponding to the words in a text corpus, representing the word vector corresponding to the cluster category with the maximum similarity as the word vector representation of the specific word meaning of the polysemous words in the context, and carrying out word meaning disambiguation on the polysemous words. The invention solves the problem of word meaning expression and the problem of ambiguity recognition in word meaning expression.

Description

CRP clustering-based word multi-prototype vector representation and word sense disambiguation method

Technical Field

The invention relates to the field of natural language processing, in particular to a CRP clustering-based word multi-prototype vector representation and word sense disambiguation method.

Background

Among the many tasks in the field of natural language processing, the fundamental problem faced is how to represent linguistic symbols as coding patterns that a machine can process. The language symbol is mapped and expressed, words, sentences, texts and the like are expressed into a continuous low-dimensional vector, semantic vectorization expression of the words, the sentences and the texts is realized, and the method is widely applied to tasks such as information retrieval, short text classification, named entity recognition, emotion analysis, recommendation engines, automatic text summarization and the like.

The words are the most basic constituent units of the language, and the vectorization representation of the words has wide application in natural language processing tasks. A simple word vector Representation is One-hot Representation, which has the disadvantages that the vector dimension is equal to the number of all words, there is a dimension disaster problem, semantic relations between words cannot be described, and different semantic representations cannot be reflected for ambiguous words.

Word vector Representation (Word Embedding or Word Representation) of words is low-dimensional real number vector Representation with fixed length, and through training and learning of massive texts, unique vector Representation of each Word is obtained, and the characteristic is that similar or related words are closer in distance. However, due to the existence of ambiguous words in words, the same word symbol may reflect different semantics in different context, and most of the conventional word vector representations only correspond to unique word vector representations, and cannot effectively express different word senses of the ambiguous words. Each sense of a multi-sense word should correspond to a vector representation.

The word multi-prototype vector representation corresponds to a word vector representation for each word meaning of the multi-meaning word, and can improve the accuracy of word representation. Vector representations of different word senses of the words are obtained, word sense induction is carried out by clustering word contexts by using a clustering-based model, the contexts of the original words are directly clustered or clustered after semantic mapping by using cross-language knowledge, and then the word vector representations corresponding to the specific word senses of the words in different context contexts are obtained by training.

The method for obtaining the expression of the word vectors of the polysemous words based on a k-means clustering algorithm and neural network language model training needs to select different values according to the number of the word senses of the polysemous words according to the size of a parameter k (clustering category). The word multi-prototype vector based on CRP clustering represents that the training process does not need to appoint the number of clustering categories in advance, and accords with the actual situation that the number of word senses of different polysemous words is inconsistent in context.

High quality word sense representation can capture rich semantic and syntactic information, and is helpful for word sense disambiguation. High quality word sense disambiguation allows better learning of the representation of the word sense. The main methods for word sense disambiguation are two types: external knowledge base based methods and corpus based methods. Based on the external knowledge base method, the specific semantics of the ambiguous words are distinguished and recognized by means of the explanation or description of different semantics of the words by an external knowledge base (WordNet or HowNet), but the construction of the external knowledge base or a dictionary needs to consume a large amount of manpower and material resources. The method based on the corpus takes the corpus as a knowledge resource, and determines the specific word meaning of a word in a given context through automatic or semi-automatic learning, thereby realizing word meaning disambiguation.

The method has the advantages that the text corpus is utilized for polysemous words in the sentences, the specific word senses of the words in the context are obtained through the given word sense disambiguation method based on the obtained multi-prototype vector representation of the words, and the representation efficiency of the words and the sentences is improved.

Internet technology and mobile application are gradually popularized in daily life, people use mobile terminals to carry out information transmission and communication, and accordingly massive data such as news headlines, microblog information, commodity or service description of a shopping platform, forum comments, intelligent interactive application, social conversation messages and the like are generated. The effective processing and understanding of massive short text data on the internet by using a machine has become an important research difficulty and hotspot in the fields of natural language processing and machine learning.

In similarity calculation of information retrieval, the word multi-prototype vector representation and word sense disambiguation method can distinguish the specific word sense of the polysemous word in a retrieval object, and the accuracy of word representation and calculation is improved. An effective word semantic representation and word sense disambiguation method is provided for short text retrieval in the field of information retrieval, and technical support is provided for semantic calculation.

Disclosure of Invention

The present invention is directed to overcome the problems in the prior art, and provides a word polytype vector representation and word sense disambiguation method based on CRP clustering, in which the word polytype vector representation corresponds to a word vector representation for each word sense of a polysemous word, thereby solving the problem of word polysemous representation in word representation, and the word sense disambiguation method based on word polytype vector representation solves the problem of ambiguity recognition in word sense representation.

The technical scheme of the invention is as follows: the CRP clustering-based word multi-prototype vector representation and word sense disambiguation method comprises the following steps:

step S1, carrying out purification pretreatment on the texts in the mass text corpus to obtain pure texts, clustering the context window representation of the target polysemous words in the text corpus based on a CRP algorithm, marking the target polysemous words in the text corpus according to the cluster category, and training on the marked text corpus to obtain polysemous vector representation of the polysemous words;

step S2, preprocessing the target short text to obtain a word sequence of the short text, identifying the target polysemous words in the word sequence, calculating the similarity between the context window representation of the target polysemous words and the centroid of each cluster corresponding to the words in the text corpus, representing the word vector corresponding to the cluster category with the maximum similarity as the word vector representation of the polysemous words with specific word senses in the context, and disambiguating the word senses of the polysemous words.

The step S1 of performing a purification preprocessing on the text in the mass text corpus to obtain a plain text includes: deleting text with the number of words less than a preset threshold; the complex characters are uniformly converted into simplified characters; replacing Chinese and English abbreviations of the text corpus by using Chinese words by using a Chinese and English abbreviation dictionary; segmenting the texts in the text corpus set; removing stop words; deleting other characters except non-Chinese characters and numbers; counting word frequency; presetting the word frequency of the high-frequency words as an upper limit threshold; selecting words with the occurrence frequency of the text corpus set larger than a preset lower limit threshold value to establish a word list; and establishing an ambiguous word list based on the ambiguous word dictionary.

The context window representation of the target ambiguous word in step S1 is obtained by averaging word vectors of words in the word context, and the specific calculation formula is as follows:

where veC is the contextual window representation of the word, w_iIs the ith word in the word Context window word set Context, vec (w)_i) Is the word w_iThe initial word vector of (2).

The method for representing the context window of the target ambiguous word in the clustered corpus based on the CRP algorithm in step S1 includes the following steps:

step S101, obtaining the context window representation of all samples of the polysemous words in the text corpus;

step S102, obtaining an initial clustering centroid of a CRP clustering algorithm, taking a random sample as the initial clustering centroid of the CRP clustering, or performing initial clustering on the context window representation of the ambiguous word based on a k-means algorithm, and taking the clustering centroid containing the most samples as the initial clustering centroid;

step S103, representing context windows of all samples of the polysemous words, calculating the similarity between each sample and the centroid of each cluster for all clusters, and obtaining the maximum similarity Smax between the ith sample and the centroid of the tth cluster; if Smax is larger than a preset threshold value alpha, dividing the ith sample into the tth cluster, adding 1 to the number of samples in the t cluster, and recalculating the centroid of the tth cluster; otherwise, generating a new cluster, wherein the total number K of the clusters is increased by 1, the number of samples in the new cluster is 1, and the centroid of the new cluster is a sample i;

and step S104, obtaining samples in each cluster, the centroid of each cluster and the total number of clusters.

The method for training the corpus of labeled text to obtain the multi-prototype vector representation of the ambiguous word in step S1 includes the following steps:

step S201, marking all samples of the target polysemous words in the text corpus according to the belonged clustering clusters, wherein different clustering clusters represent different word senses of the target words;

step S202, executing a word vector representation training process based on a neural network language model on the marked cluster, and obtaining multi-prototype vector representation of words expressing specific word senses in different contexts.

The word sense disambiguation of the ambiguous word in step S2 includes the following steps:

step S301, preprocessing the target short text to obtain a word sequence of the short text, and identifying polysemous words in the word sequence according to the multi-prototype vector representation of the words;

step S302, carrying out word sense disambiguation on the polysemous words, calculating the similarity between the context window representation of the words in the short text word sequence and the centroids of all clustering clusters corresponding to the words in the text corpus, and extracting word vector representation corresponding to the clustering cluster category with the maximum similarity as word vector representation of the polysemous words expressing specific word senses in the context.

Preprocessing the target short text to obtain a word sequence of the short text in the step S2, including removing stop words and converting traditional characters into simplified characters; replacing English abbreviations in the target short text with Chinese words by using a Chinese and English abbreviation dictionary; performing word segmentation processing on the short text; characters other than chinese characters and numerals are replaced with special symbols.

The invention has the beneficial effects that: the embodiment of the invention provides a word multi-prototype vector representation and word sense disambiguation method based on CRP clustering, which adopts a clustering algorithm based on CRP to cluster context window representation of target words in a text corpus, trains on a marked clustering cluster to obtain word vector representation of polysemous words, improves the accuracy of polysemous word vector representation, and solves the problem of word polysemous representation in word representation. For the polysemous words in the sentence, the polysemous vector representation of the words is utilized, the similarity between the context window representation of the polysemous words and the centroid of the word cluster in the training sample is calculated, the word vector representation corresponding to the cluster with the maximum similarity is used as the word vector representation of the polysemous words with specific semantics in the context, and the ambiguity of the polysemous words is eliminated.

The invention provides a CRP clustering-based word multi-prototype vector representation method, which adopts a CRP-based clustering algorithm to cluster context representations of all samples of target polysemous words, one clustering result represents the semantics of one target word, and word multi-prototype vector representation is obtained by training on a marked clustering corpus. The multi-prototype vector representation of the words can be used for distinguishing different word senses of the multi-sense words, and the problem of representing the word sense is solved.

The invention clusters the context window representations of all samples of the target polysemous words by adopting a CRP-based clustering algorithm, the CRP algorithm does not need to appoint the clustering number in advance, the obtained clustering cluster number can effectively express the number of different word senses of the polysemous words, the actual problem that the word sense numbers of the different polysemous words are inconsistent is solved, the word context window representations are used as the judgment standard for the words belonging to the same clustering cluster, and the calculation process is simple.

The word sense disambiguation method based on the word multi-prototype vector representation can identify the multi-sense words in the sentence, obtain the word vector representation of the specific word sense of the words in the context, and eliminate the ambiguity of the multi-sense words in different context. Calculating the similarity between the context window representation of the target polysemous word and the centroid of each cluster corresponding to the word in the text corpus, representing the word vector corresponding to the cluster category with the maximum similarity as the word vector representation of the polysemous word with the specific word meaning in the context, and carrying out word meaning disambiguation on the polysemous word.

Drawings

FIG. 1 is an overall flow diagram of word polytype vector representation and word sense disambiguation based on CRP clustering;

FIG. 2 is a flow diagram of an ambiguous word context window representation based on CRP clustering;

FIG. 3 is a training flow of a word polytypic vector representation based on CRP clustering;

FIG. 4 is a flow diagram of word sense disambiguation based on word polytype vector representations;

FIG. 5 is a result of noun word sense disambiguation based on word polytype vector representations;

FIG. 6 is a verb sense disambiguation result based on a word polytype vector representation.

Detailed Description

An embodiment of the present invention will be described in detail below with reference to the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the embodiment.

The invention discloses a CRP clustering-based word multi-prototype vector representation and word sense disambiguation method, as shown in figure 1, the basic idea of the invention is to construct multi-prototype vector representation of words on the basis of CRP clustering word context representation, identify polysemous words in sentences or short texts, eliminate ambiguity of the polysemous words, obtain word vector representation of specific word senses of the words in the contexts, and the multi-prototype representation of the word vector can more accurately represent different semantics of the words in the context. The method comprises the following specific steps:

in step S1, the text in the mass text corpus is purified and preprocessed to obtain a plain text: deleting the texts with the word number less than a preset threshold value from the disclosed or acquired text corpus set; converting traditional characters in a text corpus set into simplified characters; replacing English abbreviations in the text corpus set with Chinese words by using a custom dictionary; then, a word segmentation system is adopted for word segmentation processing; removing other characters except non-Chinese characters and numbers by adopting a regular matching method; removing stop words and counting word frequency; presetting the word frequency of the high-frequency words as an upper limit threshold; and finally, establishing a word list according to the words with the occurrence times of the text corpus set larger than a preset lower limit threshold.

In step S1, context window representations of all samples of the ambiguous word in the corpus of text are obtained, the window size is set to a positive integer, and each context window representation is calculated by weighting the word vectors of the words in the window;

in step S1, as shown in fig. 2, the context window representation of all samples of the target ambiguous word is clustered based on the CRP algorithm, specifically:

1. performing initial clustering on the context window representation of the polysemous words based on a k-means algorithm to obtain each cluster and a centroid thereof;

2. taking the cluster centroid containing the most number of samples as an initial cluster centroid of the CRP clustering algorithm;

3. representing context windows of all samples of the polysemous words, calculating the similarity between each sample and the centroid of each cluster for all clusters, and obtaining the maximum similarity Smax between the ith sample and the tth cluster centroid;

4. and if the Smax is larger than a preset threshold value alpha, dividing the ith sample into the tth cluster, adding 1 to the number of samples in the t cluster, and recalculating the centroid of the tth cluster. Otherwise, generating a new cluster, wherein the total number K of the clusters is increased by 1, the number of samples in the new cluster is 1, and the centroid of the new cluster is a sample i;

5. the samples in each cluster, the centroid of the cluster, and the total number of clusters are obtained.

Wherein, the 1 st and 2 nd steps can be simplified into that the first sample or a random sample is taken as the initial clustering centroid of the CRP clustering.

In step S1, as shown in fig. 3, training obtains a multi-prototype vector representation of the word, specifically:

1. acquiring all context window representations of the target ambiguous words in the text corpus;

2. clustering the context window representation of the polysemous words based on a CRP algorithm to obtain a clustering cluster represented by word context;

3. for target polysemous words, searching corresponding positions in the original text corpus according to the target words and the context thereof, and performing corresponding category marking on the target text corpus according to the cluster to which the sample belongs, wherein different cluster clusters represent different semantics of the target words;

4. and (3) executing the steps 1, 2 and 3 for each ambiguous word, and marking the category of the clustering cluster into a target text corpus set.

5. And training on the marked text corpus based on a CBOW model to obtain the multi-prototype vector representation of the words.

In step S2, as shown in fig. 4, the ambiguous word recognition and word sense disambiguation based on the word polytype vector representation specifically include:

1. the method for preprocessing the target short text specifically comprises the following steps: removing stop words, and converting traditional characters into simplified characters; replacing English abbreviations in the target sentence by using Chinese words by using a Chinese and English abbreviation dictionary; performing word segmentation processing on the short text; and replacing other characters except the non-Chinese characters and the numbers with special symbols to obtain a word sequence of the short text.

2. Ambiguous words in the sentence are identified. Ambiguous words in the sequence of words are identified based on the word polytype vector representations, the ambiguous words having two or more word vector representations.

3. A contextual window representation of the ambiguous word is computed. The context window represents the weighted average value of the word vectors of the context words, for the polysemous words appearing in the context words, the word vector corresponding to the cluster with the most appearing times of the words in the text corpus is used as the word vector participating in calculation, and the average value of the word vectors in the context window is used for representing the unidentified words.

4. And for the ambiguous terms, disambiguating the ambiguous terms sequentially according to the number of the semantic terms and the sequence of less semantic terms and more semantic terms.

5. And calculating the similarity between the context window representation of the polysemous words in the short text sequence and the centroid of the training sample cluster, and using the word vector representation corresponding to the cluster with the maximum similarity as the word vector representation of the polysemous words.

According to the idea that the semantics of the words are determined by the contexts of the words, the specific semantics of the polysemous words in the contexts are obtained by calculating the similarity between the context window representation of the polysemous words and the text corpus cluster centroids corresponding to the word vectors of the polysemous words, and the word vector representation corresponding to the maximum similarity is used as the word vector representation of the specific semantics of the polysemous words in the contexts, and the specific calculation method comprises the following steps:

vec(w)＝{vec_k(w)|k,Sim(veC,vec_k(w))＝Max(Sim(veC,vec_j(w)))} (2)

wherein vec (w) is a specific semantic word vector representation corresponding to the polysemous word w in the context window, vec_j(w) is a word vector representation of the centroid of the text corpus cluster corresponding to the jth semantic of the ambiguous word w, Max (Sim (veC, veC)_j(w))) context window for the ambiguous word w represents veC and each veC_j(w) a maximum value of the similarity, and representing a k-th word vector corresponding to the maximum value as a word vector of the specific semantics of the word w.

Interpretation of terms: CRP: the Chinese Restaurant Process is a typical Dirichlet (Dirichlet) Process mixed model, and has the advantages that the number of types of the mixed model is not required to be specified manually, and the method is suitable for the clustering problem in natural language processing.

A word vector polytype representation of a multi-sense word in different contexts. In table 1, unlabeled words correspond to representations of word vectors, such as "apple," which are words that do not distinguish between ambiguities. The term-specific sense corresponds to the term polytypic vector representation, e.g., "apple 2" represents the 2 nd sense of the term "apple", referring to the apple of agricultural produce. The word vector for "apple 1" means that IT corresponds to IT as IT corporation, and "apple 2" means that IT is a fruit. The term polytypic vector represents semantic information that can capture the distinguishing terms.

TABLE 1 CRP method-based words or word senses of closest words

An embodiment of a word sense disambiguation method based on word polytypic vector representations.

The ambiguous word sense disambiguation test dataset came from the Chinese corpus in SemEval-2007# task 5. There are 40 ambiguous words in the test dataset: the meaning of each word is at least two word meanings. The word sense disambiguation test data sets have different word sense numbers of the multi-meaning words, the number of the multi-meaning words is 2-4, and the word with the largest word sense number is 'out', and has 9 word senses. For example, the word "TCM" has two word senses, namely "practioner of Chinese medicine" and "traditional of Chinese medical science", and the word senses are "TCM doctor" and "TCM medicine", and each word sense has different numbers of specific text examples.

In the word meaning disambiguation test example, the word meaning disambiguation method based on word multi-prototype vector representation extracts the multi-meaning words and the context representation thereof in the text example for each given multi-meaning word in the test set, calculates the similarity between the text corpus cluster centroids corresponding to each word vector in the word multi-prototype vector representation to obtain the multi-prototype word vector representation corresponding to the multi-meaning word and the cluster category corresponding to the multi-prototype word vector representation, compares the word meaning category expressed by the multi-prototype word vector representation with the criterion of the test set discrimination to discriminate the correctness of the disambiguation result.

The noun word sense disambiguation results based on the word polytypic vector representation are shown in FIG. 5. The result of verb sense disambiguation based on the word polytype vector representation is shown in FIG. 6.

In information retrieval, the word vector multi-prototype representation and word sense disambiguation method can identify the specific semantics of the polysemous words in the context of the retrieval object, improve the accuracy of word representation, make calculation more reasonable and make the retrieval result more accurate.

In an information retrieval application, in order to recall more similar results to a retrieval word sequence or keyword, similarity (sentence similarity, word similarity) is used to identify similar word sequences or keywords. The similarity of words can be measured by the cosine value of the included angle between two word vectors.

For example, the word vector polytype of the ambiguous word "accounting" is represented by "accounting 1" and "accounting 2", the semantic of "accounting 1" is the meaning of "accounting" or "accounting profit and loss", and the semantic of "accounting 2" is the meaning of "autumn accounting" or "comparing with people after loss or failure". The similarity between "account 1" and the words "settlement" and "reply" is 0.66 and 0.11, respectively, and the similarity between "account 2" and the words "settlement" and "reply" is 0.14 and 0.72, respectively. The similarity between "accounting 1" and "accounting 2" is 0.25, and the difference in similarity between different semantics of the word "accounting" is large.

In the information retrieval, when the retrieval object is a sentence, the similarity between the retrieval object and the retrieval target may be measured using the sentence similarity. And preprocessing the retrieval target to obtain a word sequence of the retrieval target, recording the number of words as m, identifying the polysemous words in the word sequence, obtaining a word vector representation of each word in the word sequence, and recording as a set D. Preprocessing the sentence to be searched to obtain a word sequence of the searched sentence, recording the number of words as n, identifying the polysemous words in the word sequence, obtaining the word vector representation of each word in the word sequence, and recording as a set S.

Respectively calculating the similarity sim (D) between each word in the set D and each word in the set S_i，S_j) Extracting m most similar word pairs, searching the similarity Sim (D, S) between the object S and the target D, and obtaining the similarity from a sentence similarity calculation formula:

wherein,

and representing the sum of the similarity of m most similar word pairs, wherein m is the number of words in the set D, and n is the number of words in the set S.

For example, the search target includes an ambiguous word "account", the search target is a sentence { account for others }, the search target is a sentence 1{ account for you by a talker 'S house } and a sentence 2{ cause them to recognize harm in the account }, and after preprocessing, the word sequence set D is obtained as { account for others }, S1 is obtained as { account for you by a talker' S house } and S2 is obtained as { cause them to recognize harm in the account }. The ambiguous words in the set of word sequences D, S1, S2 are identified and a word vector representation of each word is obtained. The similarity of each word between the retrieval target D and the retrieval objects S1, S2 is shown in table 2.

TABLE 2 similarity table between words in search target and search target

From equation 3, Sim (D, S1) is 0.62, Sim (D, S2) is 0.39, and the search target D is more matched with the sentence S1, closer to the real context, and more accurate in the search result.

The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. The CRP clustering-based word multi-prototype vector representation and word sense disambiguation method is characterized by comprising the following steps of:

the context window representation of the target ambiguous word in the clustered corpus based on the CRP algorithm in step S1 includes the following steps:

step S104, obtaining samples in each cluster, the centroid of each cluster and the total number of clusters;

step S2, preprocessing the target short text to obtain a word sequence of the short text, identifying the target polysemous words in the word sequence, calculating the similarity between the context window representation of the target polysemous words and the centroid of each cluster corresponding to the words in the text corpus, representing the word vector corresponding to the cluster category with the maximum similarity as the word vector representation of the polysemous words with specific word senses in the context, and disambiguating the word senses of the polysemous words;

2. The method as claimed in claim 1, wherein the step S1 of refining and preprocessing the text in the corpus of massive texts to obtain plain texts comprises: deleting text with the number of words less than a preset threshold; the complex characters are uniformly converted into simplified characters; replacing Chinese and English abbreviations of the text corpus by using Chinese words by using a Chinese and English abbreviation dictionary; segmenting the texts in the text corpus set; removing stop words; deleting other characters except non-Chinese characters and numbers; counting word frequency; presetting the word frequency of the high-frequency words as an upper limit threshold; selecting words with the occurrence frequency of the text corpus set larger than a preset lower limit threshold value to establish a word list; and establishing an ambiguous word list based on the ambiguous word dictionary.

3. The method for word polytypic vector representation and word sense disambiguation based on CRP clustering as claimed in claim 1, wherein said context window representation of the target polysemous word in step S1 is obtained by averaging word vectors of words in word context, and the specific calculation formula is:

4. The method for word polytypic vector representation and word sense disambiguation based on CRP clustering as recited in claim 1, wherein the training on the corpus of labeled text at step S1 obtains a polytypic vector representation of a polysemous word, the method comprising the steps of:

5. The CRP cluster-based word polytypic vector representation and word sense disambiguation method of claim 1, characterized in that the preprocessing of the target short text to obtain the word sequence of the short text in step S2 comprises removing stop words, and converting traditional words into simplified words; replacing English abbreviations in the target short text with Chinese words by using a Chinese and English abbreviation dictionary; performing word segmentation processing on the short text; characters other than chinese characters and numerals are replaced with special symbols.