CN116561319A - Text clustering method, text clustering device and text clustering system - Google Patents

Text clustering method, text clustering device and text clustering system Download PDF

Info

Publication number
CN116561319A
CN116561319A CN202310666521.XA CN202310666521A CN116561319A CN 116561319 A CN116561319 A CN 116561319A CN 202310666521 A CN202310666521 A CN 202310666521A CN 116561319 A CN116561319 A CN 116561319A
Authority
CN
China
Prior art keywords
target
word
text
words
target text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310666521.XA
Other languages
Chinese (zh)
Inventor
孙悦
李少波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Postal Savings Bank of China Ltd
Original Assignee
Postal Savings Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Postal Savings Bank of China Ltd filed Critical Postal Savings Bank of China Ltd
Priority to CN202310666521.XA priority Critical patent/CN116561319A/en
Publication of CN116561319A publication Critical patent/CN116561319A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a text clustering method, a text clustering device and a text clustering system. In the scheme, word vectors are expressed by combining a Word2vec model with a TF-IDF algorithm, so that distinction between different texts is enhanced, the advantages of the Word vectors are utilized, the influence of the words on the texts is added, the expressed Word vectors are used as the input of a WMD algorithm, the WMD algorithm is used as a similarity measurement algorithm in text clustering, and the accuracy of the text clustering is improved.

Description

Text clustering method, text clustering device and text clustering system
Technical Field
The application relates to the technical field of text clustering, in particular to a text clustering method, a text clustering device, a computer-readable storage medium and a text clustering system.
Background
With the increase of information propagation speed and the continuous expansion of network space, the information volume of the internet is exponentially and rapidly increased, and the unstructured text data volume is continuously increased. In order to effectively utilize valuable information contained in the text data, a text cluster is built according to the commonality characteristics among the text data through a clustering algorithm to complete subsequent information processing and analysis, and the method is an important ring for extracting the valuable information. The common distance algorithm is used for calculating the text similarity in the clustering process, and the clustering accuracy is low.
Disclosure of Invention
The main object of the present application is to provide a text clustering method, a text clustering device, a computer readable storage medium and a text clustering system, so as to at least solve the problem of low accuracy of text clustering in the prior art.
To achieve the above object, according to one aspect of the present application, there is provided a clustering method of text, including: obtaining a plurality of original texts, and preprocessing each original text to obtain a plurality of target texts, wherein the preprocessing comprises at least one of the following steps: word segmentation processing and word deactivation processing are carried out, wherein the target text comprises a plurality of words; converting the words in each target text into Word vectors by using a Word2vec model, and determining weight values of the words by using a TF-IDF algorithm, wherein the weight values are the importance degrees of the words in the target text; and taking the word vector and the weight value as the input of a WMD algorithm, determining the similarity between any two target texts, and clustering the target texts according to the similarity.
Optionally, converting the words in each target text into Word vectors by using a Word2vec model, and determining weight values of the words by using a TF-IDF algorithm, including: a Word2vec model is constructed, wherein the Word2vec model is obtained by training by using a plurality of sets of training data, and each set of training data in the plurality of sets of training data comprises: the historical words and the historical word vectors corresponding to the historical words are obtained in different historical time periods; inputting the Word into the Word2vec model to obtain the Word vector corresponding to the Word, wherein the Word2vec model is output; acquiring word frequency and inverse document frequency of the word in the target text, wherein the inverse document frequency is used for representing the importance of the word; and determining the weight value of the word according to the word frequency and the inverse document frequency by adopting a TF-IDF algorithm, wherein the smaller the product of the word frequency and the inverse document frequency is, the lower the weight value of the word is, the lower the importance degree is, the larger the product of the word frequency and the inverse document frequency is, and the larger the weight value of the word is, the higher the importance degree is.
Optionally, determining the weight value of the word by using a TF-IDF algorithm further includes: according to the target formula:
determining the weight value of the word, wherein f represents the word, m represents the target text, W (f, m) represents the weight value of the word f in the target text m, TF (f, m) represents the number of times the word f appears in the target text m, N represents the total number of the target text, N i Representing the number of the target texts m containing the words f.
Optionally, determining the similarity between any two target texts by using the word vector and the weight value as input of a WMD algorithm includes: according to a plurality of first Word vectors in a first target text and first TF-IDF values corresponding to the first Word vectors, performing TF-IDF & Word2vec vectorization representation on the first target text to obtain a first target set, wherein the first target set comprises a plurality of first target Word vectors and weight values, and the first Word vectors and the first TF-IDF values are used as input of a WMD algorithm; performing TF-IDF & Word2vec vectorization representation on a second target text according to a plurality of second Word vectors in the second target text and second TF-IDF values corresponding to the second Word vectors to obtain a second target set, wherein the second target set comprises a plurality of second target Word vectors and weight values, and the second Word vectors and the second TF-IDF values are used as input of the WMD algorithm; and in the process of calculating the similarity of the first target text and the second target text by using a WMD algorithm, calculating the cosine distance between the first target word vector and the second target word vector as a transfer cost, using a TF-IDF value to perform weight distribution, and using the minimum value of the product sum of the transfer cost and the distribution weight as the distance between the first target text and the second target text, wherein the distance is the similarity of the distance between the first target text and the second target text.
Optionally, in calculating the similarity between the first target text and the second target text using WMD algorithm, calculating a cosine distance between the first target word vector and the second target word vector as a transfer cost, performing weight distribution using TF-IDF value, and using a minimum value of a product sum of the transfer cost and the distribution weight as a distance between the first target text and the second target text, the method further includes: obtaining a corpus, wherein the corpus comprises a plurality of dictionary words; sequentially selecting one dictionary word from the corpus as a central dictionary word; respectively calculating cosine distances between non-central dictionary words and the central dictionary words in the corpus, wherein the cosine distances are cosine values of included angles between word vectors corresponding to the non-central dictionary words and word vectors corresponding to the central dictionary words in the corpus; storing the non-central dictionary words with the cosine distances within a target range into an irrelevant set of the central dictionary words, and storing the non-central dictionary words with the cosine distances not within the target range into an relevant set of the central dictionary words.
Optionally, in calculating the similarity between the first target text and the second target text using WMD algorithm, calculating a cosine distance between the first target word vector and the second target word vector as a transfer cost, and before using TF-IDF value for weight distribution, the method further includes: calculating cosine distance of the second target word vector in the relevant set of the first target word vector as transfer cost of weight distribution of the second target word vector under the condition that part of the second target word vector in the relevant set of the first target word vector and part of the second target word vector in the irrelevant set of the first target word vector; calculating a second average value of the distances from the second target word vector to the first target word vector in the uncorrelated set of the first target word vector as a transfer cost of weight distribution of the second target word vector, and performing weight distribution by using a TF-IDF value; and calculating the sum of first data and second data to obtain the WMD distance between the first target text and the second target text, wherein the first data is the minimum value of the sum of a plurality of cosine distances and the weight of the second target word vector, and the second data is the product of the second average value and the weight of the second target word vector.
Optionally, clustering the target text according to the similarity includes: taking one target text as a first text cluster under the condition that one target text is used for carrying out first clustering; under the condition that the Nth target text is used for carrying out the Nth clustering, carrying out similarity comparison on the Nth target text and each target text in the formed text clusters, wherein N is more than or equal to 2; and classifying the Nth target text into the text cluster under the condition that the similarity is greater than or equal to a similarity threshold value, and creating a new text cluster according to the Nth target text under the condition that the similarity is less than the similarity threshold value.
According to another aspect of the present application, there is provided a clustering apparatus for text, including: the first acquisition unit is used for acquiring a plurality of original texts, preprocessing each original text to obtain a plurality of target texts, wherein the preprocessing comprises at least one of the following steps: word segmentation processing and word deactivation processing are carried out, wherein the target text comprises a plurality of words; the first processing unit is used for converting the words in each target text into Word vectors by using a Word2vec model, and determining weight values of the words by using a TF-IDF algorithm, wherein the weight values are the importance degrees of the words in the target text; and the second processing unit is used for taking the word vector and the weight value as the input of a WMD algorithm, determining the similarity between any two target texts, and clustering the target texts according to the similarity.
According to still another aspect of the present application, there is provided a computer readable storage medium, where the computer readable storage medium includes a stored program, and when the program runs, the device in which the computer readable storage medium is controlled to execute any one of the text clustering methods.
According to yet another aspect of the present application, there is provided a text clustering system, including: one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising a clustering method for performing any one of the texts.
By applying the technical scheme, word vectors are expressed by combining a Word2vec model with a TF-IDF algorithm, so that distinction between different texts is enhanced, the advantages of the Word vectors are utilized, the influence of the words on the texts is added, the expressed Word vectors are used as the input of a WMD algorithm, the WMD algorithm is used as a similarity measurement algorithm in text clustering, and the accuracy of the text clustering is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
Fig. 1 shows a hardware block diagram of a mobile terminal performing a text clustering method according to an embodiment of the present application;
FIG. 2 shows a flow diagram of a method of clustering text provided in accordance with an embodiment of the present application;
FIG. 3 shows a schematic flow chart of preprocessing an original text;
FIG. 4 shows a spatial schematic of Euclidean distance and cosine distance;
FIG. 5 shows a schematic diagram of a model of WMD calculation document distance;
FIG. 6 shows a schematic diagram of a word embedding distance range in a corpus;
FIG. 7 illustrates a schematic distribution of distances between central dictionary words and non-central dictionary words of a corpus;
FIG. 8 shows a flow diagram for clustering target text according to similarity;
fig. 9 shows a block diagram of a text clustering device according to an embodiment of the present application.
Wherein the above figures include the following reference numerals:
102. a processor; 104. a memory; 106. a transmission device; 108. and an input/output device.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the present application described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The reasons for the lower accuracy of text clustering at present are mainly as follows:
1) Text representation model the Word2vec text representation model, which is commonly used, can be trained to convert each feature Word into a real vector of fixed dimensions. However, a document is a set of several words, and obtaining a word vector is only an intermediate process, and representing the document by using the word vector is the final purpose. The current common methods are averaging all word vectors of the document, clustering the word vectors, and the like. Because the contributions of different words to the document are different, the method neglects the influence of a single word on the document, so that the text representation is inaccurate, and the subsequent clustering effect is also influenced;
2) The WMD (Word river's Distance) similarity algorithm is directly used for text similarity measurement, and the calculation accuracy is not high. When the algorithm defines the transfer cost among words, only word frequency information of feature words in the text is utilized, the influence of high-frequency words on the text is difficult to restrain, and keyword factors are not considered, so that the text clustering accuracy is low;
3) When the WMD algorithm calculates the similarity of the texts, each word in one text sentence is compared with all words in the compared texts to calculate transfer cost, and then weight distribution is carried out according to the transfer cost, so that the calculation complexity is high, and the calculation efficiency is low.
As described in the background art, in order to solve the above problem, embodiments of the present application provide a text clustering method, a text clustering device, a computer readable storage medium, and a text clustering system.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking the operation on a mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of a mobile terminal of a text clustering method according to an embodiment of the present invention. As shown in fig. 1, a mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the mobile terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a display method of device information in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, to implement the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In the present embodiment, a method of clustering text running in a mobile terminal, a computer terminal, or a similar computing device is provided, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different from that herein.
Fig. 2 is a flow chart of a clustering method of texts according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:
step S201, obtaining a plurality of original texts, and preprocessing each original text to obtain a plurality of target texts, wherein the preprocessing comprises at least one of the following steps: word segmentation processing and word removal stopping processing, wherein the target text comprises a plurality of words;
specifically, the text corpus cannot be used as initial data for vectorization modeling, and the standard data format can ensure the smooth proceeding of the subsequent steps. Therefore, the text information can be further processed, including word segmentation processing on the text, and stop word processing is removed.
Word segmentation is a method of reorganizing consecutive words into word sequences following a specific rule. According to the structure of English sentences, the space between the words divides the words in the sentences, and no word division processing is needed for the sentences. In the text, no segmenter exists between words, so that the word segmentation technology is needed to preprocess Chinese. Chinese segmentation is the fundamental task of natural language processing. Chinese segmentation requires that each sentence be segmented into words in a sequence. Since the complexity of Chinese in terms of semantics and grammar makes it difficult for a computer to understand Chinese effectively, the text can be segmented using jieba segmentation. jieba is a third party Chinese word segmentation library of Python, and mainly supports 3 word segmentation modes of an accurate mode, a full mode and a search engine mode. The principle is that the Chinese word segmentation library is used for comparing the content of the segmented words with the word segmentation library, and the segmentation combination with the maximum probability is found in the graph structure and the dynamic programming.
After chinese segmentation, the document (text) will become a collection containing multiple words. Word elements with weak semantic information and no real meaning or some special symbols with no influence on document semantics in a set are called Stop words. Such as the words "woolen", "yo", "bar", etc., the structural terms "ground", "having" etc., and the conjunctions "only", "and" etc. Although these stop words occur more frequently in sentences, there is little information. The presence of stop words occupies system memory space, increasing the amount of index and thus reducing the speed of processing. Therefore, the words should be deleted in the document, so that the speed of processing the document is increased and the subsequent analysis is simpler and more convenient. The word or character to be deleted is typically placed in the stop vocabulary, and the result after chinese word segmentation is checked for presence in the stop vocabulary and if so, deleted. After word segmentation, reading a local stop word list, and storing the processed text information. Specifically, as shown in fig. 3, the process of preprocessing an original text is shown, firstly, the original text is obtained, noise reduction processing is performed on the original text (some preset words are deleted, preliminary data are cleaned), word segmentation processing is performed on the original text to obtain a plurality of words, one word is sequentially read, whether the word is in a deactivated word list is determined, if the word is not in the deactivated word list, one word is read again, if the word is in the deactivated word list, the word is deleted until the words in the text are preprocessed, and the preprocessing process is ended.
Step S202, converting the words in each target text into Word vectors by using a Word2vec model, and determining weight values of the words by using a TF-IDF algorithm, wherein the weight values are the importance degrees of the words in the target text;
specifically, the Word2vec model can utilize context information, each Word can be converted into a real number vector (namely a Word vector) with a fixed dimension through training, and the distances of similar words in a vector space are similar, so that the problems of high-dimensional sparseness of a Word vector matrix and neglecting semantic grammar and context relation are solved. The TF-IDF algorithm is used for calculating weight values of words in texts and enhancing distinction between different texts. Therefore, the Word vector output by the Word2vec model is weighted and represented by using a TF-IDF algorithm, so that the problem that the text representation model is not accurate enough is solved.
And step S203, using the word vector and the weight value as the input of a WMD algorithm, determining the similarity between any two target texts, and clustering the target texts according to the similarity.
Specifically, in order to improve the efficiency of text similarity calculation and improve the accuracy of clustering, the method can use a WMD algorithm to calculate the similarity between texts, and of course, in order to further improve the efficiency of text similarity calculation and improve the accuracy of clustering, the WMD algorithm can be improved, and the improved WMD algorithm is used to calculate the similarity between texts. And the clustering algorithm may be any available clustering algorithm, such as a Single-Pass incremental clustering algorithm.
According to the embodiment, word vectors are expressed by combining a Word2vec model with a TF-IDF algorithm, so that distinction between different texts is enhanced, the advantages of the Word vectors are utilized, the influence of the words on the texts is added, the expressed Word vectors are used as the input of a WMD algorithm, the WMD algorithm is used as a similarity measurement algorithm in text clustering, and the accuracy of the text clustering is improved.
After the original text is preprocessed, the data set of the original text can show the expression forms of a plurality of words, namely target texts, in order to effectively apply the set of the target texts with the structure to subsequent clustering analysis, the texts need to be vectorized, in the specific implementation process, the words in each target text are converted into Word vectors by using a Word2vec model, and the weight values of the words are determined by using a TF-IDF algorithm, and the method can be realized by the following steps: a Word2vec model is constructed, wherein the Word2vec model is obtained by training by using a plurality of sets of training data, and each set of training data in the plurality of sets of training data comprises: historical words and historical word vectors corresponding to the historical words, wherein the training data of different groups are acquired in different historical time periods; inputting the words into the Word2vec model to obtain the Word vector with the output of the Word2vec model corresponding to the words; acquiring word frequency and inverse document frequency of the words in the target text, wherein the inverse document frequency is used for representing the importance of the words; and determining the weight value of the word according to the word frequency and the inverse document frequency by adopting a TF-IDF algorithm, wherein the smaller the product of the word frequency and the inverse document frequency is, the smaller the weight value of the word is, the lower the importance degree is, the larger the product of the word frequency and the inverse document frequency is, and the larger the weight value of the word is, and the higher the importance degree is.
In the scheme, words in the target text are converted into Word vectors with the same dimension by using a Word2vec model, and weights of the words in the target text can be calculated, so that the target text can be represented by using the Word vectors and the weight values. The Word vector obtained by the Word2Vec model is weighted and expressed by using a TF-IDF algorithm, and then TF-IDF is used as weight for weight distribution in WMD, namely, a TF-IDF & Word2Vec text expression model is used as input of a WMD distance similarity algorithm, so that the difference of words is objectively reflected, the excessive influence of high-frequency words on the document is avoided, and the accuracy of text clustering is further improved.
The inverse document frequency (nverse Document Frequency, IDF) is a measure of the general importance of a word, its size being inversely proportional to the degree of commonality of a word, calculated by dividing the total number of documents in the corpus by the number of documents in the corpus that contain the word, and taking the logarithm of the quotient obtained.
The genesim library provided by Python is a powerful NLP tool, comprises a Word2vec model and a TF-IDF model, and can be directly called in a program according to user requirements to train and load the model, wherein an initial formula of a TF-IDF algorithm can be as follows:
Wherein f represents a word, m represents a target text, W (f, m) represents a weight value of the word f in the target text m, and TF (f, m) represents the word f in the target textThe number of occurrences in this m, N, represents the total number of target texts, N i The number of target texts m containing the word f is represented.
In order to further ensure that the weight value of the words obtained by using the TF-IDF algorithm is more accurate, the weight value of the words is determined by adopting the TF-IDF algorithm, and the method can be realized by the following steps: according to the target formula:
determining the weight value of the word, wherein f represents the word, m represents the target text, W (f, m) represents the weight value of the word f in the target text m, TF (f, m) represents the number of times the word f appears in the target text m, N represents the total number of the target texts, N i The number of target texts m containing the word f is indicated.
In the scheme, normalization processing can be further performed on the basis of an initial formula of the TF-IDF algorithm to obtain a target formula, so that the weight value of the obtained word is ensured to be more accurate by further adopting the target formula.
The TF-IDF described above&Word2Vec model indicates that the input of text is the original text m= { M 1 ,m 2 ,m 3 ,...,m n The output is the target text (text vector set), the specific steps are as follows:
step1: inputting original text, preprocessing the original text, and changing each text into a set containing words, namely m i ={f i1 ,f i2 ,f i3 ,...,f ij };
Step2: training Word2vec model of the processed text set, and for each feature Word (Word) f ij Will result in an N-dimensional word vector V (f ij );
Step3: obtaining weight W (f) of words in corpus by using TF-IDF formula ij );
Step4: from the calculations in Step2 and Step3, each text vector is denoted as m i ={(V(f i1 ),W(f i1 )),(V(f i2 ),W(f i2 )),...}。
The function of text clustering is to aggregate the scrambled original data into different clusters, the text in the same cluster is closer, i.e. has higher similarity, but the text similarity between different text clusters is very low. Therefore, the quality of a similarity measurement algorithm has important significance on the quality of the clustering result. The common similarity measurement algorithm has cosine distance and Euclidean distance, and the two modes are introduced below to explain the advantages of the WMD algorithm adopted in the scheme.
Assume that there are two words W 1 And W is 2 Training a Word2vec model to obtain two p-dimensional Word vectors V (W) 1 )=(x 1 ,x 2 ,x 3 ...x p ) T And V (W) 2 )=(y 1 ,y 2 ,y 3 ...y p ) T Two common ways of measuring the distance between word vectors are:
(1) The cosine distance is measured by using the cosine value of the vector included angle, and the specific calculated first formula is as follows:
(2) Euclidean distance: also known as euclidean distance, the absolute distance between two points in a coordinate system is used to measure the difference between two vectors, and a second formula is specifically calculated as:
to better explain the difference of the two similarity measures, the effect of the Euclidean distance and cosine distance of the P and Q points in three-dimensional space is shown in FIG. 4.
A document (also called text) is a collection of words, and it is only an intermediate process to obtain a word vector, and it is the final purpose to represent the document by the word vector. The current common methods are averaging all word vectors of the document, clustering the word vectors, and the like. The term vector average is used instead of the whole document, and then the distance between the documents is calculated in a way that the distance is calculated according to the term vector. Although the method considers the semantic characteristics of the document to a certain extent, the contribution of different words appearing in the document to the document is also different. The method of averaging word vectors instead of document vectors therefore ignores the effect of a single word on the document itself. Therefore, whether the Euclidean distance or the cosine distance is used for text similarity measurement, the influence of a single word on a document is ignored, so that a calculation result is inaccurate, and a subsequent clustering effect is influenced. Therefore, in the scheme, a word shift distance WMD text similarity measurement algorithm is used, and the algorithm considers the contribution of all words to the text, so that the calculation accuracy is improved, and the text clustering accuracy is improved.
Word shift distance WMD considers the contribution of all words in a document to the document in a similarity calculation, which can be simply understood as calculating the minimum distance that all words in one document shift to all words in another document. The theoretical basis of this algorithm is the earth movement distance EMD, also known as bulldozer distance. EMD is widely used in speech signal processing and image processing, which Kusner et al applied in the NLP field, and experiments have shown that this algorithm is quite effective. The Word vector trained in the Word2vec model has an addition and subtraction relation, and based on the relation of the Word vector, when calculating the distance between two documents P1 and P2, the basic idea of WMD is to use any Word w in the document P1 i The AND w can be found in document P2 i Word w 'with minimum distance' j I.e. w i Move to w' j The distance of (2) is the smallest, and the cost of transfer can be said to be the smallest. The distance between the two documents is then the total distance that all words of P1 have been transferred to all words of P2, otherwise known as the total cost. The document distance calculation method is similar to EMD, and can be directly calculated without super parameters. Assuming that the WMD distance of two documents P1, P2 is calculated, the model is shown in fig. 5, and the specific algorithm procedure is as follows:
assuming Word2vec trained Word vectors are m-dimensional corpus, and n words exist in the corpus dictionary, then Word vector matrix X epsilon R m×n . The number of occurrences of word w (i) in document P in the document is t w(i) Then the third formula for the weight of the word w (i) in P is:
according to a third formula, it can be determined that weights of feature words (also referred to as words) in the WMD algorithm are calculated using word frequencies.
Assuming that two documents P1 and P2 are represented by a Word2vec model, the distance between the ith Word in P1 and the jth Word in P2 is calculated by using the euclidean distance, namely the transfer cost mentioned in the algorithm, and a specific calculated fourth formula is as follows: c (i, j) = ||x i -x j || 2
Defining a transfer matrix T E R n×n ,T ij The weight assignment of word i in P1 to word j in P2 is shown. The smaller the distance between word i and word j, the more similar the two words are, the greater the weight transferred from word i to word j. The total transfer cost from P1 to P2 can be expressed as a fifth formula:
the WMD of the document is obtained by calculating the minimum value of the sum of the transfer costs of the two document words, namely the distance calculation is converted into a linear programming problem, and the minimum value of the fifth formula is obtained. The algorithm specifies two constraints to avoid the extreme case of one word in P1 corresponding to all words of P2. Wherein D is i Represents the weight, D 'of word i in P1' j Representing the weight of word j in P2. The sixth formula for the calculation of WMD is as follows:
assuming that Word2vec trained Word vectors are D-dimensional, words w in document D, words in D'After w' is represented by the Word2vec model, the Word vector is represented as V (w i ),V(w' j )∈R d . The WMD distance of documents D and D' is transformed according to the sixth equation to obtain a seventh equation:
wherein c (i, j) = |v (w) i )-V(w' j )|| 2 However, the seventh formula has the following disadvantages:
(1) Each word in the document D is required to calculate the transfer cost with all the words in the document D', and then weight distribution is carried out according to the transfer cost, so that the calculation complexity is high.
(2) The WMD uses the importance degree of the word frequency scale word when constraining the transfer amount, it is difficult to suppress the influence of the high frequency word on the document, and keyword factors are not considered.
In order to further improve efficiency and accuracy of the WMD algorithm, according to the distribution of the terms in the embedding space, the terms may be divided into a related set and an unrelated set, and in some embodiments, the determining the similarity between any two target texts by using the term vector and the weight value as input of the WMD algorithm includes: according to a plurality of first Word vectors in a first target text and first TF-IDF values corresponding to the first Word vectors, performing TF-IDF & Word2vec vectorization representation on the first target text to obtain a first target set, wherein the first target set comprises a plurality of first target Word vectors and weight values, and the first Word vectors and the first TF-IDF values are used as inputs of the WMD algorithm; according to a plurality of second Word vectors in a second target text and second TF-IDF values corresponding to the second Word vectors, performing TF-IDF & Word2vec vectorization representation on the second target text to obtain a second target set, wherein the second target set comprises a plurality of second target Word vectors and weight values, and the second Word vectors and the second TF-IDF values are used as inputs of the WMD algorithm; in the process of calculating the similarity between the first target text and the second target text by using a WMD algorithm, a cosine distance between the first target word vector and the second target word vector is calculated as a transfer cost, weight distribution is performed by using a TF-IDF value, the minimum value of the sum of the transfer cost and the distribution weight is used as the distance between the first target text and the second target text, and the distance is the similarity between the first target text and the second target text.
In the scheme, the similarity of the distances between the texts is calculated by calculating the transfer cost between Word vectors and then using the transfer cost and the weight, so that the operation efficiency of an algorithm is improved, and TF-IDF is used as the weight for weight distribution in WMD, namely, a TF-IDF & Word2Vec text representation model is used as the input of the WMD algorithm, and TF-IDF & Word2Vec text is used for vectorizing representation, so that the difference of words is objectively reflected, and the excessive influence of high-frequency words on the document is avoided.
In the Word2vec model, semantically similar words are similar in distance in the embedded space, i.e., word vectors are similar. From the aspect of semantics, it may be determined that, for word w, only few words are closely related in terms of ten thousand words in the corpus, that is, the few words are spatially closer to word w, and the others are relatively far away, so that a plurality of words may be divided according to a relationship on this distance, and in a specific implementation process, in calculating a similarity process of the first target text and the second target text using the WMD algorithm, a cosine distance between the first target word vector and the second target word vector is calculated as a transfer cost, weight distribution is performed using TF-IDF values, and a minimum value of a product of the transfer cost and the distribution weight is used as a distance between the first target text and the second target text, the method further includes the following steps: obtaining a corpus, wherein the corpus comprises a plurality of dictionary words; sequentially selecting one dictionary word from the corpus as a central dictionary word; respectively calculating cosine distances between non-central dictionary words and the central dictionary words in the corpus, wherein the cosine distances are cosine values of included angles between word vectors corresponding to the non-central dictionary words and word vectors corresponding to the central dictionary words in the corpus; storing the non-central dictionary words with the cosine distances within a target range into an irrelevant set of the central dictionary words, and storing the non-central dictionary words with the cosine distances not within the target range into an relevant set of the central dictionary words.
In the scheme, the non-central dictionary words corresponding to the central dictionary words can be divided firstly, and the non-central dictionary words can be divided according to cosine distances, so that after a plurality of non-central dictionary words are divided into different sets, the similarity between the non-central dictionary words and the central dictionary words in the different sets can be calculated subsequently.
For example, if the order of words is arranged according to their distance from the word vector, the words "region A", "City B" and "Chinese" should be arranged before "Guitar" and "Water cup". Which words are farther from "region a" are not considered "guitar" and "cup" because they have no obvious semantic relationship to the word "region a" to be compared. And randomly selecting six words, namely, world, weekend, region A, family, culture and entertainment, as central dictionary words, and respectively calculating cosine distances between the central dictionary words and non-central dictionary words in the corpus in an embedded space, namely, word transfer cost c (i, j) in WMD. And the results are ordered from small to large as shown in fig. 6. For these six non-central dictionary words, only a small part of the words are closer, and the distance between most of the words and the central dictionary word falls in the interval [0.6,0.8] with inflection points as boundaries.
To avoid chance, 1000 central dictionary words are randomly chosen to calculate their cosine distances from non-central dictionary words in vector space, i.e., the transition costs c (i, j) mentioned above. Fig. 7 shows the distance distribution between words, and it can be seen that the distances of the remaining words from the central dictionary word are highly concentrated in the vicinity of the interval [0.6,0.8], and behave similarly to a normal distribution.
In order to further improve efficiency and accuracy of the WMD algorithm, according to the distribution of the words in the embedding space, the words may be divided into a related set and an unrelated set, and in some embodiments, in a process of calculating the similarity between the first target text and the second target text using the WMD algorithm, a cosine distance between the first target word vector and the second target word vector is calculated as a transfer cost, and before using the TF-IDF value to perform weight distribution, the method further includes: when a part of the second target word vectors in the second target set are in the related set of the first target word vectors and a part of the second target word vectors in the uncorrelated set of the first target word vectors, calculating cosine distances of the second target word vectors in the related set of the first target word vectors as transfer costs of weight distribution of the second target word vectors, and performing weight distribution by using TF-IDF values; calculating a second average value of distances from the second target word vector to the first target word vector in the uncorrelated set of the first target word vector as a transfer cost of weight allocation of the second target word vector, and performing weight allocation by using a TF-IDF value; and calculating the sum of first data and second data to obtain the WMD distance between the first target text and the second target text, wherein the first data is the minimum value of the sum of the cosine distances and the weight of the second target word vector, and the second data is the product of the second average value and the weight of the second target word vector.
According to the method, the similarity between the texts can be calculated based on the WMD algorithm, the terms are divided into the related set and the unrelated set by determining the distribution condition of the terms in the embedded space, the terms in the related set and the terms in the unrelated set are combined together to calculate the WMD distance, the calculation times of the distance and the weight distribution between the terms are greatly reduced, the calculation time of the algorithm is shortened, the operation efficiency of the algorithm is improved, and the text clustering efficiency is further improved.
Specifically, based on the data obtained in fig. 6 and fig. 7, in order to simplify the WMD calculation process, the present solution designs a new WMD algorithm, which is specifically described as follows:
given a word, in the corpus dictionary, the remaining words, except word wCan be divided into two groups RE (w) and URE (w). RE (w) includes words related to word w, URE (w) includes words unrelated to w; averaging the w distance to each word in URE (w), using C avg To represent. The distances from w to all words in the uncorrelated set are approximately used for C avg Instead of.
As can be seen from fig. 6, the number of words in RE (w) should be smaller than the number corresponding to the inflection point, and the number of words in URE (w) should be larger than the number corresponding to the inflection point, with the number corresponding to the inflection point as a limit. From fig. 7, it can be determined that the distances from w to the words in the irrelevant set are approximately normally distributed, so it is reasonable to use the distance average in the irrelevant set to represent the word transfer cost in this scheme.
The parameter r (w) is used to determine the number of RE (w) words in the related set of words w. The distance between the rest words and the w embedding space is arranged in an ascending order, the first r (w) words are classified into a related set RE (w), and the rest words are put into URE (w). For convenience in the following, the modified WMD algorithm will be named RE-WMD. The calculation formula of word shift cost when the RE-WMD is used for calculating the distance between the documents D and D' is as follows:
where C is a set of RE (w), containing related words for all words. The above proposal divides the rest words into a relevant set and an irrelevant set according to the distance between the rest words and any word w in the corpus. For the transfer cost of word and w in the uncorrelated set, using the average value C of all distances avg Instead of. The scheme is applied to WMD to improve algorithm efficiency. The key idea of the RE-WMD algorithm is that in the process of calculating the similarity of documents, whether the words w and w' to be calculated exist in a related set C is judged in advance, if so, the rest chord distance is calculated as the transfer cost, otherwise, the C is directly used avg As a transition cost between the two words. It can be determined by combining fig. 6 and fig. 7 that only part of the words need to be subjected to cosine distance calculation, so that a large amount of calculation can be avoided, thereby improving the algorithm efficiency, and thus the RE-WMD meter The calculation formula is as follows:
specifically, the improved word shift distance WMD algorithm replaces the traditional distance similarity measurement algorithm to serve as a clustering standard. Whether the Euclidean distance or the cosine distance is used for text similarity measurement, the influence of a single word on a document is ignored, so that a calculation result is inaccurate, and a subsequent clustering effect is influenced. Therefore, the word shift distance WMD text similarity measurement algorithm is used in the method, and the calculation accuracy is improved.
In an embodiment of the present application, clustering the target text according to the similarity includes: under the condition that one target text is used for carrying out first clustering, one target text is used as a first text cluster; under the condition that the Nth clustering is carried out by using the Nth target text, carrying out similarity comparison on the Nth target text and each formed target text in the text clusters, wherein N is more than or equal to 2; and classifying the Nth target text into the text cluster under the condition that the similarity is greater than or equal to a similarity threshold value, and creating a new text cluster according to the Nth target text under the condition that the similarity is less than the similarity threshold value.
In this scheme, if the first clustering is performed, because there is no text cluster, the first target text may be used as the first text cluster, and the plurality of later text clusters are also one target text as one text cluster, when the nth target text is processed, classifying the nth target text into the text cluster with the closest similarity, that is, greater than the similarity threshold, and if the similarity is lower, using the nth target text as a new text cluster, so that the similarity is calculated by using Word2vec & TF-IDF-WMD algorithm, and the obtained similarity is more accurate, thereby ensuring that the clustering accuracy is higher in this embodiment.
Specifically, a Single-Pass algorithm of incremental clustering can be used to cluster a plurality of target texts, and the Single-Pass algorithm is also called a Single-channel algorithm and is widely applied to text clustering, and the Single-Pass algorithm is an efficient and simple unsupervised clustering algorithm, and as shown in fig. 8, the flow of the algorithm is as follows:
the clustering algorithm starts to calculate, a plurality of n preprocessed target texts are obtained, and the n target texts form a corpus M, M= { M 1 ,m 2 ,m 3 ,...m n When clustering is performed, since no text clusters are generated in the initial algorithm, i.e. whether there are already generated text clusters, if not, the first entry to be entered marks text m 1 As the first text cluster, i.e. create text cluster, m 1 As the center of the first text cluster, if there is a text cluster, when processing to the ith text sentence, will m i Similarity comparison with each of the target texts in the formed text clusters, i.e. calculating m i And m j Similarity between the item names, and finding the sentence m with the maximum similarity from the i-1 item mark text k And determines whether the similarity is greater than a similarity threshold T C If the similarity threshold is larger than the similarity threshold, classifying the similarity threshold into the corresponding text cluster, otherwise, creating a new text cluster. And (3) cycling the process until all target texts participate in clustering, and ending the clustering process after no unprocessed text exists in the corpus.
The embodiment of the application also provides a text clustering device, and the text clustering device can be used for executing the text clustering method. The device is used for realizing the above embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The following describes a text clustering device provided in an embodiment of the present application.
Fig. 9 is a block diagram of a text clustering device according to an embodiment of the present application. As shown in fig. 9, the apparatus includes:
a first obtaining unit 10, configured to obtain a plurality of original texts, and perform preprocessing on each of the original texts to obtain a plurality of target texts, where the preprocessing includes at least one of the following: word segmentation processing and word removal stopping processing, wherein the target text comprises a plurality of words;
the first processing unit 20 is configured to convert the words in each target text into Word vectors by using a Word2vec model, and determine weight values of the words by using a TF-IDF algorithm, where the weight values are importance levels of the words in the target text;
the second processing unit 30 is configured to determine a similarity between any two of the target texts by using the word vector and the weight value as input of the WMD algorithm, and cluster the target texts according to the similarity.
According to the embodiment, word vectors are expressed by combining a Word2vec model with a TF-IDF algorithm, so that distinction between different texts is enhanced, the advantages of the Word vectors are utilized, the influence of the words on the texts is added, the expressed Word vectors are used as the input of a WMD algorithm, the WMD algorithm is used as a similarity measurement algorithm in text clustering, and the accuracy of the text clustering is improved.
After the original text is preprocessed, the data set of the original text can show the expression forms of a plurality of words, namely, the target text, in order to effectively apply the set of the target text with the structure to subsequent clustering analysis, the text needs to be vectorized, and in the specific implementation process, the first processing unit comprises a building module, a first processing module, an acquisition module and a first determination module, wherein the building module is used for building a Word2vec model, the Word2vec model is obtained by training by using a plurality of sets of training data, and each set of training data in the plurality of sets of training data comprises: historical words and historical word vectors corresponding to the historical words, wherein the training data of different groups are acquired in different historical time periods; the first processing module is used for inputting the words into the Word2vec model to obtain the Word vectors with the output of the Word2vec model corresponding to the words; the acquisition module is used for acquiring word frequency and inverse document frequency of the words in the target text, wherein the inverse document frequency is used for representing the importance of the words; the first determining module is configured to determine the weight value of the term according to the word frequency and the inverse document frequency by using a TF-IDF algorithm, where the smaller the product of the word frequency and the inverse document frequency is, the smaller the weight value of the term is, the lower the importance degree is, and the larger the product of the word frequency and the inverse document frequency is, the larger the weight value of the term is, and the higher the importance degree is.
In the scheme, words in the target text are converted into Word vectors with the same dimension by using a Word2vec model, and weights of the words in the target text can be calculated, so that the target text can be represented by using the Word vectors and the weight values. The Word vector obtained by the Word2Vec model is weighted and expressed by using a TF-IDF algorithm, and then TF-IDF is used as weight for weight distribution in WMD, namely, a TF-IDF & Word2Vec text expression model is used as input of a WMD distance similarity algorithm, so that the difference of words is objectively reflected, the excessive influence of high-frequency words on the document is avoided, and the accuracy of text clustering is further improved.
In order to further ensure that the weight value of the word obtained by using the TF-IDF algorithm is accurate, the first processing unit comprises a second determining module, and the second determining module is used for determining the weight value according to a target formula:
determining the weight value of the word, wherein f represents the word, m represents the target text, W (f, m) represents the weight value of the word f in the target text m, TF (f, m) represents the number of times the word f appears in the target text m, N represents the total number of the target texts, N i Representing a word containing the word f The number of the target texts m.
In the scheme, normalization processing can be further performed on the basis of an initial formula of the TF-IDF algorithm to obtain a target formula, so that the weight value of the obtained word is ensured to be more accurate by further adopting the target formula.
In order to further improve efficiency and accuracy of the WMD algorithm, according to a distribution situation of words in an embedded space, the words may be divided into a related set and an unrelated set, in some embodiments, the second processing unit includes a second processing module, a third processing module, and a fourth processing module, where the second processing module is configured to perform TF-IDF & Word2vec vectorization representation on the first target text according to a plurality of first Word vectors in the first target text and first TF-IDF values corresponding to the first Word vectors, to obtain a first target set, where the first target set includes a plurality of first target Word vectors and weight values, and the first Word vectors and the first TF-IDF values are used as inputs of the WMD algorithm; the third processing module is configured to perform TF-IDF & Word2vec vectorization representation on a second target text according to a plurality of second Word vectors in the second target text and second TF-IDF values corresponding to the second Word vectors, to obtain a second target set, where the second target set includes a plurality of second target Word vectors and weight values, and the second Word vectors and the second TF-IDF values are used as inputs of the WMD algorithm; the fourth processing module is configured to calculate, using a WMD algorithm, a cosine distance between the first target word vector and the second target word vector as a transition cost in a process of calculating a similarity between the first target text and the second target text, perform weight distribution using TF-IDF values, and use a minimum value of a product of the transition cost and the distribution weight as a distance between the first target text and the second target text, where the distance is the similarity between the distances between the first target text and the second target text.
In the scheme, the similarity of the distances between the texts is calculated by calculating the transfer cost between Word vectors and then using the transfer cost and the weight, so that the operation efficiency of an algorithm is improved, and TF-IDF is used as the weight for weight distribution in WMD, namely, a TF-IDF & Word2Vec text representation model is used as the input of the WMD algorithm, and TF-IDF & Word2Vec text is used for vectorizing representation, so that the difference of words is objectively reflected, and the excessive influence of high-frequency words on the document is avoided.
In the Word2vec model, semantically similar words are similar in distance in the embedded space, i.e., word vectors are similar. From the aspect of semantics, it may be determined that, for word w, only few words are closely related in the ten-thousand words in the corpus, that is, the few words are closer to word w in space distance, and the rest are farther away, so that a plurality of words may be divided according to the relationship in the distance, in a specific implementation process, the apparatus further includes a second obtaining unit, a selecting unit, a calculating unit, and a storage unit, where the second obtaining unit is configured to obtain the corpus before calculating a cosine distance between the first target word vector and the second target word vector as a transfer cost in a similarity process of calculating the first target text and the second target text using a WMD algorithm, performing weight distribution using TF-IDF values, and using a minimum value of a product of the transfer cost and the distribution weight as a distance between the first target text and the second target text, where the corpus includes a plurality of dictionary words; the selecting unit is used for sequentially selecting one dictionary word from the corpus as a central dictionary word; the computing unit is used for respectively computing cosine distances between non-central dictionary words and the central dictionary words in the corpus, wherein the cosine distances are cosine values of included angles between word vectors corresponding to the non-central dictionary words and word vectors corresponding to the central dictionary words in the corpus; the storage unit is used for storing the non-central dictionary words with the cosine distances within a target range into an irrelevant set of the central dictionary words, and storing the non-central dictionary words with the cosine distances not within the target range into an relevant set of the central dictionary words.
In the scheme, the non-central dictionary words corresponding to the central dictionary words can be divided firstly, and the non-central dictionary words can be divided according to cosine distances, so that after a plurality of non-central dictionary words are divided into different sets, the similarity between the non-central dictionary words and the central dictionary words in the different sets can be calculated subsequently.
In order to further improve efficiency and accuracy of the WMD algorithm, according to a distribution of terms in the embedding space, the terms may be divided into a related set and an unrelated set, and in some embodiments, the apparatus further includes a third processing unit, where the third processing unit is configured to calculate, in a process of calculating a similarity between the first target text and the second target text using the WMD algorithm, a cosine distance between the first target term vector and the second target term vector as a transfer cost, and before performing weight distribution using TF-IDF values, the method further includes: when a part of the second target word vectors in the second target set are in the related set of the first target word vectors and a part of the second target word vectors in the uncorrelated set of the first target word vectors, calculating cosine distances of the second target word vectors in the related set of the first target word vectors as transfer costs of weight distribution of the second target word vectors, and performing weight distribution by using TF-IDF values; calculating a second average value of distances from the second target word vector to the first target word vector in the uncorrelated set of the first target word vector as a transfer cost of weight allocation of the second target word vector, and performing weight allocation by using a TF-IDF value; and calculating the sum of first data and second data to obtain the WMD distance between the first target text and the second target text, wherein the first data is the minimum value of the sum of the cosine distances and the weight of the second target word vector, and the second data is the product of the second average value and the weight of the second target word vector.
According to the method, the similarity between the texts can be calculated based on the WMD algorithm, the terms are divided into the related set and the unrelated set by determining the distribution condition of the terms in the embedded space, the terms in the related set and the terms in the unrelated set are combined together to calculate the WMD distance, the calculation times of the distance and the weight distribution between the terms are greatly reduced, the calculation time of the algorithm is shortened, the operation efficiency of the algorithm is improved, and the text clustering efficiency is further improved.
In one embodiment of the present application, the second processing unit includes a fifth processing module, a sixth processing module, and a seventh processing module, where the fifth processing module is configured to use one of the target texts as a first text cluster when performing a first clustering using the one of the target texts; the sixth processing module is used for comparing the similarity between the Nth target text and each formed target text in the text clusters under the condition that the Nth target text is used for carrying out the Nth clustering, wherein N is more than or equal to 2; and the seventh processing module is used for classifying the Nth target text into the text cluster under the condition that the similarity is greater than or equal to a similarity threshold value, and creating a new text cluster according to the Nth target text under the condition that the similarity is less than the similarity threshold value.
In this scheme, if the first clustering is performed, because there is no text cluster, the first target text may be used as the first text cluster, and the plurality of later text clusters are also one target text as one text cluster, when the nth target text is processed, classifying the nth target text into the text cluster with the closest similarity, that is, greater than the similarity threshold, and if the similarity is lower, using the nth target text as a new text cluster, so that the similarity is calculated by using Word2vec & TF-IDF-WMD algorithm, and the obtained similarity is more accurate, thereby ensuring that the clustering accuracy is higher in this embodiment.
The text clustering device comprises a processor and a memory, wherein the first acquisition unit, the first processing unit, the second processing unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions. The modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more, and text can be accurately clustered by adjusting kernel parameters.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
The embodiment of the invention provides a computer readable storage medium, which comprises a stored program, wherein the program is used for controlling equipment where the computer readable storage medium is located to execute a clustering method of texts.
The embodiment of the invention provides a processor, which is used for running a program, wherein the clustering method of texts is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program stored on the memory and capable of running on the processor, wherein the processor realizes the clustering method steps of texts when executing the program. The device herein may be a server, PC, PAD, cell phone, etc.
The present application also provides a computer program product adapted to perform a program of clustering method steps initialized with at least text when executed on a data processing device.
The application also provides a text clustering system comprising one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs comprise a clustering method for executing any one of the texts.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
From the above description, it can be seen that the above embodiments of the present application achieve the following technical effects:
1) According to the text clustering method, word2vec models and TF-IDF algorithms are combined to represent Word vectors, distinction between different texts is enhanced, the advantages of the Word vectors are utilized, influences of words on the texts are added, the Word vectors represented by the Word vectors are used as input of WMD algorithms, the WMD algorithms are used as similarity measurement algorithms in text clustering, and accuracy of text clustering is improved.
2) According to the text clustering device, word2vec models are combined with TF-IDF algorithms to represent Word vectors, distinction between different texts is enhanced, the advantages of the Word vectors are utilized, influences of words on the texts are added, the Word vectors represented by the Word vectors are used as input of WMD algorithms, the WMD algorithms are used as similarity measurement algorithms in text clustering, and accuracy of text clustering is improved.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for clustering text, comprising:
obtaining a plurality of original texts, and preprocessing each original text to obtain a plurality of target texts, wherein the preprocessing comprises at least one of the following steps: word segmentation processing and word deactivation processing are carried out, wherein the target text comprises a plurality of words;
converting the words in each target text into Word vectors by using a Word2vec model, and determining weight values of the words by using a TF-IDF algorithm, wherein the weight values are the importance degrees of the words in the target text;
and taking the word vector and the weight value as the input of a WMD algorithm, determining the similarity between any two target texts, and clustering the target texts according to the similarity.
2. The method of claim 1, wherein converting the words in each of the target texts into Word vectors using a Word2vec model, and determining weight values for the words using a TF-IDF algorithm comprises:
a Word2vec model is constructed, wherein the Word2vec model is obtained by training by using a plurality of sets of training data, and each set of training data in the plurality of sets of training data comprises: the historical words and the historical word vectors corresponding to the historical words are obtained in different historical time periods;
Inputting the Word into the Word2vec model to obtain the Word vector corresponding to the Word, wherein the Word2vec model is output;
acquiring word frequency and inverse document frequency of the word in the target text, wherein the inverse document frequency is used for representing the importance of the word;
and determining the weight value of the word according to the word frequency and the inverse document frequency by adopting a TF-IDF algorithm, wherein the smaller the product of the word frequency and the inverse document frequency is, the lower the weight value of the word is, the lower the importance degree is, the larger the product of the word frequency and the inverse document frequency is, and the larger the weight value of the word is, the higher the importance degree is.
3. The method of claim 1, wherein determining the weight value of the term using TF-IDF algorithm further comprises:
according to the target formula:
determining the weight value of the word, wherein f represents the word, m represents the target text, W (f, m) represents the weight value of the word f in the target text m, TF (f, m) represents the number of times the word f appears in the target text m, N represents the total number of the target text, N i Representing the number of the target texts m containing the words f.
4. The method of claim 1, wherein determining the similarity between any two of the target texts using the word vectors and the weight values as inputs to a WMD algorithm comprises:
according to a plurality of first Word vectors in a first target text and first TF-IDF values corresponding to the first Word vectors, performing TF-IDF & Word2vec vectorization representation on the first target text to obtain a first target set, wherein the first target set comprises a plurality of first target Word vectors and weight values, and the first Word vectors and the first TF-IDF values are used as input of a WMD algorithm;
performing TF-IDF & Word2vec vectorization representation on a second target text according to a plurality of second Word vectors in the second target text and second TF-IDF values corresponding to the second Word vectors to obtain a second target set, wherein the second target set comprises a plurality of second target Word vectors and weight values, and the second Word vectors and the second TF-IDF values are used as input of the WMD algorithm;
and in the process of calculating the similarity of the first target text and the second target text by using a WMD algorithm, calculating the cosine distance between the first target word vector and the second target word vector as a transfer cost, using a TF-IDF value to perform weight distribution, and using the minimum value of the product sum of the transfer cost and the distribution weight as the distance between the first target text and the second target text, wherein the distance is the similarity of the distance between the first target text and the second target text.
5. The method of claim 4, wherein in calculating the similarity between the first target text and the second target text using WMD algorithm, calculating a cosine distance between the first target word vector and the second target word vector as a transfer cost, performing weight distribution using TF-IDF value, and using a minimum value of a product sum of transfer cost and distribution weight as a distance between the first target text and the second target text, the method further comprises:
obtaining a corpus, wherein the corpus comprises a plurality of dictionary words;
sequentially selecting one dictionary word from the corpus as a central dictionary word;
respectively calculating cosine distances between non-central dictionary words and the central dictionary words in the corpus, wherein the cosine distances are cosine values of included angles between word vectors corresponding to the non-central dictionary words and word vectors corresponding to the central dictionary words in the corpus;
storing the non-central dictionary words with the cosine distances within a target range into an irrelevant set of the central dictionary words, and storing the non-central dictionary words with the cosine distances not within the target range into an relevant set of the central dictionary words.
6. The method of claim 5, wherein in calculating the similarity between the first target text and the second target text using WMD algorithm, the cosine distance between the first target word vector and the second target word vector is calculated as a transfer cost, and before the weight is assigned using TF-IDF value, the method further comprises:
calculating cosine distance of the second target word vector in the relevant set of the first target word vector as transfer cost of weight distribution of the second target word vector under the condition that part of the second target word vector in the relevant set of the first target word vector and part of the second target word vector in the irrelevant set of the first target word vector; calculating a second average value of the distances from the second target word vector to the first target word vector in the uncorrelated set of the first target word vector as a transfer cost of weight distribution of the second target word vector, and performing weight distribution by using a TF-IDF value; and calculating the sum of first data and second data to obtain the WMD distance between the first target text and the second target text, wherein the first data is the minimum value of the sum of a plurality of cosine distances and the weight of the second target word vector, and the second data is the product of the second average value and the weight of the second target word vector.
7. The method of claim 1, wherein clustering the target text according to the similarity comprises:
taking one target text as a first text cluster under the condition that one target text is used for carrying out first clustering;
under the condition that the Nth target text is used for carrying out the Nth clustering, carrying out similarity comparison on the Nth target text and each target text in the formed text clusters, wherein N is more than or equal to 2;
and classifying the Nth target text into the text cluster under the condition that the similarity is greater than or equal to a similarity threshold value, and creating a new text cluster according to the Nth target text under the condition that the similarity is less than the similarity threshold value.
8. A text clustering device, comprising:
the first acquisition unit is used for acquiring a plurality of original texts, preprocessing each original text to obtain a plurality of target texts, wherein the preprocessing comprises at least one of the following steps: word segmentation processing and word deactivation processing are carried out, wherein the target text comprises a plurality of words;
the first processing unit is used for converting the words in each target text into Word vectors by using a Word2vec model, and determining weight values of the words by using a TF-IDF algorithm, wherein the weight values are the importance degrees of the words in the target text;
And the second processing unit is used for taking the word vector and the weight value as the input of a WMD algorithm, determining the similarity between any two target texts, and clustering the target texts according to the similarity.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, controls a device in which the computer-readable storage medium is located to perform the method of clustering text according to any one of claims 1 to 7.
10. A text clustering system, comprising: one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising a clustering method for performing the text of any of claims 1-7.
CN202310666521.XA 2023-06-06 2023-06-06 Text clustering method, text clustering device and text clustering system Pending CN116561319A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310666521.XA CN116561319A (en) 2023-06-06 2023-06-06 Text clustering method, text clustering device and text clustering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310666521.XA CN116561319A (en) 2023-06-06 2023-06-06 Text clustering method, text clustering device and text clustering system

Publications (1)

Publication Number Publication Date
CN116561319A true CN116561319A (en) 2023-08-08

Family

ID=87494691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310666521.XA Pending CN116561319A (en) 2023-06-06 2023-06-06 Text clustering method, text clustering device and text clustering system

Country Status (1)

Country Link
CN (1) CN116561319A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012979A (en) * 2024-04-10 2024-05-10 济南宝林信息技术有限公司 Intelligent acquisition and storage system for common surgical operation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012979A (en) * 2024-04-10 2024-05-10 济南宝林信息技术有限公司 Intelligent acquisition and storage system for common surgical operation

Similar Documents

Publication Publication Date Title
CN109635150B (en) Text generation method, device and storage medium
Aliniya et al. A novel combinatorial merge-split approach for automatic clustering using imperialist competitive algorithm
CN109284675B (en) User identification method, device and equipment
CN110363049B (en) Method and device for detecting, identifying and determining categories of graphic elements
CN112329460B (en) Text topic clustering method, device, equipment and storage medium
CN109902156B (en) Entity retrieval method, storage medium and electronic device
CN110728313B (en) Classification model training method and device for intention classification recognition
CN112200296B (en) Network model quantization method and device, storage medium and electronic equipment
US11694111B2 (en) Learning device and learning method
CN114329029B (en) Object retrieval method, device, equipment and computer storage medium
US20200020321A1 (en) Speech recognition results re-ranking device, speech recognition results re-ranking method, and program
WO2020149897A1 (en) A deep learning model for learning program embeddings
CN116561319A (en) Text clustering method, text clustering device and text clustering system
CN113535912B (en) Text association method and related equipment based on graph rolling network and attention mechanism
CN114444668A (en) Network quantization method, network quantization system, network quantization apparatus, network quantization medium, and image processing method
CN110895703B (en) Legal document case recognition method and device
US11755671B2 (en) Projecting queries into a content item embedding space
Su et al. Semantically guided projection for zero-shot 3D model classification and retrieval
CN110083828A (en) A kind of Text Clustering Method and device
CN115858780A (en) Text clustering method, device, equipment and medium
Wicht et al. Keyword spotting with convolutional deep belief networks and dynamic time warping
CN111737469A (en) Data mining method and device, terminal equipment and readable storage medium
Nagovitsin et al. DGAC: dialogue graph auto construction based on data with a regular structure
CN115146596B (en) Recall text generation method and device, electronic equipment and storage medium
Loong et al. Image‐based structural analysis for education purposes: A proof‐of‐concept study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination