CN111488429A

CN111488429A - Short text clustering system based on search engine and short text clustering method thereof

Info

Publication number: CN111488429A
Application number: CN202010194422.2A
Authority: CN
Inventors: 赵粉玉; 徐鹏波; 陈尚武
Original assignee: Hangzhou Xujian Science And Technology Co ltd
Current assignee: Hangzhou Xujian Science And Technology Co ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2020-08-04

Abstract

The invention provides a short text clustering system based on a search engine and a short text clustering method thereof, wherein the short text clustering system based on the search engine comprises a data preprocessing module, a search engine data matching module, a short text similarity calculation module and a data processing module; the data preprocessing module is used for preprocessing the text data according to the actual service condition, wherein the sample data refers to the short text; the invention effectively solves the problems that the current clustering mode has the defects of low calculation speed, difficult control of the clustering effect of short texts, incapability of clustering certain text data in real time and the like, can put a text data into a similar data set in real time, and can realize high-efficiency clustering by utilizing the high concurrency characteristic of a search engine database.

Description

Short text clustering system based on search engine and short text clustering method thereof

Technical Field

The invention relates to the technical field of big data processing, in particular to a short text clustering system based on a search engine and a short text clustering method thereof.

Background

Common short text clustering methods include a partition-based algorithm represented by k-means and a hierarchical-based algorithm represented by hierarchical clustering. The k-means algorithm has the defects that the number of clusters needs to be determined in advance, when a data set is large, a proper value is difficult to give, the condition of stopping splitting needs to be determined by the hierarchical division algorithm, and the calculation speed is low. The clustering effect of the short text clustering based on unsupervised learning is not obvious under the condition of a large amount of noise data, and certain data cannot be put into a similar data set in real time.

Disclosure of Invention

In order to solve the technical problems, the invention provides a short text clustering system based on a search engine and a short text clustering method thereof, aiming at the fact that various unsupervised short text clustering methods are mature day by day, such as k-means, DBSCAN and the like. However, the current clustering mode has the defects of low calculation speed, difficult control of the clustering effect of short texts, incapability of clustering certain text data in real time and the like.

In order to achieve the aim, the invention provides a short text clustering system based on a search engine, which comprises a data preprocessing module (1), a search engine data matching module (2), a short text similarity calculation module (3) and a data processing module (4);

the short text generally refers to a text form with a short text length of no more than 300 characters theoretically, such as a microblog, a news theme, a viewpoint comment, a mobile phone short message, a document summary and the like.

The data preprocessing module (1) is used for preprocessing text data according to actual service conditions, wherein the sample data refers to the short text;

the search engine data matching module (2) searches the database in a fuzzy manner according to corresponding rules on the text processed by the data preprocessing module (1), and returns the first n results; the rules can be customized according to related services, for example, news is clustered, places in the text can be extracted and stored in place fields in a database, multi-field fuzzy search is carried out on the text and the places during retrieval, data with similar places and texts is returned, and clustering accuracy can be improved;

the short text similarity calculation module (3) calculates the similarity between each sentence returned by the search engine data matching module (2) and the input text by using a short text similarity calculation method;

the data processing module (4) places the data with similarity greater than a certain threshold value into the corresponding field in the search engine table according to the rule.

The invention also provides a short text clustering method based on the search engine, which comprises the following procedures:

the method comprises the following steps that (1) a data preprocessing module (1) is responsible for processing input text data and removing stop words in short texts; such as: words without actual meanings such as yes, and the like, removing format marks, removing messy code characters and the like, and selecting and removing English, numbers, emoticons, special stop words set by actual application and the like according to actual conditions.

Examples are: for example, in a sentence '# how the Shanxi Daizhongda unclear gas # goes back, unclear gas appears around the Shanxi Daizda colorless and odorless, all people choking to start to cough, @ Shanxi environmental protection public sentiment gateway department surveys Shanzhong and Shanxi agriculture university' after treatment, a sentence 'Shanxi Daizhongda unclear gas around the Shanxi Daizda colorless and odorless choking all people starts to cough department surveys Shanzhong and Shanxi agriculture university'.

Step (2), a search engine data matching module (2) selects a search engine database, such as elastic search, Solr and the like, when the search engine (elastic search engine) processes full-text search, firstly, a query character string is analyzed, then, a query is constructed according to word segmentation, a search result shows a result set which is ordered from top to bottom according to score of score, and two sentences are generally similar when the score is higher; arranging word segmenters in a search engine into ik _ smart, ik _ max _ word or other Chinese word segmenters, and increasing, deleting and modifying stop words and dictionaries in the search engine according to requirements;

the short text is fuzzily searched in a search engine database by adopting a direct search mode or a CUR L command mode, and the first n sentences which are similar are returned, wherein n can be adjusted according to the final effect and is suggested to be about 3, namely the first 3 data are returned, and the 3 data are the 3 data which are most similar to the short text in the database;

step (3), the short text similarity calculation module (3) divides each short sentence of the first n sentences in the search engine data matching module (2) into words and removes noise parts, converts the words into a word vector list through a word vector space model and calculates the similarity between the short sentences through cosine similarity between vectors;

the word vector space model is obtained by performing word segmentation on Wikipedia linguistic data or other large linguistic data through a Chinese word segmentation tool and removing stop words and then training the word2vec of a genesis toolkit, and the word vector space model is used for expressing words by vectors;

cosine similarity uses a cosine value of an included angle between two vectors in a vector space as a measure of the difference between the two individuals; the closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are.

The cosine value between the two vectors is obtained by using the following formula, wherein A, B is a word vector list converted by a word vector space model of the two sentences;

step (4), the data processing module (4) sets a similar data set of similar fields in a database, circularly calculates the similarity between the matching sentence and each sentence in the matching return list by using a formula in the short text similarity calculation module (3), and directly adds the similarity smaller than a set threshold value into the database; adding the similarity which is larger than a set threshold and the sentence length which is smaller than the similarity into a similar data set field of the similarity; replacing the similar sentence if the similarity is greater than a set threshold and the length is greater than the similar sentence, and adding the similar sentence into a similar data set field of the sentence;

the threshold value is generally set to be 0.8, if the first strip is similar and the second strip is still similar, the second strip is added to the similar data set field of the first strip and is deleted, and then the sentence processing modes are consistent;

that is, if the list of sentences similar to the matching sentence s is [ a, b, c ], the sentence s will perform the following operations (using a as an example) with each sentence in the list: s and a calculate similarity, the similarity is smaller than a threshold value and is directly added into a database, the similarity is larger than the threshold value, the length of s is smaller than the length of a, s is added into a similar field of a, the similarity is larger than the threshold value, the length of s is larger than the length of a, a is replaced by s, and a is added into the replaced similar field, wherein s, a, b and c are single short text data;

the word segmentation tools in the text include, but are not limited to L TP, N L PIR, jieba and Han L P, and the word segmentation tools are a custom dictionary and a stop word dictionary which are set according to requirements.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

the invention effectively solves the problems that the current clustering mode has the defects of low calculation speed, difficult control of the clustering effect of short texts, incapability of clustering certain text data in real time and the like, can put a text data into a similar data set in real time, and can realize high-efficiency clustering by utilizing the high concurrency characteristic of a search engine database.

Drawings

FIG. 1 is a block diagram of a short text clustering system based on a search engine according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present invention provides a specific embodiment of a short text clustering system based on a search engine, which includes a data preprocessing module (1), a search engine data matching module (2), a short text similarity calculation module (3), and a data processing module (4);

As shown in fig. 1, the present invention further provides a specific embodiment of a short text clustering method based on a search engine, which includes the following steps:

Such as: the existing database is as follows:

the short text is ' explosion accident of extra large gas in the west two mining areas of xx coal mines ', the gas explosion accident in the west two mining areas of xx coal mines is obtained through text preprocessing ', two pieces of data with id of 1 and 3 are obtained by using text fields in a short text fuzzy matching table, the similarity between the short text and the data in the text fields with id of 1 and 3 is respectively calculated, wherein the similarity between the short text and the data in the text fields with id of 1 is greater than a threshold value, and the number of words is less, the short text is added into a similar data set with id of 1, the similarity between the short text and the data in id of 3 is less than the threshold value, no processing is performed, the data with id of 1 is updated, and the following results are obtained:

the main short text clustering method comprises the steps of operating an elastic search and calculating the similarity between texts. The Elasticissearch supports the near real-time processing of mass data and is a distributed RESTful style search and data analysis engine, and the calculation speed of the similarity between texts is in the millisecond level, so that the short text clustering efficiency is high.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The principle and embodiments of the present invention have been described herein by way of specific examples, which are provided only to help understand the method and the core idea of the present invention, and the above is only a preferred embodiment of the present invention, and it should be noted that there are objectively infinite specific structures due to the limited character expressions, and it will be apparent to those skilled in the art that a plurality of modifications, decorations or changes can be made without departing from the principle of the present invention, and the above technical features can also be combined in a suitable manner; such modifications, variations, combinations, or adaptations of the invention using its spirit and scope, as defined by the claims, may be directed to other uses and embodiments.

Claims

1. A short text clustering system based on a search engine is characterized by comprising a data preprocessing module (1), a search engine data matching module (2), a short text similarity calculation module (3) and a data processing module (4);

the search engine data matching module (2) returns the first n results of the text processed by the data preprocessing module (1) according to a fuzzy search database;

the data processing module (4) puts the data with similarity larger than a certain threshold value into the corresponding field in the search engine table.

2. The search engine-based short text clustering system according to claim 1, wherein the short text is in a text form with a short text length of no more than 300 characters, such as microblog, news topic, opinion comment, short message service, and document summary.

3. A short text clustering method based on a search engine is characterized by comprising the following processes:

the method comprises the following steps that (1) a data preprocessing module (1) is responsible for processing input text data and removing stop words in short texts;

step (2), the search engine data matching module (2) selects a search engine database, and the search engine is used for processing full-text search;

step (4), the data processing module (4) sets a similar data set of similar fields in a database, circularly calculates the similarity between the matching sentence and each sentence in the matching return list by using a formula in the short text similarity calculation module (3), and directly adds the similarity smaller than a set threshold value into the database; adding the similarity which is larger than a set threshold and the sentence length which is smaller than the similarity into a similar data set field of the similarity; and if the similarity is greater than the set threshold and the length is greater than the length of the similar sentence, replacing the similar sentence, and adding the similar sentence into the similar data set field of the sentence.

4. The method for clustering short texts based on a search engine as claimed in claim 3, wherein in step (2), when the search engine processes the full text search, the search engine firstly analyzes the query string, then constructs the query according to the word segmentation, the search result shows a result set which is ranked from top to bottom according to score, and the two sentences are more similar when the score is higher; arranging word segmenters in a search engine into ik _ smart, ik _ max _ word or other Chinese word segmenters, and increasing, deleting and modifying stop words and dictionaries in the search engine according to requirements;

and (3) fuzzily searching the short text in a search engine database by adopting a direct search mode or a CUR L command mode, and returning the first n similar sentences, wherein n is adjusted according to the final effect.

5. The method for clustering short texts based on a search engine according to claim 3, wherein in the step (3), the word vector space model is obtained by word2vec training of a genim toolkit after the wikipedia or other large corpora are participated by a Chinese word segmentation tool and stop words are removed, and the word vector space model is used for representing words by vectors;

cosine similarity uses a cosine value of an included angle between two vectors in a vector space as a measure of the difference between the two individuals; the closer the cosine value is to 1, the closer the included angle is to 0 degrees, namely the more similar the two vectors are; the cosine value between the two vectors is obtained by using the following formula, wherein A, B is a word vector list converted by a word vector space model of the two sentences;

。