CN102184256A

CN102184256A - Clustering method and system aiming at massive similar short texts

Info

Publication number: CN102184256A
Application number: CN2011101473403A
Authority: CN
Inventors: 白俊良; 陈�光
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2011-06-02
Filing date: 2011-06-02
Publication date: 2011-09-14

Abstract

The invention relates to a clustering method and system aiming at massive similar short texts, belonging to a research on repeated short text detection in the scientific field of information technology. Due to self features of the short texts, the calculated result obtained by applying the traditional repeated text analysis method to short texts are not satisfactory. By adopting a repeated analysis method based on main short text content and combining related word groups, the invention not only can detect completely repeated texts, but also can detect texts with extremely high similarity. The method and system disclosed by the invention have high processing speed and high efficiencyand can better process massive data. By the adoption of the method, redundant short texts can be removed, the system processing scale can be greatly decreased, and hot short texts can be found to a certain extent. therefore, the method and system disclosed by the invention are helpful to find out social hotspots.

Description

A kind of clustering method and system at the similar short text of magnanimity

One, technical field

Infotech

Two, background technology

Become in informationization under the background of world development trends, the internet have application very extensively, development scale is maximum, press close to very much numerous characteristics such as people's life.On the one hand, huge economic benefit and social benefit have been created in the internet, make people can receive instant, up-to-date message; But simultaneously along with universal, the online quantity of information of network is increasing, not only computing machine is proposed stern challenge to the obtaining of these magnanimity informations, storage and real-time analysis processing power, brought certain degree of difficulty also for people accuracy and reliability when search information; On the other hand, the internet has also brought some negative effects, propagates in a large number on network as flames such as pornographic, reactions.Spreading unchecked of devious conduct such as spam utilized the behavior of property infringements such as Internet communication film, music, software, even by network mode swindle user, and problem such as internet-relevant violence occurs.Therefore, in the process of construction information society, improving information content safety guarantee level and to the detectability of various flames in the internet, be the important ring in the network information technology, also is the solid foundation of smooth construction information society.

Be accompanied by the process of the integration of three networks, it is diversified that Next Generation Internet Chinese version form becomes, and the generic web page proportion is more and more littler.Contents such as microblogging, WAP, comment, note improve gradually than regular meeting.Similar with generic web page, also there is a large amount of identical or very similar contents in this class text.For example:

[1] the diploma I.D. QQ731787311 that carves a seal is done in Beijing certificates handling

[2] do in Beijing, and card is done graduation, the card identity, and card is carved, chapter QQ7317@87@311

[3] I send out to bless note, and the bachelor to one's heart's content laughingly.Be regardless of red-letter day big and little, happy graceful very lively.All things all is wiped off with the wind, complies with one's wishes and just can not have worry!

[4]＜blessing note I come to send out＜bachelor to one's heart's content laughingly.＜red-letter day〉be regardless of big and little,＜happy graceful very lively.＜all things〉all wipe off with the wind,＜comply with one's wishes just can not have worry!

[5] timely snow wafts, and cold plum is pretty, Taurus moo moo herald spring early.Gong and drum strike, and firecracker makes a noise, and Divine Land has laughter everywhere.The friendship jail is caught up with ingeniously, and the crust crust send good fortune to arrive today.Healthy, the mammon looks for, and ox fortune is great not to be forgotten to hand over!---Zhang San's truly yours

[6] timely snow wafts, and cold plum is pretty, Taurus moo moo herald spring early.Gong and drum strike, and firecracker makes a noise, and Divine Land has laughter everywhere.The friendship jail is caught up with ingeniously, and the crust crust send good fortune to arrive today.Healthy, the mammon looks for, and ox fortune is great not to be forgotten to hand over!---Li Si's truly yours

Example 1 and example 2 are relatively found, have inserted punctuation mark and special symbol improperly in the note, and this is to send the illegal retailer of advertisement SMS in order to hide the advertisement filter of operator.Example 3 and example 4 are relatively found, send note person and in repeating process the keyword that will emphasize have been drawn together.Example 5 and example 6 find that relatively the body matter of note is identical, and different forwarding persons in the end affixes one's name to the name that goes up oneself respectively.Though the content of this type note is changed to some extent, its main part still is the same.

Also having a class is the note of cellphone subscriber with regard to same topic or the creation of similar topic.The note that exchanges as festival blessing short message or with regard to some public events etc.This class note all is original note, though expression way is different, because content is same topic, so very big similarity is arranged.

Three, summary of the invention

1, technical matters to be solved by this invention (goal of the invention)

The redundant phenomenon especially severe of short text language material: redundant main a large amount of mass-sendings in SMS, a large amount of mass-sendings and the forwarding of the note of making laughs and blessing note, and the emerge in multitude of civil day common-use words from refuse messages; In BBS language material or news analysis language material, redundant a large amount of commentaries on classics cards and a large amount of answer that mainly comes from the focus model; Humorous message, blessing message, works and expressions for everyday use etc. are very frequent in the instant message, cause a large amount of message redundancies.Microsoft had once added up the internet language material that is made of 1.5 hundred million webpages, found that 6% webpage repeats fully.This shows that ratio that short text repeats fully is higher than the repetition ratio of internet language material far away.In addition, in the short text language material except the identical redundant note of content, it is approximately uniform also having the more huge short text content of quantity, these short texts obviously are to talk about same incident, and obviously be to talk about in almost completely identical mode, just punctuation mark has nuance, and perhaps note begins or end up to have added several characters.And the approximate redundant ratio that Microsoft comes out from the internet corpus statistics is 29.2%, so the approximate redundant ratio of short text language material is much higher than the approximate redundant ratio of internet language material.The existence of fully redundance short text and approximate redundant short text can cause the waste of hard drive space. and detect and remove redundant short text and can reduce the system handles scale greatly.Detect and remove redundant short text and can also find the focus short text to a certain extent, the auxiliary social hotspots of finding.

Whether traditional repeated text detection algorithm is used for solving two texts of detection mostly repeats fully, can not solve the duplicate detection problem of the similar short text in 1.1.

Traditional repeated text analytical approach is not suitable for the replicate analysis of short text, and traditional text relevant analytical approach mainly adopts vector space model or probability model.In vector space model, word in the usefulness text or speech are measured the correlativity of text as the character representation text with the similarity between the proper vector.But the length of note, this class text of microblogging is too short, and this can cause proper vector too sparse, and the result who calculates similarity can't satisfy the requirement of similarity analysis, and its result can't make the people accept at semantic level especially.In probability model, can there be similar problem equally.If use this too short text of note, most of feature all can be the level and smooth result of probability, can not reflect the information of True Data.Therefore result of calculation can't be satisfactory, can not solve the duplicate detection problem of similar short text.This paper adopts the replicate analysis method based on the content of text trunk, and in conjunction with relevant clump, preferably resolves this problem.

2, complete skill scheme provided by the invention (invention scheme)

2.1 replicate analysis method based on short text content trunk

This algorithm removes highly similar text according to the consistance of content of text trunk.No matter be probability model or vector space model, the method for its correlation analysis all is based on the word frequency in the text.Simultaneously, if two short texts (for example note, microblogging) a large amount of identical or semantic approximate speech must occur so if similar in the text.Therefore we adopt the method for extracting the content of text trunk to carry out the correlation analysis of note sample.This scheme comprises following a few step:

1) pre-service

This step is used to improve text quality.Comprise the steps:

A) text filtering (remove length too short and do not have a text of quantity of information)

B) text is pruned (sew and special symbol the front and back of removing in the text of playing interference effect)

C) text code conversion

D) content of text normalization (unification that either traditional and simplified characters is unified, upper and lower case letter is unified, the full-shape DBC case is unified, various forms is numbered etc.)

2) participle

This step says that complete content of text is cut into word or the word that has part of speech.

3) extract the text trunk

This step is only extracted verb, noun, number, and the word of other part of speech abandons need not.Then that semanteme is identical synonym, near synonym replace with same speech (semantic normalization).

4) similarity is calculated

After extracting trunk, we suppose the text of same words number many more (order of words is constant), and its similarity is strong more.

Therefore this step is put into the HASH table with the text trunk, is divided into text relevant and uncorrelated two kinds according to mapping relations.

5) similar text cluster

This step is classified as a class with relevant documentation, thereby forms the classification of a plurality of " related texts ".And select the highest keyword of word frequency (keyword repetition rate) and represent this classification.

Four, description of drawings

Fig. 1: strong correlation repeated text detection algorithm process flow diagram

Fig. 2: distributed treatment scheme Organization Chart

Fig. 3: short text data time sequence synchronization scheme

Fig. 4: server end deployment diagram

Fig. 5: the text-processing process flow diagram of each processing node

Five, embodiment

In order to handle the mass network data, must dispose such scheme in distributed mode.Each distributed treatment node obtains data from the short text data source, after extracting the short text trunk, communicate by letter with the HASH database server, in the HASH database, search this short text trunk, thereby determine whether this short text repeated, if repeat, then in local TokyoCabinet HASH table, upgrade the quantity of such short text, result is transferred to subsequent processes and does further processing.For improving processing speed, on each processing node, adopt BUFFER_DEQUE and two buffer structures of DB_DEQUE that the repeated text classification information in the HASH server is done L2 cache simultaneously.

1, this framework need illustrate part

1) processing node is provided with the reason of buffer memory

For guaranteeing the higher reading performance of Hash server, it is very important the data volume in the hash database to be limited within the specific limits (hundred million ranks are following), so at each processing node buffer memory is set.

On the other hand, all can pin database file during record of every deletion, other requests must be waited for.Therefore can not adopt " concentrating deletion strategy " or " deletion strategy in batches ".Each processing node is responsible for the record that deletion was handled oneself from the Hash server database, can spread out deletion action like this, can not cause long wait (database manipulation answer delay).In addition, short text is arranged according to time sequencing in buffer memory, when the short text record of deletion " out-of-date ", can find short text class to be deleted with O (1) time complexity like this.

2) reason of two-level cache is set

Therefore the final incident of using the time-sensitive of being concerned about often, (abbreviates " minor cycle " as) and does not find that the short text class that repeats is considered as unconcerned short text in a short time.This class short text often accounts for the overwhelming majority, for example the note that sends in our daily life.

Even find the short text class of repetition in the short time, after having taken place, also can become a period of time (abbreviating " large period " as) the short text class of " out-of-date ", also be considered as unconcerned short text.It is just nonsensical for example to talk financial crisis now again.

In order to reduce the record quantity of storing in the hash database as far as possible, we distinguish according to above-mentioned reason and treat short text class record.Storage all short text records in " minor cycle " among the buffer structure Buffer_Deque, comprise repetition with unduplicated.Buffer structure DB_Deque is used for the repetition short text in the storage " large period ".

In the process of handling short text stream, we will exceed the short text record of minor cycle and discovery repetition and in time delete from Hash server and buffer structure Buffer_Deque.To exceed the minor cycle but have been found that the short text record unloading of repetition goes into buffer structure DB_Deque; The short text class record that surpasses large period should in time be deleted from buffer structure DB_Deque and HASH database.

3) relation of the data sync between two buffer memorys, the hash database

All short text records between buffer structure Buffer_Deque and the hash database in " minor cycle " synchronously.Repetition short text record between buffer structure DB_Deque and the hash database in synchronous " large period ".

4) use the TokyoCabinet HASH table of processing node to do the reason that short text is counted

The quantity of short text (be called for short and concentrate counting) seems more simpler in unified certain short text class of record in hash database, and its problem of also can occurrence count not makeing mistakes.But so also there is following problem:

A, count results need periodically write Analytical Results Database (oracle database etc.), at this moment the pinning database table and the HASH database that need the long period, each processing node can not be visited the HASH database to return the result of similarity duplicate detection during this period, and the instantaneous pressure of oracle database is also bigger simultaneously.

B, each short essay given figure is surpassed 3 short text class, concentrate counting can increase a database write operation.Can increase the pressure of Hash server like this.

Adopt distributed short text counting can avoid the problems referred to above.This is because reduced HASH access of database amount, adopts the mode of disperseing to reduce the instantaneous pressure of database to the database write data simultaneously.

2, the data storage that relates in this framework

1) the HASH database is installed on the HASH database server, is responsible for the trunk that storage repeats short text.

2) in each processing node TokyoCabinet is installed, is responsible for storage short text class count information.

3) buffer structure Buffer_Deque is used to store all short texts in the minor cycle.Buffer structure Buffer_Deque comprises buffer_queue and two Hash structures of buffer_inde.

Whether adopting the reason of two Hash to be, is key with the short text trunk among the buffer_queue, can inquire about certain short text class rapidly and exist.Buffer_index is a key with the short text transmitting time, can learn rapidly which short text class has exceeded " minor cycle ".So the short text class among buffer_queue and the buffer_index needs synchronously.

4) DB_Deque is used to store all short text classes of finding repetition in the large period.

Short text class in the DB_Deque formation is arranged according to the time sequencing ascending order.Each so only need reading from team's head according to the time threshold deleted data gets final product.

3, HASH server end structure

At the HASH server end, request is according to the time order and function processed in sequence that receives.Main three parts of HASH server end, main thread, Global Queue and worker thread group.Main thread is intercepted request by network interface and is connected, and the request that will obtain is put in the formation of an overall situation then, and worker thread takes out request from queue heads then, inquires about in the HASH database, and Query Result is returned to the user.

Claims

1. to the duplicate detection method of the content-based trunk of the similar short text of magnanimity, comprise text is carried out pre-service, complete content of text is cut into word or the word that has part of speech, text is extracted trunk, only extract the verb in the text, noun, number, the word of other part of speech abandons need not, then that semanteme is identical synonym, near synonym replace with same speech (semantic normalization), text is carried out similarity to be calculated, after extracting trunk, we suppose the text of same words number many more (order of words is constant), and its similarity is strong more, relevant documentation is classified as a class, thereby forms the classification of a plurality of " related texts ".And select several the highest keywords of word frequency (keyword repetition rate) and represent this classification.

2. the duplicate detection method of the content-based trunk to the similar short text of magnanimity as claimed in claim 1, to the filtering and pruning of text, it is too short and do not have the text of quantity of information and the front and back of playing interference effect in the text to sew and special symbol promptly to remove length when it is characterized in that text carried out pre-service.

3. the duplicate detection method of the content-based trunk to the similar short text of magnanimity as claimed in claim 1, when it is characterized in that text carried out pre-service text is carried out code conversion, and to content of text normalization, promptly either traditional and simplified characters is unified, upper and lower case letter is unified, the full-shape DBC case is unified, the unification of various forms numbering etc.

4. the duplicate detection method of the content-based trunk to the similar short text of magnanimity as claimed in claim 1, when it is characterized in that text carried out pre-service text is carried out in the similarity computation process text trunk being put into the HASH table, be divided into text relevant and uncorrelated two kinds according to mapping relations.

5. comprising duplicate detection and repeating the distributed structure/architecture of degree statistical function the similar short text of magnanimity, comprise that each distributed treatment node obtains data from the short text data source, extract the short text trunk, communicate by letter with the HASH database server, in the HASH database, search this short text trunk, thereby determine whether this short text repeated, if repeat, then upgrade the quantity of such short text in local TokyoCabinet, result is transferred to subsequent processes and does further processing.

6. the distributed structure/architecture that comprises duplicate detection and the degree of repetition statistical function to the similar short text of magnanimity as claimed in claim 5, it is characterized in that each distributed treatment node is obtained data from the short text data source, on each processing node, adopt BUFFER_DEQUE and DB_DEQUE that the repeated text classification information in the hash server is done L2 cache when extracting the short text trunk.