CN106569989A

CN106569989A - De-weighting method and apparatus for short text

Info

Publication number: CN106569989A
Application number: CN201610915522.3A
Authority: CN
Inventors: 李苗苗
Original assignee: Beijing Intelligent Housekeeper Technology Co Ltd
Current assignee: Beijing Intelligent Housekeeper Technology Co Ltd
Priority date: 2016-10-20
Filing date: 2016-10-20
Publication date: 2017-04-19

Abstract

An embodiment of the invention discloses a de-weighting method for a short text. The de-weighting method comprises the steps of obtaining text string information of the short text; performing word segmentation on the text string, and obtaining keywords of the text string according to the word segmentation information of the text string; obtaining a text sub string according to a weight corresponding to the keywords, wherein the text sub string comprises keywords with the number of a threshold value; and removing repeating items of the text sub string. According to the technical scheme provided by the embodiment, by obtaining the keywords of the text string, a generalization performance on the original text string is achieved, and the de-weighting generalization capability and efficiency are improved; and meanwhile, the calculated quantity is low, and a de-weighing effect among multiple text strings is realized.

Description

A kind of De-weight method and device for short text

Technical field

The present embodiments relate to text-processing technical field, more particularly to a kind of De-weight method and dress for short text Put.

Background technology

Text duplicate removal refers to the composition for removing identical word, word or semantic similarity in text string.With the Internet skill The continuous development of art, occurs in that substantial amounts of short message stream, the enormous amount of these information, but length is general all very short, such Being referred to as short text information, specifically, short text refers to that length is very short, typically the text within 200 characters, for example more The common SMS sent by mobile communications network, the instant message sent by instant communication software, network day The comment of will, the comment of internet news etc..

Current text De-weight method is mainly text hash method and similarity-rough set method.Text hash method is divided into unanimously Property Hash and local susceptibility Hash, concordance Hash do not have generalization, and Rule of judgment is excessively strict；Local susceptibility Hash Relatively it is adapted to the relatively long texts such as webpage；Similarity-rough set method needs to compare two-by-two, and amount of calculation is excessive, it is impossible to adapt to magnanimity The calculating of text.As the general length of short text is all very short, sample characteristics are very sparse, it is difficult to extract effective language exactly Speech feature, and short text real-time is especially strong, quantity is extremely huge, and the process to short text has more relative to long article present treatment High efficiency requirements；Short text language performance is succinct, and misspellings, user lack of standardization and noise ratio are more, available information Limited, word is sparse serious, and the duplicate removal problem effect for directly processing short text using the De-weight method of long text will decline.

The content of the invention

In view of this, the present invention proposes a kind of De-weight method and device for short text, sentences in solving text duplicate removal The problems such as broken strip part is excessively strict, improves the generalization ability and efficiency of short text duplicate removal.

In a first aspect, embodiments providing a kind of De-weight method for short text, methods described includes：Obtain The text string information of short text；Participle is carried out to the text string, the text is obtained according to the participle information of the text string The key word of string；Text substring is obtained according to the corresponding weight of the key word, the text substring includes the pass of threshold number Keyword；Remove the duplicate keys of the text substring.

Second aspect, embodiments provides a kind of duplicate removal device for short text, and described device includes：Obtain Unit, for obtaining the text string information of short text；Extraction unit, is connected with the acquiring unit, for the text string Participle is carried out, the key word of the text string is obtained according to the participle information of the text string；Processing unit, extracts single with described Unit is connected, and for obtaining text substring according to the corresponding weight of the key word, the text substring includes the pass of threshold number Keyword；Operating unit, is connected with the processing unit, for removing the duplicate keys of the text substring.

In the embodiment of the present invention, carry out participle, extract the extensive operations such as key word by the text string to short text, and root Text substring is obtained according to the weight information of key word, the duplicate keys in text substring is removed, has been reached to the extensive of original text string Effect, improves the generalization ability and efficiency of duplicate removal, and amount of calculation is little, realizes in a text string or duplicate removal between multiple text strings Effect.

Description of the drawings

By reading the detailed description made to non-limiting example made with reference to the following drawings, other of the invention Feature, objects and advantages will become more apparent upon：

Fig. 1 is a kind of flow chart of the De-weight method for short text in the embodiment of the present invention one；

Fig. 2 is a kind of flow chart of the De-weight method for short text in the embodiment of the present invention two；

Fig. 3 is a kind of flow chart of the De-weight method of the short text in the embodiment of the present invention three；

Fig. 4 is a kind of structure chart of the duplicate removal device for short text in the embodiment of the present invention four.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.Also, it should be noted that for the ease of retouching State, in accompanying drawing, illustrate only part related to the present invention rather than full content.It also should be noted that, for the ease of saying It is bright, example related to the present invention is shown in following examples, principle of these examples only as the explanation embodiment of the present invention It is used, the restriction to the embodiment of the present invention is not intended as, meanwhile, the concrete numerical value of these examples can be according to different applied environments It is different with the parameter of device or component and different.

The method and device for short text duplicate removal of the embodiment of the present invention can run on and be provided with Windows (Microsofts Company exploitation operating system platform), Android (Google exploitation the operation for Portable movable smart machine System platform), the iOS operating system platform for Portable movable smart machine of exploitation (Apple), Windows The terminal of the operating systems such as Phone (operating system platform for Portable movable smart machine of Microsoft's exploitation) In, the terminal can be desktop computer, notebook computer, mobile phone, palm PC, panel computer, digital camera, digital vedio recording Any one in machine etc..

Embodiment one

Fig. 1 is a kind of flow chart of the method for short text duplicate removal in the embodiment of the present invention one, and the method is used for short The deduplication operation of text, the method can be performed by the device with document process function, the device can by software and/or Hardware mode realizes, such as typically subscriber terminal equipment, such as mobile phone, computer etc..In the present embodiment, generalization is referred to The general description of element and the relation for specifically describing, specifically describe and set up on the basis of general description, and which is carried out Extension.It is extensive to refer to that carrying out operation to element makes which more typically change.The method for short text duplicate removal in the present embodiment includes： Step S110, step S120, step S130 and step S140.

Step S110, obtains the text string information of short text.

Specifically, user input needs text string to be processed, obtains the information of text string.Optionally, the information of text string The semanteme of each word in title, the content of text string, the length of text string and the text string of text string can be included but is not limited to. Optionally, the title of text string can be S.

Step S120, carries out participle to the text string, obtains the text string according to the participle information of the text string Key word.

Specifically, participle is carried out to the text string.Participle technique is the basic link of information processing, and the main of participle is appointed Business is to be automatically performed the cutting to sentence by computer, identifies independent word.Optionally, the segmentation methods can be elected as most Short-circuit shot, critical path method (CPM) are used to calculate a node to the shortest path of other all nodes, are mainly characterized by with starting Outwards extend layer by layer centered on point, till expanding to terminal.Optionally, to text string S：I will go to industrial and commercial bank, profit With critical path method (CPM) word segmentation result it is：I will go to industrial and commercial bank.The participle information of text string is processed, text string is obtained Key word information.Wherein, key word information can be including but not limited to：Verb, the noun of text string physical meaning can be expressed And adjective.Text string S：I will go to the industrial and commercial bank, key word information to be：I goes to industrial and commercial bank.

Step S130, obtains text substring according to the corresponding weight of the key word, and the text substring includes threshold value Several key words.

Specifically, the weight of each key word is calculated, default counts threshold value one by one, with the corresponding weight of each key word is Basis for estimation, selects the key word of threshold number as text substring.

Step S140, removes the duplicate keys of the text substring.

Specifically, after participle is carried out to text string, a series of extensive operations such as key word are extracted, obtain correspondence text This substring, now, removes the duplicate keys in text substring.The duplicate keys can be including but not limited to：Identical in text string Word or word, the word of semantic similarity or word.Optionally, integrated application concordance hash algorithm and local susceptibility hash algorithm.One Cause property hash algorithm, such as Message Digest 5 (Message-Digest Algorithm 5, MD5), murmur hash algorithms Deng, using concordance hash algorithm, the text string after extensive process being operated, the Hash string value of generation is text string Unique mark.Local susceptibility hash algorithm, such as SimHash algorithms, by the Hash string value to generating further by sea Prescribed distance calculates similarity and determines whether same or like text.Hamming distances are referred to, in information coding, two legal generations Code correspondence encodes different digits on position, and Hamming distances think that less than 3 two text strings are identical.Integrated application concordance Hash is calculated Method and local susceptibility hash algorithm, carry out deduplication operation to text substring according to the Hash string value and Hamming distances that generate

Embodiment two

Fig. 2 is a kind of flow chart of the method for short text duplicate removal in the embodiment of the present invention two, and the present embodiment is in reality On the basis of applying example one, step S120, step S130 and step S140 are further illustrated.In step S120, according to the text The participle information of this string obtains the key word of the text string to be included：The stop words in the participle information is removed, and is returned One change is processed.In step S130, the frequency of the factor of the keyword weight at least including each key word and/or reverse text are affected Shelves frequency, the text substring include that the key word of threshold number includes：The weight for removing key word in the text string is less than The key word of default weight threshold；Or, according to the corresponding weight of the key word, threshold number in the selection text string Key word；Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.Step In rapid S140, the duplicate keys for removing the text substring include：If the text substring is one, text is removed Duplicate keys in string；If the text substring is two or more, the duplicate keys between the text substring are removed. Specifically, the method for short text duplicate removal in the present embodiment includes：Step S210, step S220, step S230, step S240 and step S250.

Step S210, obtains the text string information of short text.

Step S220, carries out participle to the text string, removes the stop words in the participle information, and is normalized Process obtains the key word of text string.

Specifically, in information retrieval, it is to save memory space and improve search efficiency, is processing natural language data Some words or word are fallen in meeting automatic fitration before or after (or text), and these words or word are referred to as stop words.Stop words is all It is manually entered, non-automated is generated, the stop words after generation can form a deactivation vocabulary, removed by the deactivation vocabulary and stopped Word.Auxiliary word and function word in including but not limited to punctuation mark, mathematical symbol and Chinese etc. are disabled in vocabulary, such as ", , eh, oh, " etc..Stop words and normalization operation are removed respectively to the word segmentation result of text string, text string is obtained Key word information.Preferably, text string " I likes drinking latte ", removes stop words " " therein, as a result for " I likes to drink to take Ferrum ".Normalization operation is included but is not limited to, and full half-angle is unified into half-angle, upper and lower case letter and is unified into small letter, numeral be unified into Ah Arabic numbers, English morphology are unified into root etc..Optionally, " I likes Shakespeare's【Hamlet】", it is normalized to " I likes [hamlet] of shakespeare "；" industrial and commercial bank " and " industrial and commercial bank " is normalized to " industrial and commercial bank "；" two 〇〇 Eight " " 2008 " are normalized to；" does, do, doing and did " is normalized to " do ".

Step S230, removes the key word of the weight less than default weight threshold of key word in the text string；Or, according to According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen, text substring is obtained.Wherein, shadow The frequency and/or reverse document frequency of the factor of the sound keyword weight at least including each key word,

Specifically, text substring is obtained by the process to text string, presets default weight threshold Q, threshold number G.After participle is carried out to text string, stop words and normalized is removed, the key word of text string is obtained, affect key word The frequency and/or reverse document frequency of the factor of weight at least including each key word.Specifically, the frequency (Term of each key word Frequency, TF), occur the frequency of the word, reverse document frequency (Inverse Document in representing text string Frequency, IDF), IDF=log (t/n), t are all number of files for statistics, and n is the number of files for containing the word, IDF is used for the discrimination for weighing the word, and discrimination is bigger, and the multiplicity of two text strings is lower, outsides of the IDF by magnanimity Resource statistics is obtained.TF-IDF is a kind of statistical method, to assess a words in a file set or a corpus A copy of it file significance level.Optionally, using TF*IDF according to the frequency of each key word of text string and reverse text The weight of each key word of frequency acquisition of shelves, according to the weight of each key word, the weight for removing key word in the text string is little In default weight threshold Q key word as text substring；Or, the key word of threshold number G in text string is chosen as text This substring.

Two or more key words in the text string are connected by step S240 by default separator or segmentation string Into phrase.

Specifically, obtain text string in two or more key words, then by default separator or point Cut the phrase that two or more key words are linked to be is gone here and there as text substring；Default separator or segmentation string can be wrapped Include but be not limited to space, pause mark etc..Optionally, when text string content is：Tian An company Men Ye, carries out participle and removes stop words Be Tian An-men industry after normalization operation, be Tian An-men industry after adding default separator, if not carrying out adding default separation Symbol or split string operation, system can be automatically recognized as " Tian An-men " it is more common, make text string semanteme change.

Step S250, removes the duplicate keys of the text substring, if the text substring is one, removes the text Duplicate keys in this substring；If the text substring is two or more, the weight between the text substring is removed Multiple item.

Specifically, the text substring to adding default separator or segmentation to obtain after going here and there in key word is removed repetition The operation of item.If the text substring is one, the duplicate keys in text substring are removed, if the text substring is two Individual or two or more, then carry out the operation of step S210 to S240 respectively to each text string, removes two or more texts Duplicate keys between this string.Remove duplicate keys and can include but is not limited to the same or like key word of cryptographic Hash, by hamming Distance calculates similarity and decides whether as same or like text.

In the embodiment of the present invention, by participle is carried out to text string, stop words, normalization is removed and adds default separator Or split the operation acquisition text substring such as string, the duplicate keys in text substring are removed to a text substring, to two or two Text substring above removes the duplicate keys between text substring.A series of extensive operation has been carried out between cryptographic Hash calculating, The extensive degree of algorithm has further been widened, deduplicated efficiency has been improve, the deduplication operation of one or more texts has been realized.

Embodiment three

Fig. 3 is a kind of De-weight method for short text in the embodiment of the present invention three, and the present embodiment is in one He of embodiment On the basis of embodiment two, as a preferred embodiment, the deduplication operation between two text strings is described.Specifically , the method for short text duplicate removal in the present embodiment includes：Step S310, step S320, step S330, step S340, Step S350, step S360 and step S370.

Step S310, obtains the information of the first text string and the second text string.

Step S320, carries out the participle information that participle obtains first text string to first text string, and right Second text string carries out the participle information that participle obtains second text string.

The participle of first text string is carried out stop words and normalization operation, obtains described first by step S330 The key word information of text string；And the participle of second text string is carried out into stop words and normalization operation, obtain institute State the key word information of the second text string.

Specifically, in information retrieval, it is to save memory space and improve search efficiency, is processing natural language data Some words or word are fallen in meeting automatic fitration before or after (or text), and these words or word are referred to as stop words.Stop words is all It is manually entered, non-automated is generated, the stop words after generation can form a deactivation vocabulary, removed by the deactivation vocabulary and stopped Word.Preferably, auxiliary word and function word in including but not limited to punctuation mark, mathematical symbol and Chinese etc. are disabled in vocabulary, such as ",, eh, oh, " etc..The word segmentation result of word segmentation result and the second text string to the first text string is gone respectively Stop words and normalization operation, obtain the key word information of the first text string and the second text string.The normalization operation include but It is not limited to, full half-angle is unified into half-angle, upper and lower case letter and is unified into small letter, numeral is unified into Arabic numerals, English morphology system One one-tenth root etc..Optionally, " I likes Shakespeare's【Hamlet】", it is normalized to that " I likes shakespeare [hamlet] "；" industrial and commercial bank " and " industrial and commercial bank " is normalized to " industrial and commercial bank "；" two 〇〇 eight " is normalized to " 2008 "； " does, do, doing and did " is normalized to " do ".

Optionally, the first text string S1：I will remove industrial and commercial bank, the second text string S2：I goes to industrial and commercial bank, using shortest path Shot word segmentation result is, the first text string S1：I will remove industrial and commercial bank, the second text string S2：I goes to industrial and commercial bank.Then to first Text string S1 and the second text string S2 carry out stop words and the result after normalization operation is, the first text string S1：I will go Industrial and commercial bank, the second text string S2：I goes to industrial and commercial bank.

Step S340, obtains described first according to the frequency and reverse document frequency of each key word of first text string First weight of each key word of text string；And, according to the frequency and reverse document of each key word of second text string Second weight of each key word of the second text string described in frequency acquisition.

Specifically, there is the frequency of the word in the frequency (Term Frequency, TF) of each key word in representing text string, Reverse document frequency (Inverse Document Frequency, IDF), IDF=log (t/n), t are all for statistics Number of files, n are the number of files for containing the word, and, for weighing the discrimination of the word, discrimination is bigger, two texts for IDF The multiplicity of string is lower, and IDF is obtained by the external resource statistics of magnanimity.TF-IDF is a kind of statistical method, to assess one Words is for the significance level of a copy of it file in a file set or a corpus.It is literary according to first using TF*IDF The frequency of each key word of this string and reverse document frequency obtain the first weight of each key word of the first text string, using TF* IDF obtains the second of each key word of the second text string according to the frequency and reverse document frequency of each key word of the second text string Weight.

Step S350, removes first weight of key word of first text string less than the key for presetting weight threshold Word, remaining key word is used as the first text substring；Or, according to the weight of each key word, the key of selected threshold number Word is used as the first text substring.Second weight of key word of second text string is removed less than default weight threshold Key word, remaining key word is used as the second text substring；Or, according to the weight of each key word, selected threshold number Key word is used as the second text substring.

Specifically, default weight threshold Q is preset, and the first weight of key word in the first text string is removed less than Q's Key word, remaining key word is used as the first text substring；Or, preset threshold value number G, according to weight size, is chosen The key word of predetermined threshold value number G.The key word of second weight less than Q of key word in the second text string is removed, remaining pass Keyword is used as the second text substring；Or, according to weight size, choose the key word of predetermined threshold value number G.

Step S360, obtains the two or more words in the key word information of first text string, by pre- If the phrase that the two or more words in the key word information of first text string are linked to be by separator or segmentation string As the first text substring；The two or more words in the key word information of second text string are obtained, by pre- If the phrase that the two or more words in the key word information of first text string are linked to be by separator or segmentation string As the second text substring.

Specifically, the key word information of the first text string is obtained, the key word information includes two or more Word, then by default separator or segmentation string by two or more words in the key word of the first text string be linked to be it is short Language is used as the first text substring；Obtain the key word information of the second text string, the key word information include two or two with On word, then by default separator or segmentation string two or more words in the key word of the second text string are linked to be Phrase as the second text substring.Default separator or segmentation string can include but is not limited to space, pause mark etc..It is optional , when the first text string or the second text string content are：Tian An company Men Ye, carries out participle and removes stop words grasping with normalization It is Tian An-men industry after work, is Tian An-men industry after adding default separator, if does not carry out adding default separator or segmentation string Operation, system can be automatically recognized as " Tian An-men " it is more common, make text string semanteme change.

Step S370, carries out deduplication operation to the first text substring and the second text substring.

Embodiments provide a preferred version of duplicate removal between two text strings.Before cryptographic Hash calculating is carried out A series of analysis and process are carried out to text string, text string is expressed as cryptographic Hash finally carries out deduplication operation, solution Traditional hash algorithm judges excessively strict situation, has reached the ability extensive to former string, has improve deduplicated efficiency.

Example IV

Fig. 4 is a kind of structure chart of the device for short text duplicate removal in the embodiment of the present invention four.The device is applied to The short text De-weight method that the embodiment of the present invention one is provided into embodiment three is performed, the device is specifically included：Acquiring unit 410th, extraction unit 420, processing unit 430 and operating unit 440.

Acquiring unit 410, for obtaining the text string information of short text；

Extraction unit 420, is connected with acquiring unit 410, for carrying out participle to the text string, according to the text string Participle information obtain the key word of the text string；

Processing unit 430, is connected with extraction unit 420, for obtaining text according to the corresponding weight of the key word String, the text substring include the key word of threshold number；

Operating unit 440, is connected with processing unit 440, for removing the duplicate keys of the text substring.

Further, in processing unit 440, affect the frequency of the factor of the keyword weight at least including each key word And/or reverse document frequency, processing unit 440 specifically for：The weight of key word in the text string is removed less than default power The key word of weight threshold value；Or, according to the corresponding weight of the key word, choose the key of threshold number in the text string Word.Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.

Further, extraction unit 420 specifically for：The stop words in the participle information is removed, and is normalized Process.

Further, operating unit 440 specifically for：If the text substring is one, text is removed Duplicate keys in string；If the text substring is two or more, the duplicate keys between the text substring are removed.

Obviously, it will be understood by those skilled in the art that the said goods can perform the side provided by any embodiment of the present invention Method, possesses the corresponding functional module of execution method and beneficial effect.

Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims

1. a kind of De-weight method for short text, it is characterised in that include：

Obtain the text string information of short text；

Participle is carried out to the text string, the key word of the text string is obtained according to the participle information of the text string；

Text substring is obtained according to the corresponding weight of the key word, the text substring includes the key word of threshold number；

Remove the duplicate keys of the text substring.

2. the De-weight method of short text according to claim 1, it is characterised in that the text substring includes threshold number Key word include：

Remove the key word of the weight less than default weight threshold of key word in the text string；Or,

According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen.

3. the De-weight method of short text according to claim 1, it is characterised in that according to the participle information of the text string The key word for obtaining the text string includes：

The stop words in the participle information is removed, and is normalized.

4. the De-weight method of short text according to claim 1, it is characterised in that according to the corresponding power of the key word Weight, obtains text substring, and the text substring includes that the key word of threshold number also includes：

Affect the frequency and/or reverse document frequency of the factor of the keyword weight at least including each key word.

5. the De-weight method of short text according to claim 1, it is characterised in that according to the corresponding power of the key word Weight, obtains text substring, and the text substring includes the key word of threshold number, also includes：

Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.

6. the De-weight method of short text according to claim 1, it is characterised in that remove the duplicate keys of the text substring Including：

If the text substring is one, the duplicate keys in the text substring are removed；If the text substring is two Individual or two or more, then remove the duplicate keys between the text substring.

7. a kind of duplicate removal device for short text, it is characterised in that include：

Acquiring unit, for obtaining the text string information of short text；

Extraction unit, is connected with the acquiring unit, for carrying out participle to the text string, according to the participle of the text string Information obtains the key word of the text string；

Processing unit, is connected with the extraction unit, for obtaining text substring according to the corresponding weight of the key word, described Text substring includes the key word of threshold number；

Operating unit, is connected with the processing unit, for removing the duplicate keys of the text substring.

8. short text duplicate removal device according to claim 7, it is characterised in that in the processing unit, affects the pass The factor of keyword weight at least including each key word the frequency and/or reverse document frequency, the processing unit specifically for：

According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen；

9. short text duplicate removal device according to claim 7, it is characterised in that the extraction unit specifically for：

The stop words in the participle information is removed, and is normalized.

10. short text duplicate removal device according to claim 7, it is characterised in that the operating unit specifically for：