CN106569989A - De-weighting method and apparatus for short text - Google Patents

De-weighting method and apparatus for short text Download PDF

Info

Publication number
CN106569989A
CN106569989A CN201610915522.3A CN201610915522A CN106569989A CN 106569989 A CN106569989 A CN 106569989A CN 201610915522 A CN201610915522 A CN 201610915522A CN 106569989 A CN106569989 A CN 106569989A
Authority
CN
China
Prior art keywords
text
key word
string
substring
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610915522.3A
Other languages
Chinese (zh)
Inventor
李苗苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Intelligent Housekeeper Technology Co Ltd
Original Assignee
Beijing Intelligent Housekeeper Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Intelligent Housekeeper Technology Co Ltd filed Critical Beijing Intelligent Housekeeper Technology Co Ltd
Priority to CN201610915522.3A priority Critical patent/CN106569989A/en
Publication of CN106569989A publication Critical patent/CN106569989A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

An embodiment of the invention discloses a de-weighting method for a short text. The de-weighting method comprises the steps of obtaining text string information of the short text; performing word segmentation on the text string, and obtaining keywords of the text string according to the word segmentation information of the text string; obtaining a text sub string according to a weight corresponding to the keywords, wherein the text sub string comprises keywords with the number of a threshold value; and removing repeating items of the text sub string. According to the technical scheme provided by the embodiment, by obtaining the keywords of the text string, a generalization performance on the original text string is achieved, and the de-weighting generalization capability and efficiency are improved; and meanwhile, the calculated quantity is low, and a de-weighing effect among multiple text strings is realized.

Description

A kind of De-weight method and device for short text
Technical field
The present embodiments relate to text-processing technical field, more particularly to a kind of De-weight method and dress for short text Put.
Background technology
Text duplicate removal refers to the composition for removing identical word, word or semantic similarity in text string.With the Internet skill The continuous development of art, occurs in that substantial amounts of short message stream, the enormous amount of these information, but length is general all very short, such Being referred to as short text information, specifically, short text refers to that length is very short, typically the text within 200 characters, for example more The common SMS sent by mobile communications network, the instant message sent by instant communication software, network day The comment of will, the comment of internet news etc..
Current text De-weight method is mainly text hash method and similarity-rough set method.Text hash method is divided into unanimously Property Hash and local susceptibility Hash, concordance Hash do not have generalization, and Rule of judgment is excessively strict;Local susceptibility Hash Relatively it is adapted to the relatively long texts such as webpage;Similarity-rough set method needs to compare two-by-two, and amount of calculation is excessive, it is impossible to adapt to magnanimity The calculating of text.As the general length of short text is all very short, sample characteristics are very sparse, it is difficult to extract effective language exactly Speech feature, and short text real-time is especially strong, quantity is extremely huge, and the process to short text has more relative to long article present treatment High efficiency requirements;Short text language performance is succinct, and misspellings, user lack of standardization and noise ratio are more, available information Limited, word is sparse serious, and the duplicate removal problem effect for directly processing short text using the De-weight method of long text will decline.
The content of the invention
In view of this, the present invention proposes a kind of De-weight method and device for short text, sentences in solving text duplicate removal The problems such as broken strip part is excessively strict, improves the generalization ability and efficiency of short text duplicate removal.
In a first aspect, embodiments providing a kind of De-weight method for short text, methods described includes:Obtain The text string information of short text;Participle is carried out to the text string, the text is obtained according to the participle information of the text string The key word of string;Text substring is obtained according to the corresponding weight of the key word, the text substring includes the pass of threshold number Keyword;Remove the duplicate keys of the text substring.
Second aspect, embodiments provides a kind of duplicate removal device for short text, and described device includes:Obtain Unit, for obtaining the text string information of short text;Extraction unit, is connected with the acquiring unit, for the text string Participle is carried out, the key word of the text string is obtained according to the participle information of the text string;Processing unit, extracts single with described Unit is connected, and for obtaining text substring according to the corresponding weight of the key word, the text substring includes the pass of threshold number Keyword;Operating unit, is connected with the processing unit, for removing the duplicate keys of the text substring.
In the embodiment of the present invention, carry out participle, extract the extensive operations such as key word by the text string to short text, and root Text substring is obtained according to the weight information of key word, the duplicate keys in text substring is removed, has been reached to the extensive of original text string Effect, improves the generalization ability and efficiency of duplicate removal, and amount of calculation is little, realizes in a text string or duplicate removal between multiple text strings Effect.
Description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, other of the invention Feature, objects and advantages will become more apparent upon:
Fig. 1 is a kind of flow chart of the De-weight method for short text in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of the De-weight method for short text in the embodiment of the present invention two;
Fig. 3 is a kind of flow chart of the De-weight method of the short text in the embodiment of the present invention three;
Fig. 4 is a kind of structure chart of the duplicate removal device for short text in the embodiment of the present invention four.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.Also, it should be noted that for the ease of retouching State, in accompanying drawing, illustrate only part related to the present invention rather than full content.It also should be noted that, for the ease of saying It is bright, example related to the present invention is shown in following examples, principle of these examples only as the explanation embodiment of the present invention It is used, the restriction to the embodiment of the present invention is not intended as, meanwhile, the concrete numerical value of these examples can be according to different applied environments It is different with the parameter of device or component and different.
The method and device for short text duplicate removal of the embodiment of the present invention can run on and be provided with Windows (Microsofts Company exploitation operating system platform), Android (Google exploitation the operation for Portable movable smart machine System platform), the iOS operating system platform for Portable movable smart machine of exploitation (Apple), Windows The terminal of the operating systems such as Phone (operating system platform for Portable movable smart machine of Microsoft's exploitation) In, the terminal can be desktop computer, notebook computer, mobile phone, palm PC, panel computer, digital camera, digital vedio recording Any one in machine etc..
Embodiment one
Fig. 1 is a kind of flow chart of the method for short text duplicate removal in the embodiment of the present invention one, and the method is used for short The deduplication operation of text, the method can be performed by the device with document process function, the device can by software and/or Hardware mode realizes, such as typically subscriber terminal equipment, such as mobile phone, computer etc..In the present embodiment, generalization is referred to The general description of element and the relation for specifically describing, specifically describe and set up on the basis of general description, and which is carried out Extension.It is extensive to refer to that carrying out operation to element makes which more typically change.The method for short text duplicate removal in the present embodiment includes: Step S110, step S120, step S130 and step S140.
Step S110, obtains the text string information of short text.
Specifically, user input needs text string to be processed, obtains the information of text string.Optionally, the information of text string The semanteme of each word in title, the content of text string, the length of text string and the text string of text string can be included but is not limited to. Optionally, the title of text string can be S.
Step S120, carries out participle to the text string, obtains the text string according to the participle information of the text string Key word.
Specifically, participle is carried out to the text string.Participle technique is the basic link of information processing, and the main of participle is appointed Business is to be automatically performed the cutting to sentence by computer, identifies independent word.Optionally, the segmentation methods can be elected as most Short-circuit shot, critical path method (CPM) are used to calculate a node to the shortest path of other all nodes, are mainly characterized by with starting Outwards extend layer by layer centered on point, till expanding to terminal.Optionally, to text string S:I will go to industrial and commercial bank, profit With critical path method (CPM) word segmentation result it is:I will go to industrial and commercial bank.The participle information of text string is processed, text string is obtained Key word information.Wherein, key word information can be including but not limited to:Verb, the noun of text string physical meaning can be expressed And adjective.Text string S:I will go to the industrial and commercial bank, key word information to be:I goes to industrial and commercial bank.
Step S130, obtains text substring according to the corresponding weight of the key word, and the text substring includes threshold value Several key words.
Specifically, the weight of each key word is calculated, default counts threshold value one by one, with the corresponding weight of each key word is Basis for estimation, selects the key word of threshold number as text substring.
Step S140, removes the duplicate keys of the text substring.
Specifically, after participle is carried out to text string, a series of extensive operations such as key word are extracted, obtain correspondence text This substring, now, removes the duplicate keys in text substring.The duplicate keys can be including but not limited to:Identical in text string Word or word, the word of semantic similarity or word.Optionally, integrated application concordance hash algorithm and local susceptibility hash algorithm.One Cause property hash algorithm, such as Message Digest 5 (Message-Digest Algorithm 5, MD5), murmur hash algorithms Deng, using concordance hash algorithm, the text string after extensive process being operated, the Hash string value of generation is text string Unique mark.Local susceptibility hash algorithm, such as SimHash algorithms, by the Hash string value to generating further by sea Prescribed distance calculates similarity and determines whether same or like text.Hamming distances are referred to, in information coding, two legal generations Code correspondence encodes different digits on position, and Hamming distances think that less than 3 two text strings are identical.Integrated application concordance Hash is calculated Method and local susceptibility hash algorithm, carry out deduplication operation to text substring according to the Hash string value and Hamming distances that generate
In the embodiment of the present invention, carry out participle, extract the extensive operations such as key word by the text string to short text, and root Text substring is obtained according to the weight information of key word, the duplicate keys in text substring is removed, has been reached to the extensive of original text string Effect, improves the generalization ability and efficiency of duplicate removal, and amount of calculation is little, realizes in a text string or duplicate removal between multiple text strings Effect.
Embodiment two
Fig. 2 is a kind of flow chart of the method for short text duplicate removal in the embodiment of the present invention two, and the present embodiment is in reality On the basis of applying example one, step S120, step S130 and step S140 are further illustrated.In step S120, according to the text The participle information of this string obtains the key word of the text string to be included:The stop words in the participle information is removed, and is returned One change is processed.In step S130, the frequency of the factor of the keyword weight at least including each key word and/or reverse text are affected Shelves frequency, the text substring include that the key word of threshold number includes:The weight for removing key word in the text string is less than The key word of default weight threshold;Or, according to the corresponding weight of the key word, threshold number in the selection text string Key word;Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.Step In rapid S140, the duplicate keys for removing the text substring include:If the text substring is one, text is removed Duplicate keys in string;If the text substring is two or more, the duplicate keys between the text substring are removed. Specifically, the method for short text duplicate removal in the present embodiment includes:Step S210, step S220, step S230, step S240 and step S250.
Step S210, obtains the text string information of short text.
Step S220, carries out participle to the text string, removes the stop words in the participle information, and is normalized Process obtains the key word of text string.
Specifically, in information retrieval, it is to save memory space and improve search efficiency, is processing natural language data Some words or word are fallen in meeting automatic fitration before or after (or text), and these words or word are referred to as stop words.Stop words is all It is manually entered, non-automated is generated, the stop words after generation can form a deactivation vocabulary, removed by the deactivation vocabulary and stopped Word.Auxiliary word and function word in including but not limited to punctuation mark, mathematical symbol and Chinese etc. are disabled in vocabulary, such as ", , eh, oh, " etc..Stop words and normalization operation are removed respectively to the word segmentation result of text string, text string is obtained Key word information.Preferably, text string " I likes drinking latte ", removes stop words " " therein, as a result for " I likes to drink to take Ferrum ".Normalization operation is included but is not limited to, and full half-angle is unified into half-angle, upper and lower case letter and is unified into small letter, numeral be unified into Ah Arabic numbers, English morphology are unified into root etc..Optionally, " I likes Shakespeare's【Hamlet】", it is normalized to " I likes [hamlet] of shakespeare ";" industrial and commercial bank " and " industrial and commercial bank " is normalized to " industrial and commercial bank ";" two 〇 〇 Eight " " 2008 " are normalized to;" does, do, doing and did " is normalized to " do ".
Step S230, removes the key word of the weight less than default weight threshold of key word in the text string;Or, according to According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen, text substring is obtained.Wherein, shadow The frequency and/or reverse document frequency of the factor of the sound keyword weight at least including each key word,
Specifically, text substring is obtained by the process to text string, presets default weight threshold Q, threshold number G.After participle is carried out to text string, stop words and normalized is removed, the key word of text string is obtained, affect key word The frequency and/or reverse document frequency of the factor of weight at least including each key word.Specifically, the frequency (Term of each key word Frequency, TF), occur the frequency of the word, reverse document frequency (Inverse Document in representing text string Frequency, IDF), IDF=log (t/n), t are all number of files for statistics, and n is the number of files for containing the word, IDF is used for the discrimination for weighing the word, and discrimination is bigger, and the multiplicity of two text strings is lower, outsides of the IDF by magnanimity Resource statistics is obtained.TF-IDF is a kind of statistical method, to assess a words in a file set or a corpus A copy of it file significance level.Optionally, using TF*IDF according to the frequency of each key word of text string and reverse text The weight of each key word of frequency acquisition of shelves, according to the weight of each key word, the weight for removing key word in the text string is little In default weight threshold Q key word as text substring;Or, the key word of threshold number G in text string is chosen as text This substring.
Two or more key words in the text string are connected by step S240 by default separator or segmentation string Into phrase.
Specifically, obtain text string in two or more key words, then by default separator or point Cut the phrase that two or more key words are linked to be is gone here and there as text substring;Default separator or segmentation string can be wrapped Include but be not limited to space, pause mark etc..Optionally, when text string content is:Tian An company Men Ye, carries out participle and removes stop words Be Tian An-men industry after normalization operation, be Tian An-men industry after adding default separator, if not carrying out adding default separation Symbol or split string operation, system can be automatically recognized as " Tian An-men " it is more common, make text string semanteme change.
Step S250, removes the duplicate keys of the text substring, if the text substring is one, removes the text Duplicate keys in this substring;If the text substring is two or more, the weight between the text substring is removed Multiple item.
Specifically, the text substring to adding default separator or segmentation to obtain after going here and there in key word is removed repetition The operation of item.If the text substring is one, the duplicate keys in text substring are removed, if the text substring is two Individual or two or more, then carry out the operation of step S210 to S240 respectively to each text string, removes two or more texts Duplicate keys between this string.Remove duplicate keys and can include but is not limited to the same or like key word of cryptographic Hash, by hamming Distance calculates similarity and decides whether as same or like text.
In the embodiment of the present invention, by participle is carried out to text string, stop words, normalization is removed and adds default separator Or split the operation acquisition text substring such as string, the duplicate keys in text substring are removed to a text substring, to two or two Text substring above removes the duplicate keys between text substring.A series of extensive operation has been carried out between cryptographic Hash calculating, The extensive degree of algorithm has further been widened, deduplicated efficiency has been improve, the deduplication operation of one or more texts has been realized.
Embodiment three
Fig. 3 is a kind of De-weight method for short text in the embodiment of the present invention three, and the present embodiment is in one He of embodiment On the basis of embodiment two, as a preferred embodiment, the deduplication operation between two text strings is described.Specifically , the method for short text duplicate removal in the present embodiment includes:Step S310, step S320, step S330, step S340, Step S350, step S360 and step S370.
Step S310, obtains the information of the first text string and the second text string.
Step S320, carries out the participle information that participle obtains first text string to first text string, and right Second text string carries out the participle information that participle obtains second text string.
The participle of first text string is carried out stop words and normalization operation, obtains described first by step S330 The key word information of text string;And the participle of second text string is carried out into stop words and normalization operation, obtain institute State the key word information of the second text string.
Specifically, in information retrieval, it is to save memory space and improve search efficiency, is processing natural language data Some words or word are fallen in meeting automatic fitration before or after (or text), and these words or word are referred to as stop words.Stop words is all It is manually entered, non-automated is generated, the stop words after generation can form a deactivation vocabulary, removed by the deactivation vocabulary and stopped Word.Preferably, auxiliary word and function word in including but not limited to punctuation mark, mathematical symbol and Chinese etc. are disabled in vocabulary, such as ",, eh, oh, " etc..The word segmentation result of word segmentation result and the second text string to the first text string is gone respectively Stop words and normalization operation, obtain the key word information of the first text string and the second text string.The normalization operation include but It is not limited to, full half-angle is unified into half-angle, upper and lower case letter and is unified into small letter, numeral is unified into Arabic numerals, English morphology system One one-tenth root etc..Optionally, " I likes Shakespeare's【Hamlet】", it is normalized to that " I likes shakespeare [hamlet] ";" industrial and commercial bank " and " industrial and commercial bank " is normalized to " industrial and commercial bank ";" two 〇 〇 eight " is normalized to " 2008 "; " does, do, doing and did " is normalized to " do ".
Optionally, the first text string S1:I will remove industrial and commercial bank, the second text string S2:I goes to industrial and commercial bank, using shortest path Shot word segmentation result is, the first text string S1:I will remove industrial and commercial bank, the second text string S2:I goes to industrial and commercial bank.Then to first Text string S1 and the second text string S2 carry out stop words and the result after normalization operation is, the first text string S1:I will go Industrial and commercial bank, the second text string S2:I goes to industrial and commercial bank.
Step S340, obtains described first according to the frequency and reverse document frequency of each key word of first text string First weight of each key word of text string;And, according to the frequency and reverse document of each key word of second text string Second weight of each key word of the second text string described in frequency acquisition.
Specifically, there is the frequency of the word in the frequency (Term Frequency, TF) of each key word in representing text string, Reverse document frequency (Inverse Document Frequency, IDF), IDF=log (t/n), t are all for statistics Number of files, n are the number of files for containing the word, and, for weighing the discrimination of the word, discrimination is bigger, two texts for IDF The multiplicity of string is lower, and IDF is obtained by the external resource statistics of magnanimity.TF-IDF is a kind of statistical method, to assess one Words is for the significance level of a copy of it file in a file set or a corpus.It is literary according to first using TF*IDF The frequency of each key word of this string and reverse document frequency obtain the first weight of each key word of the first text string, using TF* IDF obtains the second of each key word of the second text string according to the frequency and reverse document frequency of each key word of the second text string Weight.
Step S350, removes first weight of key word of first text string less than the key for presetting weight threshold Word, remaining key word is used as the first text substring;Or, according to the weight of each key word, the key of selected threshold number Word is used as the first text substring.Second weight of key word of second text string is removed less than default weight threshold Key word, remaining key word is used as the second text substring;Or, according to the weight of each key word, selected threshold number Key word is used as the second text substring.
Specifically, default weight threshold Q is preset, and the first weight of key word in the first text string is removed less than Q's Key word, remaining key word is used as the first text substring;Or, preset threshold value number G, according to weight size, is chosen The key word of predetermined threshold value number G.The key word of second weight less than Q of key word in the second text string is removed, remaining pass Keyword is used as the second text substring;Or, according to weight size, choose the key word of predetermined threshold value number G.
Step S360, obtains the two or more words in the key word information of first text string, by pre- If the phrase that the two or more words in the key word information of first text string are linked to be by separator or segmentation string As the first text substring;The two or more words in the key word information of second text string are obtained, by pre- If the phrase that the two or more words in the key word information of first text string are linked to be by separator or segmentation string As the second text substring.
Specifically, the key word information of the first text string is obtained, the key word information includes two or more Word, then by default separator or segmentation string by two or more words in the key word of the first text string be linked to be it is short Language is used as the first text substring;Obtain the key word information of the second text string, the key word information include two or two with On word, then by default separator or segmentation string two or more words in the key word of the second text string are linked to be Phrase as the second text substring.Default separator or segmentation string can include but is not limited to space, pause mark etc..It is optional , when the first text string or the second text string content are:Tian An company Men Ye, carries out participle and removes stop words grasping with normalization It is Tian An-men industry after work, is Tian An-men industry after adding default separator, if does not carry out adding default separator or segmentation string Operation, system can be automatically recognized as " Tian An-men " it is more common, make text string semanteme change.
Step S370, carries out deduplication operation to the first text substring and the second text substring.
Embodiments provide a preferred version of duplicate removal between two text strings.Before cryptographic Hash calculating is carried out A series of analysis and process are carried out to text string, text string is expressed as cryptographic Hash finally carries out deduplication operation, solution Traditional hash algorithm judges excessively strict situation, has reached the ability extensive to former string, has improve deduplicated efficiency.
Example IV
Fig. 4 is a kind of structure chart of the device for short text duplicate removal in the embodiment of the present invention four.The device is applied to The short text De-weight method that the embodiment of the present invention one is provided into embodiment three is performed, the device is specifically included:Acquiring unit 410th, extraction unit 420, processing unit 430 and operating unit 440.
Acquiring unit 410, for obtaining the text string information of short text;
Extraction unit 420, is connected with acquiring unit 410, for carrying out participle to the text string, according to the text string Participle information obtain the key word of the text string;
Processing unit 430, is connected with extraction unit 420, for obtaining text according to the corresponding weight of the key word String, the text substring include the key word of threshold number;
Operating unit 440, is connected with processing unit 440, for removing the duplicate keys of the text substring.
Further, in processing unit 440, affect the frequency of the factor of the keyword weight at least including each key word And/or reverse document frequency, processing unit 440 specifically for:The weight of key word in the text string is removed less than default power The key word of weight threshold value;Or, according to the corresponding weight of the key word, choose the key of threshold number in the text string Word.Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.
Further, extraction unit 420 specifically for:The stop words in the participle information is removed, and is normalized Process.
Further, operating unit 440 specifically for:If the text substring is one, text is removed Duplicate keys in string;If the text substring is two or more, the duplicate keys between the text substring are removed.
In the embodiment of the present invention, carry out participle, extract the extensive operations such as key word by the text string to short text, and root Text substring is obtained according to the weight information of key word, the duplicate keys in text substring is removed, has been reached to the extensive of original text string Effect, improves the generalization ability and efficiency of duplicate removal, and amount of calculation is little, realizes in a text string or duplicate removal between multiple text strings Effect.
Obviously, it will be understood by those skilled in the art that the said goods can perform the side provided by any embodiment of the present invention Method, possesses the corresponding functional module of execution method and beneficial effect.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims (10)

1. a kind of De-weight method for short text, it is characterised in that include:
Obtain the text string information of short text;
Participle is carried out to the text string, the key word of the text string is obtained according to the participle information of the text string;
Text substring is obtained according to the corresponding weight of the key word, the text substring includes the key word of threshold number;
Remove the duplicate keys of the text substring.
2. the De-weight method of short text according to claim 1, it is characterised in that the text substring includes threshold number Key word include:
Remove the key word of the weight less than default weight threshold of key word in the text string;Or,
According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen.
3. the De-weight method of short text according to claim 1, it is characterised in that according to the participle information of the text string The key word for obtaining the text string includes:
The stop words in the participle information is removed, and is normalized.
4. the De-weight method of short text according to claim 1, it is characterised in that according to the corresponding power of the key word Weight, obtains text substring, and the text substring includes that the key word of threshold number also includes:
Affect the frequency and/or reverse document frequency of the factor of the keyword weight at least including each key word.
5. the De-weight method of short text according to claim 1, it is characterised in that according to the corresponding power of the key word Weight, obtains text substring, and the text substring includes the key word of threshold number, also includes:
Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.
6. the De-weight method of short text according to claim 1, it is characterised in that remove the duplicate keys of the text substring Including:
If the text substring is one, the duplicate keys in the text substring are removed;If the text substring is two Individual or two or more, then remove the duplicate keys between the text substring.
7. a kind of duplicate removal device for short text, it is characterised in that include:
Acquiring unit, for obtaining the text string information of short text;
Extraction unit, is connected with the acquiring unit, for carrying out participle to the text string, according to the participle of the text string Information obtains the key word of the text string;
Processing unit, is connected with the extraction unit, for obtaining text substring according to the corresponding weight of the key word, described Text substring includes the key word of threshold number;
Operating unit, is connected with the processing unit, for removing the duplicate keys of the text substring.
8. short text duplicate removal device according to claim 7, it is characterised in that in the processing unit, affects the pass The factor of keyword weight at least including each key word the frequency and/or reverse document frequency, the processing unit specifically for:
Remove the key word of the weight less than default weight threshold of key word in the text string;Or,
According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen;
Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.
9. short text duplicate removal device according to claim 7, it is characterised in that the extraction unit specifically for:
The stop words in the participle information is removed, and is normalized.
10. short text duplicate removal device according to claim 7, it is characterised in that the operating unit specifically for:
If the text substring is one, the duplicate keys in the text substring are removed;If the text substring is two Individual or two or more, then remove the duplicate keys between the text substring.
CN201610915522.3A 2016-10-20 2016-10-20 De-weighting method and apparatus for short text Pending CN106569989A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610915522.3A CN106569989A (en) 2016-10-20 2016-10-20 De-weighting method and apparatus for short text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610915522.3A CN106569989A (en) 2016-10-20 2016-10-20 De-weighting method and apparatus for short text

Publications (1)

Publication Number Publication Date
CN106569989A true CN106569989A (en) 2017-04-19

Family

ID=58533112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610915522.3A Pending CN106569989A (en) 2016-10-20 2016-10-20 De-weighting method and apparatus for short text

Country Status (1)

Country Link
CN (1) CN106569989A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066623A (en) * 2017-05-12 2017-08-18 湖南中周至尚信息技术有限公司 A kind of article merging method and device
CN107977347A (en) * 2017-12-04 2018-05-01 海南云江科技有限公司 A kind of topic De-weight method and computing device
CN108536676A (en) * 2018-03-28 2018-09-14 广州华多网络科技有限公司 Data processing method, device, electronic equipment and storage medium
CN110032730A (en) * 2019-02-18 2019-07-19 阿里巴巴集团控股有限公司 A kind of processing method of text data, device and equipment
CN110348539A (en) * 2019-07-19 2019-10-18 知者信息技术服务成都有限公司 Short text correlation method of discrimination
WO2021109850A1 (en) * 2019-12-03 2021-06-10 世强先进(深圳)科技股份有限公司 Method and system for deduplicating and storing pdf files

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694658A (en) * 2009-10-20 2010-04-14 浙江大学 Method for constructing webpage crawler based on repeated removal of news
CN102289523A (en) * 2011-09-20 2011-12-21 北京金和软件股份有限公司 Method for intelligently extracting text labels
CN102682085A (en) * 2012-04-18 2012-09-19 北京十分科技有限公司 Method for removing duplicated web page
CN103646029A (en) * 2013-11-04 2014-03-19 北京中搜网络技术股份有限公司 Similarity calculation method for blog articles
CN104156452A (en) * 2014-08-18 2014-11-19 中国人民解放军国防科学技术大学 Method and device for generating webpage text summarization
CN105468713A (en) * 2015-11-19 2016-04-06 西安交通大学 Multi-model fused short text classification method
CN105893551A (en) * 2016-03-31 2016-08-24 上海智臻智能网络科技股份有限公司 Method and device for processing data and knowledge graph
CN105989033A (en) * 2015-02-03 2016-10-05 北京中搜网络技术股份有限公司 Information duplication eliminating method based on information fingerprints

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694658A (en) * 2009-10-20 2010-04-14 浙江大学 Method for constructing webpage crawler based on repeated removal of news
CN102289523A (en) * 2011-09-20 2011-12-21 北京金和软件股份有限公司 Method for intelligently extracting text labels
CN102682085A (en) * 2012-04-18 2012-09-19 北京十分科技有限公司 Method for removing duplicated web page
CN103646029A (en) * 2013-11-04 2014-03-19 北京中搜网络技术股份有限公司 Similarity calculation method for blog articles
CN104156452A (en) * 2014-08-18 2014-11-19 中国人民解放军国防科学技术大学 Method and device for generating webpage text summarization
CN105989033A (en) * 2015-02-03 2016-10-05 北京中搜网络技术股份有限公司 Information duplication eliminating method based on information fingerprints
CN105468713A (en) * 2015-11-19 2016-04-06 西安交通大学 Multi-model fused short text classification method
CN105893551A (en) * 2016-03-31 2016-08-24 上海智臻智能网络科技股份有限公司 Method and device for processing data and knowledge graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
龚静: "《中文文本聚类研究》", 31 March 2012, 北京:中国传媒大学出版社 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066623A (en) * 2017-05-12 2017-08-18 湖南中周至尚信息技术有限公司 A kind of article merging method and device
CN107977347A (en) * 2017-12-04 2018-05-01 海南云江科技有限公司 A kind of topic De-weight method and computing device
CN107977347B (en) * 2017-12-04 2021-12-21 海南云江科技有限公司 Topic duplication removing method and computing equipment
CN108536676A (en) * 2018-03-28 2018-09-14 广州华多网络科技有限公司 Data processing method, device, electronic equipment and storage medium
CN108536676B (en) * 2018-03-28 2020-10-13 广州华多网络科技有限公司 Data processing method and device, electronic equipment and storage medium
CN110032730A (en) * 2019-02-18 2019-07-19 阿里巴巴集团控股有限公司 A kind of processing method of text data, device and equipment
CN110032730B (en) * 2019-02-18 2023-09-05 创新先进技术有限公司 Text data processing method, device and equipment
CN110348539A (en) * 2019-07-19 2019-10-18 知者信息技术服务成都有限公司 Short text correlation method of discrimination
CN110348539B (en) * 2019-07-19 2021-05-07 知者信息技术服务成都有限公司 Short text relevance judging method
WO2021109850A1 (en) * 2019-12-03 2021-06-10 世强先进(深圳)科技股份有限公司 Method and system for deduplicating and storing pdf files

Similar Documents

Publication Publication Date Title
Hidayatullah et al. Pre-processing tasks in Indonesian Twitter messages
CN106569989A (en) De-weighting method and apparatus for short text
CN107168954B (en) Text keyword generation method and device, electronic equipment and readable storage medium
WO2021227831A1 (en) Method and apparatus for detecting subject of cyber threat intelligence, and computer storage medium
WO2015179643A1 (en) Systems and methods for generating summaries of documents
CN109635297A (en) A kind of entity disambiguation method, device, computer installation and computer storage medium
CN103150382A (en) Automatic short text semantic concept expansion method and system based on open knowledge base
CN110874532A (en) Method and device for extracting keywords of feedback information
Litvak et al. Degext: a language-independent keyphrase extractor
Singh et al. Sentiment analysis using lexicon based approach
Gupta et al. SMPOST: parts of speech tagger for code-mixed indic social media text
CN106776678A (en) Search engine optimization technology is realized in new keyword optimization
CN106528726A (en) Keyword optimization-based search engine optimization realization technology
Sarkar Part-of-speech tagging for code-mixed indian social media text at icon 2015
Krishna et al. Analysis of customer opinion using machine learning and NLP techniques
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
Thushara et al. An analysis on different document keyword extraction methods
Makinist et al. Preparation of improved Turkish dataset for sentiment analysis in social media
Mehmood et al. Contributions to the study of bi-lingual roman urdu sms spam filtering
Richardson et al. Topic models: A tutorial with R
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm
CN109947947B (en) Text classification method and device and computer readable storage medium
TWI534640B (en) Chinese network information monitoring and analysis system and its method
CN113792546A (en) Corpus construction method, apparatus, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170419

RJ01 Rejection of invention patent application after publication