CN106569989A - De-weighting method and apparatus for short text - Google Patents
De-weighting method and apparatus for short text Download PDFInfo
- Publication number
- CN106569989A CN106569989A CN201610915522.3A CN201610915522A CN106569989A CN 106569989 A CN106569989 A CN 106569989A CN 201610915522 A CN201610915522 A CN 201610915522A CN 106569989 A CN106569989 A CN 106569989A
- Authority
- CN
- China
- Prior art keywords
- text
- key word
- string
- substring
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
An embodiment of the invention discloses a de-weighting method for a short text. The de-weighting method comprises the steps of obtaining text string information of the short text; performing word segmentation on the text string, and obtaining keywords of the text string according to the word segmentation information of the text string; obtaining a text sub string according to a weight corresponding to the keywords, wherein the text sub string comprises keywords with the number of a threshold value; and removing repeating items of the text sub string. According to the technical scheme provided by the embodiment, by obtaining the keywords of the text string, a generalization performance on the original text string is achieved, and the de-weighting generalization capability and efficiency are improved; and meanwhile, the calculated quantity is low, and a de-weighing effect among multiple text strings is realized.
Description
Technical field
The present embodiments relate to text-processing technical field, more particularly to a kind of De-weight method and dress for short text
Put.
Background technology
Text duplicate removal refers to the composition for removing identical word, word or semantic similarity in text string.With the Internet skill
The continuous development of art, occurs in that substantial amounts of short message stream, the enormous amount of these information, but length is general all very short, such
Being referred to as short text information, specifically, short text refers to that length is very short, typically the text within 200 characters, for example more
The common SMS sent by mobile communications network, the instant message sent by instant communication software, network day
The comment of will, the comment of internet news etc..
Current text De-weight method is mainly text hash method and similarity-rough set method.Text hash method is divided into unanimously
Property Hash and local susceptibility Hash, concordance Hash do not have generalization, and Rule of judgment is excessively strict;Local susceptibility Hash
Relatively it is adapted to the relatively long texts such as webpage;Similarity-rough set method needs to compare two-by-two, and amount of calculation is excessive, it is impossible to adapt to magnanimity
The calculating of text.As the general length of short text is all very short, sample characteristics are very sparse, it is difficult to extract effective language exactly
Speech feature, and short text real-time is especially strong, quantity is extremely huge, and the process to short text has more relative to long article present treatment
High efficiency requirements;Short text language performance is succinct, and misspellings, user lack of standardization and noise ratio are more, available information
Limited, word is sparse serious, and the duplicate removal problem effect for directly processing short text using the De-weight method of long text will decline.
The content of the invention
In view of this, the present invention proposes a kind of De-weight method and device for short text, sentences in solving text duplicate removal
The problems such as broken strip part is excessively strict, improves the generalization ability and efficiency of short text duplicate removal.
In a first aspect, embodiments providing a kind of De-weight method for short text, methods described includes:Obtain
The text string information of short text;Participle is carried out to the text string, the text is obtained according to the participle information of the text string
The key word of string;Text substring is obtained according to the corresponding weight of the key word, the text substring includes the pass of threshold number
Keyword;Remove the duplicate keys of the text substring.
Second aspect, embodiments provides a kind of duplicate removal device for short text, and described device includes:Obtain
Unit, for obtaining the text string information of short text;Extraction unit, is connected with the acquiring unit, for the text string
Participle is carried out, the key word of the text string is obtained according to the participle information of the text string;Processing unit, extracts single with described
Unit is connected, and for obtaining text substring according to the corresponding weight of the key word, the text substring includes the pass of threshold number
Keyword;Operating unit, is connected with the processing unit, for removing the duplicate keys of the text substring.
In the embodiment of the present invention, carry out participle, extract the extensive operations such as key word by the text string to short text, and root
Text substring is obtained according to the weight information of key word, the duplicate keys in text substring is removed, has been reached to the extensive of original text string
Effect, improves the generalization ability and efficiency of duplicate removal, and amount of calculation is little, realizes in a text string or duplicate removal between multiple text strings
Effect.
Description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, other of the invention
Feature, objects and advantages will become more apparent upon:
Fig. 1 is a kind of flow chart of the De-weight method for short text in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of the De-weight method for short text in the embodiment of the present invention two;
Fig. 3 is a kind of flow chart of the De-weight method of the short text in the embodiment of the present invention three;
Fig. 4 is a kind of structure chart of the duplicate removal device for short text in the embodiment of the present invention four.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.Also, it should be noted that for the ease of retouching
State, in accompanying drawing, illustrate only part related to the present invention rather than full content.It also should be noted that, for the ease of saying
It is bright, example related to the present invention is shown in following examples, principle of these examples only as the explanation embodiment of the present invention
It is used, the restriction to the embodiment of the present invention is not intended as, meanwhile, the concrete numerical value of these examples can be according to different applied environments
It is different with the parameter of device or component and different.
The method and device for short text duplicate removal of the embodiment of the present invention can run on and be provided with Windows (Microsofts
Company exploitation operating system platform), Android (Google exploitation the operation for Portable movable smart machine
System platform), the iOS operating system platform for Portable movable smart machine of exploitation (Apple), Windows
The terminal of the operating systems such as Phone (operating system platform for Portable movable smart machine of Microsoft's exploitation)
In, the terminal can be desktop computer, notebook computer, mobile phone, palm PC, panel computer, digital camera, digital vedio recording
Any one in machine etc..
Embodiment one
Fig. 1 is a kind of flow chart of the method for short text duplicate removal in the embodiment of the present invention one, and the method is used for short
The deduplication operation of text, the method can be performed by the device with document process function, the device can by software and/or
Hardware mode realizes, such as typically subscriber terminal equipment, such as mobile phone, computer etc..In the present embodiment, generalization is referred to
The general description of element and the relation for specifically describing, specifically describe and set up on the basis of general description, and which is carried out
Extension.It is extensive to refer to that carrying out operation to element makes which more typically change.The method for short text duplicate removal in the present embodiment includes:
Step S110, step S120, step S130 and step S140.
Step S110, obtains the text string information of short text.
Specifically, user input needs text string to be processed, obtains the information of text string.Optionally, the information of text string
The semanteme of each word in title, the content of text string, the length of text string and the text string of text string can be included but is not limited to.
Optionally, the title of text string can be S.
Step S120, carries out participle to the text string, obtains the text string according to the participle information of the text string
Key word.
Specifically, participle is carried out to the text string.Participle technique is the basic link of information processing, and the main of participle is appointed
Business is to be automatically performed the cutting to sentence by computer, identifies independent word.Optionally, the segmentation methods can be elected as most
Short-circuit shot, critical path method (CPM) are used to calculate a node to the shortest path of other all nodes, are mainly characterized by with starting
Outwards extend layer by layer centered on point, till expanding to terminal.Optionally, to text string S:I will go to industrial and commercial bank, profit
With critical path method (CPM) word segmentation result it is:I will go to industrial and commercial bank.The participle information of text string is processed, text string is obtained
Key word information.Wherein, key word information can be including but not limited to:Verb, the noun of text string physical meaning can be expressed
And adjective.Text string S:I will go to the industrial and commercial bank, key word information to be:I goes to industrial and commercial bank.
Step S130, obtains text substring according to the corresponding weight of the key word, and the text substring includes threshold value
Several key words.
Specifically, the weight of each key word is calculated, default counts threshold value one by one, with the corresponding weight of each key word is
Basis for estimation, selects the key word of threshold number as text substring.
Step S140, removes the duplicate keys of the text substring.
Specifically, after participle is carried out to text string, a series of extensive operations such as key word are extracted, obtain correspondence text
This substring, now, removes the duplicate keys in text substring.The duplicate keys can be including but not limited to:Identical in text string
Word or word, the word of semantic similarity or word.Optionally, integrated application concordance hash algorithm and local susceptibility hash algorithm.One
Cause property hash algorithm, such as Message Digest 5 (Message-Digest Algorithm 5, MD5), murmur hash algorithms
Deng, using concordance hash algorithm, the text string after extensive process being operated, the Hash string value of generation is text string
Unique mark.Local susceptibility hash algorithm, such as SimHash algorithms, by the Hash string value to generating further by sea
Prescribed distance calculates similarity and determines whether same or like text.Hamming distances are referred to, in information coding, two legal generations
Code correspondence encodes different digits on position, and Hamming distances think that less than 3 two text strings are identical.Integrated application concordance Hash is calculated
Method and local susceptibility hash algorithm, carry out deduplication operation to text substring according to the Hash string value and Hamming distances that generate
In the embodiment of the present invention, carry out participle, extract the extensive operations such as key word by the text string to short text, and root
Text substring is obtained according to the weight information of key word, the duplicate keys in text substring is removed, has been reached to the extensive of original text string
Effect, improves the generalization ability and efficiency of duplicate removal, and amount of calculation is little, realizes in a text string or duplicate removal between multiple text strings
Effect.
Embodiment two
Fig. 2 is a kind of flow chart of the method for short text duplicate removal in the embodiment of the present invention two, and the present embodiment is in reality
On the basis of applying example one, step S120, step S130 and step S140 are further illustrated.In step S120, according to the text
The participle information of this string obtains the key word of the text string to be included:The stop words in the participle information is removed, and is returned
One change is processed.In step S130, the frequency of the factor of the keyword weight at least including each key word and/or reverse text are affected
Shelves frequency, the text substring include that the key word of threshold number includes:The weight for removing key word in the text string is less than
The key word of default weight threshold;Or, according to the corresponding weight of the key word, threshold number in the selection text string
Key word;Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.Step
In rapid S140, the duplicate keys for removing the text substring include:If the text substring is one, text is removed
Duplicate keys in string;If the text substring is two or more, the duplicate keys between the text substring are removed.
Specifically, the method for short text duplicate removal in the present embodiment includes:Step S210, step S220, step S230, step
S240 and step S250.
Step S210, obtains the text string information of short text.
Step S220, carries out participle to the text string, removes the stop words in the participle information, and is normalized
Process obtains the key word of text string.
Specifically, in information retrieval, it is to save memory space and improve search efficiency, is processing natural language data
Some words or word are fallen in meeting automatic fitration before or after (or text), and these words or word are referred to as stop words.Stop words is all
It is manually entered, non-automated is generated, the stop words after generation can form a deactivation vocabulary, removed by the deactivation vocabulary and stopped
Word.Auxiliary word and function word in including but not limited to punctuation mark, mathematical symbol and Chinese etc. are disabled in vocabulary, such as ",
, eh, oh, " etc..Stop words and normalization operation are removed respectively to the word segmentation result of text string, text string is obtained
Key word information.Preferably, text string " I likes drinking latte ", removes stop words " " therein, as a result for " I likes to drink to take
Ferrum ".Normalization operation is included but is not limited to, and full half-angle is unified into half-angle, upper and lower case letter and is unified into small letter, numeral be unified into Ah
Arabic numbers, English morphology are unified into root etc..Optionally, " I likes Shakespeare's【Hamlet】", it is normalized to
" I likes [hamlet] of shakespeare ";" industrial and commercial bank " and " industrial and commercial bank " is normalized to " industrial and commercial bank ";" two 〇 〇
Eight " " 2008 " are normalized to;" does, do, doing and did " is normalized to " do ".
Step S230, removes the key word of the weight less than default weight threshold of key word in the text string;Or, according to
According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen, text substring is obtained.Wherein, shadow
The frequency and/or reverse document frequency of the factor of the sound keyword weight at least including each key word,
Specifically, text substring is obtained by the process to text string, presets default weight threshold Q, threshold number
G.After participle is carried out to text string, stop words and normalized is removed, the key word of text string is obtained, affect key word
The frequency and/or reverse document frequency of the factor of weight at least including each key word.Specifically, the frequency (Term of each key word
Frequency, TF), occur the frequency of the word, reverse document frequency (Inverse Document in representing text string
Frequency, IDF), IDF=log (t/n), t are all number of files for statistics, and n is the number of files for containing the word,
IDF is used for the discrimination for weighing the word, and discrimination is bigger, and the multiplicity of two text strings is lower, outsides of the IDF by magnanimity
Resource statistics is obtained.TF-IDF is a kind of statistical method, to assess a words in a file set or a corpus
A copy of it file significance level.Optionally, using TF*IDF according to the frequency of each key word of text string and reverse text
The weight of each key word of frequency acquisition of shelves, according to the weight of each key word, the weight for removing key word in the text string is little
In default weight threshold Q key word as text substring;Or, the key word of threshold number G in text string is chosen as text
This substring.
Two or more key words in the text string are connected by step S240 by default separator or segmentation string
Into phrase.
Specifically, obtain text string in two or more key words, then by default separator or point
Cut the phrase that two or more key words are linked to be is gone here and there as text substring;Default separator or segmentation string can be wrapped
Include but be not limited to space, pause mark etc..Optionally, when text string content is:Tian An company Men Ye, carries out participle and removes stop words
Be Tian An-men industry after normalization operation, be Tian An-men industry after adding default separator, if not carrying out adding default separation
Symbol or split string operation, system can be automatically recognized as " Tian An-men " it is more common, make text string semanteme change.
Step S250, removes the duplicate keys of the text substring, if the text substring is one, removes the text
Duplicate keys in this substring;If the text substring is two or more, the weight between the text substring is removed
Multiple item.
Specifically, the text substring to adding default separator or segmentation to obtain after going here and there in key word is removed repetition
The operation of item.If the text substring is one, the duplicate keys in text substring are removed, if the text substring is two
Individual or two or more, then carry out the operation of step S210 to S240 respectively to each text string, removes two or more texts
Duplicate keys between this string.Remove duplicate keys and can include but is not limited to the same or like key word of cryptographic Hash, by hamming
Distance calculates similarity and decides whether as same or like text.
In the embodiment of the present invention, by participle is carried out to text string, stop words, normalization is removed and adds default separator
Or split the operation acquisition text substring such as string, the duplicate keys in text substring are removed to a text substring, to two or two
Text substring above removes the duplicate keys between text substring.A series of extensive operation has been carried out between cryptographic Hash calculating,
The extensive degree of algorithm has further been widened, deduplicated efficiency has been improve, the deduplication operation of one or more texts has been realized.
Embodiment three
Fig. 3 is a kind of De-weight method for short text in the embodiment of the present invention three, and the present embodiment is in one He of embodiment
On the basis of embodiment two, as a preferred embodiment, the deduplication operation between two text strings is described.Specifically
, the method for short text duplicate removal in the present embodiment includes:Step S310, step S320, step S330, step S340,
Step S350, step S360 and step S370.
Step S310, obtains the information of the first text string and the second text string.
Step S320, carries out the participle information that participle obtains first text string to first text string, and right
Second text string carries out the participle information that participle obtains second text string.
The participle of first text string is carried out stop words and normalization operation, obtains described first by step S330
The key word information of text string;And the participle of second text string is carried out into stop words and normalization operation, obtain institute
State the key word information of the second text string.
Specifically, in information retrieval, it is to save memory space and improve search efficiency, is processing natural language data
Some words or word are fallen in meeting automatic fitration before or after (or text), and these words or word are referred to as stop words.Stop words is all
It is manually entered, non-automated is generated, the stop words after generation can form a deactivation vocabulary, removed by the deactivation vocabulary and stopped
Word.Preferably, auxiliary word and function word in including but not limited to punctuation mark, mathematical symbol and Chinese etc. are disabled in vocabulary, such as
",, eh, oh, " etc..The word segmentation result of word segmentation result and the second text string to the first text string is gone respectively
Stop words and normalization operation, obtain the key word information of the first text string and the second text string.The normalization operation include but
It is not limited to, full half-angle is unified into half-angle, upper and lower case letter and is unified into small letter, numeral is unified into Arabic numerals, English morphology system
One one-tenth root etc..Optionally, " I likes Shakespeare's【Hamlet】", it is normalized to that " I likes shakespeare
[hamlet] ";" industrial and commercial bank " and " industrial and commercial bank " is normalized to " industrial and commercial bank ";" two 〇 〇 eight " is normalized to " 2008 ";
" does, do, doing and did " is normalized to " do ".
Optionally, the first text string S1:I will remove industrial and commercial bank, the second text string S2:I goes to industrial and commercial bank, using shortest path
Shot word segmentation result is, the first text string S1:I will remove industrial and commercial bank, the second text string S2:I goes to industrial and commercial bank.Then to first
Text string S1 and the second text string S2 carry out stop words and the result after normalization operation is, the first text string S1:I will go
Industrial and commercial bank, the second text string S2:I goes to industrial and commercial bank.
Step S340, obtains described first according to the frequency and reverse document frequency of each key word of first text string
First weight of each key word of text string;And, according to the frequency and reverse document of each key word of second text string
Second weight of each key word of the second text string described in frequency acquisition.
Specifically, there is the frequency of the word in the frequency (Term Frequency, TF) of each key word in representing text string,
Reverse document frequency (Inverse Document Frequency, IDF), IDF=log (t/n), t are all for statistics
Number of files, n are the number of files for containing the word, and, for weighing the discrimination of the word, discrimination is bigger, two texts for IDF
The multiplicity of string is lower, and IDF is obtained by the external resource statistics of magnanimity.TF-IDF is a kind of statistical method, to assess one
Words is for the significance level of a copy of it file in a file set or a corpus.It is literary according to first using TF*IDF
The frequency of each key word of this string and reverse document frequency obtain the first weight of each key word of the first text string, using TF*
IDF obtains the second of each key word of the second text string according to the frequency and reverse document frequency of each key word of the second text string
Weight.
Step S350, removes first weight of key word of first text string less than the key for presetting weight threshold
Word, remaining key word is used as the first text substring;Or, according to the weight of each key word, the key of selected threshold number
Word is used as the first text substring.Second weight of key word of second text string is removed less than default weight threshold
Key word, remaining key word is used as the second text substring;Or, according to the weight of each key word, selected threshold number
Key word is used as the second text substring.
Specifically, default weight threshold Q is preset, and the first weight of key word in the first text string is removed less than Q's
Key word, remaining key word is used as the first text substring;Or, preset threshold value number G, according to weight size, is chosen
The key word of predetermined threshold value number G.The key word of second weight less than Q of key word in the second text string is removed, remaining pass
Keyword is used as the second text substring;Or, according to weight size, choose the key word of predetermined threshold value number G.
Step S360, obtains the two or more words in the key word information of first text string, by pre-
If the phrase that the two or more words in the key word information of first text string are linked to be by separator or segmentation string
As the first text substring;The two or more words in the key word information of second text string are obtained, by pre-
If the phrase that the two or more words in the key word information of first text string are linked to be by separator or segmentation string
As the second text substring.
Specifically, the key word information of the first text string is obtained, the key word information includes two or more
Word, then by default separator or segmentation string by two or more words in the key word of the first text string be linked to be it is short
Language is used as the first text substring;Obtain the key word information of the second text string, the key word information include two or two with
On word, then by default separator or segmentation string two or more words in the key word of the second text string are linked to be
Phrase as the second text substring.Default separator or segmentation string can include but is not limited to space, pause mark etc..It is optional
, when the first text string or the second text string content are:Tian An company Men Ye, carries out participle and removes stop words grasping with normalization
It is Tian An-men industry after work, is Tian An-men industry after adding default separator, if does not carry out adding default separator or segmentation string
Operation, system can be automatically recognized as " Tian An-men " it is more common, make text string semanteme change.
Step S370, carries out deduplication operation to the first text substring and the second text substring.
Embodiments provide a preferred version of duplicate removal between two text strings.Before cryptographic Hash calculating is carried out
A series of analysis and process are carried out to text string, text string is expressed as cryptographic Hash finally carries out deduplication operation, solution
Traditional hash algorithm judges excessively strict situation, has reached the ability extensive to former string, has improve deduplicated efficiency.
Example IV
Fig. 4 is a kind of structure chart of the device for short text duplicate removal in the embodiment of the present invention four.The device is applied to
The short text De-weight method that the embodiment of the present invention one is provided into embodiment three is performed, the device is specifically included:Acquiring unit
410th, extraction unit 420, processing unit 430 and operating unit 440.
Acquiring unit 410, for obtaining the text string information of short text;
Extraction unit 420, is connected with acquiring unit 410, for carrying out participle to the text string, according to the text string
Participle information obtain the key word of the text string;
Processing unit 430, is connected with extraction unit 420, for obtaining text according to the corresponding weight of the key word
String, the text substring include the key word of threshold number;
Operating unit 440, is connected with processing unit 440, for removing the duplicate keys of the text substring.
Further, in processing unit 440, affect the frequency of the factor of the keyword weight at least including each key word
And/or reverse document frequency, processing unit 440 specifically for:The weight of key word in the text string is removed less than default power
The key word of weight threshold value;Or, according to the corresponding weight of the key word, choose the key of threshold number in the text string
Word.Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.
Further, extraction unit 420 specifically for:The stop words in the participle information is removed, and is normalized
Process.
Further, operating unit 440 specifically for:If the text substring is one, text is removed
Duplicate keys in string;If the text substring is two or more, the duplicate keys between the text substring are removed.
In the embodiment of the present invention, carry out participle, extract the extensive operations such as key word by the text string to short text, and root
Text substring is obtained according to the weight information of key word, the duplicate keys in text substring is removed, has been reached to the extensive of original text string
Effect, improves the generalization ability and efficiency of duplicate removal, and amount of calculation is little, realizes in a text string or duplicate removal between multiple text strings
Effect.
Obviously, it will be understood by those skilled in the art that the said goods can perform the side provided by any embodiment of the present invention
Method, possesses the corresponding functional module of execution method and beneficial effect.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes,
Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.
Claims (10)
1. a kind of De-weight method for short text, it is characterised in that include:
Obtain the text string information of short text;
Participle is carried out to the text string, the key word of the text string is obtained according to the participle information of the text string;
Text substring is obtained according to the corresponding weight of the key word, the text substring includes the key word of threshold number;
Remove the duplicate keys of the text substring.
2. the De-weight method of short text according to claim 1, it is characterised in that the text substring includes threshold number
Key word include:
Remove the key word of the weight less than default weight threshold of key word in the text string;Or,
According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen.
3. the De-weight method of short text according to claim 1, it is characterised in that according to the participle information of the text string
The key word for obtaining the text string includes:
The stop words in the participle information is removed, and is normalized.
4. the De-weight method of short text according to claim 1, it is characterised in that according to the corresponding power of the key word
Weight, obtains text substring, and the text substring includes that the key word of threshold number also includes:
Affect the frequency and/or reverse document frequency of the factor of the keyword weight at least including each key word.
5. the De-weight method of short text according to claim 1, it is characterised in that according to the corresponding power of the key word
Weight, obtains text substring, and the text substring includes the key word of threshold number, also includes:
Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.
6. the De-weight method of short text according to claim 1, it is characterised in that remove the duplicate keys of the text substring
Including:
If the text substring is one, the duplicate keys in the text substring are removed;If the text substring is two
Individual or two or more, then remove the duplicate keys between the text substring.
7. a kind of duplicate removal device for short text, it is characterised in that include:
Acquiring unit, for obtaining the text string information of short text;
Extraction unit, is connected with the acquiring unit, for carrying out participle to the text string, according to the participle of the text string
Information obtains the key word of the text string;
Processing unit, is connected with the extraction unit, for obtaining text substring according to the corresponding weight of the key word, described
Text substring includes the key word of threshold number;
Operating unit, is connected with the processing unit, for removing the duplicate keys of the text substring.
8. short text duplicate removal device according to claim 7, it is characterised in that in the processing unit, affects the pass
The factor of keyword weight at least including each key word the frequency and/or reverse document frequency, the processing unit specifically for:
Remove the key word of the weight less than default weight threshold of key word in the text string;Or,
According to the corresponding weight of the key word, the key word of threshold number in the text string is chosen;
Two or more key words in the text string are linked to be by phrase by default separator or segmentation string.
9. short text duplicate removal device according to claim 7, it is characterised in that the extraction unit specifically for:
The stop words in the participle information is removed, and is normalized.
10. short text duplicate removal device according to claim 7, it is characterised in that the operating unit specifically for:
If the text substring is one, the duplicate keys in the text substring are removed;If the text substring is two
Individual or two or more, then remove the duplicate keys between the text substring.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610915522.3A CN106569989A (en) | 2016-10-20 | 2016-10-20 | De-weighting method and apparatus for short text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610915522.3A CN106569989A (en) | 2016-10-20 | 2016-10-20 | De-weighting method and apparatus for short text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106569989A true CN106569989A (en) | 2017-04-19 |
Family
ID=58533112
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610915522.3A Pending CN106569989A (en) | 2016-10-20 | 2016-10-20 | De-weighting method and apparatus for short text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106569989A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066623A (en) * | 2017-05-12 | 2017-08-18 | 湖南中周至尚信息技术有限公司 | A kind of article merging method and device |
CN107977347A (en) * | 2017-12-04 | 2018-05-01 | 海南云江科技有限公司 | A kind of topic De-weight method and computing device |
CN108536676A (en) * | 2018-03-28 | 2018-09-14 | 广州华多网络科技有限公司 | Data processing method, device, electronic equipment and storage medium |
CN110032730A (en) * | 2019-02-18 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A kind of processing method of text data, device and equipment |
CN110348539A (en) * | 2019-07-19 | 2019-10-18 | 知者信息技术服务成都有限公司 | Short text correlation method of discrimination |
WO2021109850A1 (en) * | 2019-12-03 | 2021-06-10 | 世强先进(深圳)科技股份有限公司 | Method and system for deduplicating and storing pdf files |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101694658A (en) * | 2009-10-20 | 2010-04-14 | 浙江大学 | Method for constructing webpage crawler based on repeated removal of news |
CN102289523A (en) * | 2011-09-20 | 2011-12-21 | 北京金和软件股份有限公司 | Method for intelligently extracting text labels |
CN102682085A (en) * | 2012-04-18 | 2012-09-19 | 北京十分科技有限公司 | Method for removing duplicated web page |
CN103646029A (en) * | 2013-11-04 | 2014-03-19 | 北京中搜网络技术股份有限公司 | Similarity calculation method for blog articles |
CN104156452A (en) * | 2014-08-18 | 2014-11-19 | 中国人民解放军国防科学技术大学 | Method and device for generating webpage text summarization |
CN105468713A (en) * | 2015-11-19 | 2016-04-06 | 西安交通大学 | Multi-model fused short text classification method |
CN105893551A (en) * | 2016-03-31 | 2016-08-24 | 上海智臻智能网络科技股份有限公司 | Method and device for processing data and knowledge graph |
CN105989033A (en) * | 2015-02-03 | 2016-10-05 | 北京中搜网络技术股份有限公司 | Information duplication eliminating method based on information fingerprints |
-
2016
- 2016-10-20 CN CN201610915522.3A patent/CN106569989A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101694658A (en) * | 2009-10-20 | 2010-04-14 | 浙江大学 | Method for constructing webpage crawler based on repeated removal of news |
CN102289523A (en) * | 2011-09-20 | 2011-12-21 | 北京金和软件股份有限公司 | Method for intelligently extracting text labels |
CN102682085A (en) * | 2012-04-18 | 2012-09-19 | 北京十分科技有限公司 | Method for removing duplicated web page |
CN103646029A (en) * | 2013-11-04 | 2014-03-19 | 北京中搜网络技术股份有限公司 | Similarity calculation method for blog articles |
CN104156452A (en) * | 2014-08-18 | 2014-11-19 | 中国人民解放军国防科学技术大学 | Method and device for generating webpage text summarization |
CN105989033A (en) * | 2015-02-03 | 2016-10-05 | 北京中搜网络技术股份有限公司 | Information duplication eliminating method based on information fingerprints |
CN105468713A (en) * | 2015-11-19 | 2016-04-06 | 西安交通大学 | Multi-model fused short text classification method |
CN105893551A (en) * | 2016-03-31 | 2016-08-24 | 上海智臻智能网络科技股份有限公司 | Method and device for processing data and knowledge graph |
Non-Patent Citations (1)
Title |
---|
龚静: "《中文文本聚类研究》", 31 March 2012, 北京:中国传媒大学出版社 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066623A (en) * | 2017-05-12 | 2017-08-18 | 湖南中周至尚信息技术有限公司 | A kind of article merging method and device |
CN107977347A (en) * | 2017-12-04 | 2018-05-01 | 海南云江科技有限公司 | A kind of topic De-weight method and computing device |
CN107977347B (en) * | 2017-12-04 | 2021-12-21 | 海南云江科技有限公司 | Topic duplication removing method and computing equipment |
CN108536676A (en) * | 2018-03-28 | 2018-09-14 | 广州华多网络科技有限公司 | Data processing method, device, electronic equipment and storage medium |
CN108536676B (en) * | 2018-03-28 | 2020-10-13 | 广州华多网络科技有限公司 | Data processing method and device, electronic equipment and storage medium |
CN110032730A (en) * | 2019-02-18 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A kind of processing method of text data, device and equipment |
CN110032730B (en) * | 2019-02-18 | 2023-09-05 | 创新先进技术有限公司 | Text data processing method, device and equipment |
CN110348539A (en) * | 2019-07-19 | 2019-10-18 | 知者信息技术服务成都有限公司 | Short text correlation method of discrimination |
CN110348539B (en) * | 2019-07-19 | 2021-05-07 | 知者信息技术服务成都有限公司 | Short text relevance judging method |
WO2021109850A1 (en) * | 2019-12-03 | 2021-06-10 | 世强先进(深圳)科技股份有限公司 | Method and system for deduplicating and storing pdf files |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hidayatullah et al. | Pre-processing tasks in Indonesian Twitter messages | |
CN106569989A (en) | De-weighting method and apparatus for short text | |
CN107168954B (en) | Text keyword generation method and device, electronic equipment and readable storage medium | |
WO2021227831A1 (en) | Method and apparatus for detecting subject of cyber threat intelligence, and computer storage medium | |
WO2015179643A1 (en) | Systems and methods for generating summaries of documents | |
CN109635297A (en) | A kind of entity disambiguation method, device, computer installation and computer storage medium | |
CN103150382A (en) | Automatic short text semantic concept expansion method and system based on open knowledge base | |
CN110874532A (en) | Method and device for extracting keywords of feedback information | |
Litvak et al. | Degext: a language-independent keyphrase extractor | |
Singh et al. | Sentiment analysis using lexicon based approach | |
Gupta et al. | SMPOST: parts of speech tagger for code-mixed indic social media text | |
CN106776678A (en) | Search engine optimization technology is realized in new keyword optimization | |
CN106528726A (en) | Keyword optimization-based search engine optimization realization technology | |
Sarkar | Part-of-speech tagging for code-mixed indian social media text at icon 2015 | |
Krishna et al. | Analysis of customer opinion using machine learning and NLP techniques | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN111985212A (en) | Text keyword recognition method and device, computer equipment and readable storage medium | |
Thushara et al. | An analysis on different document keyword extraction methods | |
Makinist et al. | Preparation of improved Turkish dataset for sentiment analysis in social media | |
Mehmood et al. | Contributions to the study of bi-lingual roman urdu sms spam filtering | |
Richardson et al. | Topic models: A tutorial with R | |
CN111159996B (en) | Short text set similarity comparison method and system based on text fingerprint algorithm | |
CN109947947B (en) | Text classification method and device and computer readable storage medium | |
TWI534640B (en) | Chinese network information monitoring and analysis system and its method | |
CN113792546A (en) | Corpus construction method, apparatus, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170419 |
|
RJ01 | Rejection of invention patent application after publication |