Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Referring to FIG. 1, Fig. 1 shows the flow diagram of Feature Words De-weight method provided by the embodiments of the present application.
As shown in Figure 1, this method comprises:
Step 110, the phrase set with current signature word association in feature set of words is obtained.
In the embodiment of the present application, at least one Feature Words is extracted from text or article, it is available according to the specific word
Associated phrase set.Wherein, phrase set can for example be determined by modes such as conjunctive word dictionaries.Conjunctive word dictionary,
Such as it can be synonym dictionary or other establish the dictionary of incidence relation with Feature Words.It is extended to example with synonym dictionary,
Feature Words extend to obtain synonym collection by synonym dictionary, and synonym collection for example may include and current signature word justice
At least one identical phrase, phrase include at least two characters.Or determine that relationship can also obtain according to other synonyms
To synonym collection associated with the specific word.In the embodiment of the present application, semanteme is identical to can be understood as semantic identical and language
Justice is close, such as synonym, near synonym etc..
In the embodiment of the present application, phrase for example can be the word of Chinese, or multiple word phrases that English indicates.Its
In, each phrase includes at least two characters.Such as the phrase indicated with Chinese, Beijing, Beijing, capital.With Beijing this
For phrase comprising north, capital, two characters.These phrases represent different expression ways equivalent in meaning.Due in middle digest
It wants in extraction process, monosyllabic to carry out abstract extraction for text or article and have no practical help, the embodiment of the present application can be with
Further remove the single syllable vocabulary in text or article.
Step 120, based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain first and value
Set.
In the embodiment of the present application, by searching for the specified portions for calculating phrase with the one-to-one ASCII character of phrase
With.Wherein, specified portions and can for example be determined by specified parameter.Specified parameter for example can be calculation times identifier
Number, it is used to indicate calculation times.
Search with phrase one-to-one ASCII character, such as can be by searching for ASCII character table, or pre-establish
The ASCII character of common phrase obtains interface, by calling the ASCII character for obtaining interface to obtain phrase.In the embodiment of the present application,
The result that each phrase is searched can be expressed as to ASCII character corresponding with the phrase.For example, phrase is { Beijing }, pass through
ASCII character subset { a corresponding with { Beijing } that ASCII character table is searched1, a2}.Phrase is { Beijing }, by ASCII
ASCII character subset { a corresponding with { Beijing } that Codebook Lookup obtains1, a2, a3}.Phrase is { capital }, by ASCII character
ASCII character subset { the b corresponding with { capital } that table is searched1, b2}。
In the embodiment of the present application, it is based on the specified portions for calculating each phrase with the one-to-one ASCII character of phrase and,
Obtain first and value set.For example, being determined by reading specified parameter by specifying during solving first and value set
The computer capacity that parameter limits.Select to specify the corresponding ASCII character of the value of parameter with this from ASCII character by specified parameter
Value is for calculating summation.In the embodiment of the present application, first and the number of value set be fixed and invariable, but first and value set
In each element be relevant to specified parameter, such as it is related to calculation times mark.
To describe to calculate the process of first and value set for above-mentioned synonym word group set { Beijing, Beijing, capital }.
When first calculating, specified parameter m, such as calculation times mark are read.As m=1, by way of assignment or
Initial value is realized with the mode that zero is added.Will each phrase first character corresponding A SCII code be used as first and value set
Summed result.First character of these phrases is respectively { north }, { north }, { first }, ASCII character corresponding with the first character
Value is respectively { a1, { a1, { b1 }, then first and value set being calculated for the first time can be expressed as { a1, a1, b1, that is, it calculates
The sum of the specified portions of the corresponding ASCII character of each phrase.It can must be gone if the result calculated for the first time meets Rule of judgment
Feature Words after weight, otherwise also need the step of computing repeatedly first and value set.
Can also by calculating whole sum of ASCII character, such as ASCII character corresponding with { Beijing } and be worth for a1+
a2, ASCII character corresponding with { Beijing } and value be a1+a2+a3, ASCII character corresponding with { capital } and value be b1+b2,
Calculate separately whole sums of the corresponding ASCII character subset of each phrase.
Step 130, the Feature Words after duplicate removal are determined by judging the number of the minimum value in first and value set.
In the embodiment of the present application, by judging whether the number of minimum value in first and value set uniquely determines duplicate removal
Feature Words afterwards.If the number of minimum value is unique in first and value set, it is determined that phrase relevant to the minimum value
For the Feature Words after duplicate removal.For the minimum value in first and value set, such as { a being finally calculated1+a2, a1+a2+a3,
b1+b2, minimum value may be a1+a2, phrase corresponding with the minimum value is Beijing.Then { Beijing } can be used as synset
Feature Words in conjunction after duplicate removal processing.Other Feature Words all one by one after the duplicate removal processing of above-mentioned steps, that is, can determine new
Word cloud building or the fields such as text mining of the feature set of words for the later period.
If the number of minimum value be it is not unique, for example, first and value set { a being calculated for the first time1, a1, b1}
In, minimum value may be a1.But its number is 2, at this point, not can determine which phrase is the Feature Words after duplicate removal.The application is real
Example is applied, also passes through the sum of the corresponding ASCII character specified portions of each phrase of cycle calculations, until judging the number of minimum value only
One, just determine that phrase relevant to the minimum value is the Feature Words after duplicate removal.
In the embodiment of the present application, the feature after duplicate removal is determined by judging the number of the minimum value in first and value set
Word may include:
Judge whether the number of the minimum value in first and value set is unique;
If it is determined that the number of minimum value be it is not unique, then update specified parameter;Then,
Return based on the one-to-one ASCII character of phrase calculate phrase specified portions and, until judging minimum value
Number is unique;
If it is determined that the number of minimum value is unique, it is determined that phrase corresponding with the minimum value is the feature after duplicate removal
Word.
In the embodiment of the present application, when it is not unique for determining the number of the minimum value in the first numerical value set, automatic trigger
A new round solves the process of first and value set.In this process, firstly, updating specified parameter, such as can be by calculation times
The value for identifying m increases by 1;Then, return calculate ASCII character specified portions and the step of.
Based on the one-to-one ASCII character of phrase calculate phrase specified portions and the step of, can also include:
Read specified parameter;
ASCII character value corresponding with specified parameter is selected in each ASCII character to sum, ASCII character includes and phrase
The one-to-one ASCII character value of character, wherein the value of specified parameter is the positive integer less than or equal to N, N is word in phrase set
The maximum value of group length, phrase length are the character sum that phrase includes.
By taking above-mentioned { Beijing, Beijing, capital } as an example, the phrase length of Beijing is 3.The value of N is 3, specifies ginseng
Several values is the positive integer less than or equal to N, such as can be 1,2,3.N is expressed as the maximum of phrase length in phrase set
Value.
For example, specifying the value of parameter is 1, such as specified parameter m=1 when first calculating;It selects in phrase { Beijing }, { north
Jing Shi }, ASCII character value corresponding with specified parameter is summed in the ASCII character in { capital }, i.e., according in word order selection phrase the
The ASCII character value of one position is summed, available first and value set be { a1, a1, b1}。
When the 2nd calculating, specifying parameter value is 2;It selects in phrase { Beijing }, { Beijing }, in the ASCII character in { capital }
ASCII character value corresponding with specified parameter is summed, i.e. first position and the second position corresponding ASCII character in selection word order
Value is summed, available first and value set be { a1+a2, a1+a2, b1+b2}。
When the 3rd calculating, specifying the value of parameter is 3;It selects in phrase { Beijing }, { Beijing }, the ASCII character in { capital }
In corresponding with specified parameter ASCII character value sum, i.e., first position, the second position and the third place pair in selection word order
The ASCII character value answered is summed, when phrase { Beijing }, when the third place is not present in { capital }, and the 3rd the first He being calculated
In value set with { Beijing }, { capital } corresponding element be the 2nd calculating as a result, be then calculated for the 3rd time first and value collect
It is combined into { a1+a2, a1+a2+a3, b1+b2}。
The embodiment of the present application, by calculating the sum of the corresponding ASCII character of phrase, gradually duplicate removal, the feature after obtaining duplicate removal
Word greatly reduces computation complexity relative to existing Feature Words duplicate removal technology, and simplifies duplicate removal step.
The consistency between Feature Words in order to guarantee plurality of articles extraction, and article keyword is promoted to the summary of article
Property.Present invention also provides a kind of Feature Words De-weight methods, referring to FIG. 2, provided Fig. 2 shows the another embodiment of the application
The flow diagram of Feature Words De-weight method.
As shown in Fig. 2, this method optionally includes:
Step 210, word segmentation processing is carried out to initial data;And
Step 220, gathering as a result, obtaining participle for above-mentioned word segmentation processing is screened using stop words dictionary.
Step 230, at least one feature set of words is obtained, each feature set of words includes according to TF-IDF algorithm from participle
At least one Feature Words extracted in set.
Step 240, it according to the conjunctive word dictionary extension feature word constructed in advance, obtains and the associated phrase collection of the specific word
It closes.
Step 250, the phrase set with current signature word association in feature set of words is obtained.
Step 260, based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain first and value
Set.
Step 270, the Feature Words after duplicate removal are determined by judging the number of the minimum value in first and value set.
Wherein, step 250-270 can use embodiment identical with step 110-130, referring to step 110-130's
Description content.
The embodiment of the present application has merged natural language processing, and text analyzing, TF-IDF, machine learning, statistical analysis etc. are more
Kind big data technology, the Feature Words that can be not limited to single text extract, more can be convex in the application that more text feature words extract
It shows it and extracts Feature Words to the abstract ability of text.
In the embodiment of the present application, initial data is obtained by network crawl technology, such as can be from predetermined quantity
The initial data obtained in text using web crawlers technology.Predetermined quantity for example can be 20000, the range of choice of text
It is not limited to retrievable academic article, forum's article, webpage article etc..
After obtaining initial data, such as it can use Chinese word segmentation tool and customized participle dictionary etc., to original
Data carry out word segmentation processing.Wherein, stammerer (Jieba) participle tool and customized can be used for example in Chinese word segmentation tool
The participle word established of the methods of participle dictionary, such as the method based on dictionary, Statistics-Based Method, rule-based method
Library, alternatively, the participle dictionary etc. established according to specific area rule or existing mark set.
After word segmentation processing, the vocabulary without practical significance such as auxiliary word, adjective are being removed using stop words dictionary, such as
Further the result of word segmentation processing can also be screened by deleting single syllable vocabulary, obtain participle set.Due in
Digest is wanted in extraction process, and single syllable word has no practical help to abstract extraction, can further duplicate removal single syllable word with
Invalid data is reduced to the occupancy of resource.
In the embodiment of the present application, by word frequency reverse document frequency TF-IDF algorithm to having divided from participle set
Word is weighted, and obtains the weight of word, then sorts according to weighted value, extracts the high word of the weighted value of front n
As feature set of words.Wherein, the value of n can adjust according to actual needs, for example, the value of any positive integer.
The reverse document frequency TF-IDF algorithm of word frequency refers to if the frequency that some word or phrase occur in an article
TF high, and seldom occur in other articles, then it is assumed that this word or phrase have good class discrimination ability, are suitble to use
To classify.That is TF-IDF tends to filter out common word, retains important word.
Wherein, word frequency (term frequency, TF) refers to time that some given word occurs in this document
Number.The value of word frequency would generally be normalized, to prevent it to be biased to the longer statistical result of article.
Reverse document-frequency (inverse document frequency, IDF) is the degree of a word general importance
Amount.The reverse document-frequency of a certain particular words, can be by general act number divided by the number of the file comprising the word, then incites somebody to action
Obtained quotient takes logarithm to obtain.If the document comprising IDF word is fewer, IDF is bigger, then it is good to illustrate that IDF word has
Class discrimination ability.If the number of files comprising IDF word is m in certain a kind of document, and other classes include the document of IDF word
Sum is k, it is clear that all number of files n=m+k comprising IDF word, when m is big, n is also big, obtains according to IDF formula
The value of IDF can be small, just illustrates that the class discrimination of the IDF word is indifferent.
The embodiment of the present application extracts weight by the reverse document frequency TF-IDF algorithm of word frequency from different text or article
The Feature Words wanted, while common word is filtered, the important Feature Words constitutive characteristic set of words based on extraction.It obtains at least
One feature set of words.
By some Feature Words (i.e. current signature word) at least one feature set of words according to the conjunctive word constructed in advance
Dictionary is extended, and is obtained and the associated phrase set of the specific word.Wherein, the conjunctive word dictionary constructed in advance refer to collect with
Feature Words have the phrase or word of certain particular kind of relationship.For example, conjunctive word dictionary can be synonym dictionary and other with
The dictionary of Feature Words and relationship.The synonym dictionary constructed in advance includes at least one phrase corresponding relationship, and the phrase is corresponding to close
System is constructed according to the matrix relationship pre-established.The matrix relationship pre-established, such as can be according to semantic identical
Two words constitute words pair, according to word to carrying out longitudinal or extending transversely establish matrix relationship.Such as, it is constituted with two phrases
Word is arranged two, with each phrase in two phrases there is the phrase of identical semanteme to extend according to longitudinal direction.Or with two
Phrase constitutes word to two rows, with each phrase in two phrases there is the phrase of identical semanteme to extend according to transverse direction.Square
Battle array relationship, such as can be the relationship of two column multirows or two row multiple rows.
In the embodiment of the present application, by being associated word expansion to each feature, plurality of articles or Text Feature Extraction are improved
Feature Words consistency and Feature Words to the abstract ability of text, this effective duplicate removal mode occupied space is also small.
It should be noted that although describing the operation of the method for the present invention in the accompanying drawings with particular order, this is not required that
Or hint must execute these operations in this particular order, or have to carry out operation shown in whole and be just able to achieve the phase
The result of prestige.On the contrary, the step of describing in flow chart can change and execute sequence.Additionally or alternatively, it is convenient to omit certain
Multiple steps are merged into a step and executed, and/or a step is decomposed into execution of multiple steps by step.
Further referring to FIG. 3, Fig. 3 shows the schematic of Feature Words duplicate removal device 300 provided by the embodiments of the present application
Structure chart.
As shown in figure 3, the device 300 includes:
First acquisition unit 310, for obtaining the phrase set with current signature word association in feature set of words.
In the embodiment of the present application, at least one Feature Words is extracted from text or article, it is available according to the specific word
Associated phrase set.Wherein, phrase set can for example be determined by modes such as conjunctive word dictionaries.Wherein conjunctive word word
Library, such as can be synonym dictionary or other establish the dictionary of associate feature with Feature Words.It is extended to synonym dictionary
Example, Feature Words extend to obtain synonym collection by synonym dictionary, and synonym collection for example may include and current signature word
At least one semantic identical phrase, phrase include at least two characters.Or determine that relationship can also according to other synonyms
To obtain synonym collection associated with the specific word.In the embodiment of the present application, it is semantic it is identical can be understood as it is semantic identical
And semantic similarity, such as synonym, near synonym etc..
In the embodiment of the present application, phrase for example can be the word of Chinese, or multiple word phrases that English indicates.Its
In, each phrase includes at least two characters.Such as the phrase indicated with Chinese, Beijing, Beijing, capital.With Beijing this
For phrase comprising north, capital, two characters.These phrases represent different expression ways equivalent in meaning.Due in middle digest
It wants in extraction process, monosyllabic to carry out abstract extraction for text or article and have no practical help, the embodiment of the present application can be with
Further remove the single syllable vocabulary in text or article.
Computing unit 320, for based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain
First and value set.
In the embodiment of the present application, by searching for the specified portions for calculating phrase with the one-to-one ASCII character of phrase
With.Wherein, the sum of specified portions, such as can be determined by specified parameter.Specified parameter for example can be calculation times identifier
Number, it is used to indicate calculation times.
Search with phrase one-to-one ASCII character, such as can be by searching for ASCII character table, or pre-establish
The ASCII character of common phrase obtains interface, by calling the ASCII character for obtaining interface to obtain phrase.In the embodiment of the present application,
The result that each phrase is searched can be expressed as to ASCII character corresponding with the phrase.For example, phrase is { Beijing }, pass through
ASCII character subset { a corresponding with { Beijing } that ASCII character table is searched1, a2}.Phrase is { Beijing }, by ASCII
ASCII character subset { a corresponding with { Beijing } that Codebook Lookup obtains1, a2, a3}.Phrase is { capital }, by ASCII character
ASCII character subset { the b corresponding with { capital } that table is searched1, b2}。
In the embodiment of the present application, it is based on the specified portions for calculating each phrase with the one-to-one ASCII character of phrase and,
Obtain first and value set.For example, being determined by reading specified parameter by specifying during solving first and value set
The computer capacity that parameter limits.Select to specify the corresponding ASCII character of the value of parameter with this from ASCII character by specified parameter
Value is for calculating summation.In the embodiment of the present application, first and the number of value set be fixed and invariable, but first and value set
In each element be relevant to specified parameter, such as it is related to calculation times mark.
To describe to calculate the process of first and value set for above-mentioned synonym word group set { Beijing, Beijing, capital }.
When first calculating, specified parameter m, such as calculation times mark are read.As m=1, by way of assignment or
Initial value is realized with the mode that zero is added.Will each phrase first character corresponding A SCII code be used as first and value set
Summed result.First character of these phrases is respectively { north }, { north }, { first }, ASCII character corresponding with the first character
Value is respectively { a1, { a1, { b1 }, then first and value set being calculated for the first time can be expressed as { a1, a1, b1, that is, it calculates
The sum of the specified portions of the corresponding ASCII character of each phrase.It can must be gone if the result calculated for the first time meets Rule of judgment
Feature Words after weight, otherwise also need the step of computing repeatedly first and value set.
Can also by calculating whole sum of ASCII character, such as ASCII character corresponding with { Beijing } and be worth for a1+
a2, ASCII character corresponding with { Beijing } and value be a1+a2+a3, ASCII character corresponding with { capital } and value be b1+b2,
Calculate separately whole sums of the corresponding ASCII character subset of each phrase.
Determination unit 330, for determining the spy after duplicate removal by judging the number of the minimum value in first and value set
Levy word.
In the embodiment of the present application, by judging whether the number of minimum value in first and value set uniquely determines duplicate removal
Feature Words afterwards.If the number of minimum value is unique in first and value set, it is determined that phrase relevant to the minimum value
For the Feature Words after duplicate removal.For the minimum value in first and value set, such as { a being finally calculated1+a2, a1+a2+a3,
b1+b2, minimum value may be a1+a2, phrase corresponding with the minimum value is Beijing.Then { Beijing } can be used as synset
Feature Words in conjunction after duplicate removal processing.Other Feature Words all one by one after the duplicate removal processing of above-mentioned steps, that is, can determine new
Word cloud building or the fields such as text mining of the feature set of words for the later period.
If the number of minimum value be it is not unique, for example, first and value set { a being calculated for the first time1, a1, b1}
In, minimum value may be a1.But its number is 2, at this point, not can determine which phrase is the Feature Words after duplicate removal.The application is real
Example is applied, also passes through the sum of the corresponding ASCII character specified portions of each phrase of cycle calculations, until judging the number of minimum value only
One, just determine that phrase relevant to the minimum value is the Feature Words after duplicate removal.
In the embodiment of the present application, determination unit may include:
Whether judging submodule, the number for judging the minimum value in first and value set are unique;
Update submodule, for if it is determined that the number of minimum value be it is not unique, then update specified parameter;Then,
Return to submodule, for returns based on phrase one-to-one ASCII character calculating phrase specified portions and, directly
To judging that the number of minimum value is unique;
Submodule is determined, for if it is determined that the number of minimum value is unique, it is determined that word corresponding with the minimum value
Group is the Feature Words after duplicate removal.
In the embodiment of the present application, when it is not unique for determining the number of the minimum value in the first numerical value set, automatic trigger
A new round solves the process of first and value set.In this process, firstly, updating specified parameter, such as can be by calculation times
The value for identifying m increases by 1;Then, return calculate ASCII character specified portions and the step of.
Computing unit can also include:
Reading submodule, for reading specified parameter;
It sums submodule, sums for selecting in each ASCII character ASCII character value corresponding with specified parameter, ASCII
Code include with the one-to-one ASCII character value of the character of phrase, wherein the value of specified parameter is the positive integer less than or equal to N, N
For the maximum value of phrase length in phrase set, phrase length is the character sum that phrase includes.
By taking above-mentioned { Beijing, Beijing, capital } as an example, the phrase length of Beijing is 3.The value of N is 3, specifies ginseng
Several values is the positive integer less than or equal to N, such as can be 1,2,3.N is expressed as the maximum of phrase length in phrase set
Value.
For example, specifying the value of parameter is 1, such as specified parameter m=1 when first calculating;It selects in phrase { Beijing }, { north
Jing Shi }, ASCII character value corresponding with specified parameter is summed in the ASCII character in { capital }, i.e., according in word order selection phrase the
The ASCII character value of one position is summed, available first and value set be { a1, a1, b1}。
When the 2nd calculating, specifying parameter value is 2;It selects in phrase { Beijing }, { Beijing }, in the ASCII character in { capital }
ASCII character value corresponding with specified parameter is summed, i.e. first position and the second position corresponding ASCII character in selection word order
Value is summed, available first and value set be { a1+a2, a1+a2, b1+b2}。
When the 3rd calculating, specifying the value of parameter is 3;It selects in phrase { Beijing }, { Beijing }, the ASCII character in { capital }
In corresponding with specified parameter ASCII character value sum, i.e., first position, the second position and the third place pair in selection word order
The ASCII character value answered is summed, when phrase { Beijing }, when the third place is not present in { capital }, and the 3rd the first He being calculated
In value set with { Beijing }, { capital } corresponding element be the 2nd calculating as a result, be then calculated for the 3rd time first and value collect
It is combined into { a1+a2, a1+a2+a3, b1+b2}。
The embodiment of the present application, by calculating the sum of the corresponding ASCII character of phrase, gradually duplicate removal, the feature after obtaining duplicate removal
Word greatly reduces computation complexity relative to existing Feature Words duplicate removal technology, and simplifies duplicate removal step.
In order to guarantee that plurality of articles extract the consistency between the Feature Words with identical meanings, and promote article keyword
To the summary of article.Present invention also provides a kind of Feature Words De-weight methods, referring to FIG. 4, Fig. 4 shows the application implementation
The structural schematic block diagram for the Feature Words duplicate removal device that example provides.
As shown in figure 4, the device optionally includes:
Word segmentation processing unit 410, for carrying out word segmentation processing to initial data;And
Screening unit 420, for screening gathering as a result, obtaining participle for above-mentioned word segmentation processing using stop words dictionary.
Second acquisition unit 430, for obtaining at least one feature set of words, each feature set of words includes according to TF-
At least one Feature Words that IDF algorithm is extracted from participle set;
Expanding element 440, for obtaining being associated with the specific word according to the conjunctive word dictionary extension feature word constructed in advance
Phrase set.
First acquisition unit 450, for obtaining the phrase set with current signature word association in feature set of words.
Computing unit 460, for based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain
First and value set.
Determination unit 470, for determining the spy after duplicate removal by judging the number of the minimum value in first and value set
Levy word.
Wherein, the function embodiment identical with step 110-130 that unit 450-470 is realized, referring to step 110-130
Description content.
The embodiment of the present application has merged natural language processing, and text analyzing, TF-IDF, machine learning, statistical analysis etc. are more
Kind big data technology, the Feature Words that can be not limited to single text extract, more can be convex in the application that more text feature words extract
It shows it and extracts Feature Words to the abstract ability of text.
In the embodiment of the present application, initial data is obtained by network crawl technology, such as can be from predetermined quantity
The initial data obtained in text using web crawlers technology.Predetermined quantity for example can be 20000, the range of choice of text
It is not limited to retrievable academic article, forum's article, webpage article etc..
After obtaining initial data, such as it can use Chinese word segmentation tool and customized participle dictionary etc., to original
Data carry out word segmentation processing.Wherein, stammerer (Jieba) participle tool and customized can be used for example in Chinese word segmentation tool
The participle word established of the methods of participle dictionary, such as the method based on dictionary, Statistics-Based Method, rule-based method
Library, alternatively, the participle dictionary etc. established according to specific area rule or existing mark set.
After word segmentation processing, the vocabulary without practical significance such as auxiliary word, adjective are being removed using stop words dictionary, such as
Further the result of word segmentation processing can also be screened by deleting single syllable vocabulary, obtain participle set.Due in
Digest is wanted in extraction process, and single syllable word has no practical help to abstract extraction, can further duplicate removal single syllable word with
Invalid data is reduced to the occupancy of resource.
In the embodiment of the present application, by word frequency reverse document frequency TF-IDF algorithm to having divided from participle set
Word is weighted, and obtains the weight of word, then sorts according to weighted value, extracts the high word of the weighted value of front n
As feature set of words.Wherein, the value of n can adjust according to actual needs, for example, the value of any positive integer.
The reverse document frequency TF-IDF algorithm of word frequency refers to if the frequency that some word or phrase occur in an article
TF high, and seldom occur in other articles, then it is assumed that this word or phrase have good class discrimination ability, are suitble to use
To classify.That is TF-IDF tends to filter out common word, retains important word.
Wherein, word frequency (term frequency, TF) refers to time that some given word occurs in this document
Number.The value of word frequency would generally be normalized, to prevent it to be biased to the longer statistical result of article.
Reverse document-frequency (inverse document frequency, IDF) is the degree of a word general importance
Amount.The reverse document-frequency of a certain particular words, can be by general act number divided by the number of the file comprising the word, then incites somebody to action
Obtained quotient takes logarithm to obtain.If the document comprising IDF word is fewer, IDF is bigger, then it is good to illustrate that IDF word has
Class discrimination ability.If the number of files comprising IDF word is m in certain a kind of document, and other classes include the document of IDF word
Sum is k, it is clear that all number of files n=m+k comprising IDF word, when m is big, n is also big, obtains according to IDF formula
The value of IDF can be small, just illustrates that the class discrimination of the IDF word is indifferent.
The embodiment of the present application extracts weight by the reverse document frequency TF-IDF algorithm of word frequency from different text or article
The Feature Words wanted, while common word is filtered, the important Feature Words constitutive characteristic set of words based on extraction.It obtains at least
One feature set of words.
By some Feature Words (i.e. current signature word) at least one feature set of words according to the conjunctive word constructed in advance
Dictionary is extended, and is obtained and the associated phrase set of the specific word.Wherein, the conjunctive word dictionary constructed in advance refer to collect with
Feature Words have the phrase or word of certain particular kind of relationship.For example, conjunctive word dictionary can be synonym dictionary and other with
The dictionary of Feature Words and relationship.The synonym dictionary constructed in advance includes at least one phrase corresponding relationship, and the phrase is corresponding to close
System is constructed according to the matrix relationship pre-established.The matrix relationship pre-established, such as can be according to semantic identical
Two words constitute words pair, according to word to carrying out longitudinal or extending transversely establish matrix relationship.Such as, it is constituted with two phrases
Word is arranged two, with each phrase in two phrases there is the phrase of identical semanteme to extend according to longitudinal direction.Or with two
Phrase constitutes word to two rows, with each phrase in two phrases there is the phrase of identical semanteme to extend according to transverse direction.Square
Battle array relationship, such as can be the relationship of two column multirows or two row multiple rows.
In the embodiment of the present application, by being associated word expansion to each feature, plurality of articles or Text Feature Extraction are improved
Feature Words consistency and Feature Words to the abstract ability of text, this effective duplicate removal mode occupied space is also small.
It should be appreciated that each in the method that all units or module recorded in device 300-400 are described with reference Fig. 1-2
Step is corresponding.Device 300-400 and wherein included is equally applicable to above with respect to the operation and feature of method description as a result,
Unit, details are not described herein.Device 300-400 can realizes in advance in the browser of electronic equipment or other security applications,
It can also be loaded into the browser or its security application of electronic equipment by modes such as downloadings.Phase in device 300-400
Answer unit that can cooperate with the unit in electronic equipment to realize the scheme of the embodiment of the present application.
Below with reference to Fig. 5, it illustrates the calculating of the terminal device or server that are suitable for being used to realize the embodiment of the present application
The structural schematic diagram of machine system 500.
As shown in figure 5, computer system 500 includes central processing unit (CPU) 501, it can be read-only according to being stored in
Program in memory (ROM) 502 or be loaded into the program in random access storage device (RAM) 503 from storage section 508 and
Execute various movements appropriate and processing.In RAM 503, also it is stored with system 500 and operates required various programs and data.
CPU 501, ROM 502 and RAM 503 are connected with each other by bus 504.Input/output (I/O) interface 505 is also connected to always
Line 504.
I/O interface 505 is connected to lower component: the importation 506 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 507 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 508 including hard disk etc.;
And the communications portion 509 of the network interface card including LAN card, modem etc..Communications portion 509 via such as because
The network of spy's net executes communication process.Driver 510 is also connected to I/O interface 505 as needed.Detachable media 511, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 510, in order to read from thereon
Computer program be mounted into storage section 508 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it is soft to may be implemented as computer for the process above with reference to the description of Fig. 1/2
Part program.For example, embodiment of the disclosure includes a kind of computer program product comprising be tangibly embodied in machine readable Jie
Computer program in matter, aforementioned computer program include the program code for executing the method for Fig. 1/2.In such implementation
In example, which can be downloaded and installed from network by communications portion 509, and/or from detachable media 511
It is mounted.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of aforementioned modules, program segment or code include one or more
Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box
The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical
On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants
It is noted that the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart, Ke Yiyong
The dedicated hardware based system of defined functions or operations is executed to realize, or can be referred to specialized hardware and computer
The combination of order is realized.
Being described in the embodiment of the present application involved unit or module can be realized by way of software, can also be with
It is realized by way of hardware.Described unit or module also can be set in the processor, for example, can be described as:
A kind of processor includes first acquisition unit, computing unit and determination unit.Wherein, the title of these units or module is at certain
The restriction to the unit or module itself is not constituted in the case of kind, for example, first acquisition unit is also described as " being used for
Obtain the unit with the synonym collection of current signature word association in feature set of words ".
As on the other hand, present invention also provides a kind of computer readable storage medium, the computer-readable storage mediums
Matter can be computer readable storage medium included in aforementioned device in above-described embodiment;It is also possible to individualism, not
The computer readable storage medium being fitted into equipment.Computer-readable recording medium storage has one or more than one journey
Sequence, foregoing routine are used to execute the Feature Words De-weight method for being described in the application by one or more than one processor.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from aforementioned invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.