CN109062898A - Characteristic word duplication eliminating method, device and equipment and storage medium thereof - Google Patents

Characteristic word duplication eliminating method, device and equipment and storage medium thereof Download PDF

Info

Publication number
CN109062898A
CN109062898A CN201810852217.3A CN201810852217A CN109062898A CN 109062898 A CN109062898 A CN 109062898A CN 201810852217 A CN201810852217 A CN 201810852217A CN 109062898 A CN109062898 A CN 109062898A
Authority
CN
China
Prior art keywords
phrase
word
value
feature
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810852217.3A
Other languages
Chinese (zh)
Inventor
李利明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongjun New Energy Co ltd
Original Assignee
Hanergy Mobile Energy Holdings Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hanergy Mobile Energy Holdings Group Co Ltd filed Critical Hanergy Mobile Energy Holdings Group Co Ltd
Priority to CN201810852217.3A priority Critical patent/CN109062898A/en
Publication of CN109062898A publication Critical patent/CN109062898A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Abstract

The application discloses a feature word duplicate removal method, a feature word duplicate removal device, feature word duplicate removal equipment and a storage medium of the feature word duplicate removal device. The method comprises the following steps: acquiring a phrase set associated with a current characteristic word in the characteristic word set; calculating the sum of the appointed parts of the phrases based on the ASCII codes which are in one-to-one correspondence with the phrases to obtain a first sum value set; and determining the deduplicated feature words by judging the number of the minimum values in the first sum value set. According to the technical scheme of the embodiment of the application, the characteristic words with the same meaning are deduplicated by calculating the sum of ASCII codes, so that the calculation complexity of the current characteristic word deduplication method is reduced, the calculation space is saved, and the text summarization capability of the current characteristic words is obviously improved.

Description

Feature Words De-weight method, device, equipment and its storage medium
Technical field
Present application relates generally to but be not limited to semantic analysis technology field, and in particular to Feature Words De-weight method, is set device Standby and its storage medium.
Background technique
In natural language processing technique, the smallest unit of meaning is phrase or is word in natural language.Usually come It says, extracts single phrase as the meaning of Feature Words and be that it can summarize well the main contents of text, and reduce text The complexity of processing.The prior art is had much based on the algorithm of Text Feature Extraction Feature Words, such as the anti-document frequency (TF- of word frequency- IDF) method, information gain scheduling algorithm.
With the development of technology, from there may be more vocabularys to state identical contain between the Feature Words extracted in more texts Justice results in a feature that word redundancy.Current Feature Words duplicate removal technology, such as the feature refined using the method for calculating comentropy Word or Principal Component Analysis are mapped to higher-dimension orthogonal intersection space, reselection variance contribution by the vector space that term vector is constituted Big characteristic dimension etc..But these duplicate removal technologies have certain subjectivity, and are not able to satisfy the Feature Words of more Text Feature Extractions Keep preferable consistency.
In addition, the computation complexity of existing Feature Mapping method is too high.
Summary of the invention
In view of drawbacks described above in the prior art or deficiency, it is intended to provide a kind of spy that at least can reduce computation complexity Levy the technical solution of word duplicate removal.
In a first aspect, the embodiment of the present application provides a kind of Feature Words De-weight method, this method comprises:
Obtain the phrase set with current signature word association in feature set of words;
Based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain first and value set;
The Feature Words after duplicate removal are determined by judging the number of the minimum value in first and value set.
Second aspect, the embodiment of the present application provide a kind of Feature Words duplicate removal device, which includes:
First acquisition unit, for obtaining the phrase set with current signature word association in feature set of words;
Computing unit, for based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain first And value set;
Determination unit, for determining the feature after duplicate removal by judging the number of the minimum value in first and value set Word.
The third aspect, the embodiment of the present application provide a kind of computer equipment, including memory, processor and are stored in On memory and the computer program that can run on a processor, the processor realize such as the embodiment of the present application when executing the program The method of description.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence, the computer program are used for:
The method as described in the embodiment of the present application is realized when the computer program is executed by processor.
The technical solution of Feature Words duplicate removal provided by the embodiments of the present application, by the association phrase for calculating Feature Words The sum of the specified portions of ASCII character, and the number by judging minimum value in its calculated result, to associated with Feature Words Phrase carries out duplicate removal, to reduce the computation complexity of current signature word De-weight method, and saves and calculates space, and significant Current signature word is improved to the abstract ability of text.
Further, the embodiment of the present application also passes through the conjunctive word dictionary constructed in advance, realizes the spy of more Text Feature Extractions The height levied between word is consistent, and is the text minings application such as the building of later period word cloud and the extraction of text purport, provides accuracy Guarantee.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 shows the flow diagram of Feature Words De-weight method provided by the embodiments of the present application;
Fig. 2 shows the flow diagrams for the Feature Words De-weight method that the another embodiment of the application provides;
Fig. 3 shows the structural schematic diagram of Feature Words duplicate removal device provided by the embodiments of the present application;
Fig. 4 shows the structural schematic diagram for the Feature Words duplicate removal device that the another embodiment of the application provides;
Fig. 5 shows the structural schematic diagram for being suitable for the computer system for being used to realize the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Referring to FIG. 1, Fig. 1 shows the flow diagram of Feature Words De-weight method provided by the embodiments of the present application.
As shown in Figure 1, this method comprises:
Step 110, the phrase set with current signature word association in feature set of words is obtained.
In the embodiment of the present application, at least one Feature Words is extracted from text or article, it is available according to the specific word Associated phrase set.Wherein, phrase set can for example be determined by modes such as conjunctive word dictionaries.Conjunctive word dictionary, Such as it can be synonym dictionary or other establish the dictionary of incidence relation with Feature Words.It is extended to example with synonym dictionary, Feature Words extend to obtain synonym collection by synonym dictionary, and synonym collection for example may include and current signature word justice At least one identical phrase, phrase include at least two characters.Or determine that relationship can also obtain according to other synonyms To synonym collection associated with the specific word.In the embodiment of the present application, semanteme is identical to can be understood as semantic identical and language Justice is close, such as synonym, near synonym etc..
In the embodiment of the present application, phrase for example can be the word of Chinese, or multiple word phrases that English indicates.Its In, each phrase includes at least two characters.Such as the phrase indicated with Chinese, Beijing, Beijing, capital.With Beijing this For phrase comprising north, capital, two characters.These phrases represent different expression ways equivalent in meaning.Due in middle digest It wants in extraction process, monosyllabic to carry out abstract extraction for text or article and have no practical help, the embodiment of the present application can be with Further remove the single syllable vocabulary in text or article.
Step 120, based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain first and value Set.
In the embodiment of the present application, by searching for the specified portions for calculating phrase with the one-to-one ASCII character of phrase With.Wherein, specified portions and can for example be determined by specified parameter.Specified parameter for example can be calculation times identifier Number, it is used to indicate calculation times.
Search with phrase one-to-one ASCII character, such as can be by searching for ASCII character table, or pre-establish The ASCII character of common phrase obtains interface, by calling the ASCII character for obtaining interface to obtain phrase.In the embodiment of the present application, The result that each phrase is searched can be expressed as to ASCII character corresponding with the phrase.For example, phrase is { Beijing }, pass through ASCII character subset { a corresponding with { Beijing } that ASCII character table is searched1, a2}.Phrase is { Beijing }, by ASCII ASCII character subset { a corresponding with { Beijing } that Codebook Lookup obtains1, a2, a3}.Phrase is { capital }, by ASCII character ASCII character subset { the b corresponding with { capital } that table is searched1, b2}。
In the embodiment of the present application, it is based on the specified portions for calculating each phrase with the one-to-one ASCII character of phrase and, Obtain first and value set.For example, being determined by reading specified parameter by specifying during solving first and value set The computer capacity that parameter limits.Select to specify the corresponding ASCII character of the value of parameter with this from ASCII character by specified parameter Value is for calculating summation.In the embodiment of the present application, first and the number of value set be fixed and invariable, but first and value set In each element be relevant to specified parameter, such as it is related to calculation times mark.
To describe to calculate the process of first and value set for above-mentioned synonym word group set { Beijing, Beijing, capital }.
When first calculating, specified parameter m, such as calculation times mark are read.As m=1, by way of assignment or Initial value is realized with the mode that zero is added.Will each phrase first character corresponding A SCII code be used as first and value set Summed result.First character of these phrases is respectively { north }, { north }, { first }, ASCII character corresponding with the first character Value is respectively { a1, { a1, { b1 }, then first and value set being calculated for the first time can be expressed as { a1, a1, b1, that is, it calculates The sum of the specified portions of the corresponding ASCII character of each phrase.It can must be gone if the result calculated for the first time meets Rule of judgment Feature Words after weight, otherwise also need the step of computing repeatedly first and value set.
Can also by calculating whole sum of ASCII character, such as ASCII character corresponding with { Beijing } and be worth for a1+ a2, ASCII character corresponding with { Beijing } and value be a1+a2+a3, ASCII character corresponding with { capital } and value be b1+b2, Calculate separately whole sums of the corresponding ASCII character subset of each phrase.
Step 130, the Feature Words after duplicate removal are determined by judging the number of the minimum value in first and value set.
In the embodiment of the present application, by judging whether the number of minimum value in first and value set uniquely determines duplicate removal Feature Words afterwards.If the number of minimum value is unique in first and value set, it is determined that phrase relevant to the minimum value For the Feature Words after duplicate removal.For the minimum value in first and value set, such as { a being finally calculated1+a2, a1+a2+a3, b1+b2, minimum value may be a1+a2, phrase corresponding with the minimum value is Beijing.Then { Beijing } can be used as synset Feature Words in conjunction after duplicate removal processing.Other Feature Words all one by one after the duplicate removal processing of above-mentioned steps, that is, can determine new Word cloud building or the fields such as text mining of the feature set of words for the later period.
If the number of minimum value be it is not unique, for example, first and value set { a being calculated for the first time1, a1, b1} In, minimum value may be a1.But its number is 2, at this point, not can determine which phrase is the Feature Words after duplicate removal.The application is real Example is applied, also passes through the sum of the corresponding ASCII character specified portions of each phrase of cycle calculations, until judging the number of minimum value only One, just determine that phrase relevant to the minimum value is the Feature Words after duplicate removal.
In the embodiment of the present application, the feature after duplicate removal is determined by judging the number of the minimum value in first and value set Word may include:
Judge whether the number of the minimum value in first and value set is unique;
If it is determined that the number of minimum value be it is not unique, then update specified parameter;Then,
Return based on the one-to-one ASCII character of phrase calculate phrase specified portions and, until judging minimum value Number is unique;
If it is determined that the number of minimum value is unique, it is determined that phrase corresponding with the minimum value is the feature after duplicate removal Word.
In the embodiment of the present application, when it is not unique for determining the number of the minimum value in the first numerical value set, automatic trigger A new round solves the process of first and value set.In this process, firstly, updating specified parameter, such as can be by calculation times The value for identifying m increases by 1;Then, return calculate ASCII character specified portions and the step of.
Based on the one-to-one ASCII character of phrase calculate phrase specified portions and the step of, can also include:
Read specified parameter;
ASCII character value corresponding with specified parameter is selected in each ASCII character to sum, ASCII character includes and phrase The one-to-one ASCII character value of character, wherein the value of specified parameter is the positive integer less than or equal to N, N is word in phrase set The maximum value of group length, phrase length are the character sum that phrase includes.
By taking above-mentioned { Beijing, Beijing, capital } as an example, the phrase length of Beijing is 3.The value of N is 3, specifies ginseng Several values is the positive integer less than or equal to N, such as can be 1,2,3.N is expressed as the maximum of phrase length in phrase set Value.
For example, specifying the value of parameter is 1, such as specified parameter m=1 when first calculating;It selects in phrase { Beijing }, { north Jing Shi }, ASCII character value corresponding with specified parameter is summed in the ASCII character in { capital }, i.e., according in word order selection phrase the The ASCII character value of one position is summed, available first and value set be { a1, a1, b1}。
When the 2nd calculating, specifying parameter value is 2;It selects in phrase { Beijing }, { Beijing }, in the ASCII character in { capital } ASCII character value corresponding with specified parameter is summed, i.e. first position and the second position corresponding ASCII character in selection word order Value is summed, available first and value set be { a1+a2, a1+a2, b1+b2}。
When the 3rd calculating, specifying the value of parameter is 3;It selects in phrase { Beijing }, { Beijing }, the ASCII character in { capital } In corresponding with specified parameter ASCII character value sum, i.e., first position, the second position and the third place pair in selection word order The ASCII character value answered is summed, when phrase { Beijing }, when the third place is not present in { capital }, and the 3rd the first He being calculated In value set with { Beijing }, { capital } corresponding element be the 2nd calculating as a result, be then calculated for the 3rd time first and value collect It is combined into { a1+a2, a1+a2+a3, b1+b2}。
The embodiment of the present application, by calculating the sum of the corresponding ASCII character of phrase, gradually duplicate removal, the feature after obtaining duplicate removal Word greatly reduces computation complexity relative to existing Feature Words duplicate removal technology, and simplifies duplicate removal step.
The consistency between Feature Words in order to guarantee plurality of articles extraction, and article keyword is promoted to the summary of article Property.Present invention also provides a kind of Feature Words De-weight methods, referring to FIG. 2, provided Fig. 2 shows the another embodiment of the application The flow diagram of Feature Words De-weight method.
As shown in Fig. 2, this method optionally includes:
Step 210, word segmentation processing is carried out to initial data;And
Step 220, gathering as a result, obtaining participle for above-mentioned word segmentation processing is screened using stop words dictionary.
Step 230, at least one feature set of words is obtained, each feature set of words includes according to TF-IDF algorithm from participle At least one Feature Words extracted in set.
Step 240, it according to the conjunctive word dictionary extension feature word constructed in advance, obtains and the associated phrase collection of the specific word It closes.
Step 250, the phrase set with current signature word association in feature set of words is obtained.
Step 260, based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain first and value Set.
Step 270, the Feature Words after duplicate removal are determined by judging the number of the minimum value in first and value set.
Wherein, step 250-270 can use embodiment identical with step 110-130, referring to step 110-130's Description content.
The embodiment of the present application has merged natural language processing, and text analyzing, TF-IDF, machine learning, statistical analysis etc. are more Kind big data technology, the Feature Words that can be not limited to single text extract, more can be convex in the application that more text feature words extract It shows it and extracts Feature Words to the abstract ability of text.
In the embodiment of the present application, initial data is obtained by network crawl technology, such as can be from predetermined quantity The initial data obtained in text using web crawlers technology.Predetermined quantity for example can be 20000, the range of choice of text It is not limited to retrievable academic article, forum's article, webpage article etc..
After obtaining initial data, such as it can use Chinese word segmentation tool and customized participle dictionary etc., to original Data carry out word segmentation processing.Wherein, stammerer (Jieba) participle tool and customized can be used for example in Chinese word segmentation tool The participle word established of the methods of participle dictionary, such as the method based on dictionary, Statistics-Based Method, rule-based method Library, alternatively, the participle dictionary etc. established according to specific area rule or existing mark set.
After word segmentation processing, the vocabulary without practical significance such as auxiliary word, adjective are being removed using stop words dictionary, such as Further the result of word segmentation processing can also be screened by deleting single syllable vocabulary, obtain participle set.Due in Digest is wanted in extraction process, and single syllable word has no practical help to abstract extraction, can further duplicate removal single syllable word with Invalid data is reduced to the occupancy of resource.
In the embodiment of the present application, by word frequency reverse document frequency TF-IDF algorithm to having divided from participle set Word is weighted, and obtains the weight of word, then sorts according to weighted value, extracts the high word of the weighted value of front n As feature set of words.Wherein, the value of n can adjust according to actual needs, for example, the value of any positive integer.
The reverse document frequency TF-IDF algorithm of word frequency refers to if the frequency that some word or phrase occur in an article TF high, and seldom occur in other articles, then it is assumed that this word or phrase have good class discrimination ability, are suitble to use To classify.That is TF-IDF tends to filter out common word, retains important word.
Wherein, word frequency (term frequency, TF) refers to time that some given word occurs in this document Number.The value of word frequency would generally be normalized, to prevent it to be biased to the longer statistical result of article.
Reverse document-frequency (inverse document frequency, IDF) is the degree of a word general importance Amount.The reverse document-frequency of a certain particular words, can be by general act number divided by the number of the file comprising the word, then incites somebody to action Obtained quotient takes logarithm to obtain.If the document comprising IDF word is fewer, IDF is bigger, then it is good to illustrate that IDF word has Class discrimination ability.If the number of files comprising IDF word is m in certain a kind of document, and other classes include the document of IDF word Sum is k, it is clear that all number of files n=m+k comprising IDF word, when m is big, n is also big, obtains according to IDF formula The value of IDF can be small, just illustrates that the class discrimination of the IDF word is indifferent.
The embodiment of the present application extracts weight by the reverse document frequency TF-IDF algorithm of word frequency from different text or article The Feature Words wanted, while common word is filtered, the important Feature Words constitutive characteristic set of words based on extraction.It obtains at least One feature set of words.
By some Feature Words (i.e. current signature word) at least one feature set of words according to the conjunctive word constructed in advance Dictionary is extended, and is obtained and the associated phrase set of the specific word.Wherein, the conjunctive word dictionary constructed in advance refer to collect with Feature Words have the phrase or word of certain particular kind of relationship.For example, conjunctive word dictionary can be synonym dictionary and other with The dictionary of Feature Words and relationship.The synonym dictionary constructed in advance includes at least one phrase corresponding relationship, and the phrase is corresponding to close System is constructed according to the matrix relationship pre-established.The matrix relationship pre-established, such as can be according to semantic identical Two words constitute words pair, according to word to carrying out longitudinal or extending transversely establish matrix relationship.Such as, it is constituted with two phrases Word is arranged two, with each phrase in two phrases there is the phrase of identical semanteme to extend according to longitudinal direction.Or with two Phrase constitutes word to two rows, with each phrase in two phrases there is the phrase of identical semanteme to extend according to transverse direction.Square Battle array relationship, such as can be the relationship of two column multirows or two row multiple rows.
In the embodiment of the present application, by being associated word expansion to each feature, plurality of articles or Text Feature Extraction are improved Feature Words consistency and Feature Words to the abstract ability of text, this effective duplicate removal mode occupied space is also small.
It should be noted that although describing the operation of the method for the present invention in the accompanying drawings with particular order, this is not required that Or hint must execute these operations in this particular order, or have to carry out operation shown in whole and be just able to achieve the phase The result of prestige.On the contrary, the step of describing in flow chart can change and execute sequence.Additionally or alternatively, it is convenient to omit certain Multiple steps are merged into a step and executed, and/or a step is decomposed into execution of multiple steps by step.
Further referring to FIG. 3, Fig. 3 shows the schematic of Feature Words duplicate removal device 300 provided by the embodiments of the present application Structure chart.
As shown in figure 3, the device 300 includes:
First acquisition unit 310, for obtaining the phrase set with current signature word association in feature set of words.
In the embodiment of the present application, at least one Feature Words is extracted from text or article, it is available according to the specific word Associated phrase set.Wherein, phrase set can for example be determined by modes such as conjunctive word dictionaries.Wherein conjunctive word word Library, such as can be synonym dictionary or other establish the dictionary of associate feature with Feature Words.It is extended to synonym dictionary Example, Feature Words extend to obtain synonym collection by synonym dictionary, and synonym collection for example may include and current signature word At least one semantic identical phrase, phrase include at least two characters.Or determine that relationship can also according to other synonyms To obtain synonym collection associated with the specific word.In the embodiment of the present application, it is semantic it is identical can be understood as it is semantic identical And semantic similarity, such as synonym, near synonym etc..
In the embodiment of the present application, phrase for example can be the word of Chinese, or multiple word phrases that English indicates.Its In, each phrase includes at least two characters.Such as the phrase indicated with Chinese, Beijing, Beijing, capital.With Beijing this For phrase comprising north, capital, two characters.These phrases represent different expression ways equivalent in meaning.Due in middle digest It wants in extraction process, monosyllabic to carry out abstract extraction for text or article and have no practical help, the embodiment of the present application can be with Further remove the single syllable vocabulary in text or article.
Computing unit 320, for based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain First and value set.
In the embodiment of the present application, by searching for the specified portions for calculating phrase with the one-to-one ASCII character of phrase With.Wherein, the sum of specified portions, such as can be determined by specified parameter.Specified parameter for example can be calculation times identifier Number, it is used to indicate calculation times.
Search with phrase one-to-one ASCII character, such as can be by searching for ASCII character table, or pre-establish The ASCII character of common phrase obtains interface, by calling the ASCII character for obtaining interface to obtain phrase.In the embodiment of the present application, The result that each phrase is searched can be expressed as to ASCII character corresponding with the phrase.For example, phrase is { Beijing }, pass through ASCII character subset { a corresponding with { Beijing } that ASCII character table is searched1, a2}.Phrase is { Beijing }, by ASCII ASCII character subset { a corresponding with { Beijing } that Codebook Lookup obtains1, a2, a3}.Phrase is { capital }, by ASCII character ASCII character subset { the b corresponding with { capital } that table is searched1, b2}。
In the embodiment of the present application, it is based on the specified portions for calculating each phrase with the one-to-one ASCII character of phrase and, Obtain first and value set.For example, being determined by reading specified parameter by specifying during solving first and value set The computer capacity that parameter limits.Select to specify the corresponding ASCII character of the value of parameter with this from ASCII character by specified parameter Value is for calculating summation.In the embodiment of the present application, first and the number of value set be fixed and invariable, but first and value set In each element be relevant to specified parameter, such as it is related to calculation times mark.
To describe to calculate the process of first and value set for above-mentioned synonym word group set { Beijing, Beijing, capital }.
When first calculating, specified parameter m, such as calculation times mark are read.As m=1, by way of assignment or Initial value is realized with the mode that zero is added.Will each phrase first character corresponding A SCII code be used as first and value set Summed result.First character of these phrases is respectively { north }, { north }, { first }, ASCII character corresponding with the first character Value is respectively { a1, { a1, { b1 }, then first and value set being calculated for the first time can be expressed as { a1, a1, b1, that is, it calculates The sum of the specified portions of the corresponding ASCII character of each phrase.It can must be gone if the result calculated for the first time meets Rule of judgment Feature Words after weight, otherwise also need the step of computing repeatedly first and value set.
Can also by calculating whole sum of ASCII character, such as ASCII character corresponding with { Beijing } and be worth for a1+ a2, ASCII character corresponding with { Beijing } and value be a1+a2+a3, ASCII character corresponding with { capital } and value be b1+b2, Calculate separately whole sums of the corresponding ASCII character subset of each phrase.
Determination unit 330, for determining the spy after duplicate removal by judging the number of the minimum value in first and value set Levy word.
In the embodiment of the present application, by judging whether the number of minimum value in first and value set uniquely determines duplicate removal Feature Words afterwards.If the number of minimum value is unique in first and value set, it is determined that phrase relevant to the minimum value For the Feature Words after duplicate removal.For the minimum value in first and value set, such as { a being finally calculated1+a2, a1+a2+a3, b1+b2, minimum value may be a1+a2, phrase corresponding with the minimum value is Beijing.Then { Beijing } can be used as synset Feature Words in conjunction after duplicate removal processing.Other Feature Words all one by one after the duplicate removal processing of above-mentioned steps, that is, can determine new Word cloud building or the fields such as text mining of the feature set of words for the later period.
If the number of minimum value be it is not unique, for example, first and value set { a being calculated for the first time1, a1, b1} In, minimum value may be a1.But its number is 2, at this point, not can determine which phrase is the Feature Words after duplicate removal.The application is real Example is applied, also passes through the sum of the corresponding ASCII character specified portions of each phrase of cycle calculations, until judging the number of minimum value only One, just determine that phrase relevant to the minimum value is the Feature Words after duplicate removal.
In the embodiment of the present application, determination unit may include:
Whether judging submodule, the number for judging the minimum value in first and value set are unique;
Update submodule, for if it is determined that the number of minimum value be it is not unique, then update specified parameter;Then,
Return to submodule, for returns based on phrase one-to-one ASCII character calculating phrase specified portions and, directly To judging that the number of minimum value is unique;
Submodule is determined, for if it is determined that the number of minimum value is unique, it is determined that word corresponding with the minimum value Group is the Feature Words after duplicate removal.
In the embodiment of the present application, when it is not unique for determining the number of the minimum value in the first numerical value set, automatic trigger A new round solves the process of first and value set.In this process, firstly, updating specified parameter, such as can be by calculation times The value for identifying m increases by 1;Then, return calculate ASCII character specified portions and the step of.
Computing unit can also include:
Reading submodule, for reading specified parameter;
It sums submodule, sums for selecting in each ASCII character ASCII character value corresponding with specified parameter, ASCII Code include with the one-to-one ASCII character value of the character of phrase, wherein the value of specified parameter is the positive integer less than or equal to N, N For the maximum value of phrase length in phrase set, phrase length is the character sum that phrase includes.
By taking above-mentioned { Beijing, Beijing, capital } as an example, the phrase length of Beijing is 3.The value of N is 3, specifies ginseng Several values is the positive integer less than or equal to N, such as can be 1,2,3.N is expressed as the maximum of phrase length in phrase set Value.
For example, specifying the value of parameter is 1, such as specified parameter m=1 when first calculating;It selects in phrase { Beijing }, { north Jing Shi }, ASCII character value corresponding with specified parameter is summed in the ASCII character in { capital }, i.e., according in word order selection phrase the The ASCII character value of one position is summed, available first and value set be { a1, a1, b1}。
When the 2nd calculating, specifying parameter value is 2;It selects in phrase { Beijing }, { Beijing }, in the ASCII character in { capital } ASCII character value corresponding with specified parameter is summed, i.e. first position and the second position corresponding ASCII character in selection word order Value is summed, available first and value set be { a1+a2, a1+a2, b1+b2}。
When the 3rd calculating, specifying the value of parameter is 3;It selects in phrase { Beijing }, { Beijing }, the ASCII character in { capital } In corresponding with specified parameter ASCII character value sum, i.e., first position, the second position and the third place pair in selection word order The ASCII character value answered is summed, when phrase { Beijing }, when the third place is not present in { capital }, and the 3rd the first He being calculated In value set with { Beijing }, { capital } corresponding element be the 2nd calculating as a result, be then calculated for the 3rd time first and value collect It is combined into { a1+a2, a1+a2+a3, b1+b2}。
The embodiment of the present application, by calculating the sum of the corresponding ASCII character of phrase, gradually duplicate removal, the feature after obtaining duplicate removal Word greatly reduces computation complexity relative to existing Feature Words duplicate removal technology, and simplifies duplicate removal step.
In order to guarantee that plurality of articles extract the consistency between the Feature Words with identical meanings, and promote article keyword To the summary of article.Present invention also provides a kind of Feature Words De-weight methods, referring to FIG. 4, Fig. 4 shows the application implementation The structural schematic block diagram for the Feature Words duplicate removal device that example provides.
As shown in figure 4, the device optionally includes:
Word segmentation processing unit 410, for carrying out word segmentation processing to initial data;And
Screening unit 420, for screening gathering as a result, obtaining participle for above-mentioned word segmentation processing using stop words dictionary.
Second acquisition unit 430, for obtaining at least one feature set of words, each feature set of words includes according to TF- At least one Feature Words that IDF algorithm is extracted from participle set;
Expanding element 440, for obtaining being associated with the specific word according to the conjunctive word dictionary extension feature word constructed in advance Phrase set.
First acquisition unit 450, for obtaining the phrase set with current signature word association in feature set of words.
Computing unit 460, for based on the one-to-one ASCII character of phrase calculate phrase specified portions and, obtain First and value set.
Determination unit 470, for determining the spy after duplicate removal by judging the number of the minimum value in first and value set Levy word.
Wherein, the function embodiment identical with step 110-130 that unit 450-470 is realized, referring to step 110-130 Description content.
The embodiment of the present application has merged natural language processing, and text analyzing, TF-IDF, machine learning, statistical analysis etc. are more Kind big data technology, the Feature Words that can be not limited to single text extract, more can be convex in the application that more text feature words extract It shows it and extracts Feature Words to the abstract ability of text.
In the embodiment of the present application, initial data is obtained by network crawl technology, such as can be from predetermined quantity The initial data obtained in text using web crawlers technology.Predetermined quantity for example can be 20000, the range of choice of text It is not limited to retrievable academic article, forum's article, webpage article etc..
After obtaining initial data, such as it can use Chinese word segmentation tool and customized participle dictionary etc., to original Data carry out word segmentation processing.Wherein, stammerer (Jieba) participle tool and customized can be used for example in Chinese word segmentation tool The participle word established of the methods of participle dictionary, such as the method based on dictionary, Statistics-Based Method, rule-based method Library, alternatively, the participle dictionary etc. established according to specific area rule or existing mark set.
After word segmentation processing, the vocabulary without practical significance such as auxiliary word, adjective are being removed using stop words dictionary, such as Further the result of word segmentation processing can also be screened by deleting single syllable vocabulary, obtain participle set.Due in Digest is wanted in extraction process, and single syllable word has no practical help to abstract extraction, can further duplicate removal single syllable word with Invalid data is reduced to the occupancy of resource.
In the embodiment of the present application, by word frequency reverse document frequency TF-IDF algorithm to having divided from participle set Word is weighted, and obtains the weight of word, then sorts according to weighted value, extracts the high word of the weighted value of front n As feature set of words.Wherein, the value of n can adjust according to actual needs, for example, the value of any positive integer.
The reverse document frequency TF-IDF algorithm of word frequency refers to if the frequency that some word or phrase occur in an article TF high, and seldom occur in other articles, then it is assumed that this word or phrase have good class discrimination ability, are suitble to use To classify.That is TF-IDF tends to filter out common word, retains important word.
Wherein, word frequency (term frequency, TF) refers to time that some given word occurs in this document Number.The value of word frequency would generally be normalized, to prevent it to be biased to the longer statistical result of article.
Reverse document-frequency (inverse document frequency, IDF) is the degree of a word general importance Amount.The reverse document-frequency of a certain particular words, can be by general act number divided by the number of the file comprising the word, then incites somebody to action Obtained quotient takes logarithm to obtain.If the document comprising IDF word is fewer, IDF is bigger, then it is good to illustrate that IDF word has Class discrimination ability.If the number of files comprising IDF word is m in certain a kind of document, and other classes include the document of IDF word Sum is k, it is clear that all number of files n=m+k comprising IDF word, when m is big, n is also big, obtains according to IDF formula The value of IDF can be small, just illustrates that the class discrimination of the IDF word is indifferent.
The embodiment of the present application extracts weight by the reverse document frequency TF-IDF algorithm of word frequency from different text or article The Feature Words wanted, while common word is filtered, the important Feature Words constitutive characteristic set of words based on extraction.It obtains at least One feature set of words.
By some Feature Words (i.e. current signature word) at least one feature set of words according to the conjunctive word constructed in advance Dictionary is extended, and is obtained and the associated phrase set of the specific word.Wherein, the conjunctive word dictionary constructed in advance refer to collect with Feature Words have the phrase or word of certain particular kind of relationship.For example, conjunctive word dictionary can be synonym dictionary and other with The dictionary of Feature Words and relationship.The synonym dictionary constructed in advance includes at least one phrase corresponding relationship, and the phrase is corresponding to close System is constructed according to the matrix relationship pre-established.The matrix relationship pre-established, such as can be according to semantic identical Two words constitute words pair, according to word to carrying out longitudinal or extending transversely establish matrix relationship.Such as, it is constituted with two phrases Word is arranged two, with each phrase in two phrases there is the phrase of identical semanteme to extend according to longitudinal direction.Or with two Phrase constitutes word to two rows, with each phrase in two phrases there is the phrase of identical semanteme to extend according to transverse direction.Square Battle array relationship, such as can be the relationship of two column multirows or two row multiple rows.
In the embodiment of the present application, by being associated word expansion to each feature, plurality of articles or Text Feature Extraction are improved Feature Words consistency and Feature Words to the abstract ability of text, this effective duplicate removal mode occupied space is also small.
It should be appreciated that each in the method that all units or module recorded in device 300-400 are described with reference Fig. 1-2 Step is corresponding.Device 300-400 and wherein included is equally applicable to above with respect to the operation and feature of method description as a result, Unit, details are not described herein.Device 300-400 can realizes in advance in the browser of electronic equipment or other security applications, It can also be loaded into the browser or its security application of electronic equipment by modes such as downloadings.Phase in device 300-400 Answer unit that can cooperate with the unit in electronic equipment to realize the scheme of the embodiment of the present application.
Below with reference to Fig. 5, it illustrates the calculating of the terminal device or server that are suitable for being used to realize the embodiment of the present application The structural schematic diagram of machine system 500.
As shown in figure 5, computer system 500 includes central processing unit (CPU) 501, it can be read-only according to being stored in Program in memory (ROM) 502 or be loaded into the program in random access storage device (RAM) 503 from storage section 508 and Execute various movements appropriate and processing.In RAM 503, also it is stored with system 500 and operates required various programs and data. CPU 501, ROM 502 and RAM 503 are connected with each other by bus 504.Input/output (I/O) interface 505 is also connected to always Line 504.
I/O interface 505 is connected to lower component: the importation 506 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 507 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 508 including hard disk etc.; And the communications portion 509 of the network interface card including LAN card, modem etc..Communications portion 509 via such as because The network of spy's net executes communication process.Driver 510 is also connected to I/O interface 505 as needed.Detachable media 511, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 510, in order to read from thereon Computer program be mounted into storage section 508 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it is soft to may be implemented as computer for the process above with reference to the description of Fig. 1/2 Part program.For example, embodiment of the disclosure includes a kind of computer program product comprising be tangibly embodied in machine readable Jie Computer program in matter, aforementioned computer program include the program code for executing the method for Fig. 1/2.In such implementation In example, which can be downloaded and installed from network by communications portion 509, and/or from detachable media 511 It is mounted.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of aforementioned modules, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart, Ke Yiyong The dedicated hardware based system of defined functions or operations is executed to realize, or can be referred to specialized hardware and computer The combination of order is realized.
Being described in the embodiment of the present application involved unit or module can be realized by way of software, can also be with It is realized by way of hardware.Described unit or module also can be set in the processor, for example, can be described as: A kind of processor includes first acquisition unit, computing unit and determination unit.Wherein, the title of these units or module is at certain The restriction to the unit or module itself is not constituted in the case of kind, for example, first acquisition unit is also described as " being used for Obtain the unit with the synonym collection of current signature word association in feature set of words ".
As on the other hand, present invention also provides a kind of computer readable storage medium, the computer-readable storage mediums Matter can be computer readable storage medium included in aforementioned device in above-described embodiment;It is also possible to individualism, not The computer readable storage medium being fitted into equipment.Computer-readable recording medium storage has one or more than one journey Sequence, foregoing routine are used to execute the Feature Words De-weight method for being described in the application by one or more than one processor.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from aforementioned invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (14)

1. a kind of Feature Words De-weight method, which is characterized in that this method comprises:
Obtain the phrase set with current signature word association in feature set of words;
Based on the one-to-one ASCII character of the phrase calculate the phrase specified portions and, obtain first and value set;
The Feature Words after duplicate removal are determined by judging the number of the minimum value in described first and value set.
2. the method according to claim 1, wherein described be based on and the one-to-one ASCII character of the phrase Calculate the sum of the phrase specified portions, comprising:
Read specified parameter;
ASCII character value corresponding with specified parameter is selected in each ASCII character to sum, the ASCII character includes and institute The one-to-one ASCII character value of the character of predicate group, the value of the specified parameter are the positive integer less than or equal to N, and N is described The maximum value of phrase length in phrase set, the phrase length are the character sum that the phrase includes.
3. method according to claim 1 or 2, which is characterized in that described by judging in described first and value set The number of minimum value determines the Feature Words after duplicate removal, comprising:
Judge whether the number of the minimum value in described first and value set is unique;
If it is determined that the number of the minimum value be it is not unique, then update specified parameter;Then, it returns and is based on and the phrase One-to-one ASCII character calculates the sum of the phrase specified portions, until judging that the number of the minimum value is unique;
If it is determined that the number of the minimum value is unique, it is determined that phrase corresponding with the minimum value is the feature after duplicate removal Word.
4. the method according to claim 1, wherein this method further include:
At least one feature set of words is obtained, each feature set of words includes calculating according to the reverse document frequency TF-IDF of word frequency At least one Feature Words that method is extracted from participle set;
The Feature Words are extended according to the conjunctive word dictionary constructed in advance, are obtained and the associated phrase set of the specific word.
5. according to the method described in claim 4, it is characterized in that, the conjunctive word dictionary includes the corresponding pass of at least one phrase System, the phrase corresponding relationship are constructed according to pre-establishing matrix relationship.
6. method according to claim 4 or 5, which is characterized in that this method further include:
Word segmentation processing is carried out to initial data;And
Gathering as a result, obtaining the participle for the word segmentation processing is screened using stop words dictionary.
7. a kind of Feature Words duplicate removal device, which is characterized in that the device includes:
First acquisition unit, for obtaining the phrase set with current signature word association in feature set of words;
Computing unit, for based on the one-to-one ASCII character of the phrase calculate the phrase specified portions and, obtain First and value set;
Determination unit, for determining the feature after duplicate removal by judging the number of the minimum value in described first and value set Word.
8. device according to claim 7, which is characterized in that the computing unit, comprising:
Reading submodule, for reading specified parameter;
It sums submodule, sums for selecting in each ASCII character ASCII character value corresponding with specified parameter, it is described ASCII character include with the one-to-one ASCII character value of the character of the phrase, the value of the specified parameter is less than or equal to N Positive integer, N is the maximum value of phrase length in the phrase set, and the phrase length is that the phrase character that includes is total Number.
9. device according to claim 7 or 8, which is characterized in that the determination unit includes:
Whether judging submodule, the number for judging the minimum value in described first and value set are unique;
Update submodule, for if it is determined that the number of the minimum value be it is not unique, then update specified parameter;And
Submodule is returned to, for returning based on calculating the phrase specified portions with the one-to-one ASCII character of the phrase With until judging that the number of the minimum value is unique;
Submodule is determined, for if it is determined that the number of the minimum value is unique, it is determined that word corresponding with the minimum value Group is the Feature Words after duplicate removal.
10. device according to claim 7, which is characterized in that the device further include:
Second acquisition unit, for obtaining at least one feature set of words, each feature set of words includes inverse according to word frequency At least one Feature Words extracted from participle set to document frequency TF-IDF algorithm;
Expanding element obtains associated with the specific word for extending the Feature Words according to the conjunctive word dictionary constructed in advance The phrase set.
11. device according to claim 9, which is characterized in that the conjunctive word dictionary includes that at least one phrase is corresponding Relationship, the phrase corresponding relationship are constructed according to the matrix relationship pre-established.
12. device described in 0 or 11 according to claim 1, which is characterized in that the device further include:
Word segmentation processing unit, for carrying out word segmentation processing to initial data;And
Screening unit, for screening gathering as a result, obtaining the participle for the word segmentation processing using stop words dictionary.
13. a kind of computer equipment, can run on a memory and on a processor including memory, processor and storage Computer program, which is characterized in that the processor realizes such as side as claimed in any one of claims 1 to 6 when executing described program Method.
14. a kind of computer readable storage medium is stored thereon with computer program, the computer program is used for:
Such as method as claimed in any one of claims 1 to 6 is realized when the computer program is executed by processor.
CN201810852217.3A 2018-07-27 2018-07-27 Characteristic word duplication eliminating method, device and equipment and storage medium thereof Pending CN109062898A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810852217.3A CN109062898A (en) 2018-07-27 2018-07-27 Characteristic word duplication eliminating method, device and equipment and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810852217.3A CN109062898A (en) 2018-07-27 2018-07-27 Characteristic word duplication eliminating method, device and equipment and storage medium thereof

Publications (1)

Publication Number Publication Date
CN109062898A true CN109062898A (en) 2018-12-21

Family

ID=64831434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810852217.3A Pending CN109062898A (en) 2018-07-27 2018-07-27 Characteristic word duplication eliminating method, device and equipment and storage medium thereof

Country Status (1)

Country Link
CN (1) CN109062898A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411568A (en) * 2010-09-20 2012-04-11 苏州同程旅游网络科技有限公司 Chinese word segmentation method based on travel industry feature word stock
US20160188554A1 (en) * 2014-12-30 2016-06-30 Chengnan Liu Method for generating random content for an article
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN108132930A (en) * 2017-12-27 2018-06-08 曙光信息产业(北京)有限公司 Feature Words extracting method and device
CN108304384A (en) * 2018-01-29 2018-07-20 上海名轩软件科技有限公司 Word-breaking method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411568A (en) * 2010-09-20 2012-04-11 苏州同程旅游网络科技有限公司 Chinese word segmentation method based on travel industry feature word stock
US20160188554A1 (en) * 2014-12-30 2016-06-30 Chengnan Liu Method for generating random content for an article
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN108132930A (en) * 2017-12-27 2018-06-08 曙光信息产业(北京)有限公司 Feature Words extracting method and device
CN108304384A (en) * 2018-01-29 2018-07-20 上海名轩软件科技有限公司 Word-breaking method and apparatus

Similar Documents

Publication Publication Date Title
CN107798136B (en) Entity relation extraction method and device based on deep learning and server
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
US10360294B2 (en) Methods and systems for efficient and accurate text extraction from unstructured documents
US9483460B2 (en) Automated formation of specialized dictionaries
CN110377886A (en) Project duplicate checking method, apparatus, equipment and storage medium
WO2009154570A1 (en) System and method for aligning and indexing multilingual documents
CN110032650B (en) Training sample data generation method and device and electronic equipment
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN111241389A (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
Lepage Analogies between binary images: Application to chinese characters
CN110147425A (en) A kind of keyword extracting method, device, computer equipment and storage medium
CN107220307A (en) Web search method and device
CN115017315A (en) Leading edge theme identification method and system and computer equipment
Babatunde et al. Automatic table recognition and extraction from heterogeneous documents
JP5869948B2 (en) Passage dividing method, apparatus, and program
CN111681731A (en) Method for automatically marking colors of inspection report
Yahya et al. Arabic text categorization based on Arabic Wikipedia
Elbarougy et al. A proposed natural language processing preprocessing procedures for enhancing arabic text summarization
CN110457707A (en) Extracting method, device, electronic equipment and the readable storage medium storing program for executing of notional word keyword
CN113722472B (en) Technical literature information extraction method, system and storage medium
CN113449063B (en) Method and device for constructing document structure information retrieval library
CN109062898A (en) Characteristic word duplication eliminating method, device and equipment and storage medium thereof
Milanova et al. LOCALE: A Rule-based Location Named-entity Recognition Method for Latin Text.
KR20070118154A (en) Information processing device and method, and program recording medium
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201221

Address after: No.31 Yanqi street, Yanqi Economic Development Zone, Huairou District, Beijing

Applicant after: Beijing Huihong Technology Co.,Ltd.

Address before: Room 107, building 2, Olympic Village street, Chaoyang District, Beijing

Applicant before: HANERGY MOBILE ENERGY HOLDING GROUP Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211109

Address after: No.31 Yanqi street, Yanqi Economic Development Zone, Huairou District, Beijing

Applicant after: Dongjun new energy Co.,Ltd.

Address before: No.31 Yanqi street, Yanqi Economic Development Zone, Huairou District, Beijing

Applicant before: Beijing Huihong Technology Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20181221

RJ01 Rejection of invention patent application after publication