CN109948141A - A kind of method and apparatus for extracting Feature Words - Google Patents
A kind of method and apparatus for extracting Feature Words Download PDFInfo
- Publication number
- CN109948141A CN109948141A CN201711391968.1A CN201711391968A CN109948141A CN 109948141 A CN109948141 A CN 109948141A CN 201711391968 A CN201711391968 A CN 201711391968A CN 109948141 A CN109948141 A CN 109948141A
- Authority
- CN
- China
- Prior art keywords
- word
- feature
- feature words
- speech
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000000605 extraction Methods 0.000 claims description 30
- 239000000284 extract Substances 0.000 claims description 21
- 238000001914 filtration Methods 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 abstract description 24
- 238000013459 approach Methods 0.000 abstract description 5
- 238000012545 processing Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 238000004458 analytical method Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 230000001235 sensitizing effect Effects 0.000 description 7
- 230000006854 communication Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012512 characterization method Methods 0.000 description 2
- 230000001351 cycling effect Effects 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 239000002105 nanoparticle Substances 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 210000004881 tumor cell Anatomy 0.000 description 2
- 235000007926 Craterellus fallax Nutrition 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 240000007175 Datura inoxia Species 0.000 description 1
- 101000951325 Homo sapiens Mitoferrin-1 Proteins 0.000 description 1
- 102100037984 Mitoferrin-1 Human genes 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of method and apparatus for extracting Feature Words, are related to information technology field.One specific embodiment of this method includes: acquisition target text, determines the part of speech feature of each word in the target text;According to the part of speech feature of each word and preset Feature Words part of speech extracting rule, the Feature Words of the target text are determined.This embodiment offers a kind of new approaches for extracting target text Feature Words, and the Feature Words of target text are extracted by part of speech feature, the multidimensional characteristic of target text is saved with this, improves the accuracy of text feature.
Description
Technical field
The present invention relates to information technology field more particularly to a kind of method and apparatus for extracting Feature Words.
Background technique
Be flooded with many text datas in network today, user be highly desirable to receive some information relevant to oneself (for example,
News, policy, the guide etc. of country's publication) prevent the omission of effective information, avoid bigger loss.For example, some
Worker is recommended by text, can be understood the text relevant to oneself work that country issues instantly quickly, be carried out phase in time
Close work project verification, economic loss caused by avoiding information from omitting.
Scheme provided by the prior art usually calculates the ratio of number and text sum that word occurs in the text,
For example, document frequency (Document Frequency, DF), TF*IDF (Term Frequency-Inverse Document
Frequency) scheduling algorithm, to extract text key word.Then the first few items and title chosen in these keywords combine
The label of characterization text feature is formed, finally carries out feature correlation calculating using these labels, to realize that text is recommended.
In realizing process of the present invention, inventor has found the prior art, and at least there are the following problems:
(1) scheme provided by the prior art, the Feature Words negligible amounts that can be extracted, cannot characterize text well
Various dimensions feature;
(2) for more recapitulative text, the word frequency of occurrence of many importance is less, can not according to the prior art
These words are identified well, hinder the extraction of text feature.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of method and apparatus for extracting Feature Words, at least it is able to solve existing
The problem of Feature Words negligible amounts that technology is extracted cause text feature to lack.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of method for extracting Feature Words is provided,
Include: acquisition target text, determines the part of speech feature of each word in the target text;According to the part of speech feature of each word
And preset Feature Words part of speech extracting rule, determine the Feature Words of the target text.
Optionally, the part of speech feature according to each word and preset Feature Words part of speech extracting rule determine
The Feature Words of the target text include:
In four adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to
Zero, the part of speech of second word is nominal phrase and frequency of occurrence is zero or primary, and the part of speech of third word is adjective
Or noun and frequency of occurrence are more than or equal to zero, when the part of speech of the 4th word is noun, determine that four adjacent words are
The Feature Words of the target text;Or
When the part of speech of first word in first three adjacent words is adjective or noun and frequency of occurrence is greater than etc.
In primary, or for nominal phrase and frequency of occurrence be at most it is primary, the part of speech of second word be adjective or noun and
Frequency of occurrence is more than or equal to zero, when the part of speech of third word is noun, determines that described first three adjacent words are described
The Feature Words of target text;Or
When the part of speech of first word in two adjacent three words is adjective or noun or verb, second word
Part of speech be verb or noun and frequency of occurrence is at most primary, and when the part of speech of third word is noun or verb, determination
Two adjacent three words are the Feature Words of the target text;Or
When the part of speech of word is academic word, determine that the academic word is the Feature Words of the target text;
Or
When the part of speech of word be sensibility word when, determine be located at the sensibility word after and with the sensibility word
The adjacent nearest word of language is the Feature Words of the target text.
Optionally, the method for the embodiment of the present invention further includes analyzing the syntactic structure of each sentence in the target text, really
The sentence feature of fixed each sentence;Wherein, the sentence feature includes determiner and centre word;According to preset Feature Words sentence
Feature extraction rule extracts the centre word and before the centre word and nearest restriction adjacent with the centre word
Word determines that portmanteau word combined by the extracted centre word and determiner is the Feature Words of the target text.
Optionally, the method for the embodiment of the present invention further include: when the target text is generality text or extracted
When the quantity of Feature Words is less than predetermined quantity threshold value, word similar with extracted Feature Words is obtained in preset feature dictionary
Language;The first similarity for calculating acquired word and the Feature Words, when first similarity is more than or equal to first
When predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.
Optionally, the method for the embodiment of the present invention further include:
According to formula
L=x* [C-value (a)]+y*SCP (w1,...,wn)
The noise figure L of each Feature Words is calculated, the Feature Words for extracting noise figure more than or equal to predetermined noise value threshold value are
The Feature Words of target text;Wherein, C-value (a) is the obtained noise figure of term filtering characteristic word, SCP (w1,...,
wn) it is the obtained noise figure of unit filtering characteristic word, w1,...,wnIt is characterized word word string, x, y respectively represent C-value
(a)、SCP(w1,...,wn) weight, a is word string.
Optionally, after the Feature Words of the determination target text, further include receiving object to be measured text, obtain
The Feature Words of the object to be measured text;Calculate the Feature Words of the target text and the Feature Words of the object to be measured text
Second similarity determines the target text when second similarity is more than or equal to the second predetermined similarity threshold
It is similar to the object to be measured text.
To achieve the above object, according to another aspect of an embodiment of the present invention, a kind of device for extracting Feature Words is provided,
It comprises determining that module, for obtaining target text, determines the part of speech feature of each word in the target text;Extraction module is used
According to each word part of speech feature and preset Feature Words part of speech extracting rule, determine the feature of the target text
Word.
Optionally, the extraction module is used for:
In four adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to
Zero, the part of speech of second word is nominal phrase and frequency of occurrence is zero or primary, and the part of speech of third word is adjective
Or noun and frequency of occurrence are more than or equal to zero, when the part of speech of the 4th word is noun, determine that four adjacent words are
The Feature Words of the target text;Or
When the part of speech of first word in first three adjacent words is adjective or noun and frequency of occurrence is greater than etc.
In primary, or for nominal phrase and frequency of occurrence be at most it is primary, the part of speech of second word be adjective or noun and
Frequency of occurrence is more than or equal to zero, when the part of speech of third word is noun, determines that described first three adjacent words are described
The Feature Words of target text;Or
When the part of speech of first word in two adjacent three words is adjective or noun or verb, second word
Part of speech be verb or noun and frequency of occurrence is at most primary, and when the part of speech of third word is noun or verb, determination
Two adjacent three words are the Feature Words of the target text;Or
When the part of speech of word is academic word, determine that the academic word is the Feature Words of the target text;
Or
When the part of speech of word be sensibility word when, determine be located at the sensibility word after and with the sensibility word
The adjacent nearest word of language is the Feature Words of the target text.
Optionally, the device of that embodiment of the invention further includes word module, is used for: analyzing each sentence in the target text
Syntactic structure determines the sentence feature of each sentence;Wherein, the sentence feature includes determiner and centre word;The extraction
Module is also used to extract the centre word according to preset Feature Words sentence feature extraction rule and be located at the centre word
Before and nearest determiner adjacent with the centre word, group combined by the extracted centre word and determiner is determined
Close the Feature Words that word is the target text.
Optionally, the device of that embodiment of the invention further includes enlargement module, for being generality text when the target text
Or the quantity of extracted Feature Words be less than predetermined quantity threshold value when, in preset feature dictionary obtain with extracted feature
The similar word of word;The first similarity for calculating acquired word and the Feature Words, be greater than when first similarity or
When person is equal to the first predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.
Optionally, the device of that embodiment of the invention further includes filtering module, for according to formula
L=x* [C-value (a)]+y*SCP (w1,...,wn)
The noise figure L of each Feature Words is calculated, the Feature Words for extracting noise figure more than or equal to predetermined noise value threshold value are
The Feature Words of target text;Wherein, C-value (a) is the obtained noise figure of term filtering characteristic word, SCP (w1,...,
wn) it is the obtained noise figure of unit filtering characteristic word, w1,...,wnIt is characterized word word string, x, y respectively represent C-value
(a)、SCP(w1,...,wn) weight, a is word string.
Optionally, the device of that embodiment of the invention further includes similar modular blocks, for receiving object to be measured text, obtain it is described to
Survey the Feature Words of target text;Calculate the second phase of the Feature Words and the Feature Words of the object to be measured text of the target text
Like degree, when second similarity is more than or equal to the second predetermined similarity threshold, determine the target text with it is described
Object to be measured text is similar.
To achieve the above object, according to an embodiment of the present invention in another aspect, provide it is a kind of extract Feature Words electronics
Equipment.
The electronic equipment of the embodiment of the present invention includes: one or more processors;Storage device, for storing one or more
A program, when one or more of programs are executed by one or more of processors, so that one or more of processing
The method that device realizes any of the above-described extraction Feature Words.
To achieve the above object, according to an embodiment of the present invention in another aspect, provide a kind of computer-readable medium,
On be stored with computer program, which is characterized in that realize that any of the above-described extraction is special when described program is executed by processor
The method for levying word.
The scheme of the offer according to the present invention, one embodiment in foregoing invention have the following advantages that or beneficial to effects
Fruit: providing a kind of new approaches for extracting target text Feature Words, by the part of speech feature of each word in analysis text, according to pre-
Fixed Feature Words part of speech feature extracting rule carries out Feature Words extraction to target text, can more characterize target text to extract
The Feature Words of feature embody the various dimensions feature of target text, solve the situation of text feature word scarcity in the prior art.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment
With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is a kind of main flow schematic diagram of method for extracting Feature Words according to an embodiment of the present invention;
Fig. 2 is a kind of flow diagram of optional method for extracting Feature Words according to an embodiment of the present invention;
Fig. 3 is the flow diagram of another optional method for extracting Feature Words according to an embodiment of the present invention;
Fig. 4 is the flow diagram of another optional method for extracting Feature Words according to an embodiment of the present invention;
Fig. 5 is the flow diagram of another optional method for extracting Feature Words according to an embodiment of the present invention;
Fig. 6 is the flow diagram of another optional method for extracting Feature Words according to an embodiment of the present invention;
Fig. 7 is a kind of main modular schematic diagram of device for extracting Feature Words according to an embodiment of the present invention;
Fig. 8 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 9 is adapted for the structural representation for realizing the mobile device of the embodiment of the present invention or the computer system of server
Figure.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Description below is done to word involved in the present invention below:
Part of speech feature: part of speech is an important feature of word, and the important tie that word is connected with sentence.Word
Property feature determination be exactly that the process of a part of speech or vocabulary classification is specified to each word, include but is not limited to be noun,
Verb, adjective, preposition, number, article, nominal phrase etc..
Specific text: sentence is more vivid, and word is compared with horn of plenty, for example, writings in the vernacular, personal information.
Generality text: with specific text on the contrary, word is more terse, sentence is more abstract, for example, the writing in classical Chinese, country
Statutes, legal document.
Relationship in fixed: relationship composed by determiner+centre word.
Term: a possibility that word string occurs as a Feature Words, wherein word string is a series of word according to certain
A string of words that kind mode classification combines.
Unit: refer to a possibility that word string occurs as a word.
Otsu algorithm: it is the gamma characteristic by image, divides the image into two parts of background and target.Background and target
Between inter-class variance it is bigger, illustrate constitute image two parts difference it is bigger, when partial target mistake is divided into background or portion
Point background mistake, which is divided into target all, can cause this two parts difference to become smaller, and therefore, the maximum segmentation of inter-class variance be made to mean wrong point
Probability is minimum.
It should be noted that the embodiment of the present invention is applicable to the Feature Words that target text is Chinese and non-Chinese text
It extracts, for non-Chinese, the present invention is illustrated by taking English as an example.In addition, target text provided by the embodiment of the present invention, it can
To be the texts such as news, article, policy, article brief introduction;Provided Feature Words can be the word of characterization target text feature
Or phrase.
Referring to Fig. 1, thus it is shown that a kind of broad flow diagram for the method for extracting Feature Words provided in an embodiment of the present invention, packet
Include following steps:
S101: target text is obtained, determines the part of speech feature of each word in target text.
S102: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is determined
Feature Words.
In above embodiment, for step S101, part of speech is an important feature of word, and by word and language
The important tie of sentence connection.The determination of part of speech feature is exactly that the process of a part of speech or vocabulary classification is specified to each word.
This is extremely important for processing later, and therefore, part of speech feature has to provide sufficiently exact result.
It further, can also include the word segmentation processing to target text before determining the part of speech feature of word.Mostly
Can be based on space as the separator between word similar to the alpha type language of English, and Chinese inherits the Chinese since ancient times
The feature of language, does not separate significantly between word and word.Ancient Chinese is led to other than distinctive noun or name, place name
Be often a Chinese character be exactly a word, thus for ancient Chinese prose participle it may not be necessary to.But the word in Modern Chinese, have very much
Two-character word and multi-character words, this just needs to segment text, is further continued for subsequent analysis.
In addition, in the prior art, lagging behind non-Chinese language processing, the processing side of many non-Chinese for the processing technique of Chinese
Formula is not directly applicable Chinese language processing, just as Chinese language processing must advanced participle operation.Therefore, for Chinese, participle
It is the premise of text analyzing, part-of-speech tagging is carried out on the basis of participle, specifically, segmenter, string matching can be used
The mode known to those skilled in the art such as method, Statistics-Based Method.
Further, it after obtaining target text, determines in target text before the part of speech feature of each word, also
Target text can be pre-processed, for example, check target text with the presence or absence of wrongly written character, whether sentence clear and coherent, word whether
It is accurate to wait operation.
For step S102, for non-Chinese, the Feature Words in target text are mostly single word.For Chinese
For, the Feature Words in target text are mostly noun phrase and length is mostly 2~8 words, and as " ", "Yes", " a little "
In the Feature Words that word is not in.For Chinese and non-Chinese, target text can be carried out in the following way
Feature Words extract.
(1) (Adj | Noun)+| [(Adj | Noun) * (NounPrep)? ] (Adj | Noun) * } Noun;Wherein,
NounPrep indicates that nominal phrase, Adj indicate that adjective, Noun indicate noun,? indicate occur zero or primary, * is represented
Existing zero or multiple ,+indicate that appearance is one or many;
Concrete condition includes but is not limited to be:
1){(Adj|Noun)+(Adj|Noun)*}Noun;The part of speech of i.e. first word is adjective or noun and appearance
One or many, the part of speech of second word is adjective or noun and frequency of occurrence is more than or equal to 0, the word of third word
Property is noun.
For example, cross-node technology cloud computing basic theory of new generation, obtains the part of speech point of each word by analysis, after participle
Not are as follows: new-adj, a generation-n, across-vn, node-n, technology-n, cloud computing-prep, basis-n, theory-n,
Analysis can obtain:
Across (gerund), node, this is the form of noun (first noun)+noun (the last one noun), be can be considered
Second noun frequency of occurrence is 0;
Equally, node, technology, this is the form of noun (first noun)+noun (the last one noun), is corresponded to
[(Adj|Noun)+]Noun;
Across, node, technology, this is noun+noun+noun form, it is understood that occur twice for first noun,
The last one noun occurs once, equally it can be appreciated that first noun, second noun and the last one noun are each
Occur once, aforesaid way effect is identical, does not distinguish herein.
2) [(Adj | Noun) * (NounPrep)? ] (Adj | Noun) * } Noun, it is divided into three parts: (Adj | Noun) *
(NounPrep)?, the part of speech corresponding to first word is adjective or noun and frequency of occurrence is more than or equal to 0, second word
The part of speech of language is nominal phrase and frequency of occurrence is zero or primary;(Adj | Noun) *, the part of speech corresponding to third word are
Adjective or noun and frequency of occurrence are more than or equal to zero;Noun, i.e., the part of speech of the 4th word are noun.
Equally by taking cross-node technology cloud computing basic theory of new generation as an example, analysis can be obtained at this time:
Across, node, technology, cloud computing, basis, theory, this is noun+noun+noun+nominal phrase+noun+noun
Form;Can be considered [(Adj | Noun) * (NounPrep)? ] in first (Adj | Noun) * frequency of occurrence be remaining part three times
It is primary for segmenting language frequency of occurrence;
Across (gerund), node, technology, this is noun+noun+noun form;Can be considered at this time [(Adj | Noun) *
(NounPrep)? ] in first (Adj | Noun) * is noun and occurs primary, NounPrep does not occur, and can also be considered as
[(Adj | Noun) * (NounPrep)? ] (Adj | Noun) * in last part (Adj | Noun) * occur twice, rest part
Frequency of occurrence is zero.
Cloud computing, basis, theory, this is nominal phrase+noun+noun form;Corresponding to [(Adj | Noun) *
(NounPrep)? ] in first (Adj | Noun) * frequency of occurrence be zero;
Technology, this is the form of only one noun;Corresponding to { [(Adj | Noun) * before the last one noun
(NounPrep)? ] (Adj | Noun) * frequency of occurrence is zero.
In addition, will then extract phone (noun) and shell for non-Chinese The phone shell is good
(noun) is used as Feature Words.
(2) [(Adj | Noun | v)+] [(v | Noun)? ] (Noun | v);Wherein, v indicates verb.It is mainly by three parts structure
At: (Adj | Noun | v)+, (v | Noun)?, (Noun | v): i.e. in three adjacent words, the part of speech of first word is shape
Hold word or noun or verb and frequency of occurrence is at least primary, the part of speech of second word is verb or noun and frequency of occurrence is
Zero or primary, and when the part of speech of third word is noun or verb, determine above-mentioned three adjacent words for file destination
Feature Words.
For example, I hardy learns cycling for Chinese, wherein study cycling is characterized word;In non-
Text, Good papermaking technology are characterized word.
(3) when the part of speech of word is academic word, determine that the academic nature word is the Feature Words of target text.Example
Such as, for Chinese, gm indicate with it is mathematically related, gp expression is related to physics;For non-Chinese, physics indicates Dedicated Physical,
It can be used as Feature Words.
(4) when the part of speech of word be sensibility word when, determine be located at sensibility word after and with sensibility word phase
Adjacent nearest word is the Feature Words of target text.Often there are some sensibility words in usual text, these words can be described as
Sensitive word, for example, research, building, realize, explore, optimization, design, break through etc..And after sensitive word, usually there is feature
Word, thus the region after sensitive word can be become into sensitizing range.For example, research nanotechnology, is studied as sensitive word, then
It, can also be using nanotechnology as sensitive word using nanometer as Feature Words.
Specifically, " network virtualization technology towards cloud computation data center " obtains after segmenter segments: face
To-v, cloud computing-gc (computerese), data-n, center-n ,-uj (different segmenter to " " part of speech table
Show difference, it can be understood as the part of speech " u " that usual segmenter indicates, wherein u is auxiliary word), network-n, virtualization-vn, skill
Art-n;Wherein, wherein previous item indicates word, latter indicates the part of speech of word.
Wherein " the two word parts of speech of data, " center " are all noun, and " virtualization, technology " the two word parts of speech are formed
Character string, be all satisfied the statement of mode 1;" cloud computing is academic vocabulary, meets the statement of mode 3, can be determined as spy
Levy word.
Method provided by above-described embodiment provides a kind of new approaches for extracting target text Feature Words, extracted
Feature Words are more, can more show the feature of target text, if giving up certain words, even these words act on text feature
Smaller and text a part will inevitably lose some text features.Method provided by above-described embodiment,
By the part of speech feature of each word in analysis text, target text is carried out according to scheduled Feature Words part of speech feature extracting rule
Feature Words extract, and to extract the Feature Words that can more characterize target text feature, embody the various dimensions feature of target text, solve
The situation for the text feature word scarcity in the prior art of having determined.
Referring to fig. 2, thus it is shown that a kind of main stream of the optional method for extracting Feature Words provided in an embodiment of the present invention
Cheng Tu includes the following steps:
S201: obtaining target text, analyzes the syntactic structure of each sentence in target text, determines that the sentence of each sentence is special
Sign;Wherein, sentence feature includes determiner and centre word.
S202: according to preset Feature Words sentence feature extraction rule, extract centre word and be located at before centre word and
Nearest determiner adjacent with centre word determines that portmanteau word combined by extracted centre word and determiner is target text
Feature Words.
In above embodiment, for step S201, for the sentence given in target text, can by segmenter into
Row participle, analyzes the qualified relation between each word, specifically includes determiner and centre word in sentence, locating for obtaining
The syntactic structure of sentence.Here sentence divides, and can be using fullstop as line of demarcation, is also possible to comma, branch for boundary
Line, the present invention is herein with no restrictions.
For step S202, due to may include multiple determiners in a word, text can be embodied for easy extract as far as possible
The Feature Words of eigen can only the nearest determiner of selected distance centre word be combined with centre word, form portmanteau word, be made
For the Feature Words of target text.
For example, sentence " the cross-node technology of cloud computing server of new generation ", wherein technology is center word, a new generation, cloud
Calculating, server across, node are determiner, and analysis obtains " node technology " between qualified relation and the two words
Other words are not interted, then as the Feature Words of this sentence.
Method provided by above-described embodiment provides another mode for extracting target text Feature Words, passes through analysis
The syntactic structure of each sentence is extracted and has the feature that relationship and the intermediate portmanteau word for not interting other words in surely are target text
Word.Above embodiment can obtain the Feature Words that can more characterize target text feature, and accuracy is higher, solves
Feature Words extract deficient problem in the prior art.
Referring to Fig. 3, thus it is shown that another kind provided in an embodiment of the present invention optionally extracts the main of the method for Feature Words
Flow chart includes the following steps:
S301: target text is obtained.
S302: the part of speech feature of each word in target text is determined.
S302 ': the syntactic structure of each sentence in analysis target text determines the sentence feature of each sentence;Wherein, sentence is special
Sign includes determiner and centre word;
S303: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is determined
Fisrt feature word.
S303 ': according to preset Feature Words sentence feature extraction rule, extract centre word and be located at before centre word and
Nearest determiner adjacent with centre word determines that portmanteau word combined by extracted centre word and determiner is target text
Second feature word.
S304: fisrt feature word and second feature word to the target text extracted carry out deduplication operation, determine
The Feature Words of target text.
In above embodiment, the description of step S101, S102 shown in Fig. 1, step can be found in for step S302, S303
S302 ', S303 ' can be found in the description of step S201, S202 shown in Fig. 2, and details are not described herein.
In above embodiment, for step S301, subordinate sentence operation can be carried out to target text using segmenter simultaneously
And participle operation, it is possible to have sequencing, the present invention is herein with no restrictions.
It should be noted that method provided by above embodiment, can first carry out after part of speech feature is extracted again into
Line statement feature extraction, can also be to carry out part of speech feature extraction after advanced line statement feature extraction again, can be with the two simultaneously
It carries out, the present invention is herein with no restrictions.
In addition, being extracted to the fisrt feature word that target text is extracted by part of speech feature, and by sentence feature
The second feature word arrived may have part to be consistent, for example, node technology, meets extracting rule mode shown in Fig. 1 (1)
Description, while meeting determiner+centre word description shown in Fig. 2, deduplication operation can be carried out, to avoid a feature
The case where word is repeatedly shown, while mitigating the service pressure of terminal and server-side.
Method provided by above-described embodiment, provide another extract target text Feature Words mode, compared with Fig. 1,
Mode shown in Fig. 2, the Feature Words quantity that above embodiment can be extracted is more, and repeatability is lower, accuracy is higher, solution
Determined prior art characteristic word extraction negligible amounts cause target text text feature lack the problem of.
Referring to fig. 4, another optional method flow signal for extracting Feature Words according to an embodiment of the present invention is shown
Figure, includes the following steps,
S401: target text is obtained, determines the part of speech feature of each word in target text.
S402: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is carried out
Feature Words extract.
S403: when target text is less than predetermined quantity threshold value for the quantity of generality text or extracted Feature Words,
Word similar with extracted Feature Words is obtained in preset feature dictionary.
S404: calculating the first similarity of acquired word and Feature Words, when the first similarity is more than or equal to the
When one predetermined similarity threshold, determine that acquired word is the Feature Words of target text.
In above embodiment, the description of step S101, S102 shown in Fig. 1 can be found in for step S401, S402, herein
It repeats no more.
In above embodiment, step S403 can equally be selected above-mentioned when target text is generality text
It proposes method shown in 1~Fig. 3 and carries out Feature Words extraction, but the Feature Words that can be extracted are possible less.
It therefore, is generality text (for example, writing in classical Chinese) when determining target text, or the Feature Words extracted
When quantity is less than institute's scheduled amount threshold, Feature Words expansion can be carried out, to increase the quantity of Feature Words.
Specifically, word2vec (term vector) can be used to be searched in preset feature dictionary, with acquisition and its
Similar word carries out Feature Words expansion.Wherein, which can train in advance, for example, be trained based on big corpus,
The present invention is not related to its training process.
It can using word acquired in word2vec since the word in feature dictionary is large number of for step S404
Some is smaller with the relevance of target text for energy, therefore, can screen to the word expanded, specifically, foundation
It is screened with the similarity of Feature Words.
It can preset similarity threshold (for example, 80%), calculate inputted Feature Words and returned with word2vec
Word between similarity, and only choose similarity value be more than or equal to the similarity threshold word, be determined as target
The Feature Words of text.
Alternatively, it is also possible to be screened to the word expanded, the word not inquired in target text is removed, is reduced
The case where user is inquired in target text less than word,
Method provided by above-described embodiment provides a kind of mode of quantity for expanding target text Feature Words, efficiently
It is deficient to solve the problems, such as that prior art characteristic word extracts;The word expanded, which only has, to be met with the similarity of Feature Words beyond pre-
When determining similarity threshold, it just can be identified as the Feature Words of target text.
Referring to Fig. 5, another optional method flow signal for extracting Feature Words according to an embodiment of the present invention is shown
Figure, includes the following steps,
S501: target text is obtained, determines the part of speech feature of each word in target text.
S502: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is carried out
Feature Words extract.
S503: according to formula
L=x* [C-value (a)]+y*SCP (w1,...,wn)
The noise figure L of each Feature Words is calculated, the Feature Words for extracting noise figure more than or equal to predetermined noise value threshold value are
The Feature Words of target text;Wherein, C-Value (a) is the obtained noise figure of term filtering characteristic word, SCP (w1,...,
wn) it is the obtained noise figure of unit filtering characteristic word, w1,...,wnIt is characterized word word string, x, y respectively represent C-value
(a)、SCP(w1,...,wn) weight, a is word string.
In above embodiment, the description of step S101, S102 shown in Fig. 1 can be found in for step S501, S502, herein
It repeats no more.
For the Feature Words that FIG. 1 to FIG. 3 mode is extracted, also there are many noises between possibility.To be extracted
Feature Words accuracy, accessed Feature Words can be screened by the way of certain, specifically, can be used
Such as under type:
(1) term filters
Above-mentioned filter type can be selected there are many mode, such as C-value method, specifically:
Wherein, the first situation only has itself corresponding to father's string of word string a, and second situation corresponds to other situations;
And | a | long corresponding to word, f (a) corresponds to word frequency, TaIndicate the set of the word comprising a.
(2) unit filters
What unit was realized by judging the compactness in word string between each word, SCP value can be passed through
(Symmetrical Conditional Probability, symmetric condition probability) Lai Hengliang.The improved MSCP of corresponding SCP
(Macro Symmetrical Conditional Probability, macro symmetric condition probability) formula is as follows:
Wherein, w1,...,wnIndicate candidate feature word word string, wiIndicate the word of the composition candidate feature word, F
(w1,...,wn) indicate candidate feature word Weighted Term Frequency, the Weighted Term Frequency calculation formula of word string a are as follows:
Wherein, Sa indicates the area classification set occurred, and b indicates some special areas proposed, for example, sensitizing range.fb
(a) word frequency that occurs in the b of region of word string a is indicated, weight (b) indicates the power that candidate feature word is endowed in text filed
Value.For example, Feature Words appear in title, can assign weight is 10, and the weight of sensitizing range is 6, other are 3, these data
It is empirical value, is only separated different location region.
During calculating SCP value, it is understood that there may be some Feature Words contains the words such as technology, theory and to calculate
SCP value it is relatively low.For example, " information service ", since " technology " there are many frequency of occurrence in the text, when calculating Avp,
Calculated value is larger, causes SCP value very small.In consideration of it, can be gone when calculating the SCP value of some Feature Words in a document
Fall the word as technology to be calculated, but do not remove really in Feature Words, for example, theory, application, technology, method,
Status, experiment etc..
Above two filter type can be used alone, and can also be used in combination.In addition, not making when being used in combination
It is measured with weight, is relatively used alone, it is more preferable to the effect of Feature Words filtering.
In above embodiment, for step S503, combined use for above-mentioned two filter type can be in conjunction with meter
The noise figure of filtering characteristic word is calculated, specifically:
L=x* [C-value (a)]+y*SCP (w1,...,wn)
Wherein, L is noise figure, and x, y can be preset, such as value is 0.7,0.3 respectively.
For the portmanteau word of the word of Feature Words, especially two, for example, word X and word Y, wherein word X is to word Y
Influence degree it is higher or word Y is higher to the influence degree of word X, the noise figure of the two acquired contamination words
It is bigger.Therefore the denoising to Feature Words can only choose the Feature Words that noise figure is more than or equal to predetermined noise threshold, determine
For the text feature of target text.For the noise figure threshold value, Otsu algorithm can be used to be determined.
Method provided by above-described embodiment provides the method that the Feature Words of a kind of pair of target text are filtered, excellent
Text feature of the higher Feature Words of noise figure as target text is selected, so that the Feature Words accuracy filtered out is higher,
Improve the accuracy of text feature.
Referring to Fig. 6, another optional method flow signal for extracting Feature Words according to an embodiment of the present invention is shown
Figure, includes the following steps,
S601: target text is obtained, determines the part of speech feature of each word in target text.
S602: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is determined
Feature Words.
S603: receiving object to be measured text, obtains the Feature Words of object to be measured text.
S604: the second similarity of the Feature Words of target text and the Feature Words of object to be measured text is calculated, when the second phase
When being more than or equal to the second predetermined similarity threshold like degree, determine that target text is similar to object to be measured text.
In above embodiment, the description of step S101, S102 shown in Fig. 1 can be found in for step S601, S602, herein
It repeats no more.
In above embodiment, for step S603, many texts are flooded in network, when recommending text A to user,
Similar text B can be recommended to improve the working efficiency of user to user to reduce the omission of effective text information simultaneously.
But before recommendation, it is thus necessary to determine that whether text A is similar to text B, the Feature Words of each text available at this time,
According to the similarity between Feature Words, both judgements whether Similarity matching.
For step S604, in the text, following region can be divided into: title, sensitizing range, other;Wherein,
Title includes level-one title, second level title etc..
How many Feature Words of the two are calculated into using algorithm for the Feature Words of the Feature Words of text A and text B in matching process
It is identical, specifically, can using DFA algorithm (Deterministic Finite Automaton, determine finite automaton) into
Row calculates, and obtained total quantity is removed the quantity in text B Feature Words and measures threshold value with this, judges between the two similar
Property.When the similarity between the two texts is more than or equal to the second similarity threshold, determine that the two texts are similar.It is right
In similar text, can recommend together.
By taking specific text and abstract text as an example, wherein the Feature Words of specific text are divided into three-level: A1 title, A2 are sensitive
Area, A3 other, the weight of imparting can be preset, such as respectively X=10, Y=6, Z=4;The Feature Words of generality text
Negligible amounts can merge title and sensitizing range as B1, other are B2, and by the extracted Feature Words of generality text
Expanded, forms A1B1, A1B2 ... etc. after classification matching.Because also classification is handled the Feature Words of generality text, therefore
Weight, such as α=0.6 can be preset, β=0.4 obtains α A1B1+ β A1B2.Later according to point of specific text feature word
Grade processing, obtains three parts: (α A1B1+ β A1B2) * X, (α A2B1+ β A2B2) * Y, (α A3B1+ β A3B2) * Z, finally by three
Adduction, the similarity being determined as between two texts.
For example, specific text is scientific research personnel's information, title feature word are as follows: nanosecond science and technology A1, sensitizing range Feature Words are as follows:
Study nano particle A2, other positions Feature Words are as follows: tumour cell A3;
Abstract text is nanosecond science and technology guide, title feature word are as follows: nanosecond science and technology B1, sensitizing range Feature Words are as follows: research
Nano particle B1, other positions Feature Words are as follows: tumour cell B2;
The Similarity matching degree of the two is calculated are as follows:
Sum=[0.6*A1B1+0.4*A1B2] * 10+ [0.6*A2B1+0.4*A2B2] * 6+ [0.6*A3B1+0.4*
A3B2]*4。
It should be noted that method provided by above-described embodiment, be equally applicable to two texts be specific text or
The case where person is generality text, the present invention for the similarity calculation of specific text and generality text only to say
It is bright.
Further, method provided by above-described embodiment is equally applicable to recommend the mode of target text to user.It is right
In user property, the Feature Words that can characterize user characteristics can be equally extracted, wherein the acquisition of these Feature Words can be
It is filled in based on individual subscriber;Wherein, for the Feature Words of user property, specific text can be considered as.
The main purpose that text is recommended is then will have the text of certain correlation with user according to the existing attribute of user
User is recommended, realizes Personalized Intelligent Recommendation.It, equally can will be similar with the text when recommending a text to user
Text recommends user, reduces the risk that information is omitted, and obtains relevant information for user and provides convenience.
Method provided by above-described embodiment provides a kind of method of two text Similarity matchings of determination, by Feature Words
Subregion is carried out, to calculate the similarity between two text feature words, improves the accuracy for calculating similarity between text.
Method provided by the embodiment of the present invention provides a kind of new approaches for extracting target text Feature Words, Ke Yitong
The more tactful modes for crossing part of speech feature and/or sentence feature, extract the Feature Words in simultaneously Filtration Goal text, save target with this
The multidimensional characteristic of text, improves the accuracy of text feature, while realizing in recommendation process between text, between text and user
The calculating of feature Word similarity improves recommended range and recommends accuracy.
Referring to Fig. 7, a kind of main modular signal of device 700 for extracting Feature Words provided in an embodiment of the present invention is shown
Figure;
Determining module 701 determines the part of speech feature of each word in the target text for obtaining target text;
Extraction module 702, for according to each word part of speech feature and preset Feature Words part of speech extracting rule,
Determine the Feature Words of the target text.
In the device of that embodiment of the invention, the extraction module 702 is used for:
In four adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to
Zero, the part of speech of second word is nominal phrase and frequency of occurrence is zero or primary, and the part of speech of third word is adjective
Or noun and frequency of occurrence are more than or equal to zero, when the part of speech of the 4th word is noun, determine that four adjacent words are
The Feature Words of the target text;Or
When the part of speech of first word in first three adjacent words is adjective or noun and frequency of occurrence is greater than etc.
In primary, the part of speech of second word is adjective or noun and frequency of occurrence is more than or equal to zero, and the part of speech of third word is
When noun, determine that described first three adjacent words are the Feature Words of the target text;Or
When the part of speech of first word in two adjacent three words is adjective or noun or verb, second word
Part of speech be verb or noun and frequency of occurrence is at most primary, and when the part of speech of third word is noun or verb, determination
Two adjacent three words are the Feature Words of the target text;Or
When the part of speech of word is academic word, determine that the academic word is the Feature Words of the target text;
Or
When the part of speech of word be sensibility word when, determine be located at the sensibility word after and with the sensibility word
The mutually adjacent nearest word of language is the Feature Words of the target text.
The device of that embodiment of the invention further includes sentence module 703, is used for: analyzing the grammer of each sentence in the target text
Structure determines the sentence feature of each sentence;Wherein, the sentence feature includes determiner and centre word;According to preset spy
Levy the feature extraction of word sentence rule, extract the centre word and be located at the centre word before and it is adjacent with the centre word most
Close determiner determines that portmanteau word combined by the extracted centre word and determiner is the feature of the target text
Word.
The device of that embodiment of the invention further includes enlargement module 704, for being generality text or institute when the target text
When the quantity of the Feature Words of extraction is less than predetermined quantity threshold value, obtained and extracted Feature Words phase in preset feature dictionary
As word;The first similarity for calculating acquired word and the Feature Words, when first similarity is greater than or waits
When the first predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.
The device of that embodiment of the invention further includes filtering module 705, for according to formula
L=x* [C-value (a)]+y*SCP (w1,...,wn)
The noise figure L of each Feature Words is calculated, the Feature Words for extracting noise figure more than or equal to predetermined noise value threshold value are
The Feature Words of target text;Wherein, C-value (a) is the obtained noise figure of term filtering characteristic word, SCP (w1,...,
wn) it is the obtained noise figure of unit filtering characteristic word, w1,...,wnIt is characterized word word string, x, y respectively represent C-value
(a)、SCP(w1,...,wn) weight, a is word string.
The device of that embodiment of the invention further includes similar modular blocks 706, for receiving object to be measured text, obtains the mesh to be measured
Mark the Feature Words of text;The Feature Words for calculating the target text are similar to the second of the Feature Words of the object to be measured text
Degree, when second similarity be more than or equal to the second predetermined similarity threshold when, determine the target text and it is described to
It is similar to survey target text.
Device provided by the embodiment of the present invention provides a kind of device for extracting target text Feature Words, passes through part of speech
More tactful modes of feature and/or sentence feature, extract the Feature Words in simultaneously Filtration Goal text, save target text with this
Multidimensional characteristic, improves the accuracy of text feature, while realizing in recommendation process Feature Words between text, between text and user
The calculating of similarity improves recommended range and recommends accuracy.
In addition, the specific implementation content of the extraction Feature Words device described in embodiments of the present invention, described above to mention
It takes in Feature Words method and has been described in detail, therefore no longer illustrate in this duplicate contents.
Showing referring to Fig. 8 can be using the extraction Feature Words method of the embodiment of the present invention or showing for extraction Feature Words device
Example property system architecture 800.
As shown in figure 8, system architecture 800 may include terminal device 801,802,803, network 804 and server 805.
Network 804 between terminal device 801,802,803 and server 805 to provide the medium of communication link.Network 804 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 801,802,803 and be interacted by network 804 with server 805, to receive or send out
Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 801,802,803
(merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.
Terminal device 801,802,803 can be the various electronic equipments with display screen and supported web page browsing, packet
Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 805 can be to provide the server of various services, such as utilize terminal device 801,802,803 to user
The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception
To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter
Breath -- merely illustrative) feed back to terminal device.
It is generally executed by server 805 it should be noted that extracting Feature Words method provided by the embodiment of the present invention, phase
Ying Di extracts Feature Words device and is generally positioned in server 805.
It should be understood that the number of terminal device, network and server in Fig. 8 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
Referring to Fig. 9, it illustrates the knots of the computer system 900 for the terminal device for being suitable for being used to realize the embodiment of the present invention
Structure schematic diagram.Terminal device shown in Fig. 9 is only an example, should not function and use scope band to the embodiment of the present invention
Carry out any restrictions.
As shown in figure 9, computer system 900 includes central processing unit (CPU) 901, it can be read-only according to being stored in
Program in memory (ROM) 902 or be loaded into the program in random access storage device (RAM) 903 from storage section 908 and
Execute various movements appropriate and processing.In RAM 903, also it is stored with system 900 and operates required various programs and data.
CPU 901, ROM 902 and RAM 903 are connected with each other by bus 904.Input/output (I/O) interface 905 is also connected to always
Line 904.
I/O interface 905 is connected to lower component: the importation 906 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 907 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 908 including hard disk etc.;
And the communications portion 909 of the network interface card including LAN card, modem etc..Communications portion 909 via such as because
The network of spy's net executes communication process.Driver 910 is also connected to I/O interface 905 as needed.Detachable media 911, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 910, in order to read from thereon
Computer program be mounted into storage section 908 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention
Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer
Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.?
In such embodiment, which can be downloaded and installed from network by communications portion 909, and/or from can
Medium 911 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) 901, system of the invention is executed
The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires
Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this
In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned
Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more
Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box
The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical
On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants
It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule
The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction
It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard
The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet
Include determining module and extraction module.Wherein, the title of these modules is not constituted to the module itself under certain conditions
It limits, for example, extraction module is also described as " Feature Words extraction module ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be
Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes
Obtaining the equipment includes:
Target text is obtained, determines the part of speech feature of each word in the target text;
According to the part of speech feature of each word and preset Feature Words part of speech extracting rule, the target text is determined
Feature Words.
Technical solution according to an embodiment of the present invention provides a kind of new approaches for extracting target text Feature Words, can be with
By way of more strategies of part of speech feature and/or sentence feature, the Feature Words in simultaneously Filtration Goal text are extracted, mesh is saved with this
The multidimensional characteristic for marking text, improves the accuracy of text feature, at the same realize in recommendation process between text, text and user it
Between feature Word similarity calculating, improve recommended range and recommend accuracy.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright
It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any
Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention
Within.
Claims (14)
1. a kind of method for extracting Feature Words characterized by comprising
Target text is obtained, determines the part of speech feature of each word in the target text;
According to the part of speech feature of each word and preset Feature Words part of speech extracting rule, the spy of the target text is determined
Levy word.
2. the method according to claim 1, wherein the part of speech feature according to each word and default
Feature Words part of speech extracting rule, determine that the Feature Words of the target text include:
In four adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to zero, the
The part of speech of two words is nominal phrase and frequency of occurrence is zero or primary, and the part of speech of third word is adjective or noun
And frequency of occurrence is more than or equal to zero, when the part of speech of the 4th word is noun, determines that four adjacent words are the mesh
Mark the Feature Words of text;Or
In first three adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to one
Secondary, the part of speech of second word is adjective or noun and frequency of occurrence is more than or equal to zero, and the part of speech of third word is noun
When, determine that described first three adjacent words are the Feature Words of the target text;Or
When the part of speech of first word in two adjacent three words is adjective or noun or verb, the word of second word
Property for verb or noun and frequency of occurrence is at most primary, and when the part of speech of third word is noun or verb, described in determination
Two adjacent three words are the Feature Words of the target text;Or
When the part of speech of word is academic word, determine that the academic word is the Feature Words of the target text;Or
When the part of speech of word be sensibility word when, determine be located at the sensibility word after and with the sensibility word phase
Adjacent nearest word is the Feature Words of the target text.
3. the method according to claim 1, wherein further include:
The syntactic structure for analyzing each sentence in the target text determines the sentence feature of each sentence;Wherein, the sentence feature
Including determiner and centre word;
According to preset Feature Words sentence feature extraction rule, extract the centre word and be located at before the centre word and with
The adjacent nearest determiner of the centre word, determines portmanteau word combined by the extracted centre word and determiner for institute
State the Feature Words of target text.
4. method according to claim 1 or 3, which is characterized in that further include:
When the target text is less than predetermined quantity threshold value for the quantity of generality text or extracted Feature Words, default
Feature dictionary in obtain similar with extracted Feature Words word;
The first similarity for calculating acquired word and the Feature Words, when first similarity is more than or equal to first
When predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.
5. method according to claim 1 or 3, which is characterized in that further include:
According to formula
L=x* [C-value (a)]+y*SCP (w1,...,wn)
The noise figure L of each Feature Words is calculated, extracting noise figure to be more than or equal to the Feature Words of predetermined noise value threshold value is target
The Feature Words of text;Wherein, C-value (a) is the obtained noise figure of term filtering characteristic word, SCP (w1,...,wn) be
The obtained noise figure of unit filtering characteristic word, w1,...,wnIt is characterized word word string, x, y respectively represent C-value (a), SCP
(w1,...,wn) weight, a is word string.
6. the method according to claim 1, wherein after the Feature Words of the determination target text,
Further include:
Object to be measured text is received, the Feature Words of the object to be measured text are obtained;
The second similarity for calculating the Feature Words of the target text and the Feature Words of the object to be measured text, when described second
When similarity is more than or equal to the second predetermined similarity threshold, the target text and the object to be measured text phase are determined
Seemingly.
7. a kind of device for extracting Feature Words characterized by comprising
Determining module determines the part of speech feature of each word in the target text for obtaining target text;
Extraction module, for according to each word part of speech feature and preset Feature Words part of speech extracting rule, determine institute
State the Feature Words of target text.
8. device according to claim 7, which is characterized in that the extraction module is used for:
In four adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to zero, the
The part of speech of two words is nominal phrase and frequency of occurrence is zero or primary, and the part of speech of third word is adjective or noun
And frequency of occurrence is more than or equal to zero, when the part of speech of the 4th word is noun, determines that four adjacent words are the mesh
Mark the Feature Words of text;Or
When the part of speech of first word in first three adjacent words is adjective or noun and frequency of occurrence is more than or equal to one
Secondary, the part of speech of second word is adjective or noun and frequency of occurrence is more than or equal to zero, and the part of speech of third word is noun
When, determine that described first three adjacent words are the Feature Words of the target text;Or
When the part of speech of first word in two adjacent three words is adjective or noun or verb, the word of second word
Property for verb or noun and frequency of occurrence is at most primary, and when the part of speech of third word is noun or verb, described in determination
Two adjacent three words are the Feature Words of the target text;Or
When the part of speech of word is academic word, determine that the academic word is the Feature Words of the target text;Or
When the part of speech of word be sensibility word when, determine be located at the sensibility word after and with the sensibility word phase
Adjacent nearest word is the Feature Words of the target text.
9. device according to claim 7, which is characterized in that described device further includes sentence module, is used for:
The syntactic structure for analyzing each sentence in the target text determines the sentence feature of each sentence;Wherein, the sentence feature
Including determiner and centre word;
According to preset Feature Words sentence feature extraction rule, extract the centre word and be located at before the centre word and with
The adjacent nearest determiner of the centre word, determines portmanteau word combined by the extracted centre word and determiner for institute
State the Feature Words of target text.
10. the device according to claim 7 or 9, which is characterized in that described device further includes enlargement module, is used for:
When the target text is less than predetermined quantity threshold value for the quantity of generality text or extracted Feature Words, default
Feature dictionary in obtain similar with extracted Feature Words word;
The first similarity for calculating acquired word and the Feature Words, when first similarity is more than or equal to first
When predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.
11. the device according to claim 7 or 9, which is characterized in that described device further includes filtering module, is used for:
According to formula
L=x* [C-value (a)]+y*SCP (w1,...,wn)
The noise figure L of each Feature Words is calculated, extracting noise figure to be more than or equal to the Feature Words of predetermined noise value threshold value is target
The Feature Words of text;Wherein, C-value (a) is the obtained noise figure of term filtering characteristic word, SCP (w1,...,wn) be
The obtained noise figure of unit filtering characteristic word, w1,...,wnIt is characterized word word string, x, y respectively represent C-value (a), SCP
(w1,...,wn) weight, a is word string.
12. device according to claim 7, which is characterized in that described device further includes similar modular blocks, is used for:
Object to be measured text is received, the Feature Words of the object to be measured text are obtained;
The second similarity for calculating the Feature Words of the target text and the Feature Words of the object to be measured text, when described second
When similarity is more than or equal to the second predetermined similarity threshold, the target text and the object to be measured text phase are determined
Seemingly.
13. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
Now such as method as claimed in any one of claims 1 to 6.
14. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor
Such as method as claimed in any one of claims 1 to 6 is realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711391968.1A CN109948141A (en) | 2017-12-21 | 2017-12-21 | A kind of method and apparatus for extracting Feature Words |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711391968.1A CN109948141A (en) | 2017-12-21 | 2017-12-21 | A kind of method and apparatus for extracting Feature Words |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109948141A true CN109948141A (en) | 2019-06-28 |
Family
ID=67005024
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711391968.1A Pending CN109948141A (en) | 2017-12-21 | 2017-12-21 | A kind of method and apparatus for extracting Feature Words |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109948141A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909122A (en) * | 2019-10-10 | 2020-03-24 | 重庆金融资产交易所有限责任公司 | Information processing method and related equipment |
CN110956018A (en) * | 2019-11-22 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Training method of text processing model, text processing method, text processing device and storage medium |
CN110990493A (en) * | 2019-11-21 | 2020-04-10 | 国网宁夏电力有限公司电力科学研究院 | Modeling method, system and application method of electric energy quality ontology model |
CN112381038A (en) * | 2020-11-26 | 2021-02-19 | 中国船舶工业系统工程研究院 | Image-based text recognition method, system and medium |
CN112417130A (en) * | 2020-11-19 | 2021-02-26 | 贝壳技术有限公司 | Word screening method and device, computer readable storage medium and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104298665A (en) * | 2014-10-16 | 2015-01-21 | 苏州大学 | Identification method and device of evaluation objects of Chinese texts |
CN104360993A (en) * | 2014-11-19 | 2015-02-18 | 广州极盛信息科技开发有限公司 | Method for extracting needed content from text |
CN105159927A (en) * | 2015-08-04 | 2015-12-16 | 北京金山安全软件有限公司 | Method and device for selecting subject term of target text and terminal |
CN106156204A (en) * | 2015-04-23 | 2016-11-23 | 深圳市腾讯计算机系统有限公司 | The extracting method of text label and device |
CN106250365A (en) * | 2016-07-21 | 2016-12-21 | 成都德迈安科技有限公司 | The extracting method of item property Feature Words in consumer reviews based on text analyzing |
JP2017091436A (en) * | 2015-11-17 | 2017-05-25 | 株式会社Nttドコモ | Feature word selection device |
-
2017
- 2017-12-21 CN CN201711391968.1A patent/CN109948141A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104298665A (en) * | 2014-10-16 | 2015-01-21 | 苏州大学 | Identification method and device of evaluation objects of Chinese texts |
CN104360993A (en) * | 2014-11-19 | 2015-02-18 | 广州极盛信息科技开发有限公司 | Method for extracting needed content from text |
CN106156204A (en) * | 2015-04-23 | 2016-11-23 | 深圳市腾讯计算机系统有限公司 | The extracting method of text label and device |
CN105159927A (en) * | 2015-08-04 | 2015-12-16 | 北京金山安全软件有限公司 | Method and device for selecting subject term of target text and terminal |
JP2017091436A (en) * | 2015-11-17 | 2017-05-25 | 株式会社Nttドコモ | Feature word selection device |
CN106250365A (en) * | 2016-07-21 | 2016-12-21 | 成都德迈安科技有限公司 | The extracting method of item property Feature Words in consumer reviews based on text analyzing |
Non-Patent Citations (1)
Title |
---|
林岚岚: "基于语法模式的评论特征词提取", 《广东水利电力职业技术学院学报》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909122A (en) * | 2019-10-10 | 2020-03-24 | 重庆金融资产交易所有限责任公司 | Information processing method and related equipment |
CN110909122B (en) * | 2019-10-10 | 2023-10-03 | 湖北华中电力科技开发有限责任公司 | Information processing method and related equipment |
CN110990493A (en) * | 2019-11-21 | 2020-04-10 | 国网宁夏电力有限公司电力科学研究院 | Modeling method, system and application method of electric energy quality ontology model |
CN110990493B (en) * | 2019-11-21 | 2023-05-23 | 国网宁夏电力有限公司电力科学研究院 | Modeling method, system and application method of electric energy quality ontology model |
CN110956018A (en) * | 2019-11-22 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Training method of text processing model, text processing method, text processing device and storage medium |
CN110956018B (en) * | 2019-11-22 | 2023-04-18 | 腾讯科技(深圳)有限公司 | Training method of text processing model, text processing method, text processing device and storage medium |
CN112417130A (en) * | 2020-11-19 | 2021-02-26 | 贝壳技术有限公司 | Word screening method and device, computer readable storage medium and electronic equipment |
CN112381038A (en) * | 2020-11-26 | 2021-02-19 | 中国船舶工业系统工程研究院 | Image-based text recognition method, system and medium |
CN112381038B (en) * | 2020-11-26 | 2024-04-19 | 中国船舶工业系统工程研究院 | Text recognition method, system and medium based on image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109948141A (en) | A kind of method and apparatus for extracting Feature Words | |
Hai et al. | Identifying features in opinion mining via intrinsic and extrinsic domain relevance | |
US11556572B2 (en) | Systems and methods for coverage analysis of textual queries | |
Zagibalov et al. | Automatic seed word selection for unsupervised sentiment classification of Chinese text | |
US8280902B2 (en) | High precision search system and method | |
CN111897970A (en) | Text comparison method, device and equipment based on knowledge graph and storage medium | |
CN104239373B (en) | Add tagged method and device for document | |
JP2017010514A (en) | Search engine and method for implementing the same | |
CN110347428A (en) | A kind of detection method and device of code similarity | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
CN111160007B (en) | Search method and device based on BERT language model, computer equipment and storage medium | |
CN102609424B (en) | Method and equipment for extracting assessment information | |
CN107798622A (en) | A kind of method and apparatus for identifying user view | |
Wang et al. | Visual analytics and information extraction of geological content for text-based mineral exploration reports | |
CN110020312A (en) | The method and apparatus for extracting Web page text | |
KR20200137924A (en) | Real-time keyword extraction method and device in text streaming environment | |
CN108073708A (en) | Information output method and device | |
CN112989235A (en) | Knowledge base-based internal link construction method, device, equipment and storage medium | |
KR20210121921A (en) | Method and device for extracting key keywords based on keyword joint appearance network | |
Jeon et al. | Making a graph database from unstructured text | |
CN109902152A (en) | Method and apparatus for retrieving information | |
CN112926297A (en) | Method, apparatus, device and storage medium for processing information | |
CN111737607A (en) | Data processing method, data processing device, electronic equipment and storage medium | |
US20230004715A1 (en) | Method and apparatus for constructing object relationship network, and electronic device | |
CN111126073A (en) | Semantic retrieval method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190628 |