CN109948141A

CN109948141A - A kind of method and apparatus for extracting Feature Words

Info

Publication number: CN109948141A
Application number: CN201711391968.1A
Authority: CN
Inventors: 古迎志
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2019-06-28

Abstract

The invention discloses a kind of method and apparatus for extracting Feature Words, are related to information technology field.One specific embodiment of this method includes: acquisition target text, determines the part of speech feature of each word in the target text；According to the part of speech feature of each word and preset Feature Words part of speech extracting rule, the Feature Words of the target text are determined.This embodiment offers a kind of new approaches for extracting target text Feature Words, and the Feature Words of target text are extracted by part of speech feature, the multidimensional characteristic of target text is saved with this, improves the accuracy of text feature.

Description

A kind of method and apparatus for extracting Feature Words

Technical field

The present invention relates to information technology field more particularly to a kind of method and apparatus for extracting Feature Words.

Background technique

Be flooded with many text datas in network today, user be highly desirable to receive some information relevant to oneself (for example, News, policy, the guide etc. of country's publication) prevent the omission of effective information, avoid bigger loss.For example, some Worker is recommended by text, can be understood the text relevant to oneself work that country issues instantly quickly, be carried out phase in time Close work project verification, economic loss caused by avoiding information from omitting.

Scheme provided by the prior art usually calculates the ratio of number and text sum that word occurs in the text, For example, document frequency (Document Frequency, DF), TF*IDF (Term Frequency-Inverse Document Frequency) scheduling algorithm, to extract text key word.Then the first few items and title chosen in these keywords combine The label of characterization text feature is formed, finally carries out feature correlation calculating using these labels, to realize that text is recommended.

In realizing process of the present invention, inventor has found the prior art, and at least there are the following problems:

(1) scheme provided by the prior art, the Feature Words negligible amounts that can be extracted, cannot characterize text well Various dimensions feature；

(2) for more recapitulative text, the word frequency of occurrence of many importance is less, can not according to the prior art These words are identified well, hinder the extraction of text feature.

Summary of the invention

In view of this, the embodiment of the present invention provides a kind of method and apparatus for extracting Feature Words, at least it is able to solve existing The problem of Feature Words negligible amounts that technology is extracted cause text feature to lack.

To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of method for extracting Feature Words is provided, Include: acquisition target text, determines the part of speech feature of each word in the target text；According to the part of speech feature of each word And preset Feature Words part of speech extracting rule, determine the Feature Words of the target text.

Optionally, the part of speech feature according to each word and preset Feature Words part of speech extracting rule determine The Feature Words of the target text include:

In four adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to Zero, the part of speech of second word is nominal phrase and frequency of occurrence is zero or primary, and the part of speech of third word is adjective Or noun and frequency of occurrence are more than or equal to zero, when the part of speech of the 4th word is noun, determine that four adjacent words are The Feature Words of the target text；Or

When the part of speech of first word in first three adjacent words is adjective or noun and frequency of occurrence is greater than etc. In primary, or for nominal phrase and frequency of occurrence be at most it is primary, the part of speech of second word be adjective or noun and Frequency of occurrence is more than or equal to zero, when the part of speech of third word is noun, determines that described first three adjacent words are described The Feature Words of target text；Or

When the part of speech of first word in two adjacent three words is adjective or noun or verb, second word Part of speech be verb or noun and frequency of occurrence is at most primary, and when the part of speech of third word is noun or verb, determination Two adjacent three words are the Feature Words of the target text；Or

When the part of speech of word is academic word, determine that the academic word is the Feature Words of the target text； Or

When the part of speech of word be sensibility word when, determine be located at the sensibility word after and with the sensibility word The adjacent nearest word of language is the Feature Words of the target text.

Optionally, the method for the embodiment of the present invention further includes analyzing the syntactic structure of each sentence in the target text, really The sentence feature of fixed each sentence；Wherein, the sentence feature includes determiner and centre word；According to preset Feature Words sentence Feature extraction rule extracts the centre word and before the centre word and nearest restriction adjacent with the centre word Word determines that portmanteau word combined by the extracted centre word and determiner is the Feature Words of the target text.

Optionally, the method for the embodiment of the present invention further include: when the target text is generality text or extracted When the quantity of Feature Words is less than predetermined quantity threshold value, word similar with extracted Feature Words is obtained in preset feature dictionary Language；The first similarity for calculating acquired word and the Feature Words, when first similarity is more than or equal to first When predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.

Optionally, the method for the embodiment of the present invention further include:

According to formula

L=x* [C-value (a)]+y*SCP (w₁,...,w_n)

The noise figure L of each Feature Words is calculated, the Feature Words for extracting noise figure more than or equal to predetermined noise value threshold value are The Feature Words of target text；Wherein, C-value (a) is the obtained noise figure of term filtering characteristic word, SCP (w₁,..., w_n) it is the obtained noise figure of unit filtering characteristic word, w₁,...,w_nIt is characterized word word string, x, y respectively represent C-value (a)、SCP(w₁,...,w_n) weight, a is word string.

Optionally, after the Feature Words of the determination target text, further include receiving object to be measured text, obtain The Feature Words of the object to be measured text；Calculate the Feature Words of the target text and the Feature Words of the object to be measured text Second similarity determines the target text when second similarity is more than or equal to the second predetermined similarity threshold It is similar to the object to be measured text.

To achieve the above object, according to another aspect of an embodiment of the present invention, a kind of device for extracting Feature Words is provided, It comprises determining that module, for obtaining target text, determines the part of speech feature of each word in the target text；Extraction module is used According to each word part of speech feature and preset Feature Words part of speech extracting rule, determine the feature of the target text Word.

Optionally, the extraction module is used for:

Optionally, the device of that embodiment of the invention further includes word module, is used for: analyzing each sentence in the target text Syntactic structure determines the sentence feature of each sentence；Wherein, the sentence feature includes determiner and centre word；The extraction Module is also used to extract the centre word according to preset Feature Words sentence feature extraction rule and be located at the centre word Before and nearest determiner adjacent with the centre word, group combined by the extracted centre word and determiner is determined Close the Feature Words that word is the target text.

Optionally, the device of that embodiment of the invention further includes enlargement module, for being generality text when the target text Or the quantity of extracted Feature Words be less than predetermined quantity threshold value when, in preset feature dictionary obtain with extracted feature The similar word of word；The first similarity for calculating acquired word and the Feature Words, be greater than when first similarity or When person is equal to the first predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.

Optionally, the device of that embodiment of the invention further includes filtering module, for according to formula

L=x* [C-value (a)]+y*SCP (w₁,...,w_n)

Optionally, the device of that embodiment of the invention further includes similar modular blocks, for receiving object to be measured text, obtain it is described to Survey the Feature Words of target text；Calculate the second phase of the Feature Words and the Feature Words of the object to be measured text of the target text Like degree, when second similarity is more than or equal to the second predetermined similarity threshold, determine the target text with it is described Object to be measured text is similar.

To achieve the above object, according to an embodiment of the present invention in another aspect, provide it is a kind of extract Feature Words electronics Equipment.

The electronic equipment of the embodiment of the present invention includes: one or more processors；Storage device, for storing one or more A program, when one or more of programs are executed by one or more of processors, so that one or more of processing The method that device realizes any of the above-described extraction Feature Words.

To achieve the above object, according to an embodiment of the present invention in another aspect, provide a kind of computer-readable medium, On be stored with computer program, which is characterized in that realize that any of the above-described extraction is special when described program is executed by processor The method for levying word.

The scheme of the offer according to the present invention, one embodiment in foregoing invention have the following advantages that or beneficial to effects Fruit: providing a kind of new approaches for extracting target text Feature Words, by the part of speech feature of each word in analysis text, according to pre- Fixed Feature Words part of speech feature extracting rule carries out Feature Words extraction to target text, can more characterize target text to extract The Feature Words of feature embody the various dimensions feature of target text, solve the situation of text feature word scarcity in the prior art.

Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.

Detailed description of the invention

Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:

Fig. 1 is a kind of main flow schematic diagram of method for extracting Feature Words according to an embodiment of the present invention；

Fig. 2 is a kind of flow diagram of optional method for extracting Feature Words according to an embodiment of the present invention；

Fig. 3 is the flow diagram of another optional method for extracting Feature Words according to an embodiment of the present invention；

Fig. 4 is the flow diagram of another optional method for extracting Feature Words according to an embodiment of the present invention；

Fig. 5 is the flow diagram of another optional method for extracting Feature Words according to an embodiment of the present invention；

Fig. 6 is the flow diagram of another optional method for extracting Feature Words according to an embodiment of the present invention；

Fig. 7 is a kind of main modular schematic diagram of device for extracting Feature Words according to an embodiment of the present invention；

Fig. 8 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein；

Fig. 9 is adapted for the structural representation for realizing the mobile device of the embodiment of the present invention or the computer system of server Figure.

Specific embodiment

Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.

Description below is done to word involved in the present invention below:

Part of speech feature: part of speech is an important feature of word, and the important tie that word is connected with sentence.Word Property feature determination be exactly that the process of a part of speech or vocabulary classification is specified to each word, include but is not limited to be noun, Verb, adjective, preposition, number, article, nominal phrase etc..

Specific text: sentence is more vivid, and word is compared with horn of plenty, for example, writings in the vernacular, personal information.

Generality text: with specific text on the contrary, word is more terse, sentence is more abstract, for example, the writing in classical Chinese, country Statutes, legal document.

Relationship in fixed: relationship composed by determiner+centre word.

Term: a possibility that word string occurs as a Feature Words, wherein word string is a series of word according to certain A string of words that kind mode classification combines.

Unit: refer to a possibility that word string occurs as a word.

Otsu algorithm: it is the gamma characteristic by image, divides the image into two parts of background and target.Background and target Between inter-class variance it is bigger, illustrate constitute image two parts difference it is bigger, when partial target mistake is divided into background or portion Point background mistake, which is divided into target all, can cause this two parts difference to become smaller, and therefore, the maximum segmentation of inter-class variance be made to mean wrong point Probability is minimum.

It should be noted that the embodiment of the present invention is applicable to the Feature Words that target text is Chinese and non-Chinese text It extracts, for non-Chinese, the present invention is illustrated by taking English as an example.In addition, target text provided by the embodiment of the present invention, it can To be the texts such as news, article, policy, article brief introduction；Provided Feature Words can be the word of characterization target text feature Or phrase.

Referring to Fig. 1, thus it is shown that a kind of broad flow diagram for the method for extracting Feature Words provided in an embodiment of the present invention, packet Include following steps:

S101: target text is obtained, determines the part of speech feature of each word in target text.

S102: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is determined Feature Words.

In above embodiment, for step S101, part of speech is an important feature of word, and by word and language The important tie of sentence connection.The determination of part of speech feature is exactly that the process of a part of speech or vocabulary classification is specified to each word. This is extremely important for processing later, and therefore, part of speech feature has to provide sufficiently exact result.

It further, can also include the word segmentation processing to target text before determining the part of speech feature of word.Mostly Can be based on space as the separator between word similar to the alpha type language of English, and Chinese inherits the Chinese since ancient times The feature of language, does not separate significantly between word and word.Ancient Chinese is led to other than distinctive noun or name, place name Be often a Chinese character be exactly a word, thus for ancient Chinese prose participle it may not be necessary to.But the word in Modern Chinese, have very much Two-character word and multi-character words, this just needs to segment text, is further continued for subsequent analysis.

In addition, in the prior art, lagging behind non-Chinese language processing, the processing side of many non-Chinese for the processing technique of Chinese Formula is not directly applicable Chinese language processing, just as Chinese language processing must advanced participle operation.Therefore, for Chinese, participle It is the premise of text analyzing, part-of-speech tagging is carried out on the basis of participle, specifically, segmenter, string matching can be used The mode known to those skilled in the art such as method, Statistics-Based Method.

Further, it after obtaining target text, determines in target text before the part of speech feature of each word, also Target text can be pre-processed, for example, check target text with the presence or absence of wrongly written character, whether sentence clear and coherent, word whether It is accurate to wait operation.

For step S102, for non-Chinese, the Feature Words in target text are mostly single word.For Chinese For, the Feature Words in target text are mostly noun phrase and length is mostly 2~8 words, and as " ", "Yes", " a little " In the Feature Words that word is not in.For Chinese and non-Chinese, target text can be carried out in the following way Feature Words extract.

(1) (Adj | Noun)+| [(Adj | Noun) * (NounPrep)? ] (Adj | Noun) * } Noun；Wherein, NounPrep indicates that nominal phrase, Adj indicate that adjective, Noun indicate noun,? indicate occur zero or primary, * is represented Existing zero or multiple ,+indicate that appearance is one or many；

Concrete condition includes but is not limited to be:

1){(Adj|Noun)+(Adj|Noun)*}Noun；The part of speech of i.e. first word is adjective or noun and appearance One or many, the part of speech of second word is adjective or noun and frequency of occurrence is more than or equal to 0, the word of third word Property is noun.

For example, cross-node technology cloud computing basic theory of new generation, obtains the part of speech point of each word by analysis, after participle Not are as follows: new-adj, a generation-n, across-vn, node-n, technology-n, cloud computing-prep, basis-n, theory-n,

Analysis can obtain:

Across (gerund), node, this is the form of noun (first noun)+noun (the last one noun), be can be considered Second noun frequency of occurrence is 0；

Equally, node, technology, this is the form of noun (first noun)+noun (the last one noun), is corresponded to [(Adj|Noun)+]Noun；

Across, node, technology, this is noun+noun+noun form, it is understood that occur twice for first noun, The last one noun occurs once, equally it can be appreciated that first noun, second noun and the last one noun are each Occur once, aforesaid way effect is identical, does not distinguish herein.

2) [(Adj | Noun) * (NounPrep)? ] (Adj | Noun) * } Noun, it is divided into three parts: (Adj | Noun) * (NounPrep)?, the part of speech corresponding to first word is adjective or noun and frequency of occurrence is more than or equal to 0, second word The part of speech of language is nominal phrase and frequency of occurrence is zero or primary；(Adj | Noun) *, the part of speech corresponding to third word are Adjective or noun and frequency of occurrence are more than or equal to zero；Noun, i.e., the part of speech of the 4th word are noun.

Equally by taking cross-node technology cloud computing basic theory of new generation as an example, analysis can be obtained at this time:

Across, node, technology, cloud computing, basis, theory, this is noun+noun+noun+nominal phrase+noun+noun Form；Can be considered [(Adj | Noun) * (NounPrep)? ] in first (Adj | Noun) * frequency of occurrence be remaining part three times It is primary for segmenting language frequency of occurrence；

Across (gerund), node, technology, this is noun+noun+noun form；Can be considered at this time [(Adj | Noun) * (NounPrep)? ] in first (Adj | Noun) * is noun and occurs primary, NounPrep does not occur, and can also be considered as [(Adj | Noun) * (NounPrep)? ] (Adj | Noun) * in last part (Adj | Noun) * occur twice, rest part Frequency of occurrence is zero.

Cloud computing, basis, theory, this is nominal phrase+noun+noun form；Corresponding to [(Adj | Noun) * (NounPrep)? ] in first (Adj | Noun) * frequency of occurrence be zero；

Technology, this is the form of only one noun；Corresponding to { [(Adj | Noun) * before the last one noun (NounPrep)? ] (Adj | Noun) * frequency of occurrence is zero.

In addition, will then extract phone (noun) and shell for non-Chinese The phone shell is good (noun) is used as Feature Words.

(2) [(Adj | Noun | v)+] [(v | Noun)? ] (Noun | v)；Wherein, v indicates verb.It is mainly by three parts structure At: (Adj | Noun | v)+, (v | Noun)?, (Noun | v): i.e. in three adjacent words, the part of speech of first word is shape Hold word or noun or verb and frequency of occurrence is at least primary, the part of speech of second word is verb or noun and frequency of occurrence is Zero or primary, and when the part of speech of third word is noun or verb, determine above-mentioned three adjacent words for file destination Feature Words.

For example, I hardy learns cycling for Chinese, wherein study cycling is characterized word；In non- Text, Good papermaking technology are characterized word.

(3) when the part of speech of word is academic word, determine that the academic nature word is the Feature Words of target text.Example Such as, for Chinese, gm indicate with it is mathematically related, gp expression is related to physics；For non-Chinese, physics indicates Dedicated Physical, It can be used as Feature Words.

(4) when the part of speech of word be sensibility word when, determine be located at sensibility word after and with sensibility word phase Adjacent nearest word is the Feature Words of target text.Often there are some sensibility words in usual text, these words can be described as Sensitive word, for example, research, building, realize, explore, optimization, design, break through etc..And after sensitive word, usually there is feature Word, thus the region after sensitive word can be become into sensitizing range.For example, research nanotechnology, is studied as sensitive word, then It, can also be using nanotechnology as sensitive word using nanometer as Feature Words.

Specifically, " network virtualization technology towards cloud computation data center " obtains after segmenter segments: face To-v, cloud computing-gc (computerese), data-n, center-n ,-uj (different segmenter to " " part of speech table Show difference, it can be understood as the part of speech " u " that usual segmenter indicates, wherein u is auxiliary word), network-n, virtualization-vn, skill Art-n；Wherein, wherein previous item indicates word, latter indicates the part of speech of word.

Wherein " the two word parts of speech of data, " center " are all noun, and " virtualization, technology " the two word parts of speech are formed Character string, be all satisfied the statement of mode 1；" cloud computing is academic vocabulary, meets the statement of mode 3, can be determined as spy Levy word.

Method provided by above-described embodiment provides a kind of new approaches for extracting target text Feature Words, extracted Feature Words are more, can more show the feature of target text, if giving up certain words, even these words act on text feature Smaller and text a part will inevitably lose some text features.Method provided by above-described embodiment, By the part of speech feature of each word in analysis text, target text is carried out according to scheduled Feature Words part of speech feature extracting rule Feature Words extract, and to extract the Feature Words that can more characterize target text feature, embody the various dimensions feature of target text, solve The situation for the text feature word scarcity in the prior art of having determined.

Referring to fig. 2, thus it is shown that a kind of main stream of the optional method for extracting Feature Words provided in an embodiment of the present invention Cheng Tu includes the following steps:

S201: obtaining target text, analyzes the syntactic structure of each sentence in target text, determines that the sentence of each sentence is special Sign；Wherein, sentence feature includes determiner and centre word.

S202: according to preset Feature Words sentence feature extraction rule, extract centre word and be located at before centre word and Nearest determiner adjacent with centre word determines that portmanteau word combined by extracted centre word and determiner is target text Feature Words.

In above embodiment, for step S201, for the sentence given in target text, can by segmenter into Row participle, analyzes the qualified relation between each word, specifically includes determiner and centre word in sentence, locating for obtaining The syntactic structure of sentence.Here sentence divides, and can be using fullstop as line of demarcation, is also possible to comma, branch for boundary Line, the present invention is herein with no restrictions.

For step S202, due to may include multiple determiners in a word, text can be embodied for easy extract as far as possible The Feature Words of eigen can only the nearest determiner of selected distance centre word be combined with centre word, form portmanteau word, be made For the Feature Words of target text.

For example, sentence " the cross-node technology of cloud computing server of new generation ", wherein technology is center word, a new generation, cloud Calculating, server across, node are determiner, and analysis obtains " node technology " between qualified relation and the two words Other words are not interted, then as the Feature Words of this sentence.

Method provided by above-described embodiment provides another mode for extracting target text Feature Words, passes through analysis The syntactic structure of each sentence is extracted and has the feature that relationship and the intermediate portmanteau word for not interting other words in surely are target text Word.Above embodiment can obtain the Feature Words that can more characterize target text feature, and accuracy is higher, solves Feature Words extract deficient problem in the prior art.

Referring to Fig. 3, thus it is shown that another kind provided in an embodiment of the present invention optionally extracts the main of the method for Feature Words Flow chart includes the following steps:

S301: target text is obtained.

S302: the part of speech feature of each word in target text is determined.

S302 ': the syntactic structure of each sentence in analysis target text determines the sentence feature of each sentence；Wherein, sentence is special Sign includes determiner and centre word；

S303: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is determined Fisrt feature word.

S303 ': according to preset Feature Words sentence feature extraction rule, extract centre word and be located at before centre word and Nearest determiner adjacent with centre word determines that portmanteau word combined by extracted centre word and determiner is target text Second feature word.

S304: fisrt feature word and second feature word to the target text extracted carry out deduplication operation, determine The Feature Words of target text.

In above embodiment, the description of step S101, S102 shown in Fig. 1, step can be found in for step S302, S303 S302 ', S303 ' can be found in the description of step S201, S202 shown in Fig. 2, and details are not described herein.

In above embodiment, for step S301, subordinate sentence operation can be carried out to target text using segmenter simultaneously And participle operation, it is possible to have sequencing, the present invention is herein with no restrictions.

It should be noted that method provided by above embodiment, can first carry out after part of speech feature is extracted again into Line statement feature extraction, can also be to carry out part of speech feature extraction after advanced line statement feature extraction again, can be with the two simultaneously It carries out, the present invention is herein with no restrictions.

In addition, being extracted to the fisrt feature word that target text is extracted by part of speech feature, and by sentence feature The second feature word arrived may have part to be consistent, for example, node technology, meets extracting rule mode shown in Fig. 1 (1) Description, while meeting determiner+centre word description shown in Fig. 2, deduplication operation can be carried out, to avoid a feature The case where word is repeatedly shown, while mitigating the service pressure of terminal and server-side.

Method provided by above-described embodiment, provide another extract target text Feature Words mode, compared with Fig. 1, Mode shown in Fig. 2, the Feature Words quantity that above embodiment can be extracted is more, and repeatability is lower, accuracy is higher, solution Determined prior art characteristic word extraction negligible amounts cause target text text feature lack the problem of.

Referring to fig. 4, another optional method flow signal for extracting Feature Words according to an embodiment of the present invention is shown Figure, includes the following steps,

S401: target text is obtained, determines the part of speech feature of each word in target text.

S402: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is carried out Feature Words extract.

S403: when target text is less than predetermined quantity threshold value for the quantity of generality text or extracted Feature Words, Word similar with extracted Feature Words is obtained in preset feature dictionary.

S404: calculating the first similarity of acquired word and Feature Words, when the first similarity is more than or equal to the When one predetermined similarity threshold, determine that acquired word is the Feature Words of target text.

In above embodiment, the description of step S101, S102 shown in Fig. 1 can be found in for step S401, S402, herein It repeats no more.

In above embodiment, step S403 can equally be selected above-mentioned when target text is generality text It proposes method shown in 1~Fig. 3 and carries out Feature Words extraction, but the Feature Words that can be extracted are possible less.

It therefore, is generality text (for example, writing in classical Chinese) when determining target text, or the Feature Words extracted When quantity is less than institute's scheduled amount threshold, Feature Words expansion can be carried out, to increase the quantity of Feature Words.

Specifically, word2vec (term vector) can be used to be searched in preset feature dictionary, with acquisition and its Similar word carries out Feature Words expansion.Wherein, which can train in advance, for example, be trained based on big corpus, The present invention is not related to its training process.

It can using word acquired in word2vec since the word in feature dictionary is large number of for step S404 Some is smaller with the relevance of target text for energy, therefore, can screen to the word expanded, specifically, foundation It is screened with the similarity of Feature Words.

It can preset similarity threshold (for example, 80%), calculate inputted Feature Words and returned with word2vec Word between similarity, and only choose similarity value be more than or equal to the similarity threshold word, be determined as target The Feature Words of text.

Alternatively, it is also possible to be screened to the word expanded, the word not inquired in target text is removed, is reduced The case where user is inquired in target text less than word,

Method provided by above-described embodiment provides a kind of mode of quantity for expanding target text Feature Words, efficiently It is deficient to solve the problems, such as that prior art characteristic word extracts；The word expanded, which only has, to be met with the similarity of Feature Words beyond pre- When determining similarity threshold, it just can be identified as the Feature Words of target text.

Referring to Fig. 5, another optional method flow signal for extracting Feature Words according to an embodiment of the present invention is shown Figure, includes the following steps,

S501: target text is obtained, determines the part of speech feature of each word in target text.

S502: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is carried out Feature Words extract.

S503: according to formula

L=x* [C-value (a)]+y*SCP (w₁,...,w_n)

In above embodiment, the description of step S101, S102 shown in Fig. 1 can be found in for step S501, S502, herein It repeats no more.

For the Feature Words that FIG. 1 to FIG. 3 mode is extracted, also there are many noises between possibility.To be extracted Feature Words accuracy, accessed Feature Words can be screened by the way of certain, specifically, can be used Such as under type:

(1) term filters

Above-mentioned filter type can be selected there are many mode, such as C-value method, specifically:

Wherein, the first situation only has itself corresponding to father's string of word string a, and second situation corresponds to other situations； And | a | long corresponding to word, f (a) corresponds to word frequency, T_aIndicate the set of the word comprising a.

(2) unit filters

What unit was realized by judging the compactness in word string between each word, SCP value can be passed through (Symmetrical Conditional Probability, symmetric condition probability) Lai Hengliang.The improved MSCP of corresponding SCP (Macro Symmetrical Conditional Probability, macro symmetric condition probability) formula is as follows:

Wherein, w₁,...,w_nIndicate candidate feature word word string, w_iIndicate the word of the composition candidate feature word, F (w₁,...,w_n) indicate candidate feature word Weighted Term Frequency, the Weighted Term Frequency calculation formula of word string a are as follows:

Wherein, Sa indicates the area classification set occurred, and b indicates some special areas proposed, for example, sensitizing range.f_b (a) word frequency that occurs in the b of region of word string a is indicated, weight (b) indicates the power that candidate feature word is endowed in text filed Value.For example, Feature Words appear in title, can assign weight is 10, and the weight of sensitizing range is 6, other are 3, these data It is empirical value, is only separated different location region.

During calculating SCP value, it is understood that there may be some Feature Words contains the words such as technology, theory and to calculate SCP value it is relatively low.For example, " information service ", since " technology " there are many frequency of occurrence in the text, when calculating Avp, Calculated value is larger, causes SCP value very small.In consideration of it, can be gone when calculating the SCP value of some Feature Words in a document Fall the word as technology to be calculated, but do not remove really in Feature Words, for example, theory, application, technology, method, Status, experiment etc..

Above two filter type can be used alone, and can also be used in combination.In addition, not making when being used in combination It is measured with weight, is relatively used alone, it is more preferable to the effect of Feature Words filtering.

In above embodiment, for step S503, combined use for above-mentioned two filter type can be in conjunction with meter The noise figure of filtering characteristic word is calculated, specifically:

L=x* [C-value (a)]+y*SCP (w₁,...,w_n)

Wherein, L is noise figure, and x, y can be preset, such as value is 0.7,0.3 respectively.

For the portmanteau word of the word of Feature Words, especially two, for example, word X and word Y, wherein word X is to word Y Influence degree it is higher or word Y is higher to the influence degree of word X, the noise figure of the two acquired contamination words It is bigger.Therefore the denoising to Feature Words can only choose the Feature Words that noise figure is more than or equal to predetermined noise threshold, determine For the text feature of target text.For the noise figure threshold value, Otsu algorithm can be used to be determined.

Method provided by above-described embodiment provides the method that the Feature Words of a kind of pair of target text are filtered, excellent Text feature of the higher Feature Words of noise figure as target text is selected, so that the Feature Words accuracy filtered out is higher, Improve the accuracy of text feature.

Referring to Fig. 6, another optional method flow signal for extracting Feature Words according to an embodiment of the present invention is shown Figure, includes the following steps,

S601: target text is obtained, determines the part of speech feature of each word in target text.

S602: according to the part of speech feature of each word and preset Feature Words part of speech extracting rule, target text is determined Feature Words.

S603: receiving object to be measured text, obtains the Feature Words of object to be measured text.

S604: the second similarity of the Feature Words of target text and the Feature Words of object to be measured text is calculated, when the second phase When being more than or equal to the second predetermined similarity threshold like degree, determine that target text is similar to object to be measured text.

In above embodiment, the description of step S101, S102 shown in Fig. 1 can be found in for step S601, S602, herein It repeats no more.

In above embodiment, for step S603, many texts are flooded in network, when recommending text A to user, Similar text B can be recommended to improve the working efficiency of user to user to reduce the omission of effective text information simultaneously.

But before recommendation, it is thus necessary to determine that whether text A is similar to text B, the Feature Words of each text available at this time, According to the similarity between Feature Words, both judgements whether Similarity matching.

For step S604, in the text, following region can be divided into: title, sensitizing range, other；Wherein, Title includes level-one title, second level title etc..

How many Feature Words of the two are calculated into using algorithm for the Feature Words of the Feature Words of text A and text B in matching process It is identical, specifically, can using DFA algorithm (Deterministic Finite Automaton, determine finite automaton) into Row calculates, and obtained total quantity is removed the quantity in text B Feature Words and measures threshold value with this, judges between the two similar Property.When the similarity between the two texts is more than or equal to the second similarity threshold, determine that the two texts are similar.It is right In similar text, can recommend together.

By taking specific text and abstract text as an example, wherein the Feature Words of specific text are divided into three-level: A1 title, A2 are sensitive Area, A3 other, the weight of imparting can be preset, such as respectively X=10, Y=6, Z=4；The Feature Words of generality text Negligible amounts can merge title and sensitizing range as B1, other are B2, and by the extracted Feature Words of generality text Expanded, forms A1B1, A1B2 ... etc. after classification matching.Because also classification is handled the Feature Words of generality text, therefore Weight, such as α=0.6 can be preset, β=0.4 obtains α A1B1+ β A1B2.Later according to point of specific text feature word Grade processing, obtains three parts: (α A1B1+ β A1B2) * X, (α A2B1+ β A2B2) * Y, (α A3B1+ β A3B2) * Z, finally by three Adduction, the similarity being determined as between two texts.

For example, specific text is scientific research personnel's information, title feature word are as follows: nanosecond science and technology A1, sensitizing range Feature Words are as follows: Study nano particle A2, other positions Feature Words are as follows: tumour cell A3；

Abstract text is nanosecond science and technology guide, title feature word are as follows: nanosecond science and technology B1, sensitizing range Feature Words are as follows: research Nano particle B1, other positions Feature Words are as follows: tumour cell B2；

The Similarity matching degree of the two is calculated are as follows:

Sum=[0.6*A1B1+0.4*A1B2] * 10+ [0.6*A2B1+0.4*A2B2] * 6+ [0.6*A3B1+0.4* A3B2]*4。

It should be noted that method provided by above-described embodiment, be equally applicable to two texts be specific text or The case where person is generality text, the present invention for the similarity calculation of specific text and generality text only to say It is bright.

Further, method provided by above-described embodiment is equally applicable to recommend the mode of target text to user.It is right In user property, the Feature Words that can characterize user characteristics can be equally extracted, wherein the acquisition of these Feature Words can be It is filled in based on individual subscriber；Wherein, for the Feature Words of user property, specific text can be considered as.

The main purpose that text is recommended is then will have the text of certain correlation with user according to the existing attribute of user User is recommended, realizes Personalized Intelligent Recommendation.It, equally can will be similar with the text when recommending a text to user Text recommends user, reduces the risk that information is omitted, and obtains relevant information for user and provides convenience.

Method provided by above-described embodiment provides a kind of method of two text Similarity matchings of determination, by Feature Words Subregion is carried out, to calculate the similarity between two text feature words, improves the accuracy for calculating similarity between text.

Method provided by the embodiment of the present invention provides a kind of new approaches for extracting target text Feature Words, Ke Yitong The more tactful modes for crossing part of speech feature and/or sentence feature, extract the Feature Words in simultaneously Filtration Goal text, save target with this The multidimensional characteristic of text, improves the accuracy of text feature, while realizing in recommendation process between text, between text and user The calculating of feature Word similarity improves recommended range and recommends accuracy.

Referring to Fig. 7, a kind of main modular signal of device 700 for extracting Feature Words provided in an embodiment of the present invention is shown Figure；

Determining module 701 determines the part of speech feature of each word in the target text for obtaining target text；

Extraction module 702, for according to each word part of speech feature and preset Feature Words part of speech extracting rule, Determine the Feature Words of the target text.

In the device of that embodiment of the invention, the extraction module 702 is used for:

When the part of speech of first word in first three adjacent words is adjective or noun and frequency of occurrence is greater than etc. In primary, the part of speech of second word is adjective or noun and frequency of occurrence is more than or equal to zero, and the part of speech of third word is When noun, determine that described first three adjacent words are the Feature Words of the target text；Or

When the part of speech of word be sensibility word when, determine be located at the sensibility word after and with the sensibility word The mutually adjacent nearest word of language is the Feature Words of the target text.

The device of that embodiment of the invention further includes sentence module 703, is used for: analyzing the grammer of each sentence in the target text Structure determines the sentence feature of each sentence；Wherein, the sentence feature includes determiner and centre word；According to preset spy Levy the feature extraction of word sentence rule, extract the centre word and be located at the centre word before and it is adjacent with the centre word most Close determiner determines that portmanteau word combined by the extracted centre word and determiner is the feature of the target text Word.

The device of that embodiment of the invention further includes enlargement module 704, for being generality text or institute when the target text When the quantity of the Feature Words of extraction is less than predetermined quantity threshold value, obtained and extracted Feature Words phase in preset feature dictionary As word；The first similarity for calculating acquired word and the Feature Words, when first similarity is greater than or waits When the first predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.

The device of that embodiment of the invention further includes filtering module 705, for according to formula

L=x* [C-value (a)]+y*SCP (w₁,...,w_n)

The device of that embodiment of the invention further includes similar modular blocks 706, for receiving object to be measured text, obtains the mesh to be measured Mark the Feature Words of text；The Feature Words for calculating the target text are similar to the second of the Feature Words of the object to be measured text Degree, when second similarity be more than or equal to the second predetermined similarity threshold when, determine the target text and it is described to It is similar to survey target text.

Device provided by the embodiment of the present invention provides a kind of device for extracting target text Feature Words, passes through part of speech More tactful modes of feature and/or sentence feature, extract the Feature Words in simultaneously Filtration Goal text, save target text with this Multidimensional characteristic, improves the accuracy of text feature, while realizing in recommendation process Feature Words between text, between text and user The calculating of similarity improves recommended range and recommends accuracy.

In addition, the specific implementation content of the extraction Feature Words device described in embodiments of the present invention, described above to mention It takes in Feature Words method and has been described in detail, therefore no longer illustrate in this duplicate contents.

Showing referring to Fig. 8 can be using the extraction Feature Words method of the embodiment of the present invention or showing for extraction Feature Words device Example property system architecture 800.

As shown in figure 8, system architecture 800 may include terminal device 801,802,803, network 804 and server 805. Network 804 between terminal device 801,802,803 and server 805 to provide the medium of communication link.Network 804 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

User can be used terminal device 801,802,803 and be interacted by network 804 with server 805, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 801,802,803 (merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.

Terminal device 801,802,803 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..

Server 805 can be to provide the server of various services, such as utilize terminal device 801,802,803 to user The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter Breath -- merely illustrative) feed back to terminal device.

It is generally executed by server 805 it should be noted that extracting Feature Words method provided by the embodiment of the present invention, phase Ying Di extracts Feature Words device and is generally positioned in server 805.

It should be understood that the number of terminal device, network and server in Fig. 8 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

Referring to Fig. 9, it illustrates the knots of the computer system 900 for the terminal device for being suitable for being used to realize the embodiment of the present invention Structure schematic diagram.Terminal device shown in Fig. 9 is only an example, should not function and use scope band to the embodiment of the present invention Carry out any restrictions.

As shown in figure 9, computer system 900 includes central processing unit (CPU) 901, it can be read-only according to being stored in Program in memory (ROM) 902 or be loaded into the program in random access storage device (RAM) 903 from storage section 908 and Execute various movements appropriate and processing.In RAM 903, also it is stored with system 900 and operates required various programs and data. CPU 901, ROM 902 and RAM 903 are connected with each other by bus 904.Input/output (I/O) interface 905 is also connected to always Line 904.

I/O interface 905 is connected to lower component: the importation 906 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 907 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage section 908 including hard disk etc.； And the communications portion 909 of the network interface card including LAN card, modem etc..Communications portion 909 via such as because The network of spy's net executes communication process.Driver 910 is also connected to I/O interface 905 as needed.Detachable media 911, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 910, in order to read from thereon Computer program be mounted into storage section 908 as needed.

Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.? In such embodiment, which can be downloaded and installed from network by communications portion 909, and/or from can Medium 911 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) 901, system of the invention is executed The above-mentioned function of middle restriction.

It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.

Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.

Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet Include determining module and extraction module.Wherein, the title of these modules is not constituted to the module itself under certain conditions It limits, for example, extraction module is also described as " Feature Words extraction module ".

As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment；It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes Obtaining the equipment includes:

Target text is obtained, determines the part of speech feature of each word in the target text；

According to the part of speech feature of each word and preset Feature Words part of speech extracting rule, the target text is determined Feature Words.

Technical solution according to an embodiment of the present invention provides a kind of new approaches for extracting target text Feature Words, can be with By way of more strategies of part of speech feature and/or sentence feature, the Feature Words in simultaneously Filtration Goal text are extracted, mesh is saved with this The multidimensional characteristic for marking text, improves the accuracy of text feature, at the same realize in recommendation process between text, text and user it Between feature Word similarity calculating, improve recommended range and recommend accuracy.

Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims

1. a kind of method for extracting Feature Words characterized by comprising

According to the part of speech feature of each word and preset Feature Words part of speech extracting rule, the spy of the target text is determined Levy word.

2. the method according to claim 1, wherein the part of speech feature according to each word and default Feature Words part of speech extracting rule, determine that the Feature Words of the target text include:

In four adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to zero, the The part of speech of two words is nominal phrase and frequency of occurrence is zero or primary, and the part of speech of third word is adjective or noun And frequency of occurrence is more than or equal to zero, when the part of speech of the 4th word is noun, determines that four adjacent words are the mesh Mark the Feature Words of text；Or

In first three adjacent words, the part of speech of first word is adjective or noun and frequency of occurrence is more than or equal to one Secondary, the part of speech of second word is adjective or noun and frequency of occurrence is more than or equal to zero, and the part of speech of third word is noun When, determine that described first three adjacent words are the Feature Words of the target text；Or

When the part of speech of first word in two adjacent three words is adjective or noun or verb, the word of second word Property for verb or noun and frequency of occurrence is at most primary, and when the part of speech of third word is noun or verb, described in determination Two adjacent three words are the Feature Words of the target text；Or

When the part of speech of word is academic word, determine that the academic word is the Feature Words of the target text；Or

When the part of speech of word be sensibility word when, determine be located at the sensibility word after and with the sensibility word phase Adjacent nearest word is the Feature Words of the target text.

3. the method according to claim 1, wherein further include:

The syntactic structure for analyzing each sentence in the target text determines the sentence feature of each sentence；Wherein, the sentence feature Including determiner and centre word；

According to preset Feature Words sentence feature extraction rule, extract the centre word and be located at before the centre word and with The adjacent nearest determiner of the centre word, determines portmanteau word combined by the extracted centre word and determiner for institute State the Feature Words of target text.

4. method according to claim 1 or 3, which is characterized in that further include:

When the target text is less than predetermined quantity threshold value for the quantity of generality text or extracted Feature Words, default Feature dictionary in obtain similar with extracted Feature Words word；

The first similarity for calculating acquired word and the Feature Words, when first similarity is more than or equal to first When predetermined similarity threshold, determine that acquired word is the Feature Words of the target text.

5. method according to claim 1 or 3, which is characterized in that further include:

According to formula

L=x* [C-value (a)]+y*SCP (w₁,...,w_n)

The noise figure L of each Feature Words is calculated, extracting noise figure to be more than or equal to the Feature Words of predetermined noise value threshold value is target The Feature Words of text；Wherein, C-value (a) is the obtained noise figure of term filtering characteristic word, SCP (w₁,...,w_n) be The obtained noise figure of unit filtering characteristic word, w₁,...,w_nIt is characterized word word string, x, y respectively represent C-value (a), SCP (w₁,...,w_n) weight, a is word string.

6. the method according to claim 1, wherein after the Feature Words of the determination target text, Further include:

Object to be measured text is received, the Feature Words of the object to be measured text are obtained；

The second similarity for calculating the Feature Words of the target text and the Feature Words of the object to be measured text, when described second When similarity is more than or equal to the second predetermined similarity threshold, the target text and the object to be measured text phase are determined Seemingly.

7. a kind of device for extracting Feature Words characterized by comprising

Determining module determines the part of speech feature of each word in the target text for obtaining target text；

Extraction module, for according to each word part of speech feature and preset Feature Words part of speech extracting rule, determine institute State the Feature Words of target text.

8. device according to claim 7, which is characterized in that the extraction module is used for:

When the part of speech of first word in first three adjacent words is adjective or noun and frequency of occurrence is more than or equal to one Secondary, the part of speech of second word is adjective or noun and frequency of occurrence is more than or equal to zero, and the part of speech of third word is noun When, determine that described first three adjacent words are the Feature Words of the target text；Or

9. device according to claim 7, which is characterized in that described device further includes sentence module, is used for:

10. the device according to claim 7 or 9, which is characterized in that described device further includes enlargement module, is used for:

11. the device according to claim 7 or 9, which is characterized in that described device further includes filtering module, is used for:

According to formula

L=x* [C-value (a)]+y*SCP (w₁,...,w_n)

12. device according to claim 7, which is characterized in that described device further includes similar modular blocks, is used for:

13. a kind of electronic equipment characterized by comprising

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 6.

14. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor Such as method as claimed in any one of claims 1 to 6 is realized when row.