CN110321561A

CN110321561A - A kind of keyword extracting method and device

Info

Publication number: CN110321561A
Application number: CN201910570592.3A
Authority: CN
Inventors: 王兴光; 许阳寅; 牛成
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2019-10-11

Abstract

The embodiment of the present application discloses a kind of keyword extracting method and device, the described method includes: obtaining text set, keyword extraction is carried out to the content of text of each text unit, obtain the corresponding candidate keywords of each text unit, according to candidate keywords to the text unit metric parameter of text unit significance level, obtain the unit frequency information that candidate keywords are directed to text unit, according to unit frequency information, from the corresponding candidate keywords of each text unit, choose the corresponding unit keyword of each text unit, according to unit keyword to the text set metric parameter of text set significance level, acquiring unit keyword is directed to the text frequency information of text set, according to text frequency information, keyword is extracted from multiple unit keywords of text set.The accuracy of keyword extraction in text set can be improved in the program.

Description

A kind of keyword extracting method and device

Technical field

This application involves field of computer technology, and in particular to a kind of keyword extracting method and device.

Background technique

Keyword is the word that can react text subject or text main contents.For example, user is in inquiry books text This when, can pass through the corresponding keyword of book text, it is thus understood that mainly chat in the theme or book text of the book text The content stated, so judge the book text whether be user need book text, therefore, for text, accurately The efficiency that user's specific aim obtains information can be promoted by extracting keyword.And now for the extraction side of keyword in book text Method is not accurate enough.

Summary of the invention

The embodiment of the present application provides a kind of keyword extracting method and device, can be according to keyword for text unit Unit frequency information and keyword are directed to the text frequency information of text set, the keyword in text set are extracted, to mention The accuracy of keyword extraction in high text set.

The embodiment of the present application provides a kind of keyword extracting method, comprising:

Text set is obtained, includes multiple text units in certain ordering relation in the text set；

Keyword extraction is carried out to the content of text of each text unit, obtains the corresponding candidate key of each text unit Word；

According to the candidate keywords to the text unit metric parameter of the text unit significance level, the time is obtained Keyword is selected to be directed to the unit frequency information of the text unit；

According to the unit frequency information, from the corresponding candidate keywords of each text unit, each text list is chosen The corresponding unit keyword of member；

According to the unit keyword to the text set metric parameter of the text set significance level, obtains the unit and close Keyword is directed to the text frequency information of the text set；

According to the text frequency information, keyword is extracted from multiple unit keywords of the text set.

Correspondingly, the embodiment of the present application also provides a kind of keyword extracting device, comprising:

Text set obtains module, includes multiple texts in certain ordering relation in the text set for obtaining text set This unit；

First extraction module carries out keyword extraction for the content of text to each text unit, obtains each text The corresponding candidate keywords of unit；

The first information obtains module, for the text list according to the candidate keywords to the text unit significance level First metric parameter obtains the unit frequency information that the candidate keywords are directed to the text unit；

Module is chosen, for from the corresponding candidate keywords of each text unit, selecting according to the unit frequency information Take the corresponding unit keyword of each text unit；

Second data obtaining module, for the text intensity according to the unit keyword to the text set significance level Parameter is measured, the text frequency information that the unit keyword is directed to the text set is obtained；

Second extraction module is used for according to the text frequency information, from multiple unit keywords of the text set Extract keyword.

Optionally, in some embodiments, it may include that unit theme probability obtains son that the first information, which obtains module, Module, relevant information acquisition submodule, unit frequency sub-information acquisition submodule, word length information acquisition submodule and unit Frequency information acquisition submodule is as follows:

The unit theme probability acquisition submodule, can be used for obtaining the pass that each candidate keywords correspond to preset themes Keyword theme probability and each text unit correspond to the unit theme probability of preset themes；

The relevant information acquisition submodule can be used for general according to the corresponding keyword subject of the candidate keywords The corresponding unit theme probability of text unit where rate and the candidate keywords, obtain the candidate keywords with it is described Subject correlation message between text unit；

The unit frequency sub-information acquisition submodule, can be used for according to the candidate keywords in corresponding text unit The frequency of middle appearance obtains the unit frequency sub-information that the candidate keywords are directed to text unit；

The word length information acquisition submodule can be used for the word length based on the candidate keywords, obtain The corresponding word length information of the candidate keywords；

The unit frequency acquisition of information submodule, can be used for by subject correlation message, unit frequency sub-information and Word length information is merged, and the unit frequency information that the candidate keywords are directed to the text unit is obtained.

The unit frequency sub-information acquisition submodule can be used for: obtain candidate keywords described in the text unit The frequency of appearance accounts for the unit word accounting sub-information of all candidate keywords quantity in the text unit, from the text set Multiple text units in choose include the candidate keywords candidate target text unit, obtain candidate target text unit Quantity accounts for the text unit accounting sub-information of all text unit quantity in the text set, based on unit word accounting Information and the text unit accounting sub-information obtain the unit frequency sub-information that the candidate keywords are directed to text unit.

The unit theme probability acquisition submodule can be used for: determine that each candidate keywords correspond to the first of preset themes Beginning keyword subject probability and each text unit correspond to the initial cell theme probability of preset themes, based on described initial Keyword subject probability and the initial cell theme probability sample theme probability distribution by default sampling algorithm, Obtain that each candidate keywords correspond to the keyword subject probability of preset themes and each text unit corresponds to preset themes Unit theme probability will be described when the keyword subject probability and the unit theme probability meet probability regularization condition Initial key word theme probability is adjusted to keyword subject probability, and the initial cell theme probability is adjusted to unit theme Probability is returned and is executed based on the initial key word theme probability and the initial cell theme probability, calculated by default sampling Method samples theme probability distribution, obtain each candidate keywords correspond to preset themes keyword subject probability and Each text unit corresponds to the step of unit theme probability of preset themes, when the keyword subject probability and the unit master Topic probability is when being unsatisfactory for probability regularization condition, obtain each candidate keywords correspond to preset themes keyword subject probability, with And each text unit corresponds to the unit theme probability of preset themes.

Optionally, in some embodiments, second data obtaining module may include the first text frequency sub-information Acquisition submodule, the second text frequency sub-information acquisition submodule and text frequency information acquisition submodule are as follows:

The first text frequency sub-information acquisition submodule, can be used for the unit frequency according to the unit keyword The frequency that information and the unit keyword occur in the text set obtains the unit keyword corresponding first Text frequency sub-information；

The second text frequency sub-information acquisition submodule, can be used for institute in the text unit according to the text set The corresponding unit frequency information of unit keyword is stated, the corresponding second text frequency sub-information of the unit keyword is obtained；

The text frequency information acquisition submodule can be used for the first text frequency sub-information and described second Text frequency sub-information is merged, and the text frequency information that the unit keyword is directed to the text set is obtained.

The first text frequency sub-information acquisition submodule can be used for: the unit frequency of the unit keyword is believed The frequency that breath and the unit keyword occur in the text set is merged, and obtains the unit keyword in institute The fusion frequency sub-information in text set is stated, according to the unit frequency information of unit keyword each in the text set, is obtained The cumulative frequency sub-information of all unit keywords in the text set accounts for the accumulative frequency according to the fusion frequency sub-information The ratio of rate sub-information obtains the corresponding first text frequency sub-information of the unit keyword.

The second text frequency sub-information acquisition submodule can be used for: obtain each text unit of the text set The middle maximum maximum frequency keyword of the unit frequency value of information, by the corresponding list of maximum frequency keywords all in the text set First frequency information adds up, and obtains comprehensive cumulative frequency sub-information, packet is chosen from multiple text units of the text set The target text unit for including the unit keyword obtains the maximum of unit keyword respective value described in each target text unit Unit frequency information, the maximum unit frequency information of unit keyword respective value described in all target text units is carried out It is cumulative, specified cumulative frequency sub-information is obtained, according to the comprehensive cumulative frequency sub-information and specified cumulative frequency The corresponding second text frequency sub-information of unit keyword described in acquisition of information.

Optionally, in some embodiments, first extraction module may include primary keys acquisition submodule and Candidate keywords acquisition submodule is as follows:

The primary keys acquisition submodule can be used for carrying out text participle to the content of text of each text unit Processing, is divided into multiple primary keys for the content of text of the text unit；

The candidate keywords acquisition submodule can be used for regular to the primary keys according to word merging is preset Word merging is carried out, the corresponding candidate keywords of each text unit are obtained.

The candidate keywords acquisition submodule can be used for: merge rule to the primary keys according to default word Word merging is carried out, the corresponding original candidates keyword of each text unit is obtained, when the original candidates keyword meets in advance If when splitting condition, the original candidates keyword is split as at least one candidate keywords, when the original candidates are crucial When word is unsatisfactory for default splitting condition, the original candidates keyword is determined as candidate keywords.

Optionally, in some embodiments, the keyword extracting device can also include that labeling module and keyword mention Modulus block, as follows:

The labeling module can be used for the word feature according to each primary keys, in the content of text Primary keys are labeled；

The keyword extracting module can be used for the mark according to the primary keys, from the content of text Candidate keywords are extracted in primary keys.

In addition, the embodiment of the present application also provides a kind of storage medium, the storage medium is stored with a plurality of instruction, the finger It enables and being loaded suitable for processor, to execute the step in any keyword extracting method provided by the embodiments of the present application.

The embodiment of the present application available text set includes multiple text units in certain ordering relation in text set, Keyword extraction is carried out to the content of text of each text unit, obtains the corresponding candidate keywords of each text unit, according to Candidate keywords obtain the list that candidate keywords are directed to text unit to the text unit metric parameter of text unit significance level First frequency information chooses each text unit from the corresponding candidate keywords of each text unit according to unit frequency information Corresponding unit keyword, according to unit keyword to the text set metric parameter of text set significance level, acquiring unit is crucial Word is directed to the text frequency information of text set, according to the text frequency information, from multiple unit keywords of the text set In extract keyword.The program can be directed to according to keyword for the unit frequency information and keyword of text unit The text frequency information of text set, extracts the keyword in text set, to improve the accurate of keyword extraction in text set Property.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is the schematic diagram of a scenario of keyword extraction system provided by the embodiments of the present application；

Fig. 2 is the first pass figure of keyword extracting method provided by the embodiments of the present application；

Fig. 3 is the second flow chart of keyword extracting method provided by the embodiments of the present application；

Fig. 4 is the third flow chart of keyword extracting method provided by the embodiments of the present application；

Fig. 5 is text reading application main interface schematic diagram provided by the embodiments of the present application；

Fig. 6 is the first schematic diagram of text reading application recommended book provided by the embodiments of the present application；

Fig. 7 is the second schematic diagram of text reading application recommended book provided by the embodiments of the present application；

Fig. 8 is the structural schematic diagram of keyword extracting device provided by the embodiments of the present application；

Fig. 9 is the structural schematic diagram of the network equipment provided by the embodiments of the present application.

Specific embodiment

Schema is please referred to, wherein identical component symbol represents identical component, the principle of the application is to implement one It is illustrated in computing environment appropriate.The following description be based on illustrated by the application specific embodiment, should not be by It is considered as limitation the application other specific embodiments not detailed herein.

In the following description, the specific embodiment of the application will refer to the step as performed by one or multi-section computer And symbol illustrates, unless otherwise stating clearly.Therefore, these steps and operation will have to mention for several times is executed by computer, this paper institute The computer execution of finger includes by representing with the computer processing unit of the electronic signal of the data in a structuring pattern Operation.This operation is converted at the data or the position being maintained in the memory system of the computer, reconfigurable Or in addition change the running of the computer in mode known to the tester of this field.The maintained data structure of the data For the provider location of the memory, there is the specific feature as defined in the data format.But the application principle is with above-mentioned text Word illustrates that be not represented as a kind of limitation, this field tester will appreciate that plurality of step and behaviour as described below Also it may be implemented in hardware.

Term as used herein " module " can see the software object executed in the arithmetic system as.It is as described herein Different components, module, engine and service can see the objective for implementation in the arithmetic system as.And device as described herein and side Method can be implemented in the form of software, can also be implemented on hardware certainly, within the application protection scope.

Term " first ", " second " and " third " in the application etc. are for distinguishing different objects, rather than for retouching State particular order.In addition, term " includes " and " having " and their any deformations, it is intended that cover and non-exclusive include. Such as contain series of steps or module process, method, system, product or equipment be not limited to listed step or Module, but some embodiments further include the steps that not listing or module or some embodiments further include for these processes, Method, product or equipment intrinsic other steps or module.

Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.

The embodiment of the present application provides a kind of keyword extracting method, and the executing subject of the keyword extracting method can be this Apply for the keyword extracting device that embodiment provides, or be integrated with the network equipment of the keyword extracting device, wherein the pass Keyword extraction element can be realized by the way of hardware or software.Wherein, the network equipment can be smart phone, plate electricity The equipment such as brain, palm PC, laptop or desktop computer.The network equipment includes but is not limited to computer, network master The cloud that machine, single network server, multiple network server collection or multiple servers are constituted.

Referring to Fig. 1, Fig. 1 is the application scenarios schematic diagram of keyword extracting method provided by the embodiments of the present application, to close Keyword extraction element it is integrated in the network device for, it is in one that the network equipment available text set, which includes multiple in text set, The text unit for determining ordering relation carries out keyword extraction to the content of text of each text unit, obtains each text unit Corresponding candidate keywords obtain candidate according to candidate keywords to the text unit metric parameter of text unit significance level Keyword is directed to the unit frequency information of text unit, according to unit frequency information, from the corresponding candidate pass of each text unit In keyword, the corresponding unit keyword of each text unit is chosen, according to unit keyword to the text of text set significance level Collect metric parameter, acquiring unit keyword is directed to the text frequency information of text set, according to the text frequency information, from described Keyword is extracted in multiple unit keywords of text set.

It specifically, may include multiple chapters and sections in the book text, from book for example, the text, which integrates, to be book text After extracting keyword in nationality text, user can be according to the keyword extracted, it is thus understood that the content of the book text, theme Etc. information, with determine the book text whether be user wish read books.Text reading application can also be according to the extraction Keyword out searches books similar with the book text subject content, and then recommends use according to the reading histories of user Family carries out the recommendation of books so as to the hobby according to user.

Referring to Fig. 2, Fig. 2 is the flow diagram of keyword extracting method provided by the embodiments of the present application.The application is real The detailed process for applying the keyword extracting method of example offer can be such that

201, text set is obtained.

Wherein, text can be the combination with complete, system meaning several sentences, and a text can be a sentence Son, a paragraph or a chapter etc..

Wherein, text set can be the set being made of several texts, for example, may include multiple in text set in certain The text unit of ordering relation may include multiple chapters and sections being arranged in order, such as book in book if text set can be a book In may include chapter 1, chapter 2 etc..

Wherein, text unit can be the component part of text set, and e.g., when text set is books, text unit can be with For multiple chapters and sections in books, each chapters and sections can be used as a text unit, and multiple text units in text set can be in Certain ordering relation is such as arranged by the sequence of chapter 1, chapter 2.

In practical applications, available text set, text concentration may include multiple texts in certain ordering relation This unit, for example, an available book as text set, may include multiple chapters and sections in this this book, can pass through Book table Show text set, text unit is indicated by Chap, the relationship between text set and text unit can be expressed as Book= {Chap₁,Chap₂,...,Chap_n, wherein Chap_nIt can indicate the n-th chapters and sections in book.

In one embodiment, the acquisition of text set can there are many methods, for example, can obtain from local text database Text set is taken, e.g., user opens text reading application, local text database can be called by text reading application, from this The text set of needs is obtained in ground text database.For another example, text set can also be obtained by external memory unit, e.g., also Text set etc. can be obtained by network side equipment.

202, keyword extraction is carried out to the content of text of each text unit, obtains the corresponding candidate of each text unit Keyword.

In practical applications, keyword extraction can be carried out to the content of text of each text unit, obtains each text The corresponding candidate keywords of unit.For example, keyword extraction can be carried out to each chapters and sections of a book, will such as express The word of chapters and sections content is as candidate keywords.

In one embodiment, due to there is no the boundary of word in Chinese sentence, Language Processing is being carried out to Chinese When, text participle can be carried out to text first, then merge rule to the word progress in text unit by presetting word Merge, to get accurate candidate keywords.Specifically, step " carries out the content of text of each text unit crucial Word extracts, and obtains the corresponding candidate keywords of each text unit ", may include:

Text word segmentation processing is carried out to the content of text of each text unit, the content of text of the text unit is divided into Multiple primary keys；

Merge rule according to default word and word merging is carried out to the primary keys, it is corresponding to obtain each text unit Candidate keywords.

In practical applications, text word segmentation processing can be carried out to the content of text of each text unit, by text unit Content of text be divided into multiple primary keys.For example, can be carried out by segmentation methods to the content of text in each chapters and sections Each chapters and sections in book are divided into the combination of multiple primary keys, can pass through Book by word segmentation processing^segIndicate participle Book text that treated, passes through Chap_n ^segIndicate the n-th chapters and sections after word segmentation processing in book, the books text after word segmentation processing Originally Book can be expressed as the relationship between chapters and sections^seg={ Chap₁ ^seg,Chap₂ ^seg,...,Chap_n ^seg}。

Certain section can be " to be sometimes Jiaxuan, but my favorite word is on the contrary clear, this is Jiang Jie in such as book " corn poppy listens rain " ", after carrying out word segmentation processing to the section, which can be treated as " sometimes: d time: v Be: the Jiaxuan v: n: w but: c I: r most: d likes: v's: u word: n on the contrary: d understands: a's: u: w should: r is: Jiang v, and: n is prompt: n: U ": w anxiety: n beauty: n:w listens rain: n ": w ".Wherein, " time " "Yes" " Jiaxuan " etc. can be to segment to chapters and sections " sometimes " The multiple primary keys obtained after processing.By the above-mentioned means, can be multiple dispersions by the chapter construction of original continuous Primary keys.

In one embodiment, segmentation methods can there are many, such as the segmenting method based on string matching, based on understand Segmenting method, the segmenting method based on statistics etc..Wherein, the segmenting method based on string matching is according to certain plan The method that slightly Chinese character string to be processed is matched with the entry in machine dictionary, and carries out word segmentation processing.Wherein, based on reason The segmenting method of solution is the progress syntax, semantic analysis while word segmentation processing, and is simulated using syntactic information and semantic information Understanding of the people to sentence, to achieve the effect that word segmentation processing.Wherein, the segmenting method based on statistics can be to adjacent in corpus Each combinatorics on words frequency of co-occurrence is counted, their information that appears alternatively is obtained, which can characterize between Chinese character The tightness degree of marriage relation, when tightness degree reaches certain threshold value, it is believed that this word group constitutes a word, to realize The effect of word segmentation processing.

Wherein, it can also be marked by part of speech of the word segmentation processing to each primary keys, for example, can be each Pass through the corresponding part of speech of the corresponding primary keys of character marking behind primary keys.Wherein, d can indicate adverbial word, and v can be with Indicate that verb, n can indicate that noun, w can indicate that punctuation mark, c can indicate that conjunction, r can indicate that pronoun, a can be with tables Show that adjective, u can indicate auxiliary word, etc..It in this way, can be not only multiple original passes by continuous chapter construction Keyword, so as to the extraction of keyword, can also by the way that each primary keys are carried out with the label of part of speech, facilitate according to part of speech into The merging of row primary keys, to improve the accuracy of keyword extraction.

In practical applications, text word segmentation processing is carried out to the content of text of each text unit, by the text of text unit This content is divided into after multiple primary keys, can merge rule according to default word and carry out word conjunction to primary keys And obtain the corresponding candidate keywords of each text unit.For example, being the primary keys of multiple dispersions by each chapter construction Afterwards, the part of speech of available each primary keys, and rule is merged according to default word and closes at least one primary keys It and is a candidate keywords.Such as available three primary keys " fair and clear: a " " face: n " " youngster: Ng ", according to default word Merge rule, adjective, noun and the nominal morpheme being arranged in order can merge, therefore can be by three original passes Keyword merges into a candidate keywords " fair and clear face ".

In one embodiment, preset word merge rule can there are many:

Adjective, noun and the nominal morpheme being arranged in order can merge, such as " fair and clear: a " " face: n " " youngster: Ng " can be merged into " fair and clear face "；

Adjective, auxiliary word and the nominal morpheme being arranged in order can merge, such as " wise: a " " it: u " " lift: Ng " can be merged into " wise move "；

The name verb and noun being arranged in order can merge, and such as " investment: vn " " theory: n ", which can be merged into, " throws Money theory "；

Noun, punctuation mark and the noun being arranged in order can merge, such as " Nash-elmo: n " ": w " " Ni Gula This: n " can be merged into " Nash-elmo Nicholas "；

Adjective, noun and the noun being arranged in order can merge, and such as " high: a " " risk: n " " loan: n " can To merge into " riskier loans "；

Noun, auxiliary word and the noun being arranged in order can merge, as " national debt: n " ": u " " interest rate: n " can be with Merge into " interest rate of national debt "；

The noun and name verb being arranged in order can merge, and such as " momentum: n " " investment: vn " can be merged into " dynamic Amount investment "；

Adjective, noun and the name verb being arranged in order can merge, such as " rationally: a " " price: n " " increase: Vn " can be merged into " reasonable price growth "；

Other proper names and noun being arranged in order can merge, and such as " star-spangled banner: nz " " bank: n " can be merged into " Citibank "；

Name verb, noun and the noun being arranged in order can merge, such as " prediction: vn " " crucial: n " " event: n " It can be merged into " prediction critical event "；

Noun, name verb and the noun being arranged in order can merge, such as " risk: n " " adjustment: vn " " concept: n " It can be merged into " Risk Adjusted concept "；

It the noun that is arranged in order and is followed by ingredient and can merge, such as " fluctuation: n " " property: k " can be merged into " wave Dynamic property "；

Adjective, auxiliary word and the noun being arranged in order can merge, as " outstanding: a " ": u " " enemy: n " can To merge into " outstanding enemy "；

Noun, auxiliary word, name verb and the nominal morpheme being arranged in order can merge, such as " assets: n " ": u " " related: vn " " property: Ng " can be merged into " correlations of assets "；

Noun, auxiliary word, name verb and the noun being arranged in order can merge, as " group: n " ": u " " be responsible for: Vn " " people: n " can be merged into " responsible person of group "；

The noun and verb character morpheme being arranged in order can merge, and such as " ginger: n " " porcelain: Vg " can be merged into " ginger Porcelain "；

Number, noun and the nominal morpheme being arranged in order can merge, such as " five: m " " stars: n " " Supreme Being: Ng " It can be merged into " five stars Supreme Beings "；

Number, adjective, nominal morpheme and the noun being arranged in order can merge, such as " nine: m " " profound: a " " gold: Ng " " brontosaurus: n " can be merged into " nine profound Jin Leilong ", etc..

In one embodiment, presetting word to merge rule can also include by continuous multiple nominal primary keys A candidate keywords are merged into, for example continuous two or three nouns can be merged into a candidate keywords.

In one embodiment, since in actual use, the too long candidate keywords of word length should not carry out subsequent word The step of language theme obtains, therefore, it is necessary to carry out word once again to original candidates keyword too long after merging to split, so that waiting Select the suitable length of keyword.Specifically, step " merges rule according to default word and carries out word conjunction to the primary keys And obtain the corresponding candidate keywords of each text unit ", may include:

Merge rule according to default word and word merging is carried out to the primary keys, it is corresponding to obtain each text unit Original candidates keyword；

When the original candidates keyword meets default splitting condition, the original candidates keyword is split as at least One candidate keywords；

When the original candidates keyword is unsatisfactory for default splitting condition, the original candidates keyword is determined as waiting Select keyword.

In practical applications, rule can be merged according to default word and word merging is carried out to primary keys, obtained every The corresponding original candidates keyword of a text unit, when original candidates keyword meets default splitting condition, by original candidates Keyword is split as at least one candidate keywords, when original candidates keyword is unsatisfactory for default splitting condition, by original time Keyword is selected to be determined as candidate keywords.Multiple primary keys are merged into for example, rule can be merged according to default word Original candidates keyword, and the word length of original candidates keyword is detected, when the word of original candidates keyword is long When degree is more than default word length threshold, it is believed that the original candidates keyword meets default splitting condition, and this is original Candidate keywords are split as at least one candidate keywords, the pass between candidate keywords after original candidates keyword and fractionation System can be expressed as word={ word_sub,0,word_sub,1,...}；When the word length of original candidates keyword be not above it is pre- If when word length threshold, it is believed that the original candidates keyword is unsatisfactory for default splitting condition, then can not be original to this Candidate keywords are split.

In one embodiment, can also be judged original according to the quantity in original candidates keyword including primary keys Whether candidate keywords meet default splitting condition, for example, when the quantity in original candidates keyword including primary keys is super When crossing predetermined keyword quantity, it is believed that the original candidates keyword meets default splitting condition, and the original candidates are closed Keyword is split as at least one candidate keywords；When in original candidates keyword include primary keys quantity be not above it is pre- If when keyword quantity, it is believed that the original candidates keyword is unsatisfactory for default splitting condition, then can not be to the original time Keyword is selected to be split.

For example, when in original candidates keyword including 5 primary keys, it is believed that the original candidates keyword is full The default splitting condition of foot, and the original candidates keyword is split as two candidate keywords, two candidate keys after fractionation 2 primary keys and 3 primary keys can be respectively included in word.

In one embodiment, the too low candidate keywords of the frequency of occurrences should not also carry out the acquisition of subsequent word theme, because This, after getting candidate keywords, can also the candidate keywords too low to the frequency of occurrences screened and deleted, to retain More accurate candidate keywords.

In one embodiment, since the name entity such as name, mechanism name, place name in text also belongs to the key of text Word, but name entity to lay particular emphasis on the identification characteristics for characterizing text, therefore can also be by carrying out to the name entity in text Identification obtains the keyword being made of name entity, user is allowed to get more accurate keyword with multi-angle.Tool Body, step " carries out text word segmentation processing to the content of text of each text unit, by the content of text of the text unit point For multiple primary keys " after, can also include:

According to the word feature of each primary keys, the primary keys in the content of text are labeled；

According to the mark of the primary keys, candidate key is extracted from the primary keys of the content of text Word.

Wherein, name entity can be to have the entity of certain sense, for example, name entity can wrap in text to be identified Include name, mechanism name, place name, proper noun, time, date, numeral classifier phrase etc..

It in practical applications, can be according to the word feature of each primary keys, to the original key in content of text Word is labeled, and then according to the mark of primary keys, extracts candidate keywords from the primary keys of content of text. For example, pass through text word segmentation processing, after text unit is divided into multiple primary keys, can by name Entity recognition, According to the word feature of each primary keys, each primary keys are labeled, i.e., to each primary keys one Label, and according to the label of primary keys, candidate keywords are extracted from primary keys.Book can be passed through^nerIt indicates Book text after naming Entity recognition, passes through Chap_n ^nerThe chapter n has after naming Entity recognition is indicated, at this point, book Relationship between nationality text and chapters and sections text can be expressed as Book^ner={ Chap₁ ^ner,Chap₂ ^ner,...,Chap_n ^ner}。

For example, can identify " the day " Jiaxuan/PER " " Jiang Jie/PER " " the wealthy cloud/PER in river " after naming Entity recognition Bright/LOC " " U.S./LOC " etc. names entity, and using above-mentioned name entity as candidate keywords.Wherein, " PER " " LOC " can Think that the label of name entity, " PER " illustrate that the name entity is name, " LOC " illustrates that the name entity is place name etc..

In one embodiment, the identification of entity can be named by network model, for example, can be by by CRF (item Part random field algorithm, conditional random field algorithm) as NER (name Entity recognition, Named Entity Recognition) the network model of benchmark model be named the identification of entity.Deep learning can also be passed through The physical models such as network model or HMM (hidden Markov model, Hidden Markov Model) are named the knowledge of entity Not.

203, candidate keywords are obtained to the text unit metric parameter of text unit significance level according to candidate keywords For the unit frequency information of text unit.

Wherein, text unit metric parameter can be the ginseng for measuring the candidate keywords relative to text unit significance level Number can learn significance level of the candidate keywords relative to text unit according to text unit metric parameter, and then determine Whether the candidate keywords are accurate keyword for text unit.Wherein, text unit metric parameter may include more A type, for example, text unit metric parameter may include subject correlation message, unit frequency sub-information, word length information Deng.

Wherein, unit frequency information can the text that obtains of candidate keywords occur in text unit according to the frequency Unit metric parameter, unit frequency information can be information of the characterization keyword relative to text unit significance level, the unit The numerical value of frequency information is bigger, illustrates that the keyword is more important for text unit, i.e., for text unit, unit frequency The bigger keyword of information value is more accurate keyword.Wherein, which can pass through keyword weight Mode is indicated.

In practical applications, can according to candidate keywords to the text unit metric parameter of text unit significance level, Obtain the unit frequency information that candidate keywords are directed to text unit.For example, can be by calculating candidate keywords to text list Multiple text unit metric parameters of first significance level, to obtain candidate keywords for the unit of the form of weights of text unit Frequency information.

It in one embodiment, can be with since significance level of the keyword for text unit is determined by Multiple factors By specifically calculating the corresponding subject correlation message of candidate keywords, unit frequency sub-information and word length information, obtains and wait It selects keyword to be directed to the unit frequency information of text unit, then carries out the extraction step of keyword, mentioned to promote keyword The accuracy taken.Specifically, step " is measured according to text unit of the candidate keywords to the text unit significance level Parameter obtains the unit frequency information that the candidate keywords are directed to the text unit ", may include:

Subject correlation message, unit frequency sub-information and word length information are merged, the candidate pass is obtained Keyword is directed to the unit frequency information of the text unit.

Since importance of the keyword for text unit can be as the frequency that keyword occurs in text unit be at just Than increasing, while can be inversely proportional decline, therefore text unit with keyword in the frequency that text unit occurs in corpus Metric parameter may include unit frequency sub-information.Keyword is more close with text unit theme simultaneously, and keyword is for text It is more important for unit, therefore text unit metric parameter may include unit frequency sub-information subject correlation message.In reality In, the word length of keyword also influences whether the significance level of keyword, therefore text unit metric parameter can be with Including word length information.

In practical applications, subject correlation message, unit frequency sub-information and word length information can be melted It closes, obtains the unit frequency information that candidate keywords are directed to text unit.For example, available arrive the corresponding master of candidate keywords Relevant information, unit frequency sub-information and word length information are inscribed, and above- mentioned information are merged, by fused information The unit frequency information of text unit is directed to as candidate keywords.

In one embodiment, specifically, step " is believed subject correlation message, unit frequency sub-information and word length Breath is merged, and the unit frequency information that the candidate keywords are directed to the text unit is obtained ", may include:

Obtain each candidate keywords correspond to preset themes keyword subject probability and each text unit it is corresponding pre- If the unit theme probability of theme；

According to text unit where the corresponding keyword subject probability of the candidate keywords and the candidate keywords Corresponding unit theme probability, obtains the subject correlation message between the candidate keywords and the text unit；

According to the frequency that the candidate keywords occur in corresponding text unit, the candidate keywords are obtained for text The unit frequency sub-information of this unit；

Word length based on the candidate keywords obtains the corresponding word length information of the candidate keywords；

In practical applications, for example, LDA (document subject matter generation model, Latent can be passed through DirichletAllocation the keyword subject probability Topic that each candidate keywords correspond to preset themes) is obtained_wordj= [v_j,1,v_j,2,...,v_j,k] and each text unit correspond to the unit theme probability Topic of preset themes_chapi=[v_i,1, v_i,2,...,v_i,k], and according to formula r_i,j=cosine (Topic_chapi,Topic_wordj) obtain candidate keywords and text list Subject correlation message r between member_chap,word.Then the frequency occurred in corresponding text unit according to candidate keywords obtains Candidate keywords are directed to the unit frequency sub-information w of text unit_chap,word.Then it according to the word length of candidate keywords, obtains Take the corresponding word length information weight of candidate keywords_len(word).And according to formula weight_word=r_chap,word× w_chap,word×weight_len(word), subject correlation message, unit frequency sub-information and word length information are merged, Obtain the unit frequency information weight that candidate keywords are directed to text unit_word。

In one embodiment, since in practical applications, the length of word also influences whether keyword for text unit Significance level therefore can pass through the word length information of following Rule different terms length candidate keywords weight_len(word):

weight_length={ 1:0.7,2:1.0,3:1.2,4:1.5,5:1.2 }

Wherein, forming the word length information of candidate keywords by a word can be 0.7；Candidate pass is formed by two words The word length information of keyword can be 1.0；The word length information that candidate keywords are made of three words can be 1.2；By The word length information of four word composition candidate keywords can be 1.5；The word length of candidate keywords is made of five words Information can be 1.2；The word length information of other length candidate keywords can be 1.0.

It in practical applications, can be according to by LDA, (document subject matter is raw in order to promote the accuracy that theme probability obtains At model, Latent Dirichlet Allocation) carry out theme probability acquisition.Specifically, step " obtains each time Select the unit theme that keyword corresponds to the keyword subject probability of preset themes and each text unit corresponds to preset themes general Rate " may include:

Determine that each candidate keywords correspond to the initial key word theme probability and each text unit pair of preset themes Answer the initial cell theme probability of preset themes；

Based on the initial key word theme probability and the initial cell theme probability, by default sampling algorithm to master Topic probability distribution is sampled, and keyword subject probability and each text that each candidate keywords correspond to preset themes are obtained The unit theme probability of the corresponding preset themes of this unit；

When the keyword subject probability and the unit theme probability meet probability regularization condition, by the initial pass Keyword theme probability is adjusted to keyword subject probability, and the initial cell theme probability is adjusted to unit theme probability；

It returns to execute and is based on the initial key word theme probability and the initial cell theme probability, pass through default sampling Algorithm samples theme probability distribution, obtain each candidate keywords correspond to preset themes keyword subject probability, with And each text unit corresponds to the step of unit theme probability of preset themes；

When the keyword subject probability and the unit theme probability are unsatisfactory for probability regularization condition, each time is obtained Select the unit theme that keyword corresponds to the keyword subject probability of preset themes and each text unit corresponds to preset themes general Rate.

Wherein, LDA (document subject matter generate model, Latent Dirichlet Allocation) can be include word, A kind of generation model of theme and text three-decker.It is considered that the generating process of text are as follows: with certain probability from multiple pre- If selecting some specific theme in theme, and with some word of certain probability selection from this theme, that then selects is more A text can be generated in a word.Multinomial distribution is obeyed from text to theme, and multinomial point is also obeyed from theme to word Cloth.

In practical applications, can first each candidate keywords be assigned with the initial key word of a corresponding preset themes Theme probability, and the initial cell theme probability of a corresponding preset themes is assigned to each text unit, then according to just Begin the initial key word theme probability and initial cell theme probability that assign, gibbs sampler formula is completely changed, and according to complete Whole gibbs sampler formula obtains keyword subject probability and each text that each candidate keywords correspond to preset themes The unit theme probability of the corresponding preset themes of this unit.

The keyword subject probability and initial key word theme probability and unit theme probability and initial list that will acquire First theme probability is compared, when comparison result shows that gibbs sampler formula has been restrained, i.e., keyword subject probability and When unit theme probability is unsatisfactory for probability regularization condition, keyword subject probability and unit theme probability can be directly acquired.When When comparison result shows that gibbs sampler formula is not restrained, i.e. keyword subject probability and unit theme probability meets probability tune When shelf-regulating Conditions, initial key word theme probability can be adjusted to keyword subject probability, and by initial cell theme probability tune Whole is unit theme probability, continues the acquisition that theme probability is carried out according to gibbs sampler formula, until gibbs sampler formula Convergence, obtains keyword subject probability Topic_wordj=[v_j,1,v_j,2,...,v_j,k] and unit theme probability Topic_chapi= [v_i,1,v_i,2,...,v_i,k]。

It in one embodiment, can in 10000 books for example, available 10000 books carry out the acquisition of theme probability To include 5000000 chapters and sections and 3000000 unduplicated candidate keywords.Then the data got are led Inscribe the acquisition of probability.But since data volume is excessive, the LDA of single machine is difficult to run, and can be used in distributed LDA work The acquisition of cluster is again very difficult, therefore, can carry out what LDA theme probability obtained respectively by the way that data are divided into several pieces Scheme, to solve the problems, such as that data volume is excessive.Such as it can be by 5000000 chapters and sections and 3000000 in 10000 books A unduplicated candidate keywords are divided into K parts, and every portion is divided all in accordance with section number, so that the data of each book are all equal Even is distributed in each part of data.Wherein, the numerical value of K can be adjusted according to the actual situation, so that each part of data are all It can be run in single machine.

In one embodiment, since importance of the keyword for text unit can go out in text unit with keyword The directly proportional increase of the existing frequency, while can be inversely proportional decline with keyword in the frequency that text unit occurs in corpus, Therefore the accuracy for determining keyword significance level can be promoted by computing unit frequency sub-information.Specifically, step " root According to the frequency that the candidate keywords occur in corresponding text unit, the list that the candidate keywords are directed to text unit is obtained First frequency sub-information " may include:

The frequency for obtaining the appearance of candidate keywords described in the text unit accounts for all candidate passes in the text unit The unit word accounting sub-information of keyword quantity；

The candidate target text unit including the candidate keywords is chosen from multiple text units of the text set；

Obtain the text unit accounting that candidate target text unit quantity accounts for all text unit quantity in the text set Sub-information；

Based on the unit word accounting sub-information and the text unit accounting sub-information, the candidate keywords are obtained For the unit frequency sub-information of text unit.

In practical applications, TF-IDF (weighting technique of information retrieval and data mining, term can be passed through Frequency-inverse document frequency) calculate the corresponding unit frequency sub-information of candidate keywords；Work as meter When calculating candidate keywords " I " corresponding unit frequency sub-information, candidate keywords " I " occur in available text unit Frequency # (word_j) and text unit in all candidate keywords quantity # (word in chap), and according to formulaComputing unit word accounting sub-information.Then the time including candidate keywords " I " is chosen from text set Target text unit is selected, and obtains quantity # (the chap has word of candidate target text unit_j) and text set in own The quantity # (chap) of text unit, then according to formulaCalculate text unit accounting sub-information.Finally According to unit frequency sub-information calculation formulaCalculate candidate close Keyword " I " is directed to the unit frequency sub-information of text unit.

Wherein, TF-IDF (weighting technique of information retrieval and data mining, term frequency-inverse Document frequency) it can be a kind of weighting technique for information retrieval and data mining.The main think of of TF-IDF Think are as follows: if the frequency that some word occurs in a text is high, and seldom occur in other texts, then can recognize There is good class discrimination ability for this word, be adapted to classify.

In one embodiment, the calculation of unit frequency sub-information can there are many, as long as can be in the dimension of statistics It is upper to assign a unit frequency sub-information for candidate keywords, for example, candidate key can also be obtained by TextRank The corresponding unit frequency sub-information of word, wherein TextRank is a kind of keyword extraction algorithm of natural language processing field, can For extracting keyword, phrase and automatically generating text snippet.For another example, candidate keywords can also be obtained by TF Corresponding unit frequency sub-information, etc..

204, according to unit frequency information, from the corresponding candidate keywords of each text unit, each text list is chosen The corresponding unit keyword of member.

In practical applications, it can be selected according to unit frequency information, from the corresponding candidate keywords of each text unit Take the corresponding unit keyword of each text unit.For example, getting the unit frequency information that candidate keywords are directed to chapters and sections weight_wordIt later, can be according to the size of the unit frequency value of information of candidate keywords, by the candidate keywords in each chapter Sequence from big to small, and the first 20 candidate passes that the unit frequency value of information is big are carried out according to the size of the unit frequency value of information Keyword, as the corresponding unit keyword of each chapters and sections.

205, according to unit keyword to the text set metric parameter of text set significance level, acquiring unit keyword is directed to The text frequency information of text set.

Wherein, text set metric parameter can be the parameter for measuring the candidate keywords relative to text set significance level, Significance level of the candidate keywords relative to text set can be learnt according to text set metric parameter, and then determines that the candidate is closed Whether keyword is accurate keyword for text set.Wherein, text set metric parameter may include multiple types, for example, Text set metric parameter may include the first text frequency sub-information, second text frequency sub-information etc..

Wherein, text frequency information can be to characterize information of the keyword relative to text set significance level, and the text is frequently The numerical value of rate information is bigger, illustrates that the keyword is more important for text set, i.e., for text set, text frequency information number Being worth bigger keyword is more accurate keyword.Wherein, text frequency information can be by way of keyword weight It is indicated.

In practical applications, it can be obtained according to unit keyword to the text set metric parameter of text set significance level Unit keyword is directed to the text frequency information of text set.For example, can be with computing unit keyword to text set significance level First text frequency sub-information KF_weightedWith the second text frequency sub-informationPass through formulaText frequency information of the acquiring unit keyword for the form of weights of text set weight_word。

In one embodiment, since metric element keyword can be with for the text set metric parameter of text set significance level There are many, therefore, in order to improve the accuracy of text frequency information acquisition, a variety of text set metric parameters can be counted It calculates, to obtain accurate text frequency information.Specifically, step is " according to the unit keyword to the important journey of the text set The text set metric parameter of degree obtains the text frequency information that the unit keyword is directed to the text set ", may include:

Occurred in the text set according to the unit frequency information of the unit keyword and the unit keyword The frequency, obtain the corresponding first text frequency sub-information of the unit keyword；

According to the corresponding unit frequency information of unit keyword described in the text unit of the text set, the list is obtained The corresponding second text frequency sub-information of first keyword；

The first text frequency sub-information and the second text frequency sub-information are merged, the unit is obtained Keyword is directed to the text frequency information of the text set.

In practical applications, for example, when the corresponding text frequency information of acquiring unit keyword " I ", available list The frequency that the corresponding unit frequency information of first keyword " I " and unit keyword " I " occur in whole book, is then counted Calculate unit keyword " I " corresponding first text frequency sub-information KF_weighted.Then crucial according to occurring unit in whole book The corresponding unit frequency information of chapters and sections and unit keyword " I " of word " I ", computing unit keyword " I " corresponding Two text frequency sub-informationsThen the first text frequency sub-information and the second text frequency sub-information that will acquire It is merged, obtains the text frequency information weight that unit keyword " I " is directed to whole book_word。

In one embodiment, due to having multiple chapters and sections in a book, and the unit extracted from these chapters and sections is crucial For word there may be repetition, these duplicate keywords are even more important for whole book, therefore can be closed by computing unit The corresponding first text frequency sub-information of keyword, to promote the accuracy of keyword acquisition.Specifically, step is " according to the list The frequency that the unit frequency information of first keyword and the unit keyword occur in the text set, obtains the list The corresponding first text frequency sub-information of first keyword " may include:

The unit frequency information of the unit keyword and the unit keyword are occurred in the text set The frequency is merged, and fusion frequency sub-information of the unit keyword in the text set is obtained；

According to the unit frequency information of unit keyword each in the text set, all units in the text set are obtained The cumulative frequency sub-information of keyword；

The ratio that the cumulative frequency sub-information is accounted for according to the fusion frequency sub-information obtains the unit keyword pair The the first text frequency sub-information answered.

In practical applications, available for example, when the first text frequency sub-information of computing unit keyword " I " The corresponding unit frequency information weight of unit keyword " I "_wordiAnd unit keyword " I " occurs in whole book The frequency, and the frequency occurred in whole book according to " I " is to unit frequency information weight_wordiIt adds up, calculates unit Fusion frequency sub-information ∑ of the keyword " I " in whole book_wordiweight_wordi.Then each unit in whole book is obtained The corresponding unit frequency information weight of keyword_word, and the corresponding unit frequency information of all unit keywords is carried out tired Add, obtains the cumulative frequency sub-information ∑ of all unit keywords in whole book_wordsweight_word.Then fusion can be calculated Frequency sub-information accounts for the ratio of cumulative frequency sub-informationAcquiring unit keyword " I " is right The the first text frequency sub-information KF answered_weighted。

In one embodiment, due to calculate keyword be directed to text unit frequency information when, each keyword for The corresponding weight of text unit frequency information is all 1, but when calculating frequency information of the keyword for text set, keyword Weight corresponding for text set frequency information is not necessarily 1, therefore can be by assigning keyword for the second text of text set This frequency sub-information improves the accuracy that keyword obtains.Specifically, step is " according to institute in the text unit of the text set The corresponding unit frequency information of unit keyword is stated, the corresponding second text frequency sub-information of the unit keyword is obtained ", it can To include:

Obtain the maximum maximum frequency keyword of the unit frequency value of information in each text unit of the text set；

The corresponding unit frequency information of maximum frequency keywords all in the text set is added up, is obtained comprehensive tired Count frequency sub-information；

The target text unit including the unit keyword is chosen from multiple text units of the text set；

Obtain the maximum unit frequency information of unit keyword respective value described in each target text unit；

The maximum unit frequency information of unit keyword respective value described in all target text units is added up, is obtained To specified cumulative frequency sub-information；

It is crucial that the unit is obtained according to the comprehensive cumulative frequency sub-information and the specified cumulative frequency sub-information The corresponding second text frequency sub-information of word.

In practical applications, available for example, when the second text frequency sub-information of computing unit keyword " I " The maximum maximum frequency keyword of the unit frequency value of information and the maximum frequency keyword are corresponding in each chapters and sections of whole book Unit frequency value of information max (weight_words), and by the corresponding unit frequency value of information of the maximum frequency keyword with 1 into Row compares, and obtains value max (1.0, max bigger in the corresponding unit frequency value of information of maximum frequency keyword and 1 (weight_words).And the biggish value in all chapters and sections of text set adds up, obtain comprehensive cumulative frequency sub-information ∑_docsmax(1.0,max(weight_words)).Then can choose from multiple chapters and sections of whole book includes unit keyword The target text unit of " I ", and obtain the maximum unit frequency of corresponding unit keyword " I " in each target text unit Information max (weight_wordi), then by the maximum unit frequency of unit keyword " I " respective value in all target text units Information adds up, and obtains specified cumulative frequency sub-information ∑_{docs has wordi}max(weight_wordi).Then according to formulaThe corresponding second text frequency of computing unit keyword " I " Sub-information

206, according to text frequency information, keyword is extracted from multiple unit keywords of text set.

In practical applications, pass can be extracted from multiple unit keywords of text set according to text frequency information Keyword.For example, can key by the unit keyword of the biggish preset number of text frequency information numerical value, as whole book Word.

For example, getting the corresponding keyword of whole book can be such that clock generation: 0.075168, special policeman team member: 0.059726, it the winter: 0.046732, frogman: 0.039869, initiates: 0.037946, reserve force: 0.035875, Tang: 0.033632, the strand police: 0.032783, qin: 0.030709, sunglasses man: 0.026317, company's equity: 0.025288, macaque Hit rifle: 0.024864, tactics knapsack: 0.023070, Party A: 0.022199, partnership: 0.021476, divorce lawyer: 0.020492, the training team: 0.020036, exhibitions: 0.020034, cobra: 0.020021, liability insurance: 0.019668, baseball Cap man: 0.019668, reconnaissance equipment: 0.019124, photo is public: 0.018494, power grid: 0.017971, administrator: 0.017759, it preparation against war: 0.017301, bayonet: 0.015978, application form: 0.015415, emergency ward: 0.015231, patrols It is alert: 0.014060, hunting dog: 0.013307, safety island: 0.012297, first-aid centre: 0.012257, hoarse throat: 0.011931, it violates: 0.011926, conservatory of music: 0.011926, cinereous vulture: 0.011550, employing: 0.011215, labor standard Method: 0.011134, it personal reason: 0.010949, masked man: 0.010912, appoints: 0.010890, poor: 0.010751, to throw Money person: 0.010691, it 506:0.010574, obligation: 0.010220, terminates: 0.010217, opening and close: 0.009624, subordinate: 0.009364, hawk: 0.009084, telescope: 0.008983,56:0.008353, stock: 0.007776, unit: 0.007724, 0.001:0.007514, member: 0.006807, it cargo ship: 0.006679, goes on patrol: 0.006429, etc..It wherein, can also will be every The corresponding text frequency information of a keyword is labeled in after keyword.

In one embodiment, since candidate keywords can also be obtained by way of naming Entity recognition, may be used also The extraction that keyword is carried out with the candidate keywords obtained after name Entity recognition, obtains the multiple passes being made of name entity Keyword.For example, ner- Han Guang: 1.053325, ner- discipline are intelligent: 0.812526, ner- He Shichang: 0.651450, ner- know army: 0.547707, ner- Cai Xiaochun: 0.371123, ner- woods are sharp: 0.201864, the tight woods of ner- Tang Xiaojun: 0.185442, ner-: 0.146594, ner- Herman: 0.128432, Zhao ner- lily: 0.107168, ner- Wang Xin: 0.074263, ner- clock generation Good: 0.068198, Feng ner- Yunshan Mountain: 0.068186, ner- Qin Wei: 0.065761, the field ner- calf: 0.064950, ner- is rich Health: 0.063059, ner- black panther: 0.048289, ner- Roy: 0.046849, ner- Xue Gang: 0.044375, ner- Huang hair: 0.042946, ner- Qin secretary: 0.034775, ner- enlightening is special: 0.034694, ner- Ge Tong: 0.034419, ner- surpass: 0.029652, the sky ner-: 0.026411, ner- Jiamei: 0.024464, ner- Ma Di: 0.023369, ner- focal length: 0.023190, ner- clock teacher: 0.021387, ner- is permitted lawyer: 0.020115, ner- Wang Bin: 0.019930, the ner- term of office: 0.018647, ner- lily: 0.017575, ner- doctor Lin: 0.016030, ner- has a holiday: 0.014477, ner- is diligent: 0.014009, ner- director: 0.013335, ner- eagle: 0.012984, ner- criminal policeman: 0.011864, the ner- chief crewman of a wooden boat: 0.011813, ner- is crucial: 0.011650, ner- team leader: 0.008063, etc..Wherein, ner can indicate that the keyword is Entity is named, and the corresponding text frequency information of keyword can be labeled in after keyword.

It include multiple in certain ordering relation from the foregoing, it will be observed that the available text set of the embodiment of the present application, in text set Text unit carries out keyword extraction to the content of text of each text unit, obtains the corresponding candidate pass of each text unit Keyword obtains candidate keywords for text according to candidate keywords to the text unit metric parameter of text unit significance level The unit frequency information of this unit from the corresponding candidate keywords of each text unit, is chosen every according to unit frequency information The corresponding unit keyword of a text unit is obtained according to unit keyword to the text set metric parameter of text set significance level Unit keyword is taken to be directed to the text frequency information of text set, according to the text frequency information, from the multiple of the text set Keyword is extracted in unit keyword.The program can obtain the frequency information of keyword from two levels, obtain first Keyword is directed to the unit frequency information of text unit, then according to the unit frequency information of keyword, obtains keyword and is directed to The text frequency information of text set, and using text frequency information as keyword for the significance level of entire text set, root Keyword is obtained from text set according to the significance level, to improve the accuracy of keyword extraction in text set.

The method according to described in preceding embodiment will specifically be integrated in the network equipment below with the keyword extracting device Citing is described in further detail.

With reference to Fig. 3, the detailed process of the keyword extracting method of the embodiment of the present application be can be such that

301, the network equipment obtains books to be processed.

In practical applications, for example, as shown in figure 5, the network equipment can obtain a book work by text reading application For books to be processed, it may include multiple chapters and sections in the books to be processed, books to be processed can be indicated by Book, are passed through Chap indicates chapters and sections, and the relationship between books and chapters and sections to be processed can be expressed as Book={ Chap₁,Chap₂,..., Chap_n, wherein Chap_nIt can indicate the n-th chapters and sections in books to be processed.

302, the network equipment carries out keyword extraction to the content of text of each chapters and sections in books to be processed, obtains each chapter Save corresponding candidate keywords.

In practical applications, for example, the network equipment can carry out text participle to each chapters and sections in books to be processed, " being sometimes Jiaxuan, but my favorite word is on the contrary clear, this is Jiang Jie " corn poppy listens rain " " is carried out at participle Manage it is available " sometimes: d wait: v is: the Jiaxuan v: n: w but: c I: r most: d likes: v's: u word: n on the contrary: d understands: a's: U: w is somebody's turn to do: r is: Jiang v: n is prompt: n: u ": w anxiety: n beauty: n:w listens rain: n ": w ".So as to will be in books to be processed Each chapters and sections are divided into multiple primary keys, and the relationship between chapters and sections after books to be processed and division can be expressed as Book^seg={ Chap₁ ^seg,Chap₂ ^seg,...,Chap_n ^seg}。

After carrying out text participle to each chapters and sections, rule can be merged according to default word, it is multiple original to what is obtained Keyword carries out word merging.Such as available three primary keys " fair and clear: a " " face: n " " youngster: Ng ", according to default word Merge rule, adjective, noun and the nominal morpheme being arranged in order can merge, therefore can be by three original passes Keyword merges into a candidate keywords " fair and clear face ".Wherein, word merging rule is preset hereinbefore to have been described, Details are not described herein again.

Due in actual use, the too long candidate keywords of word length and the too low candidate key of the frequency of occurrences Word should not all carry out subsequent word theme obtain the step of, therefore can the candidate keywords too long to word length tear open Point, word={ word can be expressed as_sub,0,word_sub,1... }, it include 5 original keys in original candidates keyword as worked as When word, which can be split as to two candidate keywords, it can be in two candidate keywords after fractionation Respectively include 2 primary keys and 3 primary keys.If candidate keywords are not suitable as key due to underfrequency When word, directly the candidate keywords can be omitted.

303, the network equipment obtains candidate close according to candidate keywords to the text unit metric parameter of chapters and sections significance level Keyword is directed to the unit frequency information of chapters and sections.

In practical applications, for example, available 10000 books of the network equipment carry out theme acquisition, can by this 10000 This book is divided into 5000000 chapters and sections and 3000000 no duplicate candidate keywords.It is single but since data volume is excessive Machine can not support the operation of LDA, and therefore, the data that can be will acquire are divided into k parts, and every portion is drawn all in accordance with section number Point, so that the data of each book are all evenly distributed in each part of data.Wherein, the numerical value of K can be according to the actual situation It is adjusted, so that each part of data can be run in single machine.Then by LDA, each candidate pass is obtained respectively Keyword corresponds to the keyword subject probability Topic of preset themes_wordj=[v_j,1,v_j,2,...,v_j,k] and each chapters and sections correspondence The unit theme probability Topic of preset themes_chapi=[v_i,1,v_i,2,...,v_i,k].It may then pass through formula r_i,j=cosine (Topic_chapi,Topic_wordj) calculate subject correlation message r between candidate keywords and chapters and sections_chap,word。

It may then pass through TF-IDF and calculate the corresponding unit frequency sub-information of candidate keywords；When calculating candidate keywords When " I " corresponding unit frequency sub-information, candidate keywords " I " occur in available chapters and sections frequency # (word_j), with And in chapters and sections all candidate keywords quantity # (word in chap), and according to formulaComputing unit Word accounting sub-information.Then the candidate target text unit including candidate keywords " I " is chosen from books to be processed, and Obtain quantity # (the chap has word of candidate target text unit_j) and books to be processed in all chapters and sections quantity # (chap), then according to formulaCalculate text unit accounting sub-information.Finally according to unit frequency The calculation formula of informationCandidate keywords " I " is calculated to be directed to The unit frequency sub-information of chapters and sections.

It then can be according to the word length of candidate keywords, by the calculation formula for obtaining word length information weight_length={ 1:0.7,2:1.0,3:1.2,4:1.5,5:1.2 } obtains the corresponding word length information of candidate keywords weight_len(word)。

Finally according to formula weight_word=r_chap,word×w_chap,word×weight_len(word), by subject correlation message, Unit frequency sub-information and word length information are merged, and the unit frequency information that candidate keywords are directed to chapters and sections is obtained weight_word。

304, the network equipment chooses each chapter from the corresponding candidate keywords of each chapters and sections according to unit frequency information Save corresponding unit keyword.

In practical applications, for example, getting the unit frequency information weight that candidate keywords are directed to chapters and sections_wordIt It afterwards, can be according to the size of the unit frequency value of information of candidate keywords, by the candidate keywords in each chapter according to unit frequency The size of the rate value of information carries out sequence from big to small, and preceding 20 candidate keywords that the unit frequency value of information is big, as The corresponding unit keyword of each chapters and sections.

305, the network equipment obtains single according to unit keyword to the text set metric parameter of books significance level to be processed First keyword is directed to the text frequency information of books to be processed.

In practical applications, available for example, when the first text frequency sub-information of computing unit keyword " I " The corresponding unit frequency information weight of unit keyword " I "_wordiAnd unit keyword " I " goes out in books to be processed The existing frequency, and the frequency occurred in books to be processed according to " I " is to unit frequency information weight_wordiIt adds up, counts Fusion frequency sub-information ∑ of the calculated unit keyword " I " in books to be processed_wordiweight_wordi.Then it obtains wait locate Manage the corresponding unit frequency information weight of each unit keyword in books_word, and to the corresponding list of all unit keywords First frequency information adds up, and obtains the cumulative frequency sub-information ∑ of all unit keywords in books to be processed_wordsweight_word.Then the ratio that fusion frequency sub-information accounts for cumulative frequency sub-information can be calculatedThe corresponding first text frequency sub-information KF of acquiring unit keyword " I "_weighted。

It is then possible to which it is crucial to obtain the maximum maximum frequency of the unit frequency value of information in each chapters and sections of books to be processed Word and the corresponding unit frequency value of information max (weight of the maximum frequency keyword_words), and the maximum frequency is crucial The corresponding unit frequency value of information of word is compared with 1, is obtained in the corresponding unit frequency value of information of maximum frequency keyword and 1 Bigger value max (1.0, max (weight_words).And the biggish value in all chapters and sections of books to be processed is carried out tired Add, obtains comprehensive cumulative frequency sub-information ∑_docsmax(1.0,max(weight_words)).It then can be from books to be processed The target text unit including unit keyword " I " is chosen in multiple chapters and sections, and is obtained corresponding single in each target text unit The maximum unit frequency information max (weight of first keyword " I "_wordi), then unit in all target text units is closed The maximum unit frequency information of keyword " I " respective value adds up, and obtains specified cumulative frequency sub-information ∑_{docs has wordi}max (weight_wordi).Then according to the calculation formula of the second text frequency sub-information The corresponding second text frequency sub-information of computing unit keyword " I "It then can be according to formulaText frequency letter of the acquiring unit keyword for the form of weights of books to be processed Cease weight_word。

306, the network equipment extracts key from multiple unit keywords of books to be processed according to text frequency information Word.

In practical applications, for example, the unit keyword of the biggish preset number of text frequency information numerical value can be made For the keyword of whole book.Wherein, getting the corresponding keyword of whole book can be such that clock generation: 0.075168, Special Police Unit Member: 0.059726, it the winter: 0.046732, frogman: 0.039869, initiates: 0.037946, reserve force: 0.035875, Tang: 0.033632, the strand police: 0.032783, qin: 0.030709, sunglasses man: 0.026317, company's equity: 0.025288, macaque Hit rifle: 0.024864, tactics knapsack: 0.023070, Party A: 0.022199, partnership: 0.021476, divorce lawyer: 0.020492, the training team: 0.020036, exhibitions: 0.020034, cobra: 0.020021, liability insurance: 0.019668, baseball Cap man: 0.019668, reconnaissance equipment: 0.019124, photo is public: 0.018494, power grid: 0.017971, administrator: 0.017759, it preparation against war: 0.017301, bayonet: 0.015978, application form: 0.015415, emergency ward: 0.015231, patrols It is alert: 0.014060, hunting dog: 0.013307, safety island: 0.012297, first-aid centre: 0.012257, hoarse throat: 0.011931, it violates: 0.011926, conservatory of music: 0.011926, cinereous vulture: 0.011550, employing: 0.011215, labor standard Method: 0.011134, it personal reason: 0.010949, masked man: 0.010912, appoints: 0.010890, poor: 0.010751, to throw Money person: 0.010691, it 506:0.010574, obligation: 0.010220, terminates: 0.010217, opening and close: 0.009624, subordinate: 0.009364, hawk: 0.009084, telescope: 0.008983,56:0.008353, stock: 0.007776, unit: 0.007724, 0.001:0.007514, member: 0.006807, it cargo ship: 0.006679, goes on patrol: 0.006429, etc..It wherein, can also will be every The corresponding text frequency information of a keyword is labeled in after keyword.

In one embodiment, after getting the corresponding keyword of books to be processed, the network equipment can be got according to this Keyword the label of corresponding keyword is carried out to the books in text reading application, so that user can answer according to text reading With the keyword of middle books correspondence markings, understand and correspond to the information such as content, the theme of books in text reading application, and then determines Whether the reading of books is carried out.

In one embodiment, text reading application can also obtain user before the history read books read and The corresponding keyword of history read books, and according to the corresponding keyword of history read books to user's recommended keywords it is similar or The identical recommended book of person.As shown in fig. 6, when user carries out books lookup by text reading application, text reading application The cover of recommended book can be shown on interface, user can carry out readding for books by the cover of click books interested It reads.When the recommended book shown on interface is not able to satisfy the reading requirement of user, user, which can also click on interface, " to be changed The button of a batch ", carries out the switching of recommended book.By this method according to the history browing record of user, recommend to user Its may interested books so that the less new book of frequency of reading can also be pushed away in such a way that keyword obtains It recommends, effectively improves the rate that is called back of new book.Recommended book in text reading application interface can every other day update one It is secondary, so that user can get different recommended book daily.

In one embodiment, text reading application in can also after the books of recommendation, mark the books be according to which What specific history read books were recommended, for example, as shown in fig. 7, text reading application can on 1 side of books of recommendation, Marking the books 1 is recommended according to books a, when user wishes to read the books similar with books a, can quickly be found Corresponding books 1, to improve accuracy and efficiency that user searches books.Wherein, the books 1 recommended according to books a It can be to have the books of similar key with books a, for example, it includes that same keyword is most that books 1, which can be with books a, Books, it includes the most books etc. of similar key that books 1, which can also be with books a,.

From the foregoing, it will be observed that the embodiment of the present application can obtain books to be processed by the network equipment, to every in books to be processed The content of text of a chapters and sections carries out keyword extraction, the corresponding candidate keywords of each chapters and sections is obtained, according to candidate keywords pair The text unit metric parameter of chapters and sections significance level obtains the unit frequency information that candidate keywords are directed to chapters and sections, according to unit Frequency information is chosen the corresponding unit keyword of each chapters and sections, is closed according to unit from the corresponding candidate keywords of each chapters and sections Text set metric parameter of the keyword to books significance level to be processed, text frequency of the acquiring unit keyword for books to be processed Rate information extracts keyword from multiple unit keywords of books to be processed according to text frequency information.The program can be with The frequency information of keyword is obtained from two levels, obtains the unit frequency information that keyword is directed to text unit first, so Afterwards according to the unit frequency information of keyword, the text frequency information that keyword is directed to text set is obtained, and by text frequency Information is directed to the significance level of entire text set as keyword, and keyword is obtained from text set according to the significance level, from And improve the accuracy of keyword extraction in text set.

With reference to Fig. 4, the detailed process of the keyword extracting method of the embodiment of the present application be can be such that

401, the network equipment obtains books to be processed.

In practical applications, it for example, the available book of the network equipment is as books to be processed, can wrap in this this book Multiple chapters and sections are included, books to be processed can be indicated by Book, chapters and sections are indicated by Chap, between books and chapters and sections to be processed Relationship can be expressed as Book={ Chap₁,Chap₂,...,Chap_n, wherein Chap_nIt can indicate in books to be processed N number of chapters and sections.

402, the network equipment is carried out the content of text of each chapters and sections in books to be processed by name Entity recognition crucial The extraction of word obtains the corresponding candidate keywords of each chapters and sections.

In practical applications, for example, the network equipment can carry out text participle to each chapters and sections in books to be processed, Each chapters and sections in books to be processed are divided into multiple primary keys, between books to be processed and chapters and sections after dividing Relationship can be expressed as Book^seg={ Chap₁ ^seg,Chap₂ ^seg,...,Chap_n ^seg}。

It, can be by naming Entity recognition, according to each primary keys after carrying out text participle to each chapters and sections Word feature is labeled each primary keys, that is, gives each primary keys one label, and according to primary keys Label, candidate keywords are extracted from primary keys.At this point, the relationship between books to be processed and chapters and sections can indicate For Book^ner={ Chap₁ ^ner,Chap₂ ^ner,...,Chap_n ^ner}。

403, the network equipment obtains candidate close according to candidate keywords to the text unit metric parameter of chapters and sections significance level Keyword is directed to the unit frequency information of chapters and sections.

It may then pass through TF-IDF and calculate the corresponding unit frequency sub-information of candidate keywords；When calculating candidate keywords When " China " corresponding unit frequency sub-information, candidate keywords " China " occur in available chapters and sections frequency # (word_j) and chapters and sections in all candidate keywords quantity # (word in chap), and according to formulaComputing unit word accounting sub-information.Then from books to be processed choose include candidate keywords " in The candidate target text unit of state ", and obtain quantity # (the chap has word of candidate target text unit_j) and wait locate The quantity # (chap) for managing all chapters and sections in books, then according to formulaCalculate text unit accounting Information.Finally according to the calculation formula of unit frequency sub-information Calculate the unit frequency sub-information that candidate keywords " China " is directed to chapters and sections.

404, the network equipment chooses each chapter from the corresponding candidate keywords of each chapters and sections according to unit frequency information Save corresponding unit keyword.

405, the network equipment obtains single according to unit keyword to the text set metric parameter of books significance level to be processed First keyword is directed to the text frequency information of books to be processed.

In practical applications, for example, can be obtained when the first text frequency sub-information of computing unit keyword " China " Take the corresponding unit frequency information weight of unit keyword " China "_wordiAnd unit keyword " China " is in book to be processed The frequency occurred in nationality, and the frequency occurred in books to be processed according to " China " is to unit frequency information weight_wordiInto Row is cumulative, calculates fusion frequency sub-information ∑ of the unit keyword " China " in books to be processed_wordiweight_wordi.So After obtain the corresponding unit frequency information weight of each unit keyword in books to be processed_word, and it is crucial to all units The corresponding unit frequency information of word adds up, and obtains the cumulative frequency sub-information of all unit keywords in books to be processed ∑_wordsweight_word.Then the ratio that fusion frequency sub-information accounts for cumulative frequency sub-information can be calculatedThe corresponding first text frequency sub-information KF of acquiring unit keyword " China "_weighted。

It is then possible to which it is crucial to obtain the maximum maximum frequency of the unit frequency value of information in each chapters and sections of books to be processed Word and the corresponding unit frequency value of information max (weight of the maximum frequency keyword_words), and the maximum frequency is crucial The corresponding unit frequency value of information of word is compared with 1, is obtained in the corresponding unit frequency value of information of maximum frequency keyword and 1 Bigger value max (1.0, max (weight_words).And the biggish value in all chapters and sections of books to be processed is carried out tired Add, obtains comprehensive cumulative frequency sub-information ∑_docsmax(1.0,max(weight_words)).It then can be from books to be processed The target text unit including unit keyword " China " is chosen in multiple chapters and sections, and is obtained corresponding in each target text unit The maximum unit frequency information max (weight of unit keyword " China "_wordi), it then will be single in all target text units The maximum unit frequency information of first keyword " China " respective value adds up, and obtains specified cumulative frequency sub-information ∑_{docs has wordi}max(weight_wordi).Then according to the calculation formula of the second text frequency sub-informationThe corresponding second text frequency of computing unit keyword " China " Rate sub-informationIt then can be according to formulaAcquiring unit keyword needle To the text frequency information weight of the form of weights of books to be processed_word。

406, the network equipment extracts key from multiple unit keywords of books to be processed according to text frequency information Word.

In practical applications, for example, the unit keyword of the biggish preset number of text frequency information numerical value can be made For the keyword of whole book.Wherein, getting the corresponding keyword of whole book can be such that ner- Han Guang: 1.053325, ner- Ji Hui: 0.812526, ner- He Shichang: 0.651450, ner- know army: 0.547707, ner- Cai Xiaochun: 0.371123, ner- Lin Rui: 0.201864, the tight woods of ner- Tang Xiaojun: 0.185442, ner-: 0.146594, ner- Herman: 0.128432, ner- Zhao Baihe: 0.107168, ner- Wang Xin: 0.074263, ner- Zhong Shijia: 0.068198, Feng ner- Yunshan Mountain: 0.068186, Ner- Qin Wei: 0.065761, the field ner- calf: 0.064950, ner- Citroen zx: 0.063059, ner- black panther: 0.048289, Ner- Roy: 0.046849, ner- Xue Gang: 0.044375, ner- Huang hair: 0.042946, ner- Qin secretary: 0.034775, Ner- enlightening is special: 0.034694, ner- Ge Tong: 0.034419, ner- surpass: 0.029652, the sky ner-: 0.026411, ner- Jiamei: 0.024464, ner- Ma Di: 0.023369, ner- focal length: 0.023190, ner- clock teacher: 0.021387, ner- is permitted Lawyer: 0.020115, ner- Wang Bin: 0.019930, the ner- term of office: 0.018647, ner- lily: 0.017575, ner- woods is big Husband: 0.016030, ner- has a holiday: 0.014477, ner- is diligent: 0.014009, ner- director: 0.013335, ner- eagle: 0.012984, ner- criminal policeman: 0.011864, the ner- chief crewman of a wooden boat: 0.011813, ner- is crucial: 0.011650, ner- team leader: 0.008063, etc..Wherein, ner can indicate the keyword to name entity, and can be by the corresponding text frequency of keyword Rate information labeling is after keyword.

From the foregoing, it will be observed that the embodiment of the present application can obtain books to be processed by the network equipment, to every in books to be processed The content of text of a chapters and sections carries out the extraction of keyword by name Entity recognition, obtains the corresponding candidate key of each chapters and sections Word obtains the list that candidate keywords are directed to chapters and sections according to candidate keywords to the text unit metric parameter of chapters and sections significance level First frequency information from the corresponding candidate keywords of each chapters and sections, chooses the corresponding list of each chapters and sections according to unit frequency information First keyword, according to unit keyword to the text set metric parameter of books significance level to be processed, acquiring unit keyword needle The text frequency information of books to be processed is mentioned from multiple unit keywords of books to be processed according to text frequency information Take out keyword.The program can obtain the frequency information of keyword from two levels, and acquisition keyword first is directed to text The unit frequency information of unit obtains keyword for the text frequency of text set then according to the unit frequency information of keyword Rate information, and using text frequency information as keyword be directed to entire text set significance level, according to the significance level from Keyword is obtained in text set, to improve the accuracy of keyword extraction in text set.

In order to better implement above method, the embodiment of the present application can also provide a kind of keyword extracting device, the pass Keyword extraction element specifically can integrate in the network device, which may include server, terminal etc., wherein eventually End may include: mobile phone, tablet computer, laptop or personal computer (PC, Personal Computer) etc..

For example, as shown in figure 8, the keyword extracting device may include that text set obtains module 81, the first extraction module 82, the first information obtains module 83, chooses module 84, the second data obtaining module 85 and the second extraction module 86, as follows:

Text set obtains module 81, includes multiple in the text set in certain ordering relation for obtaining text set Text unit；

First extraction module 82 carries out keyword extraction for the content of text to each text unit, obtains each text The corresponding candidate keywords of this unit；

The first information obtains module 83, for the text according to the candidate keywords to the text unit significance level Unit metric parameter obtains the unit frequency information that the candidate keywords are directed to the text unit；

Module 84 is chosen, is used for according to the unit frequency information, from the corresponding candidate keywords of each text unit, Choose the corresponding unit keyword of each text unit；

Second data obtaining module 85, for the text set according to the unit keyword to the text set significance level Metric parameter obtains the text frequency information that the unit keyword is directed to the text set；

Second extraction module 86 is used for according to the text frequency information, from multiple unit keywords of the text set In extract keyword.

In one embodiment, the first information obtain module 83 may include unit theme probability acquisition submodule 831, Relevant information acquisition submodule 832, unit frequency sub-information acquisition submodule 833,834 and of word length information acquisition submodule Unit frequency acquisition of information submodule 835, as follows:

Unit theme probability acquisition submodule 831, the keyword of preset themes is corresponded to for obtaining each candidate keywords Theme probability and each text unit correspond to the unit theme probability of preset themes；

Relevant information acquisition submodule 832, for according to the corresponding keyword subject probability of the candidate keywords and The corresponding unit theme probability of text unit where the candidate keywords, obtains the candidate keywords and the text unit Between subject correlation message；

Unit frequency sub-information acquisition submodule 833, for being gone out in corresponding text unit according to the candidate keywords The existing frequency obtains the unit frequency sub-information that the candidate keywords are directed to text unit；

Word length information acquisition submodule 834 obtains the time for the word length based on the candidate keywords Select the corresponding word length information of keyword；

Unit frequency acquisition of information submodule 835, for subject correlation message, unit frequency sub-information and word is long Degree information is merged, and the unit frequency information that the candidate keywords are directed to the text unit is obtained.

In one embodiment, the unit frequency sub-information acquisition submodule 833 specifically can be used for:

In one embodiment, the unit theme probability acquisition submodule 831 can be specifically used for:

In one embodiment, second data obtaining module 85 may include that the first text frequency sub-information obtains submodule Block 851, the second text frequency sub-information acquisition submodule 852 and text frequency information acquisition submodule 853 are as follows:

First text frequency sub-information acquisition submodule 851, for according to the unit frequency information of the unit keyword, And the frequency that the unit keyword occurs in the text set, obtain the corresponding first text frequency of the unit keyword Rate sub-information；

Second text frequency sub-information acquisition submodule 852, for single according to the text unit of the text set The corresponding unit frequency information of first keyword, obtains the corresponding second text frequency sub-information of the unit keyword；

Text frequency information acquisition submodule 853 is used for the first text frequency sub-information and second text Frequency sub-information is merged, and the text frequency information that the unit keyword is directed to the text set is obtained.

In one embodiment, the first text frequency sub-information acquisition submodule 851 can be specifically used for:

In one embodiment, the second text frequency sub-information acquisition submodule 852 can be specifically used for:

In one embodiment, first extraction module 82 may include primary keys acquisition submodule 821 and candidate Keyword acquisition submodule 822 is as follows:

Primary keys acquisition submodule 821 carries out text word segmentation processing for the content of text to each text unit, The content of text of the text unit is divided into multiple primary keys；

Candidate keywords acquisition submodule 822, for word merging is regular to carry out the primary keys according to presetting Word merges, and obtains the corresponding candidate keywords of each text unit.

In one embodiment, the candidate keywords acquisition submodule 822 can be specifically used for:

In one embodiment, the keyword extracting device can also include labeling module 87 and keyword extracting module 88, as follows:

Labeling module 87, for the word feature according to each primary keys, to the original pass in the content of text Keyword is labeled；

Keyword extracting module 88, for the mark according to the primary keys, from the original pass of the content of text Candidate keywords are extracted in keyword.

When it is implemented, above each unit can be used as independent entity to realize, any combination can also be carried out, is made It is realized for same or several entities, the specific implementation of above each unit can be found in the embodiment of the method for front, herein not It repeats again.

From the foregoing, it will be observed that the keyword extracting device of the present embodiment, which obtains module 81 by text set, obtains text set, text Concentrating includes multiple text units in certain ordering relation, by the first extraction module 82 in the text of each text unit Hold and carry out keyword extraction, obtain the corresponding candidate keywords of each text unit, 83 basis of module is obtained by the first information Candidate keywords obtain the list that candidate keywords are directed to text unit to the text unit metric parameter of text unit significance level First frequency information from the corresponding candidate keywords of each text unit, is selected by choosing module 84 according to unit frequency information The corresponding unit keyword of each text unit is taken, by the second data obtaining module 85 according to the unit keyword to described The text set metric parameter of text set significance level, acquiring unit keyword are directed to the text frequency information of text set, by the Two extraction modules 86 extract keyword from multiple unit keywords of text set according to text frequency information.The program can To obtain the frequency information of keyword from two levels, the unit frequency information that keyword is directed to text unit is obtained first, Then according to the unit frequency information of keyword, the text frequency information that keyword is directed to text set is obtained, and frequently by the text Rate information is directed to the significance level of entire text set as keyword, and keyword is obtained from text set according to the significance level, To improve the accuracy of keyword extraction in text set.

The embodiment of the present application also provides a kind of network equipment, which can integrate provided by the embodiment of the present application Any keyword extracting device.

For example, as shown in figure 9, it illustrates the structural schematic diagrams of the network equipment involved in the embodiment of the present application, specifically For:

The network equipment may include one or more than one processing core processor 901, one or more The components such as memory 902, power supply 903 and the input unit 904 of computer readable storage medium.Those skilled in the art can manage It solves, network equipment infrastructure shown in Fig. 9 does not constitute the restriction to the network equipment, may include more more or fewer than illustrating Component perhaps combines certain components or different component layouts.Wherein:

Processor 901 is the control centre of the network equipment, utilizes various interfaces and connection whole network equipment Various pieces by running or execute the software program and/or module that are stored in memory 902, and are called and are stored in Data in reservoir 902 execute the various functions and processing data of the network equipment, to carry out integral monitoring to the network equipment. Optionally, processor 901 may include one or more processing cores；Preferably, processor 901 can integrate application processor and tune Demodulation processor processed, wherein the main processing operation system of application processor, user interface and application program etc., modulatedemodulate is mediated Reason device mainly handles wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 901 In.

Memory 902 can be used for storing software program and module, and processor 901 is stored in memory 902 by operation Software program and module, thereby executing various function application and data processing.Memory 902 can mainly include storage journey Sequence area and storage data area, wherein storing program area can the (ratio of application program needed for storage program area, at least one function Such as sound-playing function, image player function) etc.；Storage data area, which can be stored, uses created number according to the network equipment According to etc..In addition, memory 902 may include high-speed random access memory, it can also include nonvolatile memory, such as extremely A few disk memory, flush memory device or other volatile solid-state parts.Correspondingly, memory 902 can also wrap Memory Controller is included, to provide access of the processor 901 to memory 902.

The network equipment further includes the power supply 903 powered to all parts, it is preferred that power supply 903 can pass through power management System and processor 901 are logically contiguous, to realize management charging, electric discharge and power managed etc. by power-supply management system Function.Power supply 903 can also include one or more direct current or AC power source, recharging system, power failure monitor The random components such as circuit, power adapter or inverter, power supply status indicator.

The network equipment may also include input unit 904, which can be used for receiving the number or character of input Information, and generate keyboard related with user setting and function control, mouse, operating stick, optics or trackball signal Input.

Although being not shown, the network equipment can also be including display unit etc., and details are not described herein.Specifically in the present embodiment In, the processor 901 in the network equipment can be corresponding by the process of one or more application program according to following instruction Executable file be loaded into memory 902, and the application program being stored in memory 902 is run by processor 901, It is as follows to realize various functions:

Text set is obtained, includes multiple text units in certain ordering relation in text set, to each text unit Content of text carries out keyword extraction, the corresponding candidate keywords of each text unit is obtained, according to candidate keywords to text The text unit metric parameter of unit significance level obtains the unit frequency information that candidate keywords are directed to text unit, according to It is crucial to choose the corresponding unit of each text unit from the corresponding candidate keywords of each text unit for unit frequency information Word, according to unit keyword to the text set metric parameter of text set significance level, acquiring unit keyword is for text set Text frequency information extracts keyword from multiple unit keywords of the text set according to the text frequency information.

The specific implementation of above each operation can be found in the embodiment of front, and details are not described herein.

It will appreciated by the skilled person that all or part of the steps in the various methods of above-described embodiment can be with It is completed by instructing, or relevant hardware is controlled by instruction to complete, which can store computer-readable deposits in one In storage media, and is loaded and executed by processor.

For this purpose, the embodiment of the present application provides a kind of storage medium, wherein being stored with a plurality of instruction, which can be processed Device is loaded, to execute the step in any keyword extracting method provided by the embodiment of the present application.For example, the instruction Following steps can be executed:

Wherein, which may include: read-only memory (ROM, Read Only Memory), random access memory Body (RAM, Random Access Memory), disk or CD etc..

By the instruction stored in the storage medium, any keyword provided by the embodiment of the present application can be executed Step in extracting method, it is thereby achieved that any keyword extracting method institute can be real provided by the embodiment of the present application Existing beneficial effect is detailed in the embodiment of front, and details are not described herein.

A kind of keyword extracting method and device provided by the embodiment of the present application are described in detail above, herein In apply specific case the principle and implementation of this application are described, the explanation of above example is only intended to sides Assistant solves the present processes and its core concept；Meanwhile for those skilled in the art, according to the thought of the application, exist There will be changes in specific embodiment and application range, in conclusion the content of the present specification should not be construed as to this Shen Limitation please.

Claims

1. a kind of keyword extracting method characterized by comprising

Keyword extraction is carried out to the content of text of each text unit, obtains the corresponding candidate keywords of each text unit；

According to the candidate keywords to the text unit metric parameter of the text unit significance level, the candidate pass is obtained Keyword is directed to the unit frequency information of the text unit；

Each text unit pair is chosen from the corresponding candidate keywords of each text unit according to the unit frequency information The unit keyword answered；

According to the unit keyword to the text set metric parameter of the text set significance level, the unit keyword is obtained For the text frequency information of the text set；

2. keyword extracting method according to claim 1, which is characterized in that the text unit metric parameter includes master Inscribe relevant information, unit frequency sub-information and word length information；

According to the candidate keywords to the text unit metric parameter of the text unit significance level, the candidate pass is obtained Keyword is directed to the unit frequency information of the text unit, comprising:

Obtain keyword subject probability and the corresponding default master of each text unit that each candidate keywords correspond to preset themes The unit theme probability of topic；

It is corresponding according to text unit where the corresponding keyword subject probability of the candidate keywords and the candidate keywords Unit theme probability, obtain the subject correlation message between the candidate keywords and the text unit；

According to the frequency that the candidate keywords occur in corresponding text unit, the candidate keywords are obtained for text list The unit frequency sub-information of member；

Subject correlation message, unit frequency sub-information and word length information are merged, the candidate keywords are obtained For the unit frequency information of the text unit.

3. keyword extracting method according to claim 2, which is characterized in that according to the candidate keywords in corresponding text The frequency occurred in this unit obtains the unit frequency sub-information that the candidate keywords are directed to text unit, comprising:

The frequency for obtaining the appearance of candidate keywords described in the text unit accounts for all candidate keywords in the text unit The unit word accounting sub-information of quantity；

Obtain the text unit accounting letter that candidate target text unit quantity accounts for all text unit quantity in the text set Breath；

Based on the unit word accounting sub-information and the text unit accounting sub-information, obtains the candidate keywords and be directed to The unit frequency sub-information of text unit.

4. keyword extracting method according to claim 2, which is characterized in that it is corresponding default to obtain each candidate keywords The keyword subject probability and each text unit of theme correspond to the unit theme probability of preset themes, comprising:

Determine that each candidate keywords correspond to the initial key word theme probability of preset themes and each text unit corresponds in advance If the initial cell theme probability of theme；

It is general to theme by default sampling algorithm based on the initial key word theme probability and the initial cell theme probability Rate distribution is sampled, and keyword subject probability and each text list that each candidate keywords correspond to preset themes are obtained The unit theme probability of the corresponding preset themes of member；

When the keyword subject probability and the unit theme probability meet probability regularization condition, by the initial key word Theme probability is adjusted to keyword subject probability, and the initial cell theme probability is adjusted to unit theme probability；

It returns to execute and is based on the initial key word theme probability and the initial cell theme probability, by presetting sampling algorithm Theme probability distribution is sampled, each candidate keywords is obtained and corresponds to the keyword subject probability of preset themes and every A text unit corresponds to the step of unit theme probability of preset themes；

When the keyword subject probability and the unit theme probability are unsatisfactory for probability regularization condition, each candidate pass is obtained Keyword corresponds to the keyword subject probability of preset themes and each text unit corresponds to the unit theme probability of preset themes.

5. keyword extracting method according to claim 1, which is characterized in that the text set metric parameter includes first Text frequency sub-information and the second text frequency sub-information；

According to the unit keyword to the text set metric parameter of the text set significance level, the unit keyword is obtained For the text frequency information of the text set, comprising:

The frequency occurred in the text set according to the unit frequency information of the unit keyword and the unit keyword It is secondary, obtain the corresponding first text frequency sub-information of the unit keyword；

According to the corresponding unit frequency information of unit keyword described in the text unit of the text set, obtains the unit and close The corresponding second text frequency sub-information of keyword；

The first text frequency sub-information and the second text frequency sub-information are merged, it is crucial to obtain the unit Word is directed to the text frequency information of the text set.

6. keyword extracting method according to claim 5, which is characterized in that according to the unit of unit keyword frequency The frequency that rate information and the unit keyword occur in the text set obtains the unit keyword corresponding One text frequency sub-information, comprising:

The frequency that the unit frequency information of the unit keyword and the unit keyword are occurred in the text set It is merged, obtains fusion frequency sub-information of the unit keyword in the text set；

According to the unit frequency information of unit keyword each in the text set, it is crucial to obtain all units in the text set The cumulative frequency sub-information of word；

It is corresponding to obtain the unit keyword for the ratio that the cumulative frequency sub-information is accounted for according to the fusion frequency sub-information First text frequency sub-information.

7. keyword extracting method according to claim 5, which is characterized in that according in the text unit of the text set The corresponding unit frequency information of the unit keyword, obtains the corresponding second text frequency sub-information of the unit keyword, Include:

The corresponding unit frequency information of maximum frequency keywords all in the text set is added up, comprehensive accumulative frequency is obtained Rate sub-information；

The maximum unit frequency information of unit keyword respective value described in all target text units is added up, is referred to Determine cumulative frequency sub-information；

The unit keyword pair is obtained according to the comprehensive cumulative frequency sub-information and the specified cumulative frequency sub-information The the second text frequency sub-information answered.

8. keyword extracting method according to claim 1, which is characterized in that the content of text of each text unit into Row keyword extraction obtains the corresponding candidate keywords of each text unit, comprising:

Merge rule according to default word and word merging is carried out to the primary keys, obtains the corresponding original of each text unit Beginning candidate keywords；

When the original candidates keyword is unsatisfactory for default splitting condition, the original candidates keyword is determined as candidate pass Keyword.

9. keyword extracting method according to claim 8, which is characterized in that the content of text of each text unit into It composes a piece of writing this word segmentation processing, the content of text of the text unit is divided into after multiple primary keys, further includes:

According to the mark of the primary keys, candidate keywords are extracted from the primary keys of the content of text.

10. a kind of keyword extracting device characterized by comprising

Text set obtains module, includes multiple text lists in certain ordering relation in the text set for obtaining text set Member；

First extraction module carries out keyword extraction for the content of text to each text unit, obtains each text unit Corresponding candidate keywords；

The first information obtains module, for the text unit degree according to the candidate keywords to the text unit significance level Parameter is measured, the unit frequency information that the candidate keywords are directed to the text unit is obtained；

Module is chosen, for from the corresponding candidate keywords of each text unit, choosing every according to the unit frequency information The corresponding unit keyword of a text unit；

Second data obtaining module, for measuring ginseng according to text set of the unit keyword to the text set significance level Number obtains the text frequency information that the unit keyword is directed to the text set；

Second extraction module, for being extracted from multiple unit keywords of the text set according to the text frequency information Keyword out.