CN110287493A

CN110287493A - Risk phrase chunking method, apparatus, electronic equipment and storage medium

Info

Publication number: CN110287493A
Application number: CN201910580521.1A
Authority: CN
Inventors: 高影繁; 刘志辉; 姚长青; 李岩; 崔笛; 郑明�; 浦墨
Original assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Current assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-09-27
Anticipated expiration: 2039-06-28
Also published as: CN110287493B

Abstract

The embodiment of the present application provides a kind of risk phrase chunking method, apparatus, electronic equipment and storage medium, is related to text-processing technical field.This method comprises: describing text to risk using scheduled phrase chunking algorithm carries out phrase chunking, the first risk list of phrases is obtained；Text is described to risk using scheduled participle tool to handle, and obtains the second risk list of phrases；Then, the first risk list of phrases and the second risk list of phrases are merged into processing, determines the risk list of phrases including multiple risk phrases.The method of the embodiment of the present application can rapidly and accurately identify risk phrase, and more comprehensively, information content is bigger for the phrase of identification, the content for the theme that can disclosure risks well.

Description

Risk phrase chunking method, apparatus, electronic equipment and storage medium

Technical field

This application involves text-processing technical fields, specifically, this application involves a kind of risk phrase chunking methods, dress It sets, electronic equipment and storage medium.

Background technique

Risk information is the external environments such as politics, economy, society, market of the enterprise according to locating for itself, in combination with each The internal environments such as class finance and management, the anticipation and police that existing or potential factor related with enterprise's existence and development is made Show, forward-looking and decision correlation meaning.Risk information can help to alleviate information asymmetry, and improve company's production warp The transparency of battalion, especially in terms of Risk-warning, the value content of risk information is higher than general Voluntary Disclosure information.

Currently, the key phrases extraction result of the prior art tends to vocabulary, word length is shorter, so that risk phrase chunking is imitated Fruit is bad, cannot not only disclose theme well, while can also lose a large amount of semantic content, can not characterize risk well The content of theme, extraction effect be not good enough.

Summary of the invention

It is existing for solving this application provides a kind of risk phrase chunking method, apparatus, electronic equipment and storage medium Risk the ineffective technical problem of risk phrase chunking of text is described.

In a first aspect, a kind of risk phrase chunking method is provided, this method comprises:

Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, obtains the first risk phrase column Table；

Text is described to risk using scheduled participle tool to handle, and obtains the second risk list of phrases；

First risk list of phrases and the second risk list of phrases are merged into processing, determine to include multiple risk phrases Risk list of phrases.

Based on the above technical solution, text is described to the risk based on scheduled filtering rule to be filtered；

Text is described to filtered risk and carries out part-of-speech tagging, and screens the word of predetermined part of speech, is formed to be identified Text；

The word string that frequency of occurrence in the text to be identified is greater than preset quantity threshold value is counted, as candidate phrase；

Risk phrase is picked out from the candidate phrase using scheduled phrase chunking algorithm.

Based on the above technical solution, the scheduled filtering rule includes: to be filtered according to scheduled deactivated vocabulary Stop words；

The word of the predetermined part of speech of the screening includes:

It describes to screen noun, verb, adjective and degree adverb in text from filtered risk.

Based on the above technical solution, described to be selected from the candidate phrase using scheduled phrase chunking algorithm Risk phrase out, comprising:

The association relationship of each candidate phrase is calculated using mutual information；

The left entropy and right entropy of each candidate phrase are calculated using left and right entropy；

Based on the statistics magnitude of each candidate phrase, each candidate phrase is calculated according to scheduled Weight algorithm Weighted value；The statistics magnitude includes that association relationship, left entropy, right entropy and candidate phrase go out in the text to be identified The existing frequency；Or association relationship, left entropy and right entropy；

According to the weighted value of each candidate phrase, risk is selected from the candidate phrase using predetermined selection rule Phrase.

Based on the above technical solution, the predetermined selection rule includes:

By the weighted value of each candidate phrase according to being ranked up from big to small, the more of preceding preset quantity that sort are chosen A candidate phrase, as risk phrase；Or,

When the weighted value of the candidate phrase is not less than preset threshold, by the candidate phrase, as risk phrase.

Based on the above technical solution, text is described to risk using scheduled participle tool to handle, comprising:

The risk is described each word in text to be combined to form phrase to be matched；

Phrase to be matched is subjected to match query in the scheduled lexicon of the participle tool, it is determining to make a reservation for described Lexicon in vocabulary match phrase；

The phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second risk phrase The risk phrase of list.

Based on the above technical solution, the scheduled filtering rule includes at least one of the following:

Filter individual character；Filtering number；Filtering composition number of words is less than the phrase of predetermined value.

Based on the above technical solution, text is described to risk using scheduled phrase chunking algorithm and carries out phrase knowledge Before not, this method further include:

Paragraph is pressed to pre-determined text and carries out identifying processing, extracts the paragraph comprising risk description in the text using as risk Text is described.

Second aspect provides a kind of risk phrase chunking device, comprising:

First obtains module, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm, obtains To the first risk list of phrases；

Second obtains module, describes text to risk using scheduled participle tool and handles, it is short to obtain the second risk Language list；

First risk list of phrases and the second risk list of phrases are merged processing, determine packet by merging treatment module Include the risk list of phrases of multiple risk phrases.

Based on the above technical solution, the first acquisition module includes:

First filtering module is filtered for describing text to risk based on scheduled filtering rule；

Screening module carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech, Form text to be identified；

Statistical module counts frequency of occurrence in the text to be identified and is greater than the word string of preset quantity threshold value as candidate short Language；

Choosing module picks out risk phrase using scheduled phrase chunking algorithm from the candidate phrase.

Based on the above technical solution, the second acquisition module includes:

Composite module is combined to form phrase to be matched for the risk to be described each word in text；

Matching module, for phrase to be matched to be carried out match query in the scheduled lexicon of the participle tool, The determining vocabulary with the scheduled lexicon matches phrase；

Second filtering module will be filtered short for being filtered based on scheduled filtering rule to the phrase that matches Risk phrase of the language as the second risk list of phrases.

The third aspect provides a kind of electronic equipment, comprising:

Processor；And

Memory, is configured to storage machine readable instructions, instruction when executed by the processor so that processor executes the The risk phrase chunking method of one side.

Fourth aspect, provides a kind of computer readable storage medium, and computer storage medium refers to for storing computer It enables, when run on a computer, computer is allowed to execute the risk phrase chunking method of first aspect.

Technical solution provided by the present application has the benefit that

Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, phrase is tentatively extracted, obtains First risk list of phrases；Text is described to risk using scheduled participle tool again to handle, and is extended risk phrase, is obtained Second risk list of phrases.Then, the first risk list of phrases and the second risk list of phrases are merged into processing, determines packet Include the risk list of phrases of multiple risk phrases.Using scheduled phrase chunking algorithm extracting phrase, the information representation energy of phrase Power is to be apparently higher than single keyword characterization ability, and the accuracy for the phrase for using phrase chunking algorithm to extract is high, but single It is to be expressed complete to be also not enough to characterize raw risk text for the risk phrase negligible amounts solely obtained using phrase chunking algorithm Portion's information.Present invention combination participle tool describes text to risk and is further processed, and extends risk phrase, further increases Accuracy, and the risk phrase for using two ways to obtain is more comprehensively.The present invention can rapidly and accurately identify risk phrase, know More comprehensively, information content is bigger for other phrase, and can disclosure risks theme well, solves the risk that existing risk describes text The bad technical problem of the recognition effect of phrase.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, institute in being described below to the embodiment of the present application Attached drawing to be used is needed to be briefly described.

Fig. 1 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application one provides；

Fig. 2 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application two provides；

Fig. 3 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application three provides；

Fig. 4 is a kind of structural schematic diagram for risk phrase chunking device that the embodiment of the present application four provides；

Fig. 5 is the structural schematic diagram for the first acquisition module that the embodiment of the present application five provides；

Fig. 6 is the structural schematic diagram for the second acquisition module that the embodiment of the present application six provides；

Fig. 7 is the structural schematic diagram for the electronic equipment that the embodiment of the present application seven provides.

Specific embodiment

Embodiments herein is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing The embodiment stated is exemplary, and is only used for explaining the application, and is not construed as limiting the claims.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", "and" "the" may also comprise plural form.It is to be further understood that " the packet of wording used in the description of the present application Include " refer to existing characteristics, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition it is one or more Other features, integer, step, operation, element, component and/or their group.It should be understood that when we claim element to be " connected " Or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be intermediary elements.This Outside, " connection " or " coupling " used herein may include being wirelessly connected or wirelessly coupling.Wording "and/or" packet used herein Include one or more associated wholes for listing item or any cell and all combination.

To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.

First to this application involves several nouns be introduced and explain:

Mutual information (Mutual Information) is a kind of useful measure information in information theory, it refers to two events Correlation between set.Mutual information is the common method of computational linguistics model analysis, it measures the phase between two objects Mutual property.For measures characteristic for the discrimination of theme in filtration problem.The definition of mutual information is approximate with cross entropy.Mutual information It originally be a concept in information theory for indicating the relationship between information is the survey of two stochastic variable statistic correlations Degree, carrying out feature extraction using Mutual Information Theory is based on an assumption that high in some particular categories frequency of occurrences, but at other The relatively low entry of the classification frequency of occurrences and such mutual information are bigger.In general, use mutual information as Feature Words and classification it That asks estimates, and if Feature Words belong to such, their mutual information is maximum.

Left and right entropy is the important statistical nature of mode, but the left and right entropy of the magnanimity word string based on large-scale corpus is calculated and needed It is related to the read operation of a large amount of unrelated characters.Left and right entropy is bigger, illustrates that the periphery word of the word is abundanter, it is meant that the freedom of word A possibility that degree is bigger, becomes an independent word is also bigger.

Existing identification technology is not particularly suited for annual report risk and describes the risk phrase chunking of text and extract.Descriptor mentions Result is taken to tend to vocabulary, word length is shorter, cannot not only characterize risk theme well, while can also lose a large amount of semanteme Content.And the information representation ability of phrase is apparently higher than single keyword, such as " aging growth " than " aging " and " growth " two The meaning that a vocabulary reaches is richer, and " managerial talent " conveys more information than " management " and " talent " two words.And in view of year Report risk describes the shorter feature of length, existing key-phrase extraction algorithm, and the phrase that can be identified is limited or even annual report wind Danger description in some important words, such as the more risky early warning meaning such as " growth ", " decline ", " shortage " vocabulary it is important Property be lowered, the phrase extracted formed with noun vocabulary it is in the majority, the phrase extracted formed with noun vocabulary it is in the majority, it is overall to know It is not ineffective, it is not able to satisfy the demand that annual report risk describes the risk phrase chunking of text.

Risk phrase chunking method, apparatus, electronic equipment and storage medium provided by the present application, it is intended to solve the prior art Technical problem as above.

How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.

Embodiment one

A kind of risk phrase chunking method is provided in the embodiment of the present application, it is shown in Figure 1, this method comprises:

S100, text progress phrase chunking is described to risk using scheduled phrase chunking algorithm, it is short obtains the first risk Language list.

S200, it text is described to risk using scheduled participle tool handles, obtain the second risk list of phrases.

S300, the first risk list of phrases and the second risk list of phrases are merged to processing, determines to include multiple wind The risk list of phrases of dangerous phrase.Specifically, merge handle when, remove duplicate risk phrase.

Based on the above embodiment, the risk phrase negligible amounts obtained using scheduled phrase chunking algorithm, are also not enough to Characterize raw risk text information to be expressed.It describes text to risk in conjunction with participle tool to handle, extension risk is short Language, accuracy are high, and the risk phrase obtained using two ways is more comprehensively.

Embodiment two

Shown in Figure 2, the embodiment of the invention provides a kind of possible implementations, on the basis of example 1, Step S100 includes the following steps:

S101, it text is described to risk based on scheduled filtering rule is filtered.Wherein, filtering rule are as follows: according to pre- Fixed deactivated vocabulary filters stop words, and retains punctuation mark, noun, verb, adjective and degree adverb.

S102, text progress part-of-speech tagging is described to filtered risk, and screen the word of predetermined part of speech, formed wait know Other text.

Further, the word for screening predetermined part of speech includes: to describe to screen noun in text from filtered risk, move Word, adjective and degree adverb.Punctuation mark, noun, verb, adjective and degree adverb are not filtered, can be prevented due to going Fall punctuation mark, and will be combined before with the separated content of symbol, it is ensured that constitutes the word of risk phrase in position It is that left and right is adjacent, it can be to avoid extracting the noises word strings such as " risk company ", " risk country ".Meanwhile by noun, verb, Adjective and degree adverb all retain, it is ensured that the phrase quality extracted is higher, can retain " growth ", " shortage ", " decline " Etc. the bigger word of information content, prevent only extract a title.

Frequency of occurrence is greater than the word string of preset quantity threshold value in S103, statistics text to be identified, as candidate phrase.Specifically Ground, preset quantity threshold value can be set according to actual text, and word string can be two contaminations.

S104, risk phrase is picked out from candidate phrase using scheduled phrase chunking algorithm.

Specifically, in step S104, risk phrase packet is picked out from candidate phrase using scheduled phrase chunking algorithm It includes:

S1041, the association relationship that each candidate phrase is calculated using mutual information；Association relationship is for indicating candidate phrase group A possibility that a possibility that at phrase, association relationship forms phrase to candidate phrase, is directly proportional.

S1042, the left entropy and right entropy that each candidate phrase is calculated using left and right entropy；Left entropy and right entropy are used respectively In indicate the word of candidate phrase or so collocation a possibility that, a possibility that left entropy and right entropy and candidate phrase composition phrase It is directly proportional.

Further, step S1041 and step S1042 is not distinguished successively, can be carried out or successively be carried out simultaneously.

S1043, the statistics magnitude based on each candidate phrase, calculate each candidate phrase according to scheduled Weight algorithm Weighted value；Counting magnitude includes association relationship, left entropy, right entropy and the candidate phrase frequency of occurrence in text to be identified；Or Person's association relationship, left entropy and right entropy.

S1044, according to the weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule Language.

Further, the weighted value of each candidate phrase is chosen into present count before sorting according to being ranked up from big to small Multiple candidate phrases of amount, as risk phrase.In practical applications, the meaningless phrase such as number can be removed, is then selected Select the phrase of sequence preceding 20

Or, when the weighted value of candidate phrase is not less than preset threshold, by the candidate phrase, as risk phrase.In reality In the application of border, threshold value is set in advance, less than can excluding for preset threshold.

As an optional implementation, in step S1043, if statistics magnitude includes association relationship, left entropy, right entropy Value and candidate phrase frequency of occurrence, then association relationship, left entropy, right entropy and frequency of occurrence based on each candidate phrase are pressed The first weighted value of each candidate phrase is calculated according to the first pre-defined algorithm；

According to the first weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule Language.

If statistics magnitude includes association relationship, left entropy, right entropy and candidate phrase frequency of occurrence, the calculating side of use Formula is as follows:

(1) association relationship of each candidate phrase is calculated using mutual information

We by two composition words of candidate phrase t respectively to character a and character b, then the calculation formula of mutual information As shown in formula 1.1:

Wherein, p (t), p (a), p (b) respectively indicate the probability of t, a, b, we can simplify the calculating of probability Estimation, is calculated in the form of normalized frequency:

P (t)=n_t/N_P(formula 1.2)

P (a)=n_a/N_T(formula 1.3)

P (b)=n_b/N_T(formula 1.4)

Wherein, n_t、n_a、n_bRespectively indicate the quantity that t, a, b occur in corpus, N_PIndicate candidate phrase in corpus set The total quantity of appearance, N_TIt is the total quantity that the single word in corpus set occurs.

The value of mutual information is higher, shows that the correlation of a and b are higher, then a possibility that a and b composition phrase is bigger；Instead It, a possibility that value of mutual information is lower, and the correlation between a and b is lower, then there are phrasal boundaries between a and b, is bigger, because A possibility that this and b composition phrase, is smaller.

(2) the left entropy and right entropy of each candidate phrase are calculated using left and right entropy

Adjacent entropy includes left adjacent entropy and right adjacent entropy, and adjacent entropy is substantially using comentropy come to the candidate phrase left side Adjoining word and the right side adjoining word a kind of probabilistic measurement.The uncertainty of the adjacent word in left and right is lower, illustrates candidate Word before and after phrase is fewer, more stable, so a possibility that it is at word is lower；Conversely, before and after then illustrating the candidate phrase A possibility that word is more, more chaotic, more unstable, therefore the candidate phrase becomes a word is higher.It is calculated using left and right entropy The calculation formula of left entropy and right entropy is as shown in formula 2.1 and formula 2.2:

Wherein, E_LWith E_RThe left entropy and right entropy of candidate phrase are respectively indicated, W is for indicating candidate phrase, W={ w₁, w₂..., w_n}；A indicates candidate phrase in the set of all words of the appearance on the left side, and a indicates some word in set A；B table Show the set of all words of the appearance of candidate phrase on the right, b indicates some word in set B；If some candidate phrase E_LWith E_RValue is bigger, then the word for then indicating that the left and right of the candidate phrase occurs is more chaotic, it is more unstable, it arranges in pairs or groups abundanter, because This candidate phrase is then more likely a phrase.

(3) the first weighted value of each candidate phrase is calculated according to the first pre-defined algorithm

The composition boundary of phrase is judged using left adjacent entropy and right adjacent entropy, and is combined frequency of occurrence TF to carry out phrase and mentioned It takes, mutual information, left entropy, right entropy and frequency TF is fitted, obtain the first weighted value Score, and threshold value progress is set The calculation of phrase chunking, Score value is as follows:

Score=(NorFreq+NorMI+NorLE+NorRE)/4 (formula 3.1)

Wherein, NorFreq, NorMI, NorLE, NorRE are respectively that frequency of occurrence TF, mutual information, left entropy, right entropy are returned Value after one change, calculation method are as follows:

NorFreq_i=(Freq_i-MAX_Freq)/(MAX_Freq-MIN_Freq) (formula 3.2)

NorMI_i=(MI_i-MAX_MI)/(MAX_MI-MIN_MI) (formula 3.3)

NorLE_i=(LE_i-MAX_LE)/(MAX_LE-MIN_LE) (formula 3.4)

NorRE_i=(RE_i-MAX_RE)/(MAX_RE-MIN_RE) (formula 3.5)

So, the first weighted value Score value is higher, and it is higher to represent a possibility that candidate phrase is as a phrase；Instead It, then illustrate that a possibility that candidate phrase becomes a phrase is lower.

As another implementation, in step S1043, if statistics magnitude includes association relationship, left entropy and right entropy, Association relationship, left entropy and right entropy then based on each candidate phrase calculate each candidate phrase according to the second pre-defined algorithm Second weighted value；

According to the second weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule Language.

If statistics magnitude includes association relationship, left entropy and right entropy, the calculation of use is as follows:

Mutual information can use following calculation formula:

Wherein, t indicates candidate phrase, and the quantity for all candidate phrases that length is met the requirements in set, n are indicated with N_t、 n_a、n_bRespectively indicate word t, the frequency that a, b occur in the text.When association relationship is bigger, show to combine between word tighter Close, a possibility that becoming phrase, is bigger；Conversely, association relationship is smaller, show more uncorrelated between word, can not more constitute short Language.

Left and right entropy can use following calculation formula:

Wherein, E_LIndicate the left entropy of word string, E_RIndicate that right entropy, W indicate candidate phrase set, A indicates that the candidate phrase left side goes out The set of existing all words, a ∈ A；Similarly, B indicates the set of all words occurred on the right of candidate phrase, b ∈ B.If word string E_LAnd E_RValue it is bigger, represent word string or so collocation word it is more rich and varied, the word string form phrase probability it is bigger.

(3) the second weighted value of each candidate phrase is calculated according to the second pre-defined algorithm

Association relationship, left entropy and right entropy are fitted according to pre-defined algorithm, is such as averaged, obtains the second weight Value.

Both the above implementation calculates the weighted value of each candidate phrase, obtains the first weighted value and the second weighted value, Risk phrase is then selected from candidate phrase using predetermined selection rule, comprising:

By the first weighted value of each candidate phrase or the second weighted value according to being ranked up from big to small, choose before sorting Multiple candidate phrases of preset quantity, as risk phrase.Or, when the first weighted value of candidate phrase or the second weighted value be not small When preset threshold, by the candidate phrase, as risk phrase.

Embodiment three

Shown in Figure 3, the embodiment of the invention provides a kind of possible implementations, on the basis of example 1, Step S200 includes the following steps:

S201, it risk is described into each word in text is combined to form phrase to be matched.

S202, phrase to be matched is subjected to match query in the scheduled lexicon of participle tool, it is determining with it is scheduled Vocabulary in lexicon matches phrase.

S203, the phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second wind The risk phrase of dangerous list of phrases.Wherein, scheduled filtering rule includes at least one of the following: filtering individual character；Filtering number；It crosses Filter composition number of words is less than the phrase of predetermined value.Further, it is small that the phrase that composition number of words is less than predetermined value can be length In 3 everyday expressions, participle efficiency is further increased.

Only with the phrase chunking algorithm of embodiment one kind, had using the phrase quantity that mutual information carries out phrase combination Limit, and when the threshold value of left and right entropy sets height, it is easy to so that some phrase chunkings comprising information do not come out.As " under economical The vocabulary such as row ", " unemployment rate increase " are identified not to be come out.

Specifically, the risk phrase quantity for commonly using the covering of participle tool dictionary is seldom, in most cases can be by a wind Dangerous phrase segmentation is at multiple words.In the present embodiment, participle tool can segment tool using jieba, using Chinese hundred Magnanimity encyclopaedia vocabulary is obtained in section's world knowledge map, is stored in txt file in the form of each vocabulary a line, then should The txt file former dictionary included as lexicon replacement jieba participle tool, has expanded vocabulary, so that the phrase obtained is more Comprehensively, it prevents from missing phrase, prevent risk theme from showing well.Simultaneously, it is contemplated that jieba included participle word Allusion quotation scale is about 350,000, and encyclopaedia entity dictionary scale is 12,000,000 or so, is 34 times of former dictionary in scale, directly uses Encyclopaedia entity dictionary is as dictionary for word segmentation and initializes jieba, may cause program crashing.It is based in view of being used in jieba Chinese character will remove the model to identify unregistered word at the stealthy Markov model (HMM) of word ability herein, to guarantee Segment efficiency.

It is identified using Baidupedia vocabulary, Chinese can be obtained from Fudan University's Open Chinese knowledge mapping website Encyclopaedic knowledge spectrum data, wherein comprising 9,000,000+encyclopaedia entity and 66,000,000+triple relationship.Fudan University provides Universal Chinese character encyclopaedic knowledge map in cover the Chinese encyclopaedia class websites such as Baidupedia, interaction encyclopaedia, Chinese wikipedia Entry, including specific things, star personality, abstract concept, literature works, focus incident, technical term, Chinese character by words or specific The contents such as the combination of theme therefrom obtain and amount to more than 1,200 ten thousand encyclopaedia entity words, almost cover whole fields, accuracy It is high.

When any phrase to be matched belongs to the vocabulary in Baidupedia, using the phrase to be matched as risk phrase, expand The big vocabulary of risk phrase.

The embodiment of the invention provides alternatively possible implementations, on the basis of example 1, step S100 it Before, further include following steps:

Specifically, it is illustrated by taking the annual report full text of listed company as an example.

Firstly, obtaining whole A-share listing company Annual report full text, risk description information is therefrom extracted.Since annual report is write The characteristics of writing, risk description information mainly exist in the form of short text, each risk classifications (" accounts receivable increasing in such as figure Greatly with the risk of cash flow reduction ") corresponding one section brief and concise specific risk describes.Secondly, risk description information is pressed section Capable processing is dropped into, each listed company corresponds to several risks and describes text.Finally, carrying out data cleansing, annual report risk is rejected " horizontal competition is avoided to promise to undertake ", " share limit sells promise " in description etc. hardly include the content of indicating risk information, simultaneously Remove illegal symbol, it will be apparent that the content of non-character and messy code.

On the basis of embodiment one to example IV, 2016 year of clear water source Science and Technology Co., Ltd. is randomly selected A risk in report describes text as experiment text, and risk phrase extraction experiment is carried out to it, will use embodiment one It extracts result with existing HanLP phrase chunking algorithm to example IV risk phrase extraction result to be compared, such as 1 institute of table Show.

Test text: " risk that the accounts receivable amount of money is larger, aging increases ends on June 30th, 2016, and receipt on account is answered by company Nearly 2.19 hundred million yuan of money remaining sum, and part aging increases, and counts the bad debt preparation mentioned and increase accordingly, affects the business performance of company. To solve the problems, such as that accounts receivable remaining sum is excessively high, company increases business personnel to the responsibility of accounts receivable collection, by accounts receivable Recovering state is included in feedback on performance, takes in and links directly with it；Company separately sets up the group that clears up defaults, to emphasis wholesale debit customers institute Debt item is taken back completely, and risk assessment to service unit accounts receivable is reinforced, and to being in arrears in recent years, the payment for goods time is longer and industry The few unit of business amount, takes legal means appropriate."

1 Comparison of experiment results of table

Obviously, the risk phrase that recognition methods is extracted into example IV using embodiment one more comprehensively, more can table Levy raw risk text information to be expressed.

Example IV

Fig. 4 is that the embodiment of the present invention also provides a kind of risk phrase chunking device, as shown in figure 4, the risk phrase chunking Device 1, comprising:

First obtains module 11, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm, Obtain the first risk list of phrases.

Second obtains module 12, describes text to risk using scheduled participle tool and handles, obtains the second risk List of phrases.

First risk list of phrases and the second risk list of phrases are merged processing by merging treatment module 13, are determined Risk list of phrases including multiple risk phrases.

The risk phrase negligible amounts obtained using the first acquisition module 11, are also not enough to characterize raw risk text and are wanted The information of expression.Text is described to risk by the second acquisition module 12 again to handle, and extends risk phrase.Merging treatment mould First risk list of phrases and the second risk list of phrases are merged processing again by block 13, determine final risk phrase column Table, accuracy is high, and obtained risk phrase is combined using two ways more comprehensively.

In addition, risk phrase chunking device of the invention can also include that text obtains module, for pressing to pre-determined text Paragraph is handled, and the extraction text obtains risk comprising the paragraph of risk description and describes text.

Embodiment five

Fig. 5 is the particular content that the embodiment of the present invention also provides that one kind first obtains module 11, as shown in figure 5, first obtains Modulus block 11 includes:

First filtering module 111 is filtered for describing text to risk based on scheduled filtering rule.In practical mistake During filter, the first filtering module 111 is specifically used for filtering stop words according to scheduled deactivated vocabulary.

Screening module 112 carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech Language forms text to be identified.During actual filtration, screening module 112 is specifically used for describing text from filtered risk Noun, verb, adjective and degree adverb are screened in this.

Statistical module 113 counts the word string that frequency of occurrence in text to be identified is greater than preset quantity threshold value, as candidate short Language；

Choosing module 114 picks out risk phrase using scheduled phrase chunking algorithm from candidate phrase.

In actual process, Choosing module 114 is specifically used for calculating the mutual trust of each candidate phrase using mutual information Breath value, the left entropy that each candidate phrase is calculated using left and right entropy and right entropy, the statistics magnitude based on each candidate phrase, are pressed The weighted value of each candidate phrase is calculated, according to the weighted value of each candidate phrase according to scheduled Weight algorithm, using predetermined choosing It selects rule and selects risk phrase from candidate phrase.Wherein, statistics magnitude includes association relationship, left entropy, right entropy and candidate Phrase frequency of occurrence；Or association relationship, left entropy and right entropy specifically can be applicable in the calculation method of embodiment two.

Embodiment six

Fig. 6 is the particular content that the embodiment of the present invention also provides that one kind second obtains module 12, as shown in fig. 6, second obtains Modulus block 12 includes:

Composite module 121 is combined to form phrase to be matched for risk to be described each word in text；

Matching module 122, for phrase to be matched to be carried out match query in the scheduled lexicon of participle tool, really Determine the phrase that matches with the vocabulary in scheduled lexicon；

Second filtering module 123 will be filtered for being filtered based on scheduled filtering rule to the phrase that matches Risk phrase of the phrase as the second risk list of phrases.During actual filtration, the second filtering module 123 was also used to At least one of below filter: filtering individual character；Filtering number；Filtering composition number of words is less than the phrase of predetermined value.

Embodiment seven

The embodiment of the present invention also provides a kind of electronic equipment, as shown in fig. 7, electronic equipment shown in Fig. 7 4000 includes:

Processor 4001；And

Memory 4003 is configured to storage machine readable instructions, instructs when executed by the processor, so that processor is held Risk phrase chunking method in row preceding method embodiment.

Wherein, processor 4001 is connected with memory 4003, is such as connected by bus 4002.Further, electronic equipment 4000 can also include transceiver 4004.It should be noted that transceiver 4004 is not limited to one in practical application, which is set Standby 4000 structure does not constitute the restriction to the embodiment of the present application.

Wherein, processor 4001 be applied to the embodiment of the present application in, for realizing it is shown in Fig. 4 first obtain module 11, Second obtains module 12 and merging treatment module 13.

Transceiver 4004 includes Receiver And Transmitter, and transceiver 4004 is applied to text in the embodiment of the present application and obtains mould The risk of block describes text acquisition, can be CPU for realizing processor 4001, general processor, DSP, ASIC, FPGA or Other programmable logic device, transistor logic, hardware component or any combination thereof.It may be implemented or execute combination Various illustrative logic blocks, module and circuit described in present disclosure.Processor 4001 is also possible to realize The combination of computing function, such as combined comprising one or more microprocessors, DSP and the combination of microprocessor etc..Bus 4002 It may include an access, information transmitted between said modules.Bus 4002 can be pci bus or eisa bus etc..Bus 4002 can be divided into address bus, data/address bus, control bus etc..Only to be indicated with a thick line in Fig. 7 convenient for indicating, but It is not offered as only a bus or a type of bus.

Memory 4003 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM Or the other kinds of dynamic memory of information and instruction can be stored, it is also possible to EEPROM, CD-ROM or other CDs Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium Or other magnetic storage apparatus or can be used in carry or store have instruction or data structure form desired program generation Code and can by any other medium of computer access, but not limited to this.

Memory 4003 is used to store the application code for executing application scheme, and is held by processor 4001 to control Row.Processor 4001 is for executing the application code stored in memory 4003, to realize what embodiment illustrated in fig. 4 provided The movement of risk phrase chunking device.

The embodiment of the present invention also provides a kind of computer readable storage medium, and computer storage medium is for storing computer Instruction, when run on a computer, allows computer to execute corresponding contents in preceding method embodiment.

In addition, risk phrase chunking method of the invention is mainly used for improving the standard that annual report risk describes text phrases extraction True property promotes the utilization rate of annual report verbal description content so that effectively excavating annual report risk describes the content contained in text, should Method can also expand business risk study of warning thinking, more in addition to using in the research of listed company's annual report database project Mend the deficiency paid attention to annual report financial data in existing research and ignore annual report verbal description content.Below the identification of risk phrase Two important realistic functions:

(1) technical support is provided for listed company's annual report database project follow-up study.In recent years, the length of annual report more comes Longer, but main body of the three big financial statements as annual report, almost without being further added by, the financial information that can be disclosed also arrives content Up to the upper limit, and the word content except financial statement is more and more abundant, and various supplementary explanations and explanation disclose more and more Information.How to describe to excavate valuable information in text to be one in listed company's annual report database project from a large amount of risks A major issue largely affects the accuracy and comprehensive of post analysis prediction.Risk phrase of the invention is known Other proposition lays the foundation for the follow-up study of project.

(2) technical support is provided for Risk-warning correlative study.It can identify and extract using recognition methods of the invention Risk phrase that is high-quality out, accurate, containing much information.Since the decision correlation of annual reporting data increasingly gets the nod, year Text information in report also starts gradually to be taken seriously.The present invention is scholar, enterprise excavates annual report assertions and provides technical support, is had Help make up availability risk study of warning, promotes the comprehensive of business risk early warning.

It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other At least part of the sub-step or stage of step or other steps executes in turn or alternately.

The above is only some embodiments of the invention, it is noted that those skilled in the art are come It says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims

1. a kind of risk phrase chunking method characterized by comprising

Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, obtains the first risk list of phrases；

Text is described to the risk using scheduled participle tool to handle, and obtains the second risk list of phrases；

The first risk list of phrases and the second risk list of phrases are merged into processing, determine to include multiple risks The risk list of phrases of phrase.

2. risk phrase chunking method according to claim 1, which is characterized in that described to be calculated using scheduled phrase chunking Method describes text to risk and carries out phrase chunking, comprising:

Text is described to the risk based on scheduled filtering rule to be filtered；

Text is described to filtered risk and carries out part-of-speech tagging, and screens the word of predetermined part of speech, forms text to be identified；

3. risk phrase chunking method according to claim 2, which is characterized in that the scheduled filtering rule includes: Stop words is filtered according to scheduled deactivated vocabulary；

The word of the predetermined part of speech of the screening includes:

4. risk phrase chunking method according to claim 2, which is characterized in that described to be calculated using scheduled phrase chunking Method picks out risk phrase from the candidate phrase, comprising:

Based on the statistics magnitude of each candidate phrase, the power of each candidate phrase is calculated according to scheduled Weight algorithm Weight values；The statistics magnitude includes that association relationship, left entropy, right entropy and candidate phrase frequency occur in the text to be identified It is secondary；Or association relationship, left entropy and right entropy；

According to the weighted value of each candidate phrase, it is short that risk is selected from the candidate phrase using predetermined selection rule Language.

5. risk phrase chunking method according to claim 4, which is characterized in that the predetermined selection rule includes:

By the weighted value of each candidate phrase according to being ranked up from big to small, multiple times of preset quantity before sorting are chosen Phrase is selected, as risk phrase；Or,

6. risk phrase chunking method according to claim 1, which is characterized in that using scheduled participle tool to described Risk describes text and is handled, comprising:

Phrase to be matched is subjected to match query, the determining and scheduled word in the scheduled lexicon of the participle tool The vocabulary converged in library matches phrase；

The phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second risk list of phrases Risk phrase.

7. risk phrase chunking method according to claim 6, which is characterized in that the scheduled filtering rule include with It is at least one of lower:

8. risk phrase chunking method according to claim 1, which is characterized in that use scheduled phrase chunking algorithm pair Before risk describes text progress phrase chunking, this method further include:

Paragraph is pressed to pre-determined text and carries out identifying processing, extracts the paragraph comprising risk description in the text to describe as risk Text.

9. a kind of risk phrase chunking device characterized by comprising

First obtains module, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm, obtains the One risk list of phrases；

Second obtains module, describes text to the risk using scheduled participle tool and handles, it is short to obtain the second risk Language list；

The first risk list of phrases and the second risk list of phrases are merged processing, really by merging treatment module It surely include the risk list of phrases of multiple risk phrases.

10. risk phrase chunking device according to claim 9, which is characterized in that described first, which obtains module, includes:

Screening module carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech, is formed Text to be identified；

Statistical module counts frequency of occurrence in the text to be identified and is greater than the word string of preset quantity threshold value as candidate phrase；

11. risk phrase chunking device according to claim 9, which is characterized in that described second, which obtains module, includes:

Matching module is determined for phrase to be matched to be carried out match query in the scheduled lexicon of the participle tool Match phrase with the vocabulary in the scheduled lexicon；

Second filtering module is made filtered phrase for being filtered based on scheduled filtering rule to the phrase that matches For the risk phrase of the second risk list of phrases.

12. a kind of electronic equipment characterized by comprising

Processor；And

Memory is configured to storage machine readable instructions, and described instruction by the processor when being executed, so that the processing Device perform claim requires risk phrase chunking method described in any one of 1-8.

13. a kind of computer readable storage medium, which is characterized in that the computer storage medium refers to for storing computer It enables, when run on a computer, so that computer can require risk phrase described in any one of 1-8 to know with perform claim Other method.