CN110287493A - Risk phrase chunking method, apparatus, electronic equipment and storage medium - Google Patents

Risk phrase chunking method, apparatus, electronic equipment and storage medium Download PDF

Info

Publication number
CN110287493A
CN110287493A CN201910580521.1A CN201910580521A CN110287493A CN 110287493 A CN110287493 A CN 110287493A CN 201910580521 A CN201910580521 A CN 201910580521A CN 110287493 A CN110287493 A CN 110287493A
Authority
CN
China
Prior art keywords
phrase
risk
text
scheduled
phrases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910580521.1A
Other languages
Chinese (zh)
Other versions
CN110287493B (en
Inventor
高影繁
刘志辉
姚长青
李岩
崔笛
郑明�
浦墨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Original Assignee
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA filed Critical INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority to CN201910580521.1A priority Critical patent/CN110287493B/en
Publication of CN110287493A publication Critical patent/CN110287493A/en
Application granted granted Critical
Publication of CN110287493B publication Critical patent/CN110287493B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the present application provides a kind of risk phrase chunking method, apparatus, electronic equipment and storage medium, is related to text-processing technical field.This method comprises: describing text to risk using scheduled phrase chunking algorithm carries out phrase chunking, the first risk list of phrases is obtained;Text is described to risk using scheduled participle tool to handle, and obtains the second risk list of phrases;Then, the first risk list of phrases and the second risk list of phrases are merged into processing, determines the risk list of phrases including multiple risk phrases.The method of the embodiment of the present application can rapidly and accurately identify risk phrase, and more comprehensively, information content is bigger for the phrase of identification, the content for the theme that can disclosure risks well.

Description

Risk phrase chunking method, apparatus, electronic equipment and storage medium
Technical field
This application involves text-processing technical fields, specifically, this application involves a kind of risk phrase chunking methods, dress It sets, electronic equipment and storage medium.
Background technique
Risk information is the external environments such as politics, economy, society, market of the enterprise according to locating for itself, in combination with each The internal environments such as class finance and management, the anticipation and police that existing or potential factor related with enterprise's existence and development is made Show, forward-looking and decision correlation meaning.Risk information can help to alleviate information asymmetry, and improve company's production warp The transparency of battalion, especially in terms of Risk-warning, the value content of risk information is higher than general Voluntary Disclosure information.
Currently, the key phrases extraction result of the prior art tends to vocabulary, word length is shorter, so that risk phrase chunking is imitated Fruit is bad, cannot not only disclose theme well, while can also lose a large amount of semantic content, can not characterize risk well The content of theme, extraction effect be not good enough.
Summary of the invention
It is existing for solving this application provides a kind of risk phrase chunking method, apparatus, electronic equipment and storage medium Risk the ineffective technical problem of risk phrase chunking of text is described.
In a first aspect, a kind of risk phrase chunking method is provided, this method comprises:
Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, obtains the first risk phrase column Table;
Text is described to risk using scheduled participle tool to handle, and obtains the second risk list of phrases;
First risk list of phrases and the second risk list of phrases are merged into processing, determine to include multiple risk phrases Risk list of phrases.
Based on the above technical solution, text is described to the risk based on scheduled filtering rule to be filtered;
Text is described to filtered risk and carries out part-of-speech tagging, and screens the word of predetermined part of speech, is formed to be identified Text;
The word string that frequency of occurrence in the text to be identified is greater than preset quantity threshold value is counted, as candidate phrase;
Risk phrase is picked out from the candidate phrase using scheduled phrase chunking algorithm.
Based on the above technical solution, the scheduled filtering rule includes: to be filtered according to scheduled deactivated vocabulary Stop words;
The word of the predetermined part of speech of the screening includes:
It describes to screen noun, verb, adjective and degree adverb in text from filtered risk.
Based on the above technical solution, described to be selected from the candidate phrase using scheduled phrase chunking algorithm Risk phrase out, comprising:
The association relationship of each candidate phrase is calculated using mutual information;
The left entropy and right entropy of each candidate phrase are calculated using left and right entropy;
Based on the statistics magnitude of each candidate phrase, each candidate phrase is calculated according to scheduled Weight algorithm Weighted value;The statistics magnitude includes that association relationship, left entropy, right entropy and candidate phrase go out in the text to be identified The existing frequency;Or association relationship, left entropy and right entropy;
According to the weighted value of each candidate phrase, risk is selected from the candidate phrase using predetermined selection rule Phrase.
Based on the above technical solution, the predetermined selection rule includes:
By the weighted value of each candidate phrase according to being ranked up from big to small, the more of preceding preset quantity that sort are chosen A candidate phrase, as risk phrase;Or,
When the weighted value of the candidate phrase is not less than preset threshold, by the candidate phrase, as risk phrase.
Based on the above technical solution, text is described to risk using scheduled participle tool to handle, comprising:
The risk is described each word in text to be combined to form phrase to be matched;
Phrase to be matched is subjected to match query in the scheduled lexicon of the participle tool, it is determining to make a reservation for described Lexicon in vocabulary match phrase;
The phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second risk phrase The risk phrase of list.
Based on the above technical solution, the scheduled filtering rule includes at least one of the following:
Filter individual character;Filtering number;Filtering composition number of words is less than the phrase of predetermined value.
Based on the above technical solution, text is described to risk using scheduled phrase chunking algorithm and carries out phrase knowledge Before not, this method further include:
Paragraph is pressed to pre-determined text and carries out identifying processing, extracts the paragraph comprising risk description in the text using as risk Text is described.
Second aspect provides a kind of risk phrase chunking device, comprising:
First obtains module, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm, obtains To the first risk list of phrases;
Second obtains module, describes text to risk using scheduled participle tool and handles, it is short to obtain the second risk Language list;
First risk list of phrases and the second risk list of phrases are merged processing, determine packet by merging treatment module Include the risk list of phrases of multiple risk phrases.
Based on the above technical solution, the first acquisition module includes:
First filtering module is filtered for describing text to risk based on scheduled filtering rule;
Screening module carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech, Form text to be identified;
Statistical module counts frequency of occurrence in the text to be identified and is greater than the word string of preset quantity threshold value as candidate short Language;
Choosing module picks out risk phrase using scheduled phrase chunking algorithm from the candidate phrase.
Based on the above technical solution, the second acquisition module includes:
Composite module is combined to form phrase to be matched for the risk to be described each word in text;
Matching module, for phrase to be matched to be carried out match query in the scheduled lexicon of the participle tool, The determining vocabulary with the scheduled lexicon matches phrase;
Second filtering module will be filtered short for being filtered based on scheduled filtering rule to the phrase that matches Risk phrase of the language as the second risk list of phrases.
The third aspect provides a kind of electronic equipment, comprising:
Processor;And
Memory, is configured to storage machine readable instructions, instruction when executed by the processor so that processor executes the The risk phrase chunking method of one side.
Fourth aspect, provides a kind of computer readable storage medium, and computer storage medium refers to for storing computer It enables, when run on a computer, computer is allowed to execute the risk phrase chunking method of first aspect.
Technical solution provided by the present application has the benefit that
Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, phrase is tentatively extracted, obtains First risk list of phrases;Text is described to risk using scheduled participle tool again to handle, and is extended risk phrase, is obtained Second risk list of phrases.Then, the first risk list of phrases and the second risk list of phrases are merged into processing, determines packet Include the risk list of phrases of multiple risk phrases.Using scheduled phrase chunking algorithm extracting phrase, the information representation energy of phrase Power is to be apparently higher than single keyword characterization ability, and the accuracy for the phrase for using phrase chunking algorithm to extract is high, but single It is to be expressed complete to be also not enough to characterize raw risk text for the risk phrase negligible amounts solely obtained using phrase chunking algorithm Portion's information.Present invention combination participle tool describes text to risk and is further processed, and extends risk phrase, further increases Accuracy, and the risk phrase for using two ways to obtain is more comprehensively.The present invention can rapidly and accurately identify risk phrase, know More comprehensively, information content is bigger for other phrase, and can disclosure risks theme well, solves the risk that existing risk describes text The bad technical problem of the recognition effect of phrase.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, institute in being described below to the embodiment of the present application Attached drawing to be used is needed to be briefly described.
Fig. 1 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application one provides;
Fig. 2 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application two provides;
Fig. 3 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application three provides;
Fig. 4 is a kind of structural schematic diagram for risk phrase chunking device that the embodiment of the present application four provides;
Fig. 5 is the structural schematic diagram for the first acquisition module that the embodiment of the present application five provides;
Fig. 6 is the structural schematic diagram for the second acquisition module that the embodiment of the present application six provides;
Fig. 7 is the structural schematic diagram for the electronic equipment that the embodiment of the present application seven provides.
Specific embodiment
Embodiments herein is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing The embodiment stated is exemplary, and is only used for explaining the application, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", "and" "the" may also comprise plural form.It is to be further understood that " the packet of wording used in the description of the present application Include " refer to existing characteristics, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition it is one or more Other features, integer, step, operation, element, component and/or their group.It should be understood that when we claim element to be " connected " Or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be intermediary elements.This Outside, " connection " or " coupling " used herein may include being wirelessly connected or wirelessly coupling.Wording "and/or" packet used herein Include one or more associated wholes for listing item or any cell and all combination.
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.
First to this application involves several nouns be introduced and explain:
Mutual information (Mutual Information) is a kind of useful measure information in information theory, it refers to two events Correlation between set.Mutual information is the common method of computational linguistics model analysis, it measures the phase between two objects Mutual property.For measures characteristic for the discrimination of theme in filtration problem.The definition of mutual information is approximate with cross entropy.Mutual information It originally be a concept in information theory for indicating the relationship between information is the survey of two stochastic variable statistic correlations Degree, carrying out feature extraction using Mutual Information Theory is based on an assumption that high in some particular categories frequency of occurrences, but at other The relatively low entry of the classification frequency of occurrences and such mutual information are bigger.In general, use mutual information as Feature Words and classification it That asks estimates, and if Feature Words belong to such, their mutual information is maximum.
Left and right entropy is the important statistical nature of mode, but the left and right entropy of the magnanimity word string based on large-scale corpus is calculated and needed It is related to the read operation of a large amount of unrelated characters.Left and right entropy is bigger, illustrates that the periphery word of the word is abundanter, it is meant that the freedom of word A possibility that degree is bigger, becomes an independent word is also bigger.
Existing identification technology is not particularly suited for annual report risk and describes the risk phrase chunking of text and extract.Descriptor mentions Result is taken to tend to vocabulary, word length is shorter, cannot not only characterize risk theme well, while can also lose a large amount of semanteme Content.And the information representation ability of phrase is apparently higher than single keyword, such as " aging growth " than " aging " and " growth " two The meaning that a vocabulary reaches is richer, and " managerial talent " conveys more information than " management " and " talent " two words.And in view of year Report risk describes the shorter feature of length, existing key-phrase extraction algorithm, and the phrase that can be identified is limited or even annual report wind Danger description in some important words, such as the more risky early warning meaning such as " growth ", " decline ", " shortage " vocabulary it is important Property be lowered, the phrase extracted formed with noun vocabulary it is in the majority, the phrase extracted formed with noun vocabulary it is in the majority, it is overall to know It is not ineffective, it is not able to satisfy the demand that annual report risk describes the risk phrase chunking of text.
Risk phrase chunking method, apparatus, electronic equipment and storage medium provided by the present application, it is intended to solve the prior art Technical problem as above.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
Embodiment one
A kind of risk phrase chunking method is provided in the embodiment of the present application, it is shown in Figure 1, this method comprises:
S100, text progress phrase chunking is described to risk using scheduled phrase chunking algorithm, it is short obtains the first risk Language list.
S200, it text is described to risk using scheduled participle tool handles, obtain the second risk list of phrases.
S300, the first risk list of phrases and the second risk list of phrases are merged to processing, determines to include multiple wind The risk list of phrases of dangerous phrase.Specifically, merge handle when, remove duplicate risk phrase.
Based on the above embodiment, the risk phrase negligible amounts obtained using scheduled phrase chunking algorithm, are also not enough to Characterize raw risk text information to be expressed.It describes text to risk in conjunction with participle tool to handle, extension risk is short Language, accuracy are high, and the risk phrase obtained using two ways is more comprehensively.
Embodiment two
Shown in Figure 2, the embodiment of the invention provides a kind of possible implementations, on the basis of example 1, Step S100 includes the following steps:
S101, it text is described to risk based on scheduled filtering rule is filtered.Wherein, filtering rule are as follows: according to pre- Fixed deactivated vocabulary filters stop words, and retains punctuation mark, noun, verb, adjective and degree adverb.
S102, text progress part-of-speech tagging is described to filtered risk, and screen the word of predetermined part of speech, formed wait know Other text.
Further, the word for screening predetermined part of speech includes: to describe to screen noun in text from filtered risk, move Word, adjective and degree adverb.Punctuation mark, noun, verb, adjective and degree adverb are not filtered, can be prevented due to going Fall punctuation mark, and will be combined before with the separated content of symbol, it is ensured that constitutes the word of risk phrase in position It is that left and right is adjacent, it can be to avoid extracting the noises word strings such as " risk company ", " risk country ".Meanwhile by noun, verb, Adjective and degree adverb all retain, it is ensured that the phrase quality extracted is higher, can retain " growth ", " shortage ", " decline " Etc. the bigger word of information content, prevent only extract a title.
Frequency of occurrence is greater than the word string of preset quantity threshold value in S103, statistics text to be identified, as candidate phrase.Specifically Ground, preset quantity threshold value can be set according to actual text, and word string can be two contaminations.
S104, risk phrase is picked out from candidate phrase using scheduled phrase chunking algorithm.
Specifically, in step S104, risk phrase packet is picked out from candidate phrase using scheduled phrase chunking algorithm It includes:
S1041, the association relationship that each candidate phrase is calculated using mutual information;Association relationship is for indicating candidate phrase group A possibility that a possibility that at phrase, association relationship forms phrase to candidate phrase, is directly proportional.
S1042, the left entropy and right entropy that each candidate phrase is calculated using left and right entropy;Left entropy and right entropy are used respectively In indicate the word of candidate phrase or so collocation a possibility that, a possibility that left entropy and right entropy and candidate phrase composition phrase It is directly proportional.
Further, step S1041 and step S1042 is not distinguished successively, can be carried out or successively be carried out simultaneously.
S1043, the statistics magnitude based on each candidate phrase, calculate each candidate phrase according to scheduled Weight algorithm Weighted value;Counting magnitude includes association relationship, left entropy, right entropy and the candidate phrase frequency of occurrence in text to be identified;Or Person's association relationship, left entropy and right entropy.
S1044, according to the weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule Language.
Further, the weighted value of each candidate phrase is chosen into present count before sorting according to being ranked up from big to small Multiple candidate phrases of amount, as risk phrase.In practical applications, the meaningless phrase such as number can be removed, is then selected Select the phrase of sequence preceding 20
Or, when the weighted value of candidate phrase is not less than preset threshold, by the candidate phrase, as risk phrase.In reality In the application of border, threshold value is set in advance, less than can excluding for preset threshold.
As an optional implementation, in step S1043, if statistics magnitude includes association relationship, left entropy, right entropy Value and candidate phrase frequency of occurrence, then association relationship, left entropy, right entropy and frequency of occurrence based on each candidate phrase are pressed The first weighted value of each candidate phrase is calculated according to the first pre-defined algorithm;
According to the first weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule Language.
If statistics magnitude includes association relationship, left entropy, right entropy and candidate phrase frequency of occurrence, the calculating side of use Formula is as follows:
(1) association relationship of each candidate phrase is calculated using mutual information
We by two composition words of candidate phrase t respectively to character a and character b, then the calculation formula of mutual information As shown in formula 1.1:
Wherein, p (t), p (a), p (b) respectively indicate the probability of t, a, b, we can simplify the calculating of probability Estimation, is calculated in the form of normalized frequency:
P (t)=nt/NP(formula 1.2)
P (a)=na/NT(formula 1.3)
P (b)=nb/NT(formula 1.4)
Wherein, nt、na、nbRespectively indicate the quantity that t, a, b occur in corpus, NPIndicate candidate phrase in corpus set The total quantity of appearance, NTIt is the total quantity that the single word in corpus set occurs.
The value of mutual information is higher, shows that the correlation of a and b are higher, then a possibility that a and b composition phrase is bigger;Instead It, a possibility that value of mutual information is lower, and the correlation between a and b is lower, then there are phrasal boundaries between a and b, is bigger, because A possibility that this and b composition phrase, is smaller.
(2) the left entropy and right entropy of each candidate phrase are calculated using left and right entropy
Adjacent entropy includes left adjacent entropy and right adjacent entropy, and adjacent entropy is substantially using comentropy come to the candidate phrase left side Adjoining word and the right side adjoining word a kind of probabilistic measurement.The uncertainty of the adjacent word in left and right is lower, illustrates candidate Word before and after phrase is fewer, more stable, so a possibility that it is at word is lower;Conversely, before and after then illustrating the candidate phrase A possibility that word is more, more chaotic, more unstable, therefore the candidate phrase becomes a word is higher.It is calculated using left and right entropy The calculation formula of left entropy and right entropy is as shown in formula 2.1 and formula 2.2:
Wherein, ELWith ERThe left entropy and right entropy of candidate phrase are respectively indicated, W is for indicating candidate phrase, W={ w1, w2..., wn};A indicates candidate phrase in the set of all words of the appearance on the left side, and a indicates some word in set A;B table Show the set of all words of the appearance of candidate phrase on the right, b indicates some word in set B;If some candidate phrase ELWith ERValue is bigger, then the word for then indicating that the left and right of the candidate phrase occurs is more chaotic, it is more unstable, it arranges in pairs or groups abundanter, because This candidate phrase is then more likely a phrase.
(3) the first weighted value of each candidate phrase is calculated according to the first pre-defined algorithm
The composition boundary of phrase is judged using left adjacent entropy and right adjacent entropy, and is combined frequency of occurrence TF to carry out phrase and mentioned It takes, mutual information, left entropy, right entropy and frequency TF is fitted, obtain the first weighted value Score, and threshold value progress is set The calculation of phrase chunking, Score value is as follows:
Score=(NorFreq+NorMI+NorLE+NorRE)/4 (formula 3.1)
Wherein, NorFreq, NorMI, NorLE, NorRE are respectively that frequency of occurrence TF, mutual information, left entropy, right entropy are returned Value after one change, calculation method are as follows:
NorFreqi=(Freqi-MAXFreq)/(MAXFreq-MINFreq) (formula 3.2)
NorMIi=(MIi-MAXMI)/(MAXMI-MINMI) (formula 3.3)
NorLEi=(LEi-MAXLE)/(MAXLE-MINLE) (formula 3.4)
NorREi=(REi-MAXRE)/(MAXRE-MINRE) (formula 3.5)
So, the first weighted value Score value is higher, and it is higher to represent a possibility that candidate phrase is as a phrase;Instead It, then illustrate that a possibility that candidate phrase becomes a phrase is lower.
As another implementation, in step S1043, if statistics magnitude includes association relationship, left entropy and right entropy, Association relationship, left entropy and right entropy then based on each candidate phrase calculate each candidate phrase according to the second pre-defined algorithm Second weighted value;
According to the second weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule Language.
If statistics magnitude includes association relationship, left entropy and right entropy, the calculation of use is as follows:
(1) association relationship of each candidate phrase is calculated using mutual information
Mutual information can use following calculation formula:
Wherein, t indicates candidate phrase, and the quantity for all candidate phrases that length is met the requirements in set, n are indicated with Nt、 na、nbRespectively indicate word t, the frequency that a, b occur in the text.When association relationship is bigger, show to combine between word tighter Close, a possibility that becoming phrase, is bigger;Conversely, association relationship is smaller, show more uncorrelated between word, can not more constitute short Language.
(2) the left entropy and right entropy of each candidate phrase are calculated using left and right entropy
Left and right entropy can use following calculation formula:
Wherein, ELIndicate the left entropy of word string, ERIndicate that right entropy, W indicate candidate phrase set, A indicates that the candidate phrase left side goes out The set of existing all words, a ∈ A;Similarly, B indicates the set of all words occurred on the right of candidate phrase, b ∈ B.If word string ELAnd ERValue it is bigger, represent word string or so collocation word it is more rich and varied, the word string form phrase probability it is bigger.
(3) the second weighted value of each candidate phrase is calculated according to the second pre-defined algorithm
Association relationship, left entropy and right entropy are fitted according to pre-defined algorithm, is such as averaged, obtains the second weight Value.
Both the above implementation calculates the weighted value of each candidate phrase, obtains the first weighted value and the second weighted value, Risk phrase is then selected from candidate phrase using predetermined selection rule, comprising:
By the first weighted value of each candidate phrase or the second weighted value according to being ranked up from big to small, choose before sorting Multiple candidate phrases of preset quantity, as risk phrase.Or, when the first weighted value of candidate phrase or the second weighted value be not small When preset threshold, by the candidate phrase, as risk phrase.
Embodiment three
Shown in Figure 3, the embodiment of the invention provides a kind of possible implementations, on the basis of example 1, Step S200 includes the following steps:
S201, it risk is described into each word in text is combined to form phrase to be matched.
S202, phrase to be matched is subjected to match query in the scheduled lexicon of participle tool, it is determining with it is scheduled Vocabulary in lexicon matches phrase.
S203, the phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second wind The risk phrase of dangerous list of phrases.Wherein, scheduled filtering rule includes at least one of the following: filtering individual character;Filtering number;It crosses Filter composition number of words is less than the phrase of predetermined value.Further, it is small that the phrase that composition number of words is less than predetermined value can be length In 3 everyday expressions, participle efficiency is further increased.
Only with the phrase chunking algorithm of embodiment one kind, had using the phrase quantity that mutual information carries out phrase combination Limit, and when the threshold value of left and right entropy sets height, it is easy to so that some phrase chunkings comprising information do not come out.As " under economical The vocabulary such as row ", " unemployment rate increase " are identified not to be come out.
Specifically, the risk phrase quantity for commonly using the covering of participle tool dictionary is seldom, in most cases can be by a wind Dangerous phrase segmentation is at multiple words.In the present embodiment, participle tool can segment tool using jieba, using Chinese hundred Magnanimity encyclopaedia vocabulary is obtained in section's world knowledge map, is stored in txt file in the form of each vocabulary a line, then should The txt file former dictionary included as lexicon replacement jieba participle tool, has expanded vocabulary, so that the phrase obtained is more Comprehensively, it prevents from missing phrase, prevent risk theme from showing well.Simultaneously, it is contemplated that jieba included participle word Allusion quotation scale is about 350,000, and encyclopaedia entity dictionary scale is 12,000,000 or so, is 34 times of former dictionary in scale, directly uses Encyclopaedia entity dictionary is as dictionary for word segmentation and initializes jieba, may cause program crashing.It is based in view of being used in jieba Chinese character will remove the model to identify unregistered word at the stealthy Markov model (HMM) of word ability herein, to guarantee Segment efficiency.
It is identified using Baidupedia vocabulary, Chinese can be obtained from Fudan University's Open Chinese knowledge mapping website Encyclopaedic knowledge spectrum data, wherein comprising 9,000,000+encyclopaedia entity and 66,000,000+triple relationship.Fudan University provides Universal Chinese character encyclopaedic knowledge map in cover the Chinese encyclopaedia class websites such as Baidupedia, interaction encyclopaedia, Chinese wikipedia Entry, including specific things, star personality, abstract concept, literature works, focus incident, technical term, Chinese character by words or specific The contents such as the combination of theme therefrom obtain and amount to more than 1,200 ten thousand encyclopaedia entity words, almost cover whole fields, accuracy It is high.
When any phrase to be matched belongs to the vocabulary in Baidupedia, using the phrase to be matched as risk phrase, expand The big vocabulary of risk phrase.
The embodiment of the invention provides alternatively possible implementations, on the basis of example 1, step S100 it Before, further include following steps:
Paragraph is pressed to pre-determined text and carries out identifying processing, extracts the paragraph comprising risk description in the text using as risk Text is described.
Specifically, it is illustrated by taking the annual report full text of listed company as an example.
Firstly, obtaining whole A-share listing company Annual report full text, risk description information is therefrom extracted.Since annual report is write The characteristics of writing, risk description information mainly exist in the form of short text, each risk classifications (" accounts receivable increasing in such as figure Greatly with the risk of cash flow reduction ") corresponding one section brief and concise specific risk describes.Secondly, risk description information is pressed section Capable processing is dropped into, each listed company corresponds to several risks and describes text.Finally, carrying out data cleansing, annual report risk is rejected " horizontal competition is avoided to promise to undertake ", " share limit sells promise " in description etc. hardly include the content of indicating risk information, simultaneously Remove illegal symbol, it will be apparent that the content of non-character and messy code.
On the basis of embodiment one to example IV, 2016 year of clear water source Science and Technology Co., Ltd. is randomly selected A risk in report describes text as experiment text, and risk phrase extraction experiment is carried out to it, will use embodiment one It extracts result with existing HanLP phrase chunking algorithm to example IV risk phrase extraction result to be compared, such as 1 institute of table Show.
Test text: " risk that the accounts receivable amount of money is larger, aging increases ends on June 30th, 2016, and receipt on account is answered by company Nearly 2.19 hundred million yuan of money remaining sum, and part aging increases, and counts the bad debt preparation mentioned and increase accordingly, affects the business performance of company. To solve the problems, such as that accounts receivable remaining sum is excessively high, company increases business personnel to the responsibility of accounts receivable collection, by accounts receivable Recovering state is included in feedback on performance, takes in and links directly with it;Company separately sets up the group that clears up defaults, to emphasis wholesale debit customers institute Debt item is taken back completely, and risk assessment to service unit accounts receivable is reinforced, and to being in arrears in recent years, the payment for goods time is longer and industry The few unit of business amount, takes legal means appropriate."
1 Comparison of experiment results of table
Obviously, the risk phrase that recognition methods is extracted into example IV using embodiment one more comprehensively, more can table Levy raw risk text information to be expressed.
Example IV
Fig. 4 is that the embodiment of the present invention also provides a kind of risk phrase chunking device, as shown in figure 4, the risk phrase chunking Device 1, comprising:
First obtains module 11, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm, Obtain the first risk list of phrases.
Second obtains module 12, describes text to risk using scheduled participle tool and handles, obtains the second risk List of phrases.
First risk list of phrases and the second risk list of phrases are merged processing by merging treatment module 13, are determined Risk list of phrases including multiple risk phrases.
The risk phrase negligible amounts obtained using the first acquisition module 11, are also not enough to characterize raw risk text and are wanted The information of expression.Text is described to risk by the second acquisition module 12 again to handle, and extends risk phrase.Merging treatment mould First risk list of phrases and the second risk list of phrases are merged processing again by block 13, determine final risk phrase column Table, accuracy is high, and obtained risk phrase is combined using two ways more comprehensively.
In addition, risk phrase chunking device of the invention can also include that text obtains module, for pressing to pre-determined text Paragraph is handled, and the extraction text obtains risk comprising the paragraph of risk description and describes text.
Embodiment five
Fig. 5 is the particular content that the embodiment of the present invention also provides that one kind first obtains module 11, as shown in figure 5, first obtains Modulus block 11 includes:
First filtering module 111 is filtered for describing text to risk based on scheduled filtering rule.In practical mistake During filter, the first filtering module 111 is specifically used for filtering stop words according to scheduled deactivated vocabulary.
Screening module 112 carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech Language forms text to be identified.During actual filtration, screening module 112 is specifically used for describing text from filtered risk Noun, verb, adjective and degree adverb are screened in this.
Statistical module 113 counts the word string that frequency of occurrence in text to be identified is greater than preset quantity threshold value, as candidate short Language;
Choosing module 114 picks out risk phrase using scheduled phrase chunking algorithm from candidate phrase.
In actual process, Choosing module 114 is specifically used for calculating the mutual trust of each candidate phrase using mutual information Breath value, the left entropy that each candidate phrase is calculated using left and right entropy and right entropy, the statistics magnitude based on each candidate phrase, are pressed The weighted value of each candidate phrase is calculated, according to the weighted value of each candidate phrase according to scheduled Weight algorithm, using predetermined choosing It selects rule and selects risk phrase from candidate phrase.Wherein, statistics magnitude includes association relationship, left entropy, right entropy and candidate Phrase frequency of occurrence;Or association relationship, left entropy and right entropy specifically can be applicable in the calculation method of embodiment two.
Embodiment six
Fig. 6 is the particular content that the embodiment of the present invention also provides that one kind second obtains module 12, as shown in fig. 6, second obtains Modulus block 12 includes:
Composite module 121 is combined to form phrase to be matched for risk to be described each word in text;
Matching module 122, for phrase to be matched to be carried out match query in the scheduled lexicon of participle tool, really Determine the phrase that matches with the vocabulary in scheduled lexicon;
Second filtering module 123 will be filtered for being filtered based on scheduled filtering rule to the phrase that matches Risk phrase of the phrase as the second risk list of phrases.During actual filtration, the second filtering module 123 was also used to At least one of below filter: filtering individual character;Filtering number;Filtering composition number of words is less than the phrase of predetermined value.
Embodiment seven
The embodiment of the present invention also provides a kind of electronic equipment, as shown in fig. 7, electronic equipment shown in Fig. 7 4000 includes:
Processor 4001;And
Memory 4003 is configured to storage machine readable instructions, instructs when executed by the processor, so that processor is held Risk phrase chunking method in row preceding method embodiment.
Wherein, processor 4001 is connected with memory 4003, is such as connected by bus 4002.Further, electronic equipment 4000 can also include transceiver 4004.It should be noted that transceiver 4004 is not limited to one in practical application, which is set Standby 4000 structure does not constitute the restriction to the embodiment of the present application.
Wherein, processor 4001 be applied to the embodiment of the present application in, for realizing it is shown in Fig. 4 first obtain module 11, Second obtains module 12 and merging treatment module 13.
Transceiver 4004 includes Receiver And Transmitter, and transceiver 4004 is applied to text in the embodiment of the present application and obtains mould The risk of block describes text acquisition, can be CPU for realizing processor 4001, general processor, DSP, ASIC, FPGA or Other programmable logic device, transistor logic, hardware component or any combination thereof.It may be implemented or execute combination Various illustrative logic blocks, module and circuit described in present disclosure.Processor 4001 is also possible to realize The combination of computing function, such as combined comprising one or more microprocessors, DSP and the combination of microprocessor etc..Bus 4002 It may include an access, information transmitted between said modules.Bus 4002 can be pci bus or eisa bus etc..Bus 4002 can be divided into address bus, data/address bus, control bus etc..Only to be indicated with a thick line in Fig. 7 convenient for indicating, but It is not offered as only a bus or a type of bus.
Memory 4003 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM Or the other kinds of dynamic memory of information and instruction can be stored, it is also possible to EEPROM, CD-ROM or other CDs Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium Or other magnetic storage apparatus or can be used in carry or store have instruction or data structure form desired program generation Code and can by any other medium of computer access, but not limited to this.
Memory 4003 is used to store the application code for executing application scheme, and is held by processor 4001 to control Row.Processor 4001 is for executing the application code stored in memory 4003, to realize what embodiment illustrated in fig. 4 provided The movement of risk phrase chunking device.
The embodiment of the present invention also provides a kind of computer readable storage medium, and computer storage medium is for storing computer Instruction, when run on a computer, allows computer to execute corresponding contents in preceding method embodiment.
In addition, risk phrase chunking method of the invention is mainly used for improving the standard that annual report risk describes text phrases extraction True property promotes the utilization rate of annual report verbal description content so that effectively excavating annual report risk describes the content contained in text, should Method can also expand business risk study of warning thinking, more in addition to using in the research of listed company's annual report database project Mend the deficiency paid attention to annual report financial data in existing research and ignore annual report verbal description content.Below the identification of risk phrase Two important realistic functions:
(1) technical support is provided for listed company's annual report database project follow-up study.In recent years, the length of annual report more comes Longer, but main body of the three big financial statements as annual report, almost without being further added by, the financial information that can be disclosed also arrives content Up to the upper limit, and the word content except financial statement is more and more abundant, and various supplementary explanations and explanation disclose more and more Information.How to describe to excavate valuable information in text to be one in listed company's annual report database project from a large amount of risks A major issue largely affects the accuracy and comprehensive of post analysis prediction.Risk phrase of the invention is known Other proposition lays the foundation for the follow-up study of project.
(2) technical support is provided for Risk-warning correlative study.It can identify and extract using recognition methods of the invention Risk phrase that is high-quality out, accurate, containing much information.Since the decision correlation of annual reporting data increasingly gets the nod, year Text information in report also starts gradually to be taken seriously.The present invention is scholar, enterprise excavates annual report assertions and provides technical support, is had Help make up availability risk study of warning, promotes the comprehensive of business risk early warning.
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other At least part of the sub-step or stage of step or other steps executes in turn or alternately.
The above is only some embodiments of the invention, it is noted that those skilled in the art are come It says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims (13)

1. a kind of risk phrase chunking method characterized by comprising
Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, obtains the first risk list of phrases;
Text is described to the risk using scheduled participle tool to handle, and obtains the second risk list of phrases;
The first risk list of phrases and the second risk list of phrases are merged into processing, determine to include multiple risks The risk list of phrases of phrase.
2. risk phrase chunking method according to claim 1, which is characterized in that described to be calculated using scheduled phrase chunking Method describes text to risk and carries out phrase chunking, comprising:
Text is described to the risk based on scheduled filtering rule to be filtered;
Text is described to filtered risk and carries out part-of-speech tagging, and screens the word of predetermined part of speech, forms text to be identified;
The word string that frequency of occurrence in the text to be identified is greater than preset quantity threshold value is counted, as candidate phrase;
Risk phrase is picked out from the candidate phrase using scheduled phrase chunking algorithm.
3. risk phrase chunking method according to claim 2, which is characterized in that the scheduled filtering rule includes: Stop words is filtered according to scheduled deactivated vocabulary;
The word of the predetermined part of speech of the screening includes:
It describes to screen noun, verb, adjective and degree adverb in text from filtered risk.
4. risk phrase chunking method according to claim 2, which is characterized in that described to be calculated using scheduled phrase chunking Method picks out risk phrase from the candidate phrase, comprising:
The association relationship of each candidate phrase is calculated using mutual information;
The left entropy and right entropy of each candidate phrase are calculated using left and right entropy;
Based on the statistics magnitude of each candidate phrase, the power of each candidate phrase is calculated according to scheduled Weight algorithm Weight values;The statistics magnitude includes that association relationship, left entropy, right entropy and candidate phrase frequency occur in the text to be identified It is secondary;Or association relationship, left entropy and right entropy;
According to the weighted value of each candidate phrase, it is short that risk is selected from the candidate phrase using predetermined selection rule Language.
5. risk phrase chunking method according to claim 4, which is characterized in that the predetermined selection rule includes:
By the weighted value of each candidate phrase according to being ranked up from big to small, multiple times of preset quantity before sorting are chosen Phrase is selected, as risk phrase;Or,
When the weighted value of the candidate phrase is not less than preset threshold, by the candidate phrase, as risk phrase.
6. risk phrase chunking method according to claim 1, which is characterized in that using scheduled participle tool to described Risk describes text and is handled, comprising:
The risk is described each word in text to be combined to form phrase to be matched;
Phrase to be matched is subjected to match query, the determining and scheduled word in the scheduled lexicon of the participle tool The vocabulary converged in library matches phrase;
The phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second risk list of phrases Risk phrase.
7. risk phrase chunking method according to claim 6, which is characterized in that the scheduled filtering rule include with It is at least one of lower:
Filter individual character;Filtering number;Filtering composition number of words is less than the phrase of predetermined value.
8. risk phrase chunking method according to claim 1, which is characterized in that use scheduled phrase chunking algorithm pair Before risk describes text progress phrase chunking, this method further include:
Paragraph is pressed to pre-determined text and carries out identifying processing, extracts the paragraph comprising risk description in the text to describe as risk Text.
9. a kind of risk phrase chunking device characterized by comprising
First obtains module, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm, obtains the One risk list of phrases;
Second obtains module, describes text to the risk using scheduled participle tool and handles, it is short to obtain the second risk Language list;
The first risk list of phrases and the second risk list of phrases are merged processing, really by merging treatment module It surely include the risk list of phrases of multiple risk phrases.
10. risk phrase chunking device according to claim 9, which is characterized in that described first, which obtains module, includes:
First filtering module is filtered for describing text to risk based on scheduled filtering rule;
Screening module carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech, is formed Text to be identified;
Statistical module counts frequency of occurrence in the text to be identified and is greater than the word string of preset quantity threshold value as candidate phrase;
Choosing module picks out risk phrase using scheduled phrase chunking algorithm from the candidate phrase.
11. risk phrase chunking device according to claim 9, which is characterized in that described second, which obtains module, includes:
Composite module is combined to form phrase to be matched for the risk to be described each word in text;
Matching module is determined for phrase to be matched to be carried out match query in the scheduled lexicon of the participle tool Match phrase with the vocabulary in the scheduled lexicon;
Second filtering module is made filtered phrase for being filtered based on scheduled filtering rule to the phrase that matches For the risk phrase of the second risk list of phrases.
12. a kind of electronic equipment characterized by comprising
Processor;And
Memory is configured to storage machine readable instructions, and described instruction by the processor when being executed, so that the processing Device perform claim requires risk phrase chunking method described in any one of 1-8.
13. a kind of computer readable storage medium, which is characterized in that the computer storage medium refers to for storing computer It enables, when run on a computer, so that computer can require risk phrase described in any one of 1-8 to know with perform claim Other method.
CN201910580521.1A 2019-06-28 2019-06-28 Risk phrase identification method and device, electronic equipment and storage medium Active CN110287493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910580521.1A CN110287493B (en) 2019-06-28 2019-06-28 Risk phrase identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910580521.1A CN110287493B (en) 2019-06-28 2019-06-28 Risk phrase identification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110287493A true CN110287493A (en) 2019-09-27
CN110287493B CN110287493B (en) 2023-04-18

Family

ID=68020086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910580521.1A Active CN110287493B (en) 2019-06-28 2019-06-28 Risk phrase identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110287493B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222316A (en) * 2020-01-03 2020-06-02 北京小米移动软件有限公司 Text detection method, device and storage medium
CN112633009A (en) * 2020-12-29 2021-04-09 扬州大学 Identification method for random combination uploading field
CN113128209A (en) * 2021-04-22 2021-07-16 百度在线网络技术(北京)有限公司 Method and device for generating word stock

Citations (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361064A (en) * 2005-12-16 2009-02-04 Emil有限公司 A text editing apparatus and method
US20110191098A1 (en) * 2010-02-01 2011-08-04 Stratify, Inc. Phrase-based document clustering with automatic phrase extraction
CN104142998A (en) * 2014-08-01 2014-11-12 中国传媒大学 Text classification method
CN104484377A (en) * 2014-12-09 2015-04-01 百度在线网络技术(北京)有限公司 Generating method and device of substitute dictionaries
US20150195406A1 (en) * 2014-01-08 2015-07-09 Callminer, Inc. Real-time conversational analytics facility
US20150207819A1 (en) * 2013-01-23 2015-07-23 The Privacy Factor, LLC Methods for analyzing application privacy and devices thereof
CN104995650A (en) * 2011-12-27 2015-10-21 汤姆森路透社全球资源公司 Methods and systems for generating composite index using social media sourced data and sentiment analysis
CN105956740A (en) * 2016-04-19 2016-09-21 北京深度时代科技有限公司 Semantic risk calculating method based on text logical characteristic
CN106170553A (en) * 2013-12-13 2016-11-30 现代治疗公司 Nucleic acid molecules modified and application thereof
CN106649597A (en) * 2016-11-22 2017-05-10 浙江大学 Method for automatically establishing back-of-book indexes of book based on book contents
CN107085584A (en) * 2016-11-09 2017-08-22 中国长城科技集团股份有限公司 A kind of cloud document management method, system and service end based on content
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN107405325A (en) * 2015-02-06 2017-11-28 英特塞普特医药品公司 Pharmaceutical composition for combination treatment
CN107688594A (en) * 2017-05-05 2018-02-13 平安科技(深圳)有限公司 The identifying system and method for risk case based on social information
CN108021558A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Keyword recognition method and device, electronic equipment and storage medium
CN108228556A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Key phrase extracting method and device
CN108447534A (en) * 2018-05-18 2018-08-24 灵玖中科软件(北京)有限公司 A kind of electronic health record data quality management method based on NLP
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
US20180309581A1 (en) * 2017-04-19 2018-10-25 International Business Machines Corporation Decentralized biometric signing of digital contracts
CN108764485A (en) * 2011-01-06 2018-11-06 电子湾有限公司 The interest-degree calculated in recommendation tools is recommended
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN109033224A (en) * 2018-06-29 2018-12-18 阿里巴巴集团控股有限公司 A kind of Risk Text recognition methods and device
CN109145313A (en) * 2018-07-18 2019-01-04 广州杰赛科技股份有限公司 Interpretation method, device and the storage medium of sentence
CN109299228A (en) * 2018-11-27 2019-02-01 阿里巴巴集团控股有限公司 The text Risk Forecast Method and device that computer executes
CN109460499A (en) * 2018-10-16 2019-03-12 青岛聚看云科技有限公司 Target search word generation method and device, electronic equipment, storage medium
CN109871426A (en) * 2018-12-18 2019-06-11 国网浙江桐乡市供电有限公司 A kind of monitoring recognition methods of confidential data
CN109872162A (en) * 2018-11-21 2019-06-11 阿里巴巴集团控股有限公司 A kind of air control classifying identification method and system handling customer complaint information
CN109918921A (en) * 2018-12-18 2019-06-21 国网浙江桐乡市供电有限公司 A kind of network communication data concerning security matters detection method

Patent Citations (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361064A (en) * 2005-12-16 2009-02-04 Emil有限公司 A text editing apparatus and method
US20110191098A1 (en) * 2010-02-01 2011-08-04 Stratify, Inc. Phrase-based document clustering with automatic phrase extraction
US20130185060A1 (en) * 2010-02-01 2013-07-18 Stratify, Inc. Phrase based document clustering with automatic phrase extraction
CN108764485A (en) * 2011-01-06 2018-11-06 电子湾有限公司 The interest-degree calculated in recommendation tools is recommended
CN104995650A (en) * 2011-12-27 2015-10-21 汤姆森路透社全球资源公司 Methods and systems for generating composite index using social media sourced data and sentiment analysis
US20160142445A1 (en) * 2013-01-23 2016-05-19 The Privacy Factor, LLC Methods and devices for analyzing user privacy based on a user's online presence
US20170111395A1 (en) * 2013-01-23 2017-04-20 The Privacy Factor, LLC Generating a privacy rating for an application or website
US20150207819A1 (en) * 2013-01-23 2015-07-23 The Privacy Factor, LLC Methods for analyzing application privacy and devices thereof
CN106170553A (en) * 2013-12-13 2016-11-30 现代治疗公司 Nucleic acid molecules modified and application thereof
US20150195406A1 (en) * 2014-01-08 2015-07-09 Callminer, Inc. Real-time conversational analytics facility
US20170013127A1 (en) * 2014-01-08 2017-01-12 Callminer, Inc. Real-time in-stream compliance monitoring facility
CN104142998A (en) * 2014-08-01 2014-11-12 中国传媒大学 Text classification method
CN104484377A (en) * 2014-12-09 2015-04-01 百度在线网络技术(北京)有限公司 Generating method and device of substitute dictionaries
CN107405325A (en) * 2015-02-06 2017-11-28 英特塞普特医药品公司 Pharmaceutical composition for combination treatment
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN105956740A (en) * 2016-04-19 2016-09-21 北京深度时代科技有限公司 Semantic risk calculating method based on text logical characteristic
CN107085584A (en) * 2016-11-09 2017-08-22 中国长城科技集团股份有限公司 A kind of cloud document management method, system and service end based on content
CN106649597A (en) * 2016-11-22 2017-05-10 浙江大学 Method for automatically establishing back-of-book indexes of book based on book contents
CN108228556A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Key phrase extracting method and device
US20180309581A1 (en) * 2017-04-19 2018-10-25 International Business Machines Corporation Decentralized biometric signing of digital contracts
CN107688594A (en) * 2017-05-05 2018-02-13 平安科技(深圳)有限公司 The identifying system and method for risk case based on social information
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN108021558A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Keyword recognition method and device, electronic equipment and storage medium
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108447534A (en) * 2018-05-18 2018-08-24 灵玖中科软件(北京)有限公司 A kind of electronic health record data quality management method based on NLP
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN109033224A (en) * 2018-06-29 2018-12-18 阿里巴巴集团控股有限公司 A kind of Risk Text recognition methods and device
CN109145313A (en) * 2018-07-18 2019-01-04 广州杰赛科技股份有限公司 Interpretation method, device and the storage medium of sentence
CN109460499A (en) * 2018-10-16 2019-03-12 青岛聚看云科技有限公司 Target search word generation method and device, electronic equipment, storage medium
CN109872162A (en) * 2018-11-21 2019-06-11 阿里巴巴集团控股有限公司 A kind of air control classifying identification method and system handling customer complaint information
CN109299228A (en) * 2018-11-27 2019-02-01 阿里巴巴集团控股有限公司 The text Risk Forecast Method and device that computer executes
CN109871426A (en) * 2018-12-18 2019-06-11 国网浙江桐乡市供电有限公司 A kind of monitoring recognition methods of confidential data
CN109918921A (en) * 2018-12-18 2019-06-21 国网浙江桐乡市供电有限公司 A kind of network communication data concerning security matters detection method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222316A (en) * 2020-01-03 2020-06-02 北京小米移动软件有限公司 Text detection method, device and storage medium
CN111222316B (en) * 2020-01-03 2023-08-29 北京小米移动软件有限公司 Text detection method, device and storage medium
CN112633009A (en) * 2020-12-29 2021-04-09 扬州大学 Identification method for random combination uploading field
CN113128209A (en) * 2021-04-22 2021-07-16 百度在线网络技术(北京)有限公司 Method and device for generating word stock
CN113128209B (en) * 2021-04-22 2023-11-24 百度在线网络技术(北京)有限公司 Method and device for generating word stock

Also Published As

Publication number Publication date
CN110287493B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
US9519634B2 (en) Systems and methods for determining lexical associations among words in a corpus
Zhang et al. Extracting implicit features in online customer reviews for opinion mining
US8108413B2 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
CN110134952A (en) A kind of Error Text rejection method for identifying, device and storage medium
CN109471933A (en) A kind of generation method of text snippet, storage medium and server
CN110287493A (en) Risk phrase chunking method, apparatus, electronic equipment and storage medium
CN110598066B (en) Bank full-name rapid matching method based on word vector expression and cosine similarity
KR20160149050A (en) Apparatus and method for selecting a pure play company by using text mining
CN115017903A (en) Method and system for extracting key phrases by combining document hierarchical structure with global local information
Bhakuni et al. Evolution and evaluation: Sarcasm analysis for twitter data using sentiment analysis
CN110968661A (en) Event extraction method and system, computer readable storage medium and electronic device
CN103116752A (en) Picture auditing method and system
Audichya et al. Stanza type identification using systematization of versification system of Hindi poetry
CN110347806A (en) Original text discriminating method, device, equipment and computer readable storage medium
CN112632964B (en) NLP-based industry policy information processing method, device, equipment and medium
Radygin et al. Application of text mining technologies in Russian language for solving the problems of primary financial monitoring
CN116050397B (en) Method, system, equipment and storage medium for generating long text abstract
CN112465262A (en) Event prediction processing method, device, equipment and storage medium
CN116561320A (en) Method, device, equipment and medium for classifying automobile comments
CN107590163B (en) The methods, devices and systems of text feature selection
CN112685548B (en) Question answering method, electronic device and storage device
Rao et al. Model for improving relevant feature extraction for opinion summarization
Li et al. Confidence estimation and reputation analysis in aspect extraction
Chou et al. On the Construction of Web NER Model Training Tool based on Distant Supervision
CN111797213A (en) Method for mining financial risk clues from unstructured network information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant