CN110287493A - Risk phrase chunking method, apparatus, electronic equipment and storage medium - Google Patents
Risk phrase chunking method, apparatus, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110287493A CN110287493A CN201910580521.1A CN201910580521A CN110287493A CN 110287493 A CN110287493 A CN 110287493A CN 201910580521 A CN201910580521 A CN 201910580521A CN 110287493 A CN110287493 A CN 110287493A
- Authority
- CN
- China
- Prior art keywords
- phrase
- risk
- text
- scheduled
- phrases
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The embodiment of the present application provides a kind of risk phrase chunking method, apparatus, electronic equipment and storage medium, is related to text-processing technical field.This method comprises: describing text to risk using scheduled phrase chunking algorithm carries out phrase chunking, the first risk list of phrases is obtained;Text is described to risk using scheduled participle tool to handle, and obtains the second risk list of phrases;Then, the first risk list of phrases and the second risk list of phrases are merged into processing, determines the risk list of phrases including multiple risk phrases.The method of the embodiment of the present application can rapidly and accurately identify risk phrase, and more comprehensively, information content is bigger for the phrase of identification, the content for the theme that can disclosure risks well.
Description
Technical field
This application involves text-processing technical fields, specifically, this application involves a kind of risk phrase chunking methods, dress
It sets, electronic equipment and storage medium.
Background technique
Risk information is the external environments such as politics, economy, society, market of the enterprise according to locating for itself, in combination with each
The internal environments such as class finance and management, the anticipation and police that existing or potential factor related with enterprise's existence and development is made
Show, forward-looking and decision correlation meaning.Risk information can help to alleviate information asymmetry, and improve company's production warp
The transparency of battalion, especially in terms of Risk-warning, the value content of risk information is higher than general Voluntary Disclosure information.
Currently, the key phrases extraction result of the prior art tends to vocabulary, word length is shorter, so that risk phrase chunking is imitated
Fruit is bad, cannot not only disclose theme well, while can also lose a large amount of semantic content, can not characterize risk well
The content of theme, extraction effect be not good enough.
Summary of the invention
It is existing for solving this application provides a kind of risk phrase chunking method, apparatus, electronic equipment and storage medium
Risk the ineffective technical problem of risk phrase chunking of text is described.
In a first aspect, a kind of risk phrase chunking method is provided, this method comprises:
Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, obtains the first risk phrase column
Table;
Text is described to risk using scheduled participle tool to handle, and obtains the second risk list of phrases;
First risk list of phrases and the second risk list of phrases are merged into processing, determine to include multiple risk phrases
Risk list of phrases.
Based on the above technical solution, text is described to the risk based on scheduled filtering rule to be filtered;
Text is described to filtered risk and carries out part-of-speech tagging, and screens the word of predetermined part of speech, is formed to be identified
Text;
The word string that frequency of occurrence in the text to be identified is greater than preset quantity threshold value is counted, as candidate phrase;
Risk phrase is picked out from the candidate phrase using scheduled phrase chunking algorithm.
Based on the above technical solution, the scheduled filtering rule includes: to be filtered according to scheduled deactivated vocabulary
Stop words;
The word of the predetermined part of speech of the screening includes:
It describes to screen noun, verb, adjective and degree adverb in text from filtered risk.
Based on the above technical solution, described to be selected from the candidate phrase using scheduled phrase chunking algorithm
Risk phrase out, comprising:
The association relationship of each candidate phrase is calculated using mutual information;
The left entropy and right entropy of each candidate phrase are calculated using left and right entropy;
Based on the statistics magnitude of each candidate phrase, each candidate phrase is calculated according to scheduled Weight algorithm
Weighted value;The statistics magnitude includes that association relationship, left entropy, right entropy and candidate phrase go out in the text to be identified
The existing frequency;Or association relationship, left entropy and right entropy;
According to the weighted value of each candidate phrase, risk is selected from the candidate phrase using predetermined selection rule
Phrase.
Based on the above technical solution, the predetermined selection rule includes:
By the weighted value of each candidate phrase according to being ranked up from big to small, the more of preceding preset quantity that sort are chosen
A candidate phrase, as risk phrase;Or,
When the weighted value of the candidate phrase is not less than preset threshold, by the candidate phrase, as risk phrase.
Based on the above technical solution, text is described to risk using scheduled participle tool to handle, comprising:
The risk is described each word in text to be combined to form phrase to be matched;
Phrase to be matched is subjected to match query in the scheduled lexicon of the participle tool, it is determining to make a reservation for described
Lexicon in vocabulary match phrase;
The phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second risk phrase
The risk phrase of list.
Based on the above technical solution, the scheduled filtering rule includes at least one of the following:
Filter individual character;Filtering number;Filtering composition number of words is less than the phrase of predetermined value.
Based on the above technical solution, text is described to risk using scheduled phrase chunking algorithm and carries out phrase knowledge
Before not, this method further include:
Paragraph is pressed to pre-determined text and carries out identifying processing, extracts the paragraph comprising risk description in the text using as risk
Text is described.
Second aspect provides a kind of risk phrase chunking device, comprising:
First obtains module, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm, obtains
To the first risk list of phrases;
Second obtains module, describes text to risk using scheduled participle tool and handles, it is short to obtain the second risk
Language list;
First risk list of phrases and the second risk list of phrases are merged processing, determine packet by merging treatment module
Include the risk list of phrases of multiple risk phrases.
Based on the above technical solution, the first acquisition module includes:
First filtering module is filtered for describing text to risk based on scheduled filtering rule;
Screening module carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech,
Form text to be identified;
Statistical module counts frequency of occurrence in the text to be identified and is greater than the word string of preset quantity threshold value as candidate short
Language;
Choosing module picks out risk phrase using scheduled phrase chunking algorithm from the candidate phrase.
Based on the above technical solution, the second acquisition module includes:
Composite module is combined to form phrase to be matched for the risk to be described each word in text;
Matching module, for phrase to be matched to be carried out match query in the scheduled lexicon of the participle tool,
The determining vocabulary with the scheduled lexicon matches phrase;
Second filtering module will be filtered short for being filtered based on scheduled filtering rule to the phrase that matches
Risk phrase of the language as the second risk list of phrases.
The third aspect provides a kind of electronic equipment, comprising:
Processor;And
Memory, is configured to storage machine readable instructions, instruction when executed by the processor so that processor executes the
The risk phrase chunking method of one side.
Fourth aspect, provides a kind of computer readable storage medium, and computer storage medium refers to for storing computer
It enables, when run on a computer, computer is allowed to execute the risk phrase chunking method of first aspect.
Technical solution provided by the present application has the benefit that
Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, phrase is tentatively extracted, obtains
First risk list of phrases;Text is described to risk using scheduled participle tool again to handle, and is extended risk phrase, is obtained
Second risk list of phrases.Then, the first risk list of phrases and the second risk list of phrases are merged into processing, determines packet
Include the risk list of phrases of multiple risk phrases.Using scheduled phrase chunking algorithm extracting phrase, the information representation energy of phrase
Power is to be apparently higher than single keyword characterization ability, and the accuracy for the phrase for using phrase chunking algorithm to extract is high, but single
It is to be expressed complete to be also not enough to characterize raw risk text for the risk phrase negligible amounts solely obtained using phrase chunking algorithm
Portion's information.Present invention combination participle tool describes text to risk and is further processed, and extends risk phrase, further increases
Accuracy, and the risk phrase for using two ways to obtain is more comprehensively.The present invention can rapidly and accurately identify risk phrase, know
More comprehensively, information content is bigger for other phrase, and can disclosure risks theme well, solves the risk that existing risk describes text
The bad technical problem of the recognition effect of phrase.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, institute in being described below to the embodiment of the present application
Attached drawing to be used is needed to be briefly described.
Fig. 1 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application one provides;
Fig. 2 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application two provides;
Fig. 3 is a kind of flow diagram for risk phrase chunking method that the embodiment of the present application three provides;
Fig. 4 is a kind of structural schematic diagram for risk phrase chunking device that the embodiment of the present application four provides;
Fig. 5 is the structural schematic diagram for the first acquisition module that the embodiment of the present application five provides;
Fig. 6 is the structural schematic diagram for the second acquisition module that the embodiment of the present application six provides;
Fig. 7 is the structural schematic diagram for the electronic equipment that the embodiment of the present application seven provides.
Specific embodiment
Embodiments herein is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end
Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing
The embodiment stated is exemplary, and is only used for explaining the application, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one
It is a ", "and" "the" may also comprise plural form.It is to be further understood that " the packet of wording used in the description of the present application
Include " refer to existing characteristics, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition it is one or more
Other features, integer, step, operation, element, component and/or their group.It should be understood that when we claim element to be " connected "
Or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be intermediary elements.This
Outside, " connection " or " coupling " used herein may include being wirelessly connected or wirelessly coupling.Wording "and/or" packet used herein
Include one or more associated wholes for listing item or any cell and all combination.
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party
Formula is described in further detail.
First to this application involves several nouns be introduced and explain:
Mutual information (Mutual Information) is a kind of useful measure information in information theory, it refers to two events
Correlation between set.Mutual information is the common method of computational linguistics model analysis, it measures the phase between two objects
Mutual property.For measures characteristic for the discrimination of theme in filtration problem.The definition of mutual information is approximate with cross entropy.Mutual information
It originally be a concept in information theory for indicating the relationship between information is the survey of two stochastic variable statistic correlations
Degree, carrying out feature extraction using Mutual Information Theory is based on an assumption that high in some particular categories frequency of occurrences, but at other
The relatively low entry of the classification frequency of occurrences and such mutual information are bigger.In general, use mutual information as Feature Words and classification it
That asks estimates, and if Feature Words belong to such, their mutual information is maximum.
Left and right entropy is the important statistical nature of mode, but the left and right entropy of the magnanimity word string based on large-scale corpus is calculated and needed
It is related to the read operation of a large amount of unrelated characters.Left and right entropy is bigger, illustrates that the periphery word of the word is abundanter, it is meant that the freedom of word
A possibility that degree is bigger, becomes an independent word is also bigger.
Existing identification technology is not particularly suited for annual report risk and describes the risk phrase chunking of text and extract.Descriptor mentions
Result is taken to tend to vocabulary, word length is shorter, cannot not only characterize risk theme well, while can also lose a large amount of semanteme
Content.And the information representation ability of phrase is apparently higher than single keyword, such as " aging growth " than " aging " and " growth " two
The meaning that a vocabulary reaches is richer, and " managerial talent " conveys more information than " management " and " talent " two words.And in view of year
Report risk describes the shorter feature of length, existing key-phrase extraction algorithm, and the phrase that can be identified is limited or even annual report wind
Danger description in some important words, such as the more risky early warning meaning such as " growth ", " decline ", " shortage " vocabulary it is important
Property be lowered, the phrase extracted formed with noun vocabulary it is in the majority, the phrase extracted formed with noun vocabulary it is in the majority, it is overall to know
It is not ineffective, it is not able to satisfy the demand that annual report risk describes the risk phrase chunking of text.
Risk phrase chunking method, apparatus, electronic equipment and storage medium provided by the present application, it is intended to solve the prior art
Technical problem as above.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned
Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept
Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
Embodiment one
A kind of risk phrase chunking method is provided in the embodiment of the present application, it is shown in Figure 1, this method comprises:
S100, text progress phrase chunking is described to risk using scheduled phrase chunking algorithm, it is short obtains the first risk
Language list.
S200, it text is described to risk using scheduled participle tool handles, obtain the second risk list of phrases.
S300, the first risk list of phrases and the second risk list of phrases are merged to processing, determines to include multiple wind
The risk list of phrases of dangerous phrase.Specifically, merge handle when, remove duplicate risk phrase.
Based on the above embodiment, the risk phrase negligible amounts obtained using scheduled phrase chunking algorithm, are also not enough to
Characterize raw risk text information to be expressed.It describes text to risk in conjunction with participle tool to handle, extension risk is short
Language, accuracy are high, and the risk phrase obtained using two ways is more comprehensively.
Embodiment two
Shown in Figure 2, the embodiment of the invention provides a kind of possible implementations, on the basis of example 1,
Step S100 includes the following steps:
S101, it text is described to risk based on scheduled filtering rule is filtered.Wherein, filtering rule are as follows: according to pre-
Fixed deactivated vocabulary filters stop words, and retains punctuation mark, noun, verb, adjective and degree adverb.
S102, text progress part-of-speech tagging is described to filtered risk, and screen the word of predetermined part of speech, formed wait know
Other text.
Further, the word for screening predetermined part of speech includes: to describe to screen noun in text from filtered risk, move
Word, adjective and degree adverb.Punctuation mark, noun, verb, adjective and degree adverb are not filtered, can be prevented due to going
Fall punctuation mark, and will be combined before with the separated content of symbol, it is ensured that constitutes the word of risk phrase in position
It is that left and right is adjacent, it can be to avoid extracting the noises word strings such as " risk company ", " risk country ".Meanwhile by noun, verb,
Adjective and degree adverb all retain, it is ensured that the phrase quality extracted is higher, can retain " growth ", " shortage ", " decline "
Etc. the bigger word of information content, prevent only extract a title.
Frequency of occurrence is greater than the word string of preset quantity threshold value in S103, statistics text to be identified, as candidate phrase.Specifically
Ground, preset quantity threshold value can be set according to actual text, and word string can be two contaminations.
S104, risk phrase is picked out from candidate phrase using scheduled phrase chunking algorithm.
Specifically, in step S104, risk phrase packet is picked out from candidate phrase using scheduled phrase chunking algorithm
It includes:
S1041, the association relationship that each candidate phrase is calculated using mutual information;Association relationship is for indicating candidate phrase group
A possibility that a possibility that at phrase, association relationship forms phrase to candidate phrase, is directly proportional.
S1042, the left entropy and right entropy that each candidate phrase is calculated using left and right entropy;Left entropy and right entropy are used respectively
In indicate the word of candidate phrase or so collocation a possibility that, a possibility that left entropy and right entropy and candidate phrase composition phrase
It is directly proportional.
Further, step S1041 and step S1042 is not distinguished successively, can be carried out or successively be carried out simultaneously.
S1043, the statistics magnitude based on each candidate phrase, calculate each candidate phrase according to scheduled Weight algorithm
Weighted value;Counting magnitude includes association relationship, left entropy, right entropy and the candidate phrase frequency of occurrence in text to be identified;Or
Person's association relationship, left entropy and right entropy.
S1044, according to the weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule
Language.
Further, the weighted value of each candidate phrase is chosen into present count before sorting according to being ranked up from big to small
Multiple candidate phrases of amount, as risk phrase.In practical applications, the meaningless phrase such as number can be removed, is then selected
Select the phrase of sequence preceding 20
Or, when the weighted value of candidate phrase is not less than preset threshold, by the candidate phrase, as risk phrase.In reality
In the application of border, threshold value is set in advance, less than can excluding for preset threshold.
As an optional implementation, in step S1043, if statistics magnitude includes association relationship, left entropy, right entropy
Value and candidate phrase frequency of occurrence, then association relationship, left entropy, right entropy and frequency of occurrence based on each candidate phrase are pressed
The first weighted value of each candidate phrase is calculated according to the first pre-defined algorithm;
According to the first weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule
Language.
If statistics magnitude includes association relationship, left entropy, right entropy and candidate phrase frequency of occurrence, the calculating side of use
Formula is as follows:
(1) association relationship of each candidate phrase is calculated using mutual information
We by two composition words of candidate phrase t respectively to character a and character b, then the calculation formula of mutual information
As shown in formula 1.1:
Wherein, p (t), p (a), p (b) respectively indicate the probability of t, a, b, we can simplify the calculating of probability
Estimation, is calculated in the form of normalized frequency:
P (t)=nt/NP(formula 1.2)
P (a)=na/NT(formula 1.3)
P (b)=nb/NT(formula 1.4)
Wherein, nt、na、nbRespectively indicate the quantity that t, a, b occur in corpus, NPIndicate candidate phrase in corpus set
The total quantity of appearance, NTIt is the total quantity that the single word in corpus set occurs.
The value of mutual information is higher, shows that the correlation of a and b are higher, then a possibility that a and b composition phrase is bigger;Instead
It, a possibility that value of mutual information is lower, and the correlation between a and b is lower, then there are phrasal boundaries between a and b, is bigger, because
A possibility that this and b composition phrase, is smaller.
(2) the left entropy and right entropy of each candidate phrase are calculated using left and right entropy
Adjacent entropy includes left adjacent entropy and right adjacent entropy, and adjacent entropy is substantially using comentropy come to the candidate phrase left side
Adjoining word and the right side adjoining word a kind of probabilistic measurement.The uncertainty of the adjacent word in left and right is lower, illustrates candidate
Word before and after phrase is fewer, more stable, so a possibility that it is at word is lower;Conversely, before and after then illustrating the candidate phrase
A possibility that word is more, more chaotic, more unstable, therefore the candidate phrase becomes a word is higher.It is calculated using left and right entropy
The calculation formula of left entropy and right entropy is as shown in formula 2.1 and formula 2.2:
Wherein, ELWith ERThe left entropy and right entropy of candidate phrase are respectively indicated, W is for indicating candidate phrase, W={ w1,
w2..., wn};A indicates candidate phrase in the set of all words of the appearance on the left side, and a indicates some word in set A;B table
Show the set of all words of the appearance of candidate phrase on the right, b indicates some word in set B;If some candidate phrase
ELWith ERValue is bigger, then the word for then indicating that the left and right of the candidate phrase occurs is more chaotic, it is more unstable, it arranges in pairs or groups abundanter, because
This candidate phrase is then more likely a phrase.
(3) the first weighted value of each candidate phrase is calculated according to the first pre-defined algorithm
The composition boundary of phrase is judged using left adjacent entropy and right adjacent entropy, and is combined frequency of occurrence TF to carry out phrase and mentioned
It takes, mutual information, left entropy, right entropy and frequency TF is fitted, obtain the first weighted value Score, and threshold value progress is set
The calculation of phrase chunking, Score value is as follows:
Score=(NorFreq+NorMI+NorLE+NorRE)/4 (formula 3.1)
Wherein, NorFreq, NorMI, NorLE, NorRE are respectively that frequency of occurrence TF, mutual information, left entropy, right entropy are returned
Value after one change, calculation method are as follows:
NorFreqi=(Freqi-MAXFreq)/(MAXFreq-MINFreq) (formula 3.2)
NorMIi=(MIi-MAXMI)/(MAXMI-MINMI) (formula 3.3)
NorLEi=(LEi-MAXLE)/(MAXLE-MINLE) (formula 3.4)
NorREi=(REi-MAXRE)/(MAXRE-MINRE) (formula 3.5)
So, the first weighted value Score value is higher, and it is higher to represent a possibility that candidate phrase is as a phrase;Instead
It, then illustrate that a possibility that candidate phrase becomes a phrase is lower.
As another implementation, in step S1043, if statistics magnitude includes association relationship, left entropy and right entropy,
Association relationship, left entropy and right entropy then based on each candidate phrase calculate each candidate phrase according to the second pre-defined algorithm
Second weighted value;
According to the second weighted value of each candidate phrase, it is short that risk is selected from candidate phrase using predetermined selection rule
Language.
If statistics magnitude includes association relationship, left entropy and right entropy, the calculation of use is as follows:
(1) association relationship of each candidate phrase is calculated using mutual information
Mutual information can use following calculation formula:
Wherein, t indicates candidate phrase, and the quantity for all candidate phrases that length is met the requirements in set, n are indicated with Nt、
na、nbRespectively indicate word t, the frequency that a, b occur in the text.When association relationship is bigger, show to combine between word tighter
Close, a possibility that becoming phrase, is bigger;Conversely, association relationship is smaller, show more uncorrelated between word, can not more constitute short
Language.
(2) the left entropy and right entropy of each candidate phrase are calculated using left and right entropy
Left and right entropy can use following calculation formula:
Wherein, ELIndicate the left entropy of word string, ERIndicate that right entropy, W indicate candidate phrase set, A indicates that the candidate phrase left side goes out
The set of existing all words, a ∈ A;Similarly, B indicates the set of all words occurred on the right of candidate phrase, b ∈ B.If word string
ELAnd ERValue it is bigger, represent word string or so collocation word it is more rich and varied, the word string form phrase probability it is bigger.
(3) the second weighted value of each candidate phrase is calculated according to the second pre-defined algorithm
Association relationship, left entropy and right entropy are fitted according to pre-defined algorithm, is such as averaged, obtains the second weight
Value.
Both the above implementation calculates the weighted value of each candidate phrase, obtains the first weighted value and the second weighted value,
Risk phrase is then selected from candidate phrase using predetermined selection rule, comprising:
By the first weighted value of each candidate phrase or the second weighted value according to being ranked up from big to small, choose before sorting
Multiple candidate phrases of preset quantity, as risk phrase.Or, when the first weighted value of candidate phrase or the second weighted value be not small
When preset threshold, by the candidate phrase, as risk phrase.
Embodiment three
Shown in Figure 3, the embodiment of the invention provides a kind of possible implementations, on the basis of example 1,
Step S200 includes the following steps:
S201, it risk is described into each word in text is combined to form phrase to be matched.
S202, phrase to be matched is subjected to match query in the scheduled lexicon of participle tool, it is determining with it is scheduled
Vocabulary in lexicon matches phrase.
S203, the phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second wind
The risk phrase of dangerous list of phrases.Wherein, scheduled filtering rule includes at least one of the following: filtering individual character;Filtering number;It crosses
Filter composition number of words is less than the phrase of predetermined value.Further, it is small that the phrase that composition number of words is less than predetermined value can be length
In 3 everyday expressions, participle efficiency is further increased.
Only with the phrase chunking algorithm of embodiment one kind, had using the phrase quantity that mutual information carries out phrase combination
Limit, and when the threshold value of left and right entropy sets height, it is easy to so that some phrase chunkings comprising information do not come out.As " under economical
The vocabulary such as row ", " unemployment rate increase " are identified not to be come out.
Specifically, the risk phrase quantity for commonly using the covering of participle tool dictionary is seldom, in most cases can be by a wind
Dangerous phrase segmentation is at multiple words.In the present embodiment, participle tool can segment tool using jieba, using Chinese hundred
Magnanimity encyclopaedia vocabulary is obtained in section's world knowledge map, is stored in txt file in the form of each vocabulary a line, then should
The txt file former dictionary included as lexicon replacement jieba participle tool, has expanded vocabulary, so that the phrase obtained is more
Comprehensively, it prevents from missing phrase, prevent risk theme from showing well.Simultaneously, it is contemplated that jieba included participle word
Allusion quotation scale is about 350,000, and encyclopaedia entity dictionary scale is 12,000,000 or so, is 34 times of former dictionary in scale, directly uses
Encyclopaedia entity dictionary is as dictionary for word segmentation and initializes jieba, may cause program crashing.It is based in view of being used in jieba
Chinese character will remove the model to identify unregistered word at the stealthy Markov model (HMM) of word ability herein, to guarantee
Segment efficiency.
It is identified using Baidupedia vocabulary, Chinese can be obtained from Fudan University's Open Chinese knowledge mapping website
Encyclopaedic knowledge spectrum data, wherein comprising 9,000,000+encyclopaedia entity and 66,000,000+triple relationship.Fudan University provides
Universal Chinese character encyclopaedic knowledge map in cover the Chinese encyclopaedia class websites such as Baidupedia, interaction encyclopaedia, Chinese wikipedia
Entry, including specific things, star personality, abstract concept, literature works, focus incident, technical term, Chinese character by words or specific
The contents such as the combination of theme therefrom obtain and amount to more than 1,200 ten thousand encyclopaedia entity words, almost cover whole fields, accuracy
It is high.
When any phrase to be matched belongs to the vocabulary in Baidupedia, using the phrase to be matched as risk phrase, expand
The big vocabulary of risk phrase.
The embodiment of the invention provides alternatively possible implementations, on the basis of example 1, step S100 it
Before, further include following steps:
Paragraph is pressed to pre-determined text and carries out identifying processing, extracts the paragraph comprising risk description in the text using as risk
Text is described.
Specifically, it is illustrated by taking the annual report full text of listed company as an example.
Firstly, obtaining whole A-share listing company Annual report full text, risk description information is therefrom extracted.Since annual report is write
The characteristics of writing, risk description information mainly exist in the form of short text, each risk classifications (" accounts receivable increasing in such as figure
Greatly with the risk of cash flow reduction ") corresponding one section brief and concise specific risk describes.Secondly, risk description information is pressed section
Capable processing is dropped into, each listed company corresponds to several risks and describes text.Finally, carrying out data cleansing, annual report risk is rejected
" horizontal competition is avoided to promise to undertake ", " share limit sells promise " in description etc. hardly include the content of indicating risk information, simultaneously
Remove illegal symbol, it will be apparent that the content of non-character and messy code.
On the basis of embodiment one to example IV, 2016 year of clear water source Science and Technology Co., Ltd. is randomly selected
A risk in report describes text as experiment text, and risk phrase extraction experiment is carried out to it, will use embodiment one
It extracts result with existing HanLP phrase chunking algorithm to example IV risk phrase extraction result to be compared, such as 1 institute of table
Show.
Test text: " risk that the accounts receivable amount of money is larger, aging increases ends on June 30th, 2016, and receipt on account is answered by company
Nearly 2.19 hundred million yuan of money remaining sum, and part aging increases, and counts the bad debt preparation mentioned and increase accordingly, affects the business performance of company.
To solve the problems, such as that accounts receivable remaining sum is excessively high, company increases business personnel to the responsibility of accounts receivable collection, by accounts receivable
Recovering state is included in feedback on performance, takes in and links directly with it;Company separately sets up the group that clears up defaults, to emphasis wholesale debit customers institute
Debt item is taken back completely, and risk assessment to service unit accounts receivable is reinforced, and to being in arrears in recent years, the payment for goods time is longer and industry
The few unit of business amount, takes legal means appropriate."
1 Comparison of experiment results of table
Obviously, the risk phrase that recognition methods is extracted into example IV using embodiment one more comprehensively, more can table
Levy raw risk text information to be expressed.
Example IV
Fig. 4 is that the embodiment of the present invention also provides a kind of risk phrase chunking device, as shown in figure 4, the risk phrase chunking
Device 1, comprising:
First obtains module 11, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm,
Obtain the first risk list of phrases.
Second obtains module 12, describes text to risk using scheduled participle tool and handles, obtains the second risk
List of phrases.
First risk list of phrases and the second risk list of phrases are merged processing by merging treatment module 13, are determined
Risk list of phrases including multiple risk phrases.
The risk phrase negligible amounts obtained using the first acquisition module 11, are also not enough to characterize raw risk text and are wanted
The information of expression.Text is described to risk by the second acquisition module 12 again to handle, and extends risk phrase.Merging treatment mould
First risk list of phrases and the second risk list of phrases are merged processing again by block 13, determine final risk phrase column
Table, accuracy is high, and obtained risk phrase is combined using two ways more comprehensively.
In addition, risk phrase chunking device of the invention can also include that text obtains module, for pressing to pre-determined text
Paragraph is handled, and the extraction text obtains risk comprising the paragraph of risk description and describes text.
Embodiment five
Fig. 5 is the particular content that the embodiment of the present invention also provides that one kind first obtains module 11, as shown in figure 5, first obtains
Modulus block 11 includes:
First filtering module 111 is filtered for describing text to risk based on scheduled filtering rule.In practical mistake
During filter, the first filtering module 111 is specifically used for filtering stop words according to scheduled deactivated vocabulary.
Screening module 112 carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech
Language forms text to be identified.During actual filtration, screening module 112 is specifically used for describing text from filtered risk
Noun, verb, adjective and degree adverb are screened in this.
Statistical module 113 counts the word string that frequency of occurrence in text to be identified is greater than preset quantity threshold value, as candidate short
Language;
Choosing module 114 picks out risk phrase using scheduled phrase chunking algorithm from candidate phrase.
In actual process, Choosing module 114 is specifically used for calculating the mutual trust of each candidate phrase using mutual information
Breath value, the left entropy that each candidate phrase is calculated using left and right entropy and right entropy, the statistics magnitude based on each candidate phrase, are pressed
The weighted value of each candidate phrase is calculated, according to the weighted value of each candidate phrase according to scheduled Weight algorithm, using predetermined choosing
It selects rule and selects risk phrase from candidate phrase.Wherein, statistics magnitude includes association relationship, left entropy, right entropy and candidate
Phrase frequency of occurrence;Or association relationship, left entropy and right entropy specifically can be applicable in the calculation method of embodiment two.
Embodiment six
Fig. 6 is the particular content that the embodiment of the present invention also provides that one kind second obtains module 12, as shown in fig. 6, second obtains
Modulus block 12 includes:
Composite module 121 is combined to form phrase to be matched for risk to be described each word in text;
Matching module 122, for phrase to be matched to be carried out match query in the scheduled lexicon of participle tool, really
Determine the phrase that matches with the vocabulary in scheduled lexicon;
Second filtering module 123 will be filtered for being filtered based on scheduled filtering rule to the phrase that matches
Risk phrase of the phrase as the second risk list of phrases.During actual filtration, the second filtering module 123 was also used to
At least one of below filter: filtering individual character;Filtering number;Filtering composition number of words is less than the phrase of predetermined value.
Embodiment seven
The embodiment of the present invention also provides a kind of electronic equipment, as shown in fig. 7, electronic equipment shown in Fig. 7 4000 includes:
Processor 4001;And
Memory 4003 is configured to storage machine readable instructions, instructs when executed by the processor, so that processor is held
Risk phrase chunking method in row preceding method embodiment.
Wherein, processor 4001 is connected with memory 4003, is such as connected by bus 4002.Further, electronic equipment
4000 can also include transceiver 4004.It should be noted that transceiver 4004 is not limited to one in practical application, which is set
Standby 4000 structure does not constitute the restriction to the embodiment of the present application.
Wherein, processor 4001 be applied to the embodiment of the present application in, for realizing it is shown in Fig. 4 first obtain module 11,
Second obtains module 12 and merging treatment module 13.
Transceiver 4004 includes Receiver And Transmitter, and transceiver 4004 is applied to text in the embodiment of the present application and obtains mould
The risk of block describes text acquisition, can be CPU for realizing processor 4001, general processor, DSP, ASIC, FPGA or
Other programmable logic device, transistor logic, hardware component or any combination thereof.It may be implemented or execute combination
Various illustrative logic blocks, module and circuit described in present disclosure.Processor 4001 is also possible to realize
The combination of computing function, such as combined comprising one or more microprocessors, DSP and the combination of microprocessor etc..Bus 4002
It may include an access, information transmitted between said modules.Bus 4002 can be pci bus or eisa bus etc..Bus
4002 can be divided into address bus, data/address bus, control bus etc..Only to be indicated with a thick line in Fig. 7 convenient for indicating, but
It is not offered as only a bus or a type of bus.
Memory 4003 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM
Or the other kinds of dynamic memory of information and instruction can be stored, it is also possible to EEPROM, CD-ROM or other CDs
Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium
Or other magnetic storage apparatus or can be used in carry or store have instruction or data structure form desired program generation
Code and can by any other medium of computer access, but not limited to this.
Memory 4003 is used to store the application code for executing application scheme, and is held by processor 4001 to control
Row.Processor 4001 is for executing the application code stored in memory 4003, to realize what embodiment illustrated in fig. 4 provided
The movement of risk phrase chunking device.
The embodiment of the present invention also provides a kind of computer readable storage medium, and computer storage medium is for storing computer
Instruction, when run on a computer, allows computer to execute corresponding contents in preceding method embodiment.
In addition, risk phrase chunking method of the invention is mainly used for improving the standard that annual report risk describes text phrases extraction
True property promotes the utilization rate of annual report verbal description content so that effectively excavating annual report risk describes the content contained in text, should
Method can also expand business risk study of warning thinking, more in addition to using in the research of listed company's annual report database project
Mend the deficiency paid attention to annual report financial data in existing research and ignore annual report verbal description content.Below the identification of risk phrase
Two important realistic functions:
(1) technical support is provided for listed company's annual report database project follow-up study.In recent years, the length of annual report more comes
Longer, but main body of the three big financial statements as annual report, almost without being further added by, the financial information that can be disclosed also arrives content
Up to the upper limit, and the word content except financial statement is more and more abundant, and various supplementary explanations and explanation disclose more and more
Information.How to describe to excavate valuable information in text to be one in listed company's annual report database project from a large amount of risks
A major issue largely affects the accuracy and comprehensive of post analysis prediction.Risk phrase of the invention is known
Other proposition lays the foundation for the follow-up study of project.
(2) technical support is provided for Risk-warning correlative study.It can identify and extract using recognition methods of the invention
Risk phrase that is high-quality out, accurate, containing much information.Since the decision correlation of annual reporting data increasingly gets the nod, year
Text information in report also starts gradually to be taken seriously.The present invention is scholar, enterprise excavates annual report assertions and provides technical support, is had
Help make up availability risk study of warning, promotes the comprehensive of business risk early warning.
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow,
These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps
Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing
Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps
Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other
At least part of the sub-step or stage of step or other steps executes in turn or alternately.
The above is only some embodiments of the invention, it is noted that those skilled in the art are come
It says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should be regarded as
Protection scope of the present invention.
Claims (13)
1. a kind of risk phrase chunking method characterized by comprising
Text is described to risk using scheduled phrase chunking algorithm and carries out phrase chunking, obtains the first risk list of phrases;
Text is described to the risk using scheduled participle tool to handle, and obtains the second risk list of phrases;
The first risk list of phrases and the second risk list of phrases are merged into processing, determine to include multiple risks
The risk list of phrases of phrase.
2. risk phrase chunking method according to claim 1, which is characterized in that described to be calculated using scheduled phrase chunking
Method describes text to risk and carries out phrase chunking, comprising:
Text is described to the risk based on scheduled filtering rule to be filtered;
Text is described to filtered risk and carries out part-of-speech tagging, and screens the word of predetermined part of speech, forms text to be identified;
The word string that frequency of occurrence in the text to be identified is greater than preset quantity threshold value is counted, as candidate phrase;
Risk phrase is picked out from the candidate phrase using scheduled phrase chunking algorithm.
3. risk phrase chunking method according to claim 2, which is characterized in that the scheduled filtering rule includes:
Stop words is filtered according to scheduled deactivated vocabulary;
The word of the predetermined part of speech of the screening includes:
It describes to screen noun, verb, adjective and degree adverb in text from filtered risk.
4. risk phrase chunking method according to claim 2, which is characterized in that described to be calculated using scheduled phrase chunking
Method picks out risk phrase from the candidate phrase, comprising:
The association relationship of each candidate phrase is calculated using mutual information;
The left entropy and right entropy of each candidate phrase are calculated using left and right entropy;
Based on the statistics magnitude of each candidate phrase, the power of each candidate phrase is calculated according to scheduled Weight algorithm
Weight values;The statistics magnitude includes that association relationship, left entropy, right entropy and candidate phrase frequency occur in the text to be identified
It is secondary;Or association relationship, left entropy and right entropy;
According to the weighted value of each candidate phrase, it is short that risk is selected from the candidate phrase using predetermined selection rule
Language.
5. risk phrase chunking method according to claim 4, which is characterized in that the predetermined selection rule includes:
By the weighted value of each candidate phrase according to being ranked up from big to small, multiple times of preset quantity before sorting are chosen
Phrase is selected, as risk phrase;Or,
When the weighted value of the candidate phrase is not less than preset threshold, by the candidate phrase, as risk phrase.
6. risk phrase chunking method according to claim 1, which is characterized in that using scheduled participle tool to described
Risk describes text and is handled, comprising:
The risk is described each word in text to be combined to form phrase to be matched;
Phrase to be matched is subjected to match query, the determining and scheduled word in the scheduled lexicon of the participle tool
The vocabulary converged in library matches phrase;
The phrase that matches is filtered based on scheduled filtering rule, using filtered phrase as the second risk list of phrases
Risk phrase.
7. risk phrase chunking method according to claim 6, which is characterized in that the scheduled filtering rule include with
It is at least one of lower:
Filter individual character;Filtering number;Filtering composition number of words is less than the phrase of predetermined value.
8. risk phrase chunking method according to claim 1, which is characterized in that use scheduled phrase chunking algorithm pair
Before risk describes text progress phrase chunking, this method further include:
Paragraph is pressed to pre-determined text and carries out identifying processing, extracts the paragraph comprising risk description in the text to describe as risk
Text.
9. a kind of risk phrase chunking device characterized by comprising
First obtains module, carries out phrase chunking for describing text to risk using scheduled phrase chunking algorithm, obtains the
One risk list of phrases;
Second obtains module, describes text to the risk using scheduled participle tool and handles, it is short to obtain the second risk
Language list;
The first risk list of phrases and the second risk list of phrases are merged processing, really by merging treatment module
It surely include the risk list of phrases of multiple risk phrases.
10. risk phrase chunking device according to claim 9, which is characterized in that described first, which obtains module, includes:
First filtering module is filtered for describing text to risk based on scheduled filtering rule;
Screening module carries out part-of-speech tagging for describing text to filtered risk, and screens the word of predetermined part of speech, is formed
Text to be identified;
Statistical module counts frequency of occurrence in the text to be identified and is greater than the word string of preset quantity threshold value as candidate phrase;
Choosing module picks out risk phrase using scheduled phrase chunking algorithm from the candidate phrase.
11. risk phrase chunking device according to claim 9, which is characterized in that described second, which obtains module, includes:
Composite module is combined to form phrase to be matched for the risk to be described each word in text;
Matching module is determined for phrase to be matched to be carried out match query in the scheduled lexicon of the participle tool
Match phrase with the vocabulary in the scheduled lexicon;
Second filtering module is made filtered phrase for being filtered based on scheduled filtering rule to the phrase that matches
For the risk phrase of the second risk list of phrases.
12. a kind of electronic equipment characterized by comprising
Processor;And
Memory is configured to storage machine readable instructions, and described instruction by the processor when being executed, so that the processing
Device perform claim requires risk phrase chunking method described in any one of 1-8.
13. a kind of computer readable storage medium, which is characterized in that the computer storage medium refers to for storing computer
It enables, when run on a computer, so that computer can require risk phrase described in any one of 1-8 to know with perform claim
Other method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910580521.1A CN110287493B (en) | 2019-06-28 | 2019-06-28 | Risk phrase identification method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910580521.1A CN110287493B (en) | 2019-06-28 | 2019-06-28 | Risk phrase identification method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110287493A true CN110287493A (en) | 2019-09-27 |
CN110287493B CN110287493B (en) | 2023-04-18 |
Family
ID=68020086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910580521.1A Active CN110287493B (en) | 2019-06-28 | 2019-06-28 | Risk phrase identification method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110287493B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111222316A (en) * | 2020-01-03 | 2020-06-02 | 北京小米移动软件有限公司 | Text detection method, device and storage medium |
CN112633009A (en) * | 2020-12-29 | 2021-04-09 | 扬州大学 | Identification method for random combination uploading field |
CN113128209A (en) * | 2021-04-22 | 2021-07-16 | 百度在线网络技术(北京)有限公司 | Method and device for generating word stock |
Citations (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101361064A (en) * | 2005-12-16 | 2009-02-04 | Emil有限公司 | A text editing apparatus and method |
US20110191098A1 (en) * | 2010-02-01 | 2011-08-04 | Stratify, Inc. | Phrase-based document clustering with automatic phrase extraction |
CN104142998A (en) * | 2014-08-01 | 2014-11-12 | 中国传媒大学 | Text classification method |
CN104484377A (en) * | 2014-12-09 | 2015-04-01 | 百度在线网络技术(北京)有限公司 | Generating method and device of substitute dictionaries |
US20150195406A1 (en) * | 2014-01-08 | 2015-07-09 | Callminer, Inc. | Real-time conversational analytics facility |
US20150207819A1 (en) * | 2013-01-23 | 2015-07-23 | The Privacy Factor, LLC | Methods for analyzing application privacy and devices thereof |
CN104995650A (en) * | 2011-12-27 | 2015-10-21 | 汤姆森路透社全球资源公司 | Methods and systems for generating composite index using social media sourced data and sentiment analysis |
CN105956740A (en) * | 2016-04-19 | 2016-09-21 | 北京深度时代科技有限公司 | Semantic risk calculating method based on text logical characteristic |
CN106170553A (en) * | 2013-12-13 | 2016-11-30 | 现代治疗公司 | Nucleic acid molecules modified and application thereof |
CN106649597A (en) * | 2016-11-22 | 2017-05-10 | 浙江大学 | Method for automatically establishing back-of-book indexes of book based on book contents |
CN107085584A (en) * | 2016-11-09 | 2017-08-22 | 中国长城科技集团股份有限公司 | A kind of cloud document management method, system and service end based on content |
CN107153658A (en) * | 2016-03-03 | 2017-09-12 | 常州普适信息科技有限公司 | A kind of public sentiment hot word based on weighted keyword algorithm finds method |
CN107405325A (en) * | 2015-02-06 | 2017-11-28 | 英特塞普特医药品公司 | Pharmaceutical composition for combination treatment |
CN107688594A (en) * | 2017-05-05 | 2018-02-13 | 平安科技(深圳)有限公司 | The identifying system and method for risk case based on social information |
CN108021558A (en) * | 2017-12-27 | 2018-05-11 | 北京金山安全软件有限公司 | Keyword recognition method and device, electronic equipment and storage medium |
CN108228556A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | Key phrase extracting method and device |
CN108447534A (en) * | 2018-05-18 | 2018-08-24 | 灵玖中科软件(北京)有限公司 | A kind of electronic health record data quality management method based on NLP |
CN108509474A (en) * | 2017-09-15 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Search for the synonym extended method and device of information |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
US20180309581A1 (en) * | 2017-04-19 | 2018-10-25 | International Business Machines Corporation | Decentralized biometric signing of digital contracts |
CN108764485A (en) * | 2011-01-06 | 2018-11-06 | 电子湾有限公司 | The interest-degree calculated in recommendation tools is recommended |
CN108845982A (en) * | 2017-12-08 | 2018-11-20 | 昆明理工大学 | A kind of Chinese word cutting method of word-based linked character |
CN108920454A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of theme phrase extraction method |
CN109033224A (en) * | 2018-06-29 | 2018-12-18 | 阿里巴巴集团控股有限公司 | A kind of Risk Text recognition methods and device |
CN109145313A (en) * | 2018-07-18 | 2019-01-04 | 广州杰赛科技股份有限公司 | Interpretation method, device and the storage medium of sentence |
CN109299228A (en) * | 2018-11-27 | 2019-02-01 | 阿里巴巴集团控股有限公司 | The text Risk Forecast Method and device that computer executes |
CN109460499A (en) * | 2018-10-16 | 2019-03-12 | 青岛聚看云科技有限公司 | Target search word generation method and device, electronic equipment, storage medium |
CN109871426A (en) * | 2018-12-18 | 2019-06-11 | 国网浙江桐乡市供电有限公司 | A kind of monitoring recognition methods of confidential data |
CN109872162A (en) * | 2018-11-21 | 2019-06-11 | 阿里巴巴集团控股有限公司 | A kind of air control classifying identification method and system handling customer complaint information |
CN109918921A (en) * | 2018-12-18 | 2019-06-21 | 国网浙江桐乡市供电有限公司 | A kind of network communication data concerning security matters detection method |
-
2019
- 2019-06-28 CN CN201910580521.1A patent/CN110287493B/en active Active
Patent Citations (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101361064A (en) * | 2005-12-16 | 2009-02-04 | Emil有限公司 | A text editing apparatus and method |
US20110191098A1 (en) * | 2010-02-01 | 2011-08-04 | Stratify, Inc. | Phrase-based document clustering with automatic phrase extraction |
US20130185060A1 (en) * | 2010-02-01 | 2013-07-18 | Stratify, Inc. | Phrase based document clustering with automatic phrase extraction |
CN108764485A (en) * | 2011-01-06 | 2018-11-06 | 电子湾有限公司 | The interest-degree calculated in recommendation tools is recommended |
CN104995650A (en) * | 2011-12-27 | 2015-10-21 | 汤姆森路透社全球资源公司 | Methods and systems for generating composite index using social media sourced data and sentiment analysis |
US20160142445A1 (en) * | 2013-01-23 | 2016-05-19 | The Privacy Factor, LLC | Methods and devices for analyzing user privacy based on a user's online presence |
US20170111395A1 (en) * | 2013-01-23 | 2017-04-20 | The Privacy Factor, LLC | Generating a privacy rating for an application or website |
US20150207819A1 (en) * | 2013-01-23 | 2015-07-23 | The Privacy Factor, LLC | Methods for analyzing application privacy and devices thereof |
CN106170553A (en) * | 2013-12-13 | 2016-11-30 | 现代治疗公司 | Nucleic acid molecules modified and application thereof |
US20150195406A1 (en) * | 2014-01-08 | 2015-07-09 | Callminer, Inc. | Real-time conversational analytics facility |
US20170013127A1 (en) * | 2014-01-08 | 2017-01-12 | Callminer, Inc. | Real-time in-stream compliance monitoring facility |
CN104142998A (en) * | 2014-08-01 | 2014-11-12 | 中国传媒大学 | Text classification method |
CN104484377A (en) * | 2014-12-09 | 2015-04-01 | 百度在线网络技术(北京)有限公司 | Generating method and device of substitute dictionaries |
CN107405325A (en) * | 2015-02-06 | 2017-11-28 | 英特塞普特医药品公司 | Pharmaceutical composition for combination treatment |
CN107153658A (en) * | 2016-03-03 | 2017-09-12 | 常州普适信息科技有限公司 | A kind of public sentiment hot word based on weighted keyword algorithm finds method |
CN105956740A (en) * | 2016-04-19 | 2016-09-21 | 北京深度时代科技有限公司 | Semantic risk calculating method based on text logical characteristic |
CN107085584A (en) * | 2016-11-09 | 2017-08-22 | 中国长城科技集团股份有限公司 | A kind of cloud document management method, system and service end based on content |
CN106649597A (en) * | 2016-11-22 | 2017-05-10 | 浙江大学 | Method for automatically establishing back-of-book indexes of book based on book contents |
CN108228556A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | Key phrase extracting method and device |
US20180309581A1 (en) * | 2017-04-19 | 2018-10-25 | International Business Machines Corporation | Decentralized biometric signing of digital contracts |
CN107688594A (en) * | 2017-05-05 | 2018-02-13 | 平安科技(深圳)有限公司 | The identifying system and method for risk case based on social information |
CN108509474A (en) * | 2017-09-15 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Search for the synonym extended method and device of information |
CN108845982A (en) * | 2017-12-08 | 2018-11-20 | 昆明理工大学 | A kind of Chinese word cutting method of word-based linked character |
CN108021558A (en) * | 2017-12-27 | 2018-05-11 | 北京金山安全软件有限公司 | Keyword recognition method and device, electronic equipment and storage medium |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108447534A (en) * | 2018-05-18 | 2018-08-24 | 灵玖中科软件(北京)有限公司 | A kind of electronic health record data quality management method based on NLP |
CN108920454A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of theme phrase extraction method |
CN109033224A (en) * | 2018-06-29 | 2018-12-18 | 阿里巴巴集团控股有限公司 | A kind of Risk Text recognition methods and device |
CN109145313A (en) * | 2018-07-18 | 2019-01-04 | 广州杰赛科技股份有限公司 | Interpretation method, device and the storage medium of sentence |
CN109460499A (en) * | 2018-10-16 | 2019-03-12 | 青岛聚看云科技有限公司 | Target search word generation method and device, electronic equipment, storage medium |
CN109872162A (en) * | 2018-11-21 | 2019-06-11 | 阿里巴巴集团控股有限公司 | A kind of air control classifying identification method and system handling customer complaint information |
CN109299228A (en) * | 2018-11-27 | 2019-02-01 | 阿里巴巴集团控股有限公司 | The text Risk Forecast Method and device that computer executes |
CN109871426A (en) * | 2018-12-18 | 2019-06-11 | 国网浙江桐乡市供电有限公司 | A kind of monitoring recognition methods of confidential data |
CN109918921A (en) * | 2018-12-18 | 2019-06-21 | 国网浙江桐乡市供电有限公司 | A kind of network communication data concerning security matters detection method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111222316A (en) * | 2020-01-03 | 2020-06-02 | 北京小米移动软件有限公司 | Text detection method, device and storage medium |
CN111222316B (en) * | 2020-01-03 | 2023-08-29 | 北京小米移动软件有限公司 | Text detection method, device and storage medium |
CN112633009A (en) * | 2020-12-29 | 2021-04-09 | 扬州大学 | Identification method for random combination uploading field |
CN113128209A (en) * | 2021-04-22 | 2021-07-16 | 百度在线网络技术(北京)有限公司 | Method and device for generating word stock |
CN113128209B (en) * | 2021-04-22 | 2023-11-24 | 百度在线网络技术(北京)有限公司 | Method and device for generating word stock |
Also Published As
Publication number | Publication date |
---|---|
CN110287493B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9519634B2 (en) | Systems and methods for determining lexical associations among words in a corpus | |
Zhang et al. | Extracting implicit features in online customer reviews for opinion mining | |
US8108413B2 (en) | Method and apparatus for automatically discovering features in free form heterogeneous data | |
CN110134952A (en) | A kind of Error Text rejection method for identifying, device and storage medium | |
CN109471933A (en) | A kind of generation method of text snippet, storage medium and server | |
CN110287493A (en) | Risk phrase chunking method, apparatus, electronic equipment and storage medium | |
CN110598066B (en) | Bank full-name rapid matching method based on word vector expression and cosine similarity | |
KR20160149050A (en) | Apparatus and method for selecting a pure play company by using text mining | |
CN115017903A (en) | Method and system for extracting key phrases by combining document hierarchical structure with global local information | |
Bhakuni et al. | Evolution and evaluation: Sarcasm analysis for twitter data using sentiment analysis | |
CN110968661A (en) | Event extraction method and system, computer readable storage medium and electronic device | |
CN103116752A (en) | Picture auditing method and system | |
Audichya et al. | Stanza type identification using systematization of versification system of Hindi poetry | |
CN110347806A (en) | Original text discriminating method, device, equipment and computer readable storage medium | |
CN112632964B (en) | NLP-based industry policy information processing method, device, equipment and medium | |
Radygin et al. | Application of text mining technologies in Russian language for solving the problems of primary financial monitoring | |
CN116050397B (en) | Method, system, equipment and storage medium for generating long text abstract | |
CN112465262A (en) | Event prediction processing method, device, equipment and storage medium | |
CN116561320A (en) | Method, device, equipment and medium for classifying automobile comments | |
CN107590163B (en) | The methods, devices and systems of text feature selection | |
CN112685548B (en) | Question answering method, electronic device and storage device | |
Rao et al. | Model for improving relevant feature extraction for opinion summarization | |
Li et al. | Confidence estimation and reputation analysis in aspect extraction | |
Chou et al. | On the Construction of Web NER Model Training Tool based on Distant Supervision | |
CN111797213A (en) | Method for mining financial risk clues from unstructured network information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |