CN107193959A - A kind of business entity's sorting technique towards plain text - Google Patents
A kind of business entity's sorting technique towards plain text Download PDFInfo
- Publication number
- CN107193959A CN107193959A CN201710371464.7A CN201710371464A CN107193959A CN 107193959 A CN107193959 A CN 107193959A CN 201710371464 A CN201710371464 A CN 201710371464A CN 107193959 A CN107193959 A CN 107193959A
- Authority
- CN
- China
- Prior art keywords
- business entity
- entity
- word
- training
- mark
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 72
- 239000012141 concentrate Substances 0.000 claims description 2
- 238000013145 classification model Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000002421 finishing Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of business entity's sorting technique towards plain text, comprises the following steps:S1, classification mark is carried out to the business entity in the plain text data that collects, be used as the training set of business entity's identification module;Classification mark is carried out according to industrial nature to the business entity in the plain text data that collects, using the training sample set as business entity's sort module;S2, by conditional random field models business entity identification model training is carried out, and obtain business entity's identification model;S3, the text data to original training set carry out semantic vector structure;S4, the training set data for having classification to mark after semantic vector as training parameter trained into business entity's disaggregated model;S5, using business entity's disaggregated model the business entity in text to be predicted is classified.This method feature of the obtained semantic vector as entity, reduces the dependence to manual features and external data, and versatility and robustness are guaranteed.
Description
Technical field
The invention belongs to name Entity recognition and fine granularity entity classification technical field, and in particular to a kind of towards plain text
Business entity's sorting technique.
Background technology
In recent years, with the upsurge of " internet finance ", increasing corporate decision maker is more advanced in the urgent need to utilizing
Information processing manner extracted and analyzed come the internet data to magnanimity, to make more preferable decision-making.In these seas
Measure among data, the plain text data such as law court's document class, news public sentiment class turns into the primary source that enterprise obtains high price value information.
Name entity recognition techniques are that enterprise carries out Entity Semantics analysis, the basis of the work such as entity relation extraction.At present
Entity is simply divided into name, place name, mechanism name etc. by the name entity recognition techniques of main flow, and this causes the type of entity to lack language
Justice.Meanwhile, carry out excessively depending on manual features and external data during entity classification, its versatility and robustness is protected
Card.
The content of the invention
Entity is simply divided into name, place name, mechanism name etc. by the present invention for the name entity recognition techniques of current main flow,
So that the type of entity lacks semantic.In addition, excessively depending on manual features and external data when carrying out entity classification, lead to it
It cannot be guaranteed with property and robustness.To solve the above problems, the present invention proposes that a kind of business entity towards plain text classifies
Method, using the more fine-grained dividing mode of business entity, and the semantic construction feature using text in itself, finally looked forward to
The classification of industry entity.Wherein, plain text, that is, include the text of business activity information, for example newsletter archive, law court's letter.
As shown in figure 1, business entity's sorting technique disclosed in this invention towards plain text, comprises the following steps:
S1, classification mark is carried out to the business entity in the plain text data that collects, will mark the data that complete as
The training set of business entity's identification module;Classification is carried out according to industrial nature to the business entity in the plain text data that collects
Mark, will mark the data of completion as the training sample set of business entity's sort module;
S2, by conditional random field models business entity identification model training is carried out, and obtain business entity's identification model;
S3, the text data to original training set carry out semantic vector structure;
S4, the training set data for having classification to mark after semantic vector as training parameter trained into business entity
Disaggregated model;
S5, using business entity's disaggregated model the business entity in text to be predicted is classified.
Further, in S1, the plain text data collected is subjected to subordinate sentence, participle and part-of-speech tagging, using artificial mark
The method of note is labeled to the business entity in plain text data and category of employment.
Further, using the participle and part-of-speech tagging software HanLP increased income to plain text data carry out subordinate sentence, participle and
Part-of-speech tagging.
Further, it is " BIO " mark pattern to business entity's notation methods in plain text data, wherein, enterprise is real
The starting word of body is labeled as " B ", and the other parts word of the non-starting word of business entity is labeled as " I ", unrelated with business entity
Word is labeled as " O ".
Further, using in the method manually marked, to the business entity in plain text data according to context
Classification mark is carried out to it according to industrial nature.
Further, in S2, business entity's identification model instruction is carried out by the conditional random field models for introducing boundary characteristic
Practice.
Further, introducing the conditional random field models of boundary characteristic includes:Will be whole after enterprise name participle by HanLP
Reason obtains left and right border dictionary;The forecast model on left and right border is obtained using the libSVM training increased income;Successively from training set
Middle taking-up word simultaneously judges whether the word is left and right border word by the forecast model on left and right border;Word sheet will be included
The condition random field instrument that body, part-of-speech tagging, right boundary mark, the training set data input of entity mark are increased income carries out enterprise
The training of entity recognition model and the identification model for obtaining business entity.
Further, in S3, the term vector that training sample concentrates all words is obtained using term vector calculating instrument, instruction is calculated
Practice inverse text frequency (IDF) value of all words in sample set, include business entity's language using term vector and TF-IDF value calculating
The vector sum context vector of business entity in sentence, the vector sum context vector of business entity is spliced, to obtain
Include business entity's semantic vector that context is semantic.
Further, the term vector of all words in training set is calculated using the word2vec instruments increased income.
Further, in S4, enterprise is gone out to having had the training set data that classification is marked using softmax model trainings real
The disaggregated model of body.
The present invention has the advantage that as described below:
1) right boundary of entity is predefined using lexicon rules and SVM classifier, afterwards by the left and right side of judgement
The result on boundary is incorporated into conditional random field models as new feature, and improved method of the invention has very in recall rate and F1 values
Big lifting.
2) mode of weighting is embedded in using word, semantic vector expression is carried out to entity and its context, so that real
Semanteme between body can be measured by semantic vector distance.With feature of the obtained semantic vector as entity, reduction pair
The dependence of manual features and external data.
3) entity boundary characteristic is introduced in existing conditional random field models, and the introducing of entity boundary characteristic is strengthened
Conditional random field models are to the control ability on entity border, and the recall rate of such as identification, which has, obviously to be improved, and also leads to it
It is guaranteed with property and robustness.
Brief description of the drawings
Fig. 1 is business entity's sorting technique FB(flow block) disclosed in this invention towards plain text.
Fig. 2 is that the training set in embodiment builds flow chart.
Fig. 3 is business entity's identification model training flow chart based on improvement condition random field in embodiment.
Fig. 4 builds flow chart for the Entity Semantics vector of the word-based vector sum TF-IDF values weighting in embodiment.
Fig. 5 is business entity's disaggregated model training flow chart.
Fig. 6 is business entity's classification process figure.
Embodiment
In order to know more about the technology contents of the present invention, especially exemplified by specific business entity's sorting technique towards law court's document
Embodiment simultaneously coordinates institute's accompanying drawings to be described as follows.
As shown in Fig. 2 the present invention first builds training sample set before implementation.The mistake of training sample set is built in embodiment
Journey is as follows:
Step 1-0, the initial state for setting up training set.
Step 1-1, using web crawlers instrument law court's document is gathered from internet, be used as original language material storehouse.
Step 1-2, the document data to collecting, using the participle and part-of-speech tagging software HanLP increased income to document
Text carries out subordinate sentence, participle and part-of-speech tagging.Certainly, general participle software of increasing income can be used, for example Chinese Academy of Sciences's participle
Etc., the HanLP softwares selected in embodiment are relatively more preferable compared to the effect of participle for current participle software of increasing income, and
And can artificial Customized dictionary, be also more convenient.
Step 1-3, due in text business entity's word (be the title of enterprise, mainly include full name and referred to as two kinds
Form) meeting cutting is multiple words after participle, so the method by manually marking is needed, by the business entity in document text
Mark out and, the mode of mark is labeled as " B " for the starting word of " BIO " mark pattern, i.e. business entity, the non-starting of business entity
The other parts word of word is labeled as " I ", and the word unrelated with business entity is labeled as " O ", and such as " defendant (O) Jiangsu (B) is Eurasian
(I) film (I) Co., Ltd (I) ".The data for marking completion are used as the training set of business entity's identification model.
Meanwhile, the business entity in the document text that collects is carried out according to context according to industrial nature to it
Classification is marked.The data completed are marked as the training set of business entity's disaggregated model, the data that mark is completed are to include one
The category of sentence comprising enterprise name and the affiliated industry of the enterprise, and whole training set is exactly some such sentence+classes
Target set.Wherein, the standard of classification mark can be selected with accuracy and authoritative industrial sectors of national economy classification (GB/T
Dividing mode in 4754-2011).
Step 1-4, the end for setting up training set.
As shown in figure 3, after training set has been built, using improved maximum matching method, i.e. by introducing border
The conditional random field models of feature carry out business entity's identification model training.
Step 2-0, the beginning of business entity's identification model training.
Training set data (the i.e. step 1-3 of step 2-1, input after subordinate sentence, participle, part-of-speech tagging and entity mark
In mark complete data).
Step 2-2, some business directories are crawled from internet, these enterprise names are passed through into HanLP participle Final finishings
Obtain left and right border dictionary.Left margin word refers to first word after enterprise name participle, and right margin word refers to enterprise's name
Claim last word after participle.All left and right border words, which are arranged, turns into left and right border word dictionary.
Step 2-3, using the libSVM training increased income obtain the forecast model on left and right border.Left margin forecast model is instructed
The feature selected during white silk is:The word of current word and latter two word is in itself and part of speech;Right margin forecast model was trained
The feature selected in journey is:The word of current word and first two words is in itself and part of speech.Wherein, the libSVM increased income the tools used
There are preferable robustness and more preferable classification boundaries
Step 2-4, taken out from training set word and judge that the word is by the forecast model on left and right border successively
No is left and right border word.
Whether current term is that the determination methods of left margin word are:If the word is appeared in left margin dictionary, and should
There is word to be determined as that left margin word is then correct left margin word under SVM methods on the right of word in two word windows, otherwise give up
Go.Certainly, each word has a judged result under dictionary methods and SVM methods, but the two methods all have shortcoming,
The step of this in embodiment is the result of comprehensive two methods, selects a more reasonably result.
Whether current term is that the determination methods of right margin word are:If the word is appeared in right margin dictionary, and should
There is word to be determined as that right margin word is then correct right margin word under SVM methods in the word window of two, the word left side, otherwise give up
Go.
Step 2-5, judge whether to have traveled through all words, step 2-7 is arrived if traveling through and completing, otherwise to 2-6.
Step 2-6, counter i add 1, take out next word in text.Actual above step is exactly to judge some word
Whether it is right boundary word.
Step 2-7, the condition random field instrument CRF++ that the data input of training set is increased income carry out business entity's identification mould
The training of type, exports the identification model of business entity.Training data selection feature for word in itself, part-of-speech tagging, left and right side
Boundary mark note, entity mark.
Step 2-8, the end of business entity's identification model training.
It can be seen that, the present invention introduces entity boundary characteristic in existing conditional random field models, in use condition random field
Judge whether this word is right boundary word before model, using this result as feature, use condition random field afterwards
Model, and the introducing of entity boundary characteristic strengthens control ability of the conditional random field models to entity border, is embodied as
The recall rate of identification is significantly improved.
As shown in figure 4, carrying out the flow chart of semantic vector structure to the text data of original training set.
The beginning that step 3-0, training set text semantic vector are built.
Step 3-1, input have completed the training set text collection of subordinate sentence, participle, part-of-speech tagging and classification mark.
Step 3-2, the term vector using all words in the word2vec instruments calculating training set increased income.It is noticeable
It is that word2vec is the instrument for the calculating term vector that Google increases income, instrument such at present is a lot, and word2vec relatively knows
Name, alternative instrument also has many such as java word2vec4j etc..
Step 3-3, inverse text frequency (IDF) value for calculating all words in training set, the formula that it is calculated are as follows:
Wherein, the fraction in logarithmic function, molecule represents the sum of document in whole document, and denominator represents to include some word
The number of files of language adds 1 again, takes both ratio.
Step 3-4, each text taken out successively since first text in training set in document.
Step 3-5, using business entity's identification model judge take out this text in whether have depositing for business entity
If then arriving step 3-6, otherwise to step 3-10.
Step 3-6, judge in step 3-5 to include business entity in text after, to the semantic vector of entity part
Calculated, it is assumed that the vector representation of an entity is vm, constituting its phrase its vector representation is respectively:w1, w2..., wn,
Then vmCalculation formula it is as follows:
Step 3-7, after the semantic vector of step 3-6 computational entities, the context section of entity is calculated it is semantic to
Amount, its calculation is as follows:
Wherein, v (context) is the vectorial forms of characterization of context, tfidf (wi) represent word wiTF-IDF
Value, v (wi) it is word wiTerm vector, k be word window size (take in context close to central entity preceding k word).Word
TF values be the frequency of the word occur in text, the TF-IDF values of word are the TF values of word and the product of IDF values.
Step 3-8, the entity and the semantic vector of context obtained in step 3-6 and step 3-7 is spliced, specifically
The context vector for the entity vector sum k dimensions for operating to tie up k, with entity vector preceding, the posterior mode of context vector is spelled
Connect the vector for obtaining a 2k dimension.
Step 3-9, judge whether to have traveled through sentences all in training set text, step 3- is arrived if traveling through and completing
11, otherwise to step 3-10.
Step 3-10, counter i add 1, take out next sentence in training set text.
Step 3-11, the entity vector output by obtained integrating context semanteme, are used as business entity's disaggregated model
Training data.It is worth noting that, step 1-3 mark after data be a plain text+category data, herein before
Step is that text is changed into vector, thus data here are the data of vector+category.
The end that step 3-12, training set text semantic are built.
As shown in figure 5, carrying out semantic vector structure to original language material (be obtain after step 1-3 data set)
Afterwards, the training of business entity's disaggregated model is carried out using softmax multi-classification algorithms.Softmax multi-classification algorithms are a kind of
Conventional method, for other methods, its calculating speed is fast, and occupying little space, and can obtain test sample exists
Probability in each classification
Step 4-0, the beginning of business entity's disaggregated model training.
Step 4-1, softmax classification will be input to by the training set data for having classification to mark after semantic vector
In model, training parameter is used as.
Step 4-2, many disaggregated model training are carried out using softmax algorithms, export many points of the softmax after training
Class model, is predicted to follow-up classification.
Step 4-3, the end of business entity's disaggregated model training.
As shown in fig. 6, after business entity's disaggregated model is obtained, the flow chart classified using the disaggregated model.
Step 5-0, the beginning of business entity's classification.
Step 5-1, the text to business entity's disaggregated model input entity class to be predicted.
Step 5-2, using business entity's identification model judge input text in whether have business entity, if then going to
Step 5-3, otherwise goes to step 5-5.
Step 5-3, carry out the vectorial structure of Entity Semantics to including business entity's text using step 3-1 to step 3-12
Build, in the business entity's disaggregated model for afterwards training obtained vector input, obtain the classification results of entity in text.
Step 5-4, the classification results for exporting 5-3 steps.
Step 5-5, the end of business entity's classification.
In summary, the TF-IDF of utilization term vector technology proposed by the present invention and document word is worth to comprising context
The method classified again after semantic business entity's vector representation form, can solve the problem that at present to business entity's sorting technique
Middle type is less and the problem of lacking semantic, making the type of business entity has thinner granularity and stronger semantic feature.
Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when can
It is used for a variety of modifications and variations.Therefore, the scope of protection of the present invention is defined by those of the claims.
Claims (10)
1. a kind of business entity's sorting technique towards plain text, it is characterised in that comprise the following steps:
S1, classification mark is carried out to the business entity in the plain text data that collects, the data that complete will be marked and be used as enterprise
The training set of Entity recognition module;Classification mark is carried out according to industrial nature to the business entity in the plain text data that collects
Note, will mark the data of completion as the training sample set of business entity's sort module;
S2, by conditional random field models business entity identification model training is carried out, and obtain business entity's identification model;
S3, the text data to original training set carry out semantic vector structure;
S4, using after semantic vector have classification mark training set data as training parameter train business entity classification
Model;
S5, using business entity's disaggregated model the business entity in text to be predicted is classified.
2. business entity's sorting technique as claimed in claim 1, it is characterised in that in S1, by the plain text data collected
Subordinate sentence, participle and part-of-speech tagging are carried out, using the method manually marked to the business entity in plain text data and category of employment
It is labeled.
3. business entity's sorting technique as claimed in claim 2, it is characterised in that soft using the participle and part-of-speech tagging increased income
Part HanLP carries out subordinate sentence, participle and part-of-speech tagging to plain text data.
4. business entity's sorting technique as claimed in claim 2, it is characterised in that marked to the business entity in plain text data
Note mode is " BIO " mark pattern, wherein, the starting word of business entity is labeled as " B ", other portions of the non-starting word of business entity
Participle language is labeled as " I ", and the word unrelated with business entity is labeled as " O ".
5. business entity's sorting technique as claimed in claim 2, it is characterised in that using in the method manually marked, to pure
Business entity in text data carries out classification mark according to industrial nature according to context to it.
6. business entity's sorting technique as claimed in claim 1, it is characterised in that in S2, by the bar for introducing boundary characteristic
Part random field models carry out business entity's identification model training.
7. business entity's sorting technique as claimed in claim 6, it is characterised in that introduce the condition random field mould of boundary characteristic
Type includes:Enterprise name participle Final finishing is obtained by left and right border dictionary by HanLP;Trained using the libSVM increased income
To the forecast model on left and right border;Word is taken out from training set successively and this is judged by the forecast model on left and right border
Whether word is left and right border word;By including word in itself, part-of-speech tagging, right boundary mark, entity mark training set number
The condition random field instrument increased income according to input carries out the training of business entity's identification model and obtains the identification model of business entity.
8. business entity's sorting technique as claimed in claim 1, it is characterised in that in S3, is obtained using term vector calculating instrument
The term vector of all words is concentrated to training sample, inverse text frequency (IDF) value that training sample concentrates all words is calculated, utilizes word
Vector sum TF-IDF values calculate the vector sum context vector of the business entity included in business entity's sentence, by business entity
Vector sum context vector spliced, to obtain comprising the semantic business entity's semantic vector of context.
9. business entity's sorting technique as claimed in claim 8, it is characterised in that calculated using the word2vec instruments increased income
The term vector of all words in training set.
10. business entity's sorting technique as claimed in claim 1, it is characterised in that in S4, to the instruction for having there is classification to mark
Practice the disaggregated model that collection data go out business entity using softmax model trainings.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710371464.7A CN107193959B (en) | 2017-05-24 | 2017-05-24 | Pure text-oriented enterprise entity classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710371464.7A CN107193959B (en) | 2017-05-24 | 2017-05-24 | Pure text-oriented enterprise entity classification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107193959A true CN107193959A (en) | 2017-09-22 |
CN107193959B CN107193959B (en) | 2020-11-27 |
Family
ID=59874712
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710371464.7A Active CN107193959B (en) | 2017-05-24 | 2017-05-24 | Pure text-oriented enterprise entity classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107193959B (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423264A (en) * | 2017-07-10 | 2017-12-01 | 广东华联建设投资管理股份有限公司 | A kind of engineering material borrowing-word extracting method |
CN107894986A (en) * | 2017-09-26 | 2018-04-10 | 北京纳人网络科技有限公司 | A kind of business connection division methods, server and client based on vectorization |
CN108255813A (en) * | 2018-01-23 | 2018-07-06 | 重庆邮电大学 | A kind of text matching technique based on term frequency-inverse document and CRF |
CN108460014A (en) * | 2018-02-07 | 2018-08-28 | 百度在线网络技术(北京)有限公司 | Recognition methods, device, computer equipment and the storage medium of business entity |
CN108733778A (en) * | 2018-05-04 | 2018-11-02 | 百度在线网络技术(北京)有限公司 | The industry type recognition methods of object and device |
CN108763402A (en) * | 2018-05-22 | 2018-11-06 | 广西师范大学 | Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary |
CN108763201A (en) * | 2018-05-17 | 2018-11-06 | 南京大学 | A kind of open field Chinese text name entity recognition method based on semi-supervised learning |
CN109408827A (en) * | 2018-11-07 | 2019-03-01 | 南京理工大学 | A kind of software entity recognition methods based on machine learning |
CN110083704A (en) * | 2019-05-06 | 2019-08-02 | 重庆天蓬网络有限公司 | A kind of company's information processing method, storage medium and equipment based on main business |
CN110297913A (en) * | 2019-06-12 | 2019-10-01 | 中电科大数据研究院有限公司 | A kind of electronic government documents entity abstracting method |
CN110472062A (en) * | 2019-07-11 | 2019-11-19 | 新华三大数据技术有限公司 | The method and device of identification name entity |
CN110502638A (en) * | 2019-08-30 | 2019-11-26 | 重庆誉存大数据科技有限公司 | A kind of Company News classification of risks method based on target entity |
CN110990587A (en) * | 2019-12-04 | 2020-04-10 | 电子科技大学 | Enterprise relation discovery method and system based on topic model |
CN111209392A (en) * | 2018-11-20 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Method, device and equipment for excavating polluted enterprises |
CN111539209A (en) * | 2020-04-15 | 2020-08-14 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
CN111881685A (en) * | 2020-07-20 | 2020-11-03 | 南京中孚信息技术有限公司 | Small-granularity strategy mixed model-based Chinese named entity identification method and system |
CN112418681A (en) * | 2020-11-26 | 2021-02-26 | 北京上奇数字科技有限公司 | Method and apparatus for analyzing industrial development, electronic device, and storage medium |
CN113065343A (en) * | 2021-03-25 | 2021-07-02 | 天津大学 | Enterprise research and development resource information modeling method based on semantics |
CN113408273A (en) * | 2021-06-30 | 2021-09-17 | 北京百度网讯科技有限公司 | Entity recognition model training and entity recognition method and device |
WO2021238337A1 (en) * | 2020-05-29 | 2021-12-02 | 华为技术有限公司 | Method and device for entity tagging |
CN114036933A (en) * | 2022-01-10 | 2022-02-11 | 湖南工商大学 | Information extraction method based on legal documents |
CN114647727A (en) * | 2022-03-17 | 2022-06-21 | 北京百度网讯科技有限公司 | Model training method, device and equipment applied to entity information recognition |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965992A (en) * | 2015-07-13 | 2015-10-07 | 南开大学 | Text mining method based on online medical question and answer information |
US20160148116A1 (en) * | 2014-11-21 | 2016-05-26 | International Business Machines Corporation | Extraction of semantic relations using distributional relation detection |
CN105630768A (en) * | 2015-12-23 | 2016-06-01 | 北京理工大学 | Cascaded conditional random field-based product name recognition method and device |
CN105787461A (en) * | 2016-03-15 | 2016-07-20 | 浙江大学 | Text-classification-and-condition-random-field-based adverse reaction entity identification method in traditional Chinese medicine literature |
CN106503035A (en) * | 2016-09-14 | 2017-03-15 | 海信集团有限公司 | A kind of data processing method of knowledge mapping and device |
CN106570179A (en) * | 2016-11-10 | 2017-04-19 | 中国科学院信息工程研究所 | Evaluative text-oriented kernel entity identification method and apparatus |
-
2017
- 2017-05-24 CN CN201710371464.7A patent/CN107193959B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160148116A1 (en) * | 2014-11-21 | 2016-05-26 | International Business Machines Corporation | Extraction of semantic relations using distributional relation detection |
CN104965992A (en) * | 2015-07-13 | 2015-10-07 | 南开大学 | Text mining method based on online medical question and answer information |
CN105630768A (en) * | 2015-12-23 | 2016-06-01 | 北京理工大学 | Cascaded conditional random field-based product name recognition method and device |
CN105787461A (en) * | 2016-03-15 | 2016-07-20 | 浙江大学 | Text-classification-and-condition-random-field-based adverse reaction entity identification method in traditional Chinese medicine literature |
CN106503035A (en) * | 2016-09-14 | 2017-03-15 | 海信集团有限公司 | A kind of data processing method of knowledge mapping and device |
CN106570179A (en) * | 2016-11-10 | 2017-04-19 | 中国科学院信息工程研究所 | Evaluative text-oriented kernel entity identification method and apparatus |
Non-Patent Citations (4)
Title |
---|
MUHAMMAD ASHRAF KHAN NIAZI等: ""Signature automation of UMLS concepts: An un-supervised named entity recognition framework for classification of DNA and RNA in biological text"", 《 2015 SCIENCE AND INFORMATION CONFERENCE (SAI)》 * |
庄成龙: ""基于树核函数的中文实体语义关系抽取方法的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
李芳: ""基于条件随机场的两阶段中文微博命名实体识别研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王树伟: ""面向金融文本的实体识别与关系抽取研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423264A (en) * | 2017-07-10 | 2017-12-01 | 广东华联建设投资管理股份有限公司 | A kind of engineering material borrowing-word extracting method |
CN107894986A (en) * | 2017-09-26 | 2018-04-10 | 北京纳人网络科技有限公司 | A kind of business connection division methods, server and client based on vectorization |
CN107894986B (en) * | 2017-09-26 | 2021-03-30 | 北京纳人网络科技有限公司 | Enterprise relation division method based on vectorization, server and client |
CN108255813A (en) * | 2018-01-23 | 2018-07-06 | 重庆邮电大学 | A kind of text matching technique based on term frequency-inverse document and CRF |
CN108255813B (en) * | 2018-01-23 | 2021-11-16 | 重庆邮电大学 | Text matching method based on word frequency-inverse document and CRF |
CN108460014A (en) * | 2018-02-07 | 2018-08-28 | 百度在线网络技术(北京)有限公司 | Recognition methods, device, computer equipment and the storage medium of business entity |
CN108460014B (en) * | 2018-02-07 | 2022-02-25 | 百度在线网络技术(北京)有限公司 | Enterprise entity identification method and device, computer equipment and storage medium |
CN108733778A (en) * | 2018-05-04 | 2018-11-02 | 百度在线网络技术(北京)有限公司 | The industry type recognition methods of object and device |
CN108733778B (en) * | 2018-05-04 | 2022-05-17 | 百度在线网络技术(北京)有限公司 | Industry type identification method and device of object |
CN108763201B (en) * | 2018-05-17 | 2021-07-23 | 南京大学 | Method for identifying text named entities in open domain based on semi-supervised learning |
CN108763201A (en) * | 2018-05-17 | 2018-11-06 | 南京大学 | A kind of open field Chinese text name entity recognition method based on semi-supervised learning |
CN108763402A (en) * | 2018-05-22 | 2018-11-06 | 广西师范大学 | Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary |
CN108763402B (en) * | 2018-05-22 | 2021-08-27 | 广西师范大学 | Class-centered vector text classification method based on dependency relationship, part of speech and semantic dictionary |
CN109408827A (en) * | 2018-11-07 | 2019-03-01 | 南京理工大学 | A kind of software entity recognition methods based on machine learning |
CN111209392A (en) * | 2018-11-20 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Method, device and equipment for excavating polluted enterprises |
CN110083704A (en) * | 2019-05-06 | 2019-08-02 | 重庆天蓬网络有限公司 | A kind of company's information processing method, storage medium and equipment based on main business |
CN110297913A (en) * | 2019-06-12 | 2019-10-01 | 中电科大数据研究院有限公司 | A kind of electronic government documents entity abstracting method |
CN110472062A (en) * | 2019-07-11 | 2019-11-19 | 新华三大数据技术有限公司 | The method and device of identification name entity |
CN110502638A (en) * | 2019-08-30 | 2019-11-26 | 重庆誉存大数据科技有限公司 | A kind of Company News classification of risks method based on target entity |
CN110502638B (en) * | 2019-08-30 | 2023-05-16 | 重庆誉存大数据科技有限公司 | Enterprise news risk classification method based on target entity |
CN110990587A (en) * | 2019-12-04 | 2020-04-10 | 电子科技大学 | Enterprise relation discovery method and system based on topic model |
CN110990587B (en) * | 2019-12-04 | 2023-04-18 | 电子科技大学 | Enterprise relation discovery method and system based on topic model |
CN111539209B (en) * | 2020-04-15 | 2023-09-15 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
CN111539209A (en) * | 2020-04-15 | 2020-08-14 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
CN113743117A (en) * | 2020-05-29 | 2021-12-03 | 华为技术有限公司 | Method and device for entity marking |
WO2021238337A1 (en) * | 2020-05-29 | 2021-12-02 | 华为技术有限公司 | Method and device for entity tagging |
CN113743117B (en) * | 2020-05-29 | 2024-04-09 | 华为技术有限公司 | Method and device for entity labeling |
CN111881685A (en) * | 2020-07-20 | 2020-11-03 | 南京中孚信息技术有限公司 | Small-granularity strategy mixed model-based Chinese named entity identification method and system |
CN112418681A (en) * | 2020-11-26 | 2021-02-26 | 北京上奇数字科技有限公司 | Method and apparatus for analyzing industrial development, electronic device, and storage medium |
CN113065343A (en) * | 2021-03-25 | 2021-07-02 | 天津大学 | Enterprise research and development resource information modeling method based on semantics |
CN113408273A (en) * | 2021-06-30 | 2021-09-17 | 北京百度网讯科技有限公司 | Entity recognition model training and entity recognition method and device |
CN113408273B (en) * | 2021-06-30 | 2022-08-23 | 北京百度网讯科技有限公司 | Training method and device of text entity recognition model and text entity recognition method and device |
CN114036933A (en) * | 2022-01-10 | 2022-02-11 | 湖南工商大学 | Information extraction method based on legal documents |
CN114647727A (en) * | 2022-03-17 | 2022-06-21 | 北京百度网讯科技有限公司 | Model training method, device and equipment applied to entity information recognition |
Also Published As
Publication number | Publication date |
---|---|
CN107193959B (en) | 2020-11-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107193959A (en) | A kind of business entity's sorting technique towards plain text | |
WO2019200806A1 (en) | Device for generating text classification model, method, and computer readable storage medium | |
CN102332028B (en) | Webpage-oriented unhealthy Web content identifying method | |
CN106919673A (en) | Text mood analysis system based on deep learning | |
US8170969B2 (en) | Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge | |
CN106776581A (en) | Subjective texts sentiment analysis method based on deep learning | |
US7386544B2 (en) | Database search system | |
CN110532563A (en) | The detection method and device of crucial paragraph in text | |
CN102541838B (en) | Method and equipment for optimizing emotional classifier | |
CN106649597A (en) | Method for automatically establishing back-of-book indexes of book based on book contents | |
CN110688836A (en) | Automatic domain dictionary construction method based on supervised learning | |
CN106844349A (en) | Comment spam recognition methods based on coorinated training | |
CN111782807B (en) | Self-bearing technology debt detection classification method based on multiparty integrated learning | |
CN112101027A (en) | Chinese named entity recognition method based on reading understanding | |
CN106933800A (en) | A kind of event sentence abstracting method of financial field | |
CN112051986B (en) | Code search recommendation device and method based on open source knowledge | |
CN110134799A (en) | A kind of text corpus based on BM25 algorithm build and optimization method | |
Saravanan et al. | Automatic identification of rhetorical roles using conditional random fields for legal document summarization | |
CN115238040A (en) | Steel material science knowledge graph construction method and system | |
CN111460147A (en) | Title short text classification method based on semantic enhancement | |
CN103473356B (en) | Document-level emotion classifying method and device | |
CN104794209A (en) | Chinese microblog sentiment classification method and system based on Markov logic network | |
CN112257442B (en) | Policy document information extraction method based on corpus expansion neural network | |
CN110888983B (en) | Positive and negative emotion analysis method, terminal equipment and storage medium | |
CN107329951A (en) | Build name entity mark resources bank method, device, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |