CN107193959B - Pure text-oriented enterprise entity classification method - Google Patents

Pure text-oriented enterprise entity classification method Download PDF

Info

Publication number
CN107193959B
CN107193959B CN201710371464.7A CN201710371464A CN107193959B CN 107193959 B CN107193959 B CN 107193959B CN 201710371464 A CN201710371464 A CN 201710371464A CN 107193959 B CN107193959 B CN 107193959B
Authority
CN
China
Prior art keywords
entity
enterprise
word
text
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710371464.7A
Other languages
Chinese (zh)
Other versions
CN107193959A (en
Inventor
张雷
陈嘉伟
谢璐遥
王崇骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710371464.7A priority Critical patent/CN107193959B/en
Publication of CN107193959A publication Critical patent/CN107193959A/en
Application granted granted Critical
Publication of CN107193959B publication Critical patent/CN107193959B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The invention discloses a pure text-oriented enterprise entity classification method, which comprises the following steps: s1, carrying out category marking on the enterprise entities in the collected plain text data to be used as a training set of an enterprise entity identification module; carrying out category marking on the enterprise entities in the acquired plain text data according to the industry properties to be used as a training sample set of an enterprise entity classification module; s2, carrying out enterprise entity recognition model training through the conditional random field model, and obtaining an enterprise entity recognition model; s3, semantic vectorization construction is carried out on the text data of the original training set; s4, training an enterprise entity classification model by using the training set data which is subjected to semantic vectorization and has class labels as training parameters; and S5, classifying the business entities in the text to be predicted by utilizing the business entity classification model. The method uses the obtained semantic vector as the characteristic of the entity, reduces the dependence on artificial characteristics and external data, and ensures the universality and the robustness.

Description

Pure text-oriented enterprise entity classification method
Technical Field
The invention belongs to the technical field of named entity recognition and fine-grained entity classification, and particularly relates to a plain text-oriented enterprise entity classification method.
Background
In recent years, with the trend of "internet finance", more and more enterprise decision makers urgently need to extract and analyze massive internet data by using a more advanced information processing mode so as to make better decisions. Among the mass data, the plain text data such as court documents and news public opinions becomes the primary source for enterprises to acquire high-value information.
The named entity recognition technology is the basis for enterprises to carry out the work of entity semantic analysis, entity relationship extraction and the like. At present, the mainstream named entity recognition technology only divides entities into names of people, places, organizations and the like, so that the types of the entities lack semantics. Meanwhile, entity classification is performed by depending on artificial features and external data too much, so that the universality and the robustness of the entity classification cannot be guaranteed.
Disclosure of Invention
Aiming at the current mainstream named entity recognition technology, the invention only divides entities into names of people, places, mechanism names and the like, so that the types of the entities lack semantics. In addition, the entity classification is performed by depending on artificial features and external data too much, so that the universality and the robustness are not guaranteed. In order to solve the problems, the invention provides a pure text-oriented enterprise entity classification method, which adopts a finer-grained division mode of enterprise entities and uses the semantic construction characteristics of texts to classify the enterprise entities. Here, plain text is text containing information about the activity of the enterprise, such as news text, court documents, etc.
As shown in fig. 1, the method for classifying business entities oriented to plain text disclosed by the present invention includes the following steps:
s1, carrying out category marking on the enterprise entities in the collected plain text data, and taking the marked data as a training set of an enterprise entity identification module; carrying out category marking on enterprise entities in the collected plain text data according to the industry properties, and using the marked data as a training sample set of an enterprise entity classification module;
s2, carrying out enterprise entity recognition model training through the conditional random field model, and obtaining an enterprise entity recognition model;
s3, semantic vectorization construction is carried out on the text data of the original training set;
s4, training an enterprise entity classification model by using the training set data which is subjected to semantic vectorization and has class labels as training parameters;
and S5, classifying the business entities in the text to be predicted by utilizing the business entity classification model.
Further, in S1, the collected plain text data is labeled with sentence, word and part of speech, and the enterprise entities and industry categories in the plain text data are labeled by a manual labeling method.
Furthermore, sentence segmentation, word segmentation and part-of-speech tagging are carried out on the plain text data by using open-source word segmentation and part-of-speech tagging software HanLP.
Furthermore, the business entity in the plain text data is labeled in a "BIO" labeling mode, wherein the initial word of the business entity is labeled as "B", other words of the business entity, which are not the initial word, are labeled as "I", and words which are irrelevant to the business entity are labeled as "O".
Further, in a manual labeling method, the enterprise entities in the plain text data are labeled according to the category and the industry property according to the context.
Further, in S2, training the business entity recognition model is performed by introducing a conditional random field model of the boundary features.
Further, the conditional random field model for introducing boundary features comprises: dividing words of enterprise names by HanLP, and then sorting to obtain a left boundary dictionary and a right boundary dictionary; training by using an open-source libSVM to obtain a prediction model of a left boundary and a right boundary; sequentially taking out words from the training set and judging whether the words are left and right boundary words or not through the prediction models of the left and right boundaries; and inputting training set data comprising words, part of speech labels, left and right boundary labels and entity labels into an open-source conditional random field tool to train the enterprise entity recognition model and obtain the recognition model of the enterprise entity.
Further, in S3, a word vector calculation tool is used to obtain word vectors of all words in the training sample set, inverse text frequency (IDF) values of all words in the training sample set are calculated, a word vector and TF-IDF values are used to calculate a vector and a context vector that include an enterprise entity in the enterprise entity sentence, and the vector and the context vector of the enterprise entity are spliced to obtain an enterprise entity semantic vector that includes context semantics.
Further, word vectors for all words in the training set are calculated using the open source word2vec tool.
Further, in S4, a classification model of the business entity is trained on the training set data with class labels using the softmax model.
The invention has the following beneficial effects:
1) the improved method of the invention has great improvement in recall rate and F1 value by using dictionary rules and SVM classifiers to predetermine the left and right boundaries of the entity and then introducing the results of the determined left and right boundaries as new features into the conditional random field model.
2) And performing semantic vectorization representation on the entities and the contexts thereof by using a word embedding weighting mode, so that the semantics between the entities can be measured by a semantic vector distance. And the obtained semantic vector is used as the characteristic of the entity, so that the dependence on artificial characteristics and external data is reduced.
3) The introduction of the entity boundary characteristics in the existing conditional random field model strengthens the control capability of the conditional random field model on the entity boundary, for example, the recognition recall rate is obviously improved, and the universality and the robustness are ensured.
Drawings
Fig. 1 is a flowchart of a plain text-oriented business entity classification method disclosed in the present invention.
Fig. 2 is a training set construction flowchart in the embodiment.
FIG. 3 is a flowchart of an embodiment of training an improved conditional random field-based enterprise entity recognition model.
FIG. 4 is a flow chart of entity semantic vector construction based on word vectors and TF-IDF value weighting in an embodiment.
FIG. 5 is a flowchart of an enterprise entity classification model training process.
FIG. 6 is a flowchart of business entity classification.
Detailed Description
To better understand the technical content of the present invention, a specific embodiment of the method for classifying business entities for court documents is described below with reference to the accompanying drawings.
As shown in FIG. 2, the present invention constructs a training sample set prior to implementation. The process of constructing the training sample set in the embodiment is as follows:
and 1-0, establishing an initial state of a training set.
Step 1-1, collecting court documents from the Internet by using a web crawler tool to serve as an original corpus.
And step 1-2, performing sentence segmentation, word segmentation and part-of-speech tagging on the document text by using open-source word segmentation and part-of-speech tagging software HanLP on the acquired document data. Of course, common open-source word segmentation software can be used, such as Chinese academy word segmentation and the like, and the HanLP software selected in the embodiment has a relatively better word segmentation effect compared with the existing open-source word segmentation software, and can customize a dictionary manually and is more convenient.
Step 1-3, because the enterprise entity words (namely the names of enterprises, mainly including the full names and the short names) in the text are segmented into a plurality of words after word segmentation, the enterprise entities in the text are marked out by a manual marking method in a form of "BIO" marks, namely the initial word of the enterprise entity is marked as "B", other partial words of the non-initial word of the enterprise entity are marked as "I", words irrelevant to the enterprise entities are marked as "O", such as "informed (O) Jiangsu (B) Eurasian (I) film (I) Limited company (I)". And the labeled data is used as a training set of the enterprise entity recognition model.
Meanwhile, the enterprise entities in the collected document text are labeled according to the categories and the industry properties according to the context content. The marked data is used as a training set of the entity classification model of the enterprise, the marked data comprises a sentence containing the name of the enterprise and a class label of the industry to which the enterprise belongs, and the whole training set is a set of a plurality of sentences and class labels. The standard of the category marking can adopt a dividing mode in national economic industry classification (GB/T4754-2011) with accuracy and authority.
And 1-4, finishing establishing the training set.
As shown in FIG. 3, after the training set is constructed, the modified conditional random field method, namely, the training of the enterprise entity recognition model is performed by introducing the conditional random field model of the boundary features.
And 2-0, starting the training of the enterprise entity recognition model.
And 2-1, inputting training set data (namely the labeling completion data in the step 1-3) after sentence segmentation, word segmentation, part of speech labeling and entity labeling.
And 2-2, crawling some enterprise directories from the Internet, and sorting the enterprise names through HanLP word segmentation to obtain a left boundary dictionary and a right boundary dictionary. The left boundary word refers to the first word after the division of the enterprise name, and the right boundary word refers to the last word after the division of the enterprise name. And sorting all the left and right boundary words into a left and right boundary word dictionary.
And 2-3, training by using an open-source libSVM to obtain a prediction model of the left boundary and the right boundary. The features selected in the left boundary prediction model training process are as follows: the words and the parts of speech of the current word and the last two words; the features selected in the training process of the right boundary prediction model are as follows: the words themselves and part of speech of the current word as well as the first two words. Wherein, the used open-source libSVM has better robustness and better classification boundary
And 2-4, sequentially taking out words from the training set and judging whether the words are left and right boundary words through the prediction models of the left and right boundaries.
The method for judging whether the current word is the left boundary word comprises the following steps: if the word appears in the left boundary dictionary and words in two word windows at the right of the word are judged as left boundary words under the SVM method, the left boundary words are correct, and if not, the left boundary words are discarded. Of course, each word has a judgment result under the dictionary method and the SVM method, but both methods have disadvantages, and in the embodiment, the step is to combine the results of the two methods and select a more reasonable result.
The method for judging whether the current word is the right boundary word comprises the following steps: and if the word appears in the right boundary dictionary and words in two word windows on the left side of the word are judged as right boundary words under the SVM method, the right boundary words are correct, and if not, the right boundary words are discarded.
And 2-5, judging whether all the words are traversed or not, if so, going to step 2-7, and otherwise, going to step 2-6.
And 2-6, adding 1 to the counter i, and taking out the next word in the text. The above steps are actually to determine whether a word is a left or right boundary word.
And 2-7, inputting the data of the training set into an open-source conditional random field tool CRF + + to train the enterprise entity recognition model, and outputting the recognition model of the enterprise entity. The characteristics of the training data selection are word self, part of speech tagging, left and right boundary tagging and entity tagging.
And 2-8, finishing the training of the enterprise entity recognition model.
Therefore, the method introduces the entity boundary characteristics into the existing conditional random field model, judges whether the word is a left boundary word or a right boundary word before using the conditional random field model, takes the result as the characteristics, and uses the conditional random field model after using the conditional random field model, and the introduction of the entity boundary characteristics strengthens the control capability of the conditional random field model on the entity boundary, which is embodied in that the recognition recall rate is obviously improved.
As shown in fig. 4, a flowchart of semantic vectorization construction is performed on the text data of the original training set.
And 3-0, starting the construction of the text semantic vector of the training set.
And 3-1, inputting a training set text set with completed sentences, participles, part of speech labels and category labels.
And 3-2, calculating word vectors of all words in the training set by using an open-source word2vec tool. It is noted that word2vec is a Google open-source tool for computing word vectors, and there are many such tools at present, word2vec is well known, and there are many alternative tools such as java's word2vec4j, etc.
Step 3-3, calculating the inverse text frequency (IDF) values of all the words in the training set, wherein the calculation formula is as follows:
Figure BDA0001302859780000051
wherein, the fraction in the logarithmic function, the numerator represents the total number of documents in the whole document, the denominator represents the number of documents containing a certain term, 1 is added, and the ratio of the two is taken.
And 3-4, sequentially taking out each sentence of text in the document from the first sentence of text in the training set.
And 3-5, judging whether the extracted text has the enterprise entity by using the enterprise entity identification model, if so, going to step 3-6, otherwise, going to step 3-10.
Step 3-6, after the fact that the text contains the enterprise entity is judged in the step 3-5, semantic vectors of the entity part are calculated, and it is assumed that the vector of one entity is represented as vmThe vector representation of the phrase forming it is respectively: w is a1,w2,...,wnThen v ismThe calculation formula of (a) is as follows:
Figure BDA0001302859780000061
step 3-7, after calculating the semantic vector of the entity in step 3-6, calculating the semantic vector for the context part of the entity in the following way:
Figure BDA0001302859780000062
where v (context) is a vector-characterized form of the context, tf · idf (w)i) Meaning word wiTF-IDF value of (v) (w)i) Is the word wiK is the word window size (i.e., take the first k words in the context that are close to the central entity). The TF value of a word is the frequency of the word appearing in the text, and the TF-IDF value of the word is the product of the TF value and the IDF value of the word.
And 3-8, splicing the semantic vectors of the entities and the contexts obtained in the steps 3-6 and 3-7, specifically, splicing the entity vector of the k dimension and the context vector of the k dimension in a mode that the entity vector is in front and the context vector is in back to obtain a vector of the 2k dimension.
And 3-9, judging whether all sentences in the text of the training set are traversed or not, if so, going to step 3-11, and otherwise, going to step 3-10.
And 3-10, adding 1 to the counter i, and taking out the next sentence in the training set text.
And 3-11, outputting the obtained entity vector fusing the context semantics as training data of the enterprise entity classification model. It should be noted that the data labeled in steps 1-3 is a plain text + type label data, and the previous step is to convert the text into a vector, so that the data is a vector + type label data.
And 3-12, finishing the semantic construction of the training set text.
As shown in FIG. 5, after semantic vectorization construction is performed on the original corpus (i.e., the data set obtained after steps 1-3), training of the enterprise entity classification model is performed using softmax multi-classification algorithm. The softmax multi-classification algorithm is a common method, and compared with other methods, the softmax multi-classification algorithm has the advantages of high calculation speed, small occupied space and capability of obtaining the probability of the test sample on each class
And 4-0, starting the training of the enterprise entity classification model.
And 4-1, inputting the training set data which is subjected to semantic vectorization and has class labels into a softmax classification model to serve as training parameters.
And 4-2, performing multi-classification model training by adopting a softmax algorithm, and outputting the trained softmax multi-classification model for subsequent classification prediction.
And 4-3, finishing the training of the enterprise entity classification model.
As shown in FIG. 6, after the business entity classification model is obtained, a flow chart for classification using the classification model is provided.
Step 5-0, start of business entity classification.
And 5-1, inputting texts of entity categories to be predicted into the enterprise entity classification model.
And 5-2, judging whether an enterprise entity exists in the input text by using the enterprise entity identification model, if so, turning to the step 5-3, otherwise, turning to the step 5-5.
And 5-3, constructing entity semantic vectors for the texts containing the enterprise entities by using the steps 3-1 to 3-12, and inputting the obtained vectors into the trained enterprise entity classification model to obtain the classification results of the entities in the texts.
And 5-4, outputting the classification result of the 5-3 steps.
And 5-5, finishing the enterprise entity classification.
In summary, the method for classifying the enterprise entity vector representation form containing the context semantics by using the word vector technology and the TF-IDF value of the document word provided by the invention can solve the problems of less types and lack of semantics in the existing enterprise entity classification method, so that the types of the enterprise entities have finer granularity and stronger semantic features.
Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims (7)

1. A plain text-oriented enterprise entity classification method is characterized by comprising the following steps:
s1, marking the enterprise entities in the collected plain text data, and taking the marked data as a training set of an enterprise entity identification module; carrying out category marking on enterprise entities in the collected plain text data according to the industry properties, and using the marked data as a training sample set of an enterprise entity classification module;
s2, carrying out enterprise entity recognition model training by introducing a conditional random field model of boundary characteristics, and obtaining an enterprise entity recognition model;
s3, performing semantic vectorization construction on the text data of the original training set, namely, using a word vector calculation tool to obtain word vectors of all words in a training sample set, calculating inverse text frequency IDF values of all words in the training sample set, calculating vectors and context vectors of enterprise entities in enterprise entity sentences by using the word vectors and TF-IDF values, and splicing the vectors and the context vectors of the enterprise entities to obtain enterprise entity semantic vectors containing context semantics;
s4, training an enterprise entity classification model by using the training set data which is subjected to semantic vectorization and has class labels as training parameters;
s5, classifying the enterprise entities in the text to be predicted by utilizing the enterprise entity classification model;
the step S2 specifically includes: dividing words of enterprise names by HanLP, and then sorting to obtain a left boundary dictionary and a right boundary dictionary; training by using an open-source libSVM to obtain a prediction model of a left boundary and a right boundary; sequentially taking out words from the training set and judging whether the words are left and right boundary words or not through the prediction models of the left and right boundaries; inputting training set data comprising words, part of speech labels, left and right boundary labels and entity labels into an open-source conditional random field tool to train an enterprise entity recognition model and obtain the recognition model of the enterprise entity;
the step S3 specifically includes:
step 3-1, inputting a training set which is already finished with sentences, participles, part of speech labels and category labels;
step 3-2, calculating word vectors of all words in the training set;
step 3-3, calculating the inverse text frequency IDF values of all the words in the training set, wherein the calculation formula is as follows:
Figure FDA0002702937050000011
in the formula, a numerator represents the total number of documents in the whole document, and a denominator represents the number of documents containing a certain term plus 1;
3-4, sequentially taking out each sentence of text in the document from the first sentence of text in the training set;
3-5, judging whether the extracted text has the enterprise entity by using an enterprise entity identification model, if so, going to step 3-6, otherwise, going to step 3-10;
step 3-6, after the fact that the text contains the enterprise entity is judged in the step 3-5, semantic vectors of entity parts are calculated, wherein the semantic vector v of one entitymThe calculation formula of (a) is as follows:
Figure FDA0002702937050000012
in the formula, wiA vector representing the ith phrase constituting an entity, i ═ 1,2, …, n;
step 3-7, calculating a semantic vector for the context part of the entity, wherein the calculation mode is as follows:
Figure FDA0002702937050000021
where v (context) is the semantic vector of the context, f · idf (w)i) Meaning word wiTF-IDF value of (v) (w)i) Is the word wiK is the word window size; the TF value is the frequency of the word appearing in the text, and the TF-IDF value of the word is the product of the TF value and the IDF value of the word;
3-8, splicing the semantic vectors of the entities and the contexts obtained in the steps 3-6 and 3-7, specifically, splicing the k-dimensional entity vector and the k-dimensional context vector in a mode that the entity vector is in front and the context vector is in back to obtain a 2 k-dimensional vector;
3-9, judging whether all sentences in the text of the training set are traversed or not, if so, going to step 3-11, otherwise, going to step 3-10;
3-10, adding 1 to the counter i, and taking out the next sentence in the training set text;
3-11, outputting the obtained entity vector fused with the context semantics as training data of an enterprise entity classification model;
3-12, finishing the semantic construction of the training set text;
the step S5 specifically includes:
step 5-1, inputting a text of an entity category to be predicted to the enterprise entity classification model;
step 5-2, judging whether an enterprise entity exists in the input text by using the enterprise entity identification model, if so, turning to step 5-3, otherwise, turning to step 5-5;
step 5-3, entity semantic vector construction is carried out on the text containing the enterprise entities by utilizing the steps 3-1 to 3-12, and then the obtained vectors are input into a trained enterprise entity classification model to obtain the classification results of the entities in the text;
step 5-4, outputting the classification result of the step 5-3;
and 5-5, finishing the enterprise entity classification.
2. The business entity classification method of claim 1, wherein in S1, the collected plain text data is labeled by sentence segmentation, word segmentation and part of speech, and the business entities and industry categories in the plain text data are labeled by manual labeling.
3. The business entity classification method of claim 2, wherein the plain text data is sentence-segmented, word-segmented and part-of-speech tagged using open-source word segmentation and part-of-speech tagging software HanLP.
4. The business entity classification method of claim 2, wherein the business entities in the plain text data are labeled in a "BIO" label format, wherein the initial word of the business entity is labeled as "B", other partial words of the business entity that are not the initial word are labeled as "I", and words that are not related to the business entity are labeled as "O".
5. The method of classifying business entities according to claim 2, wherein in the manual labeling method, the business entities in the plain text data are classified according to their business properties according to the context.
6. The business entity classification method of claim 1, wherein word vectors for all words in the training set are computed using an open-source word2vec tool.
7. The business entity classification method of claim 1, wherein in S4, a classification model of the business entity is trained using softmax model on the training set data with class labels already.
CN201710371464.7A 2017-05-24 2017-05-24 Pure text-oriented enterprise entity classification method Active CN107193959B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710371464.7A CN107193959B (en) 2017-05-24 2017-05-24 Pure text-oriented enterprise entity classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710371464.7A CN107193959B (en) 2017-05-24 2017-05-24 Pure text-oriented enterprise entity classification method

Publications (2)

Publication Number Publication Date
CN107193959A CN107193959A (en) 2017-09-22
CN107193959B true CN107193959B (en) 2020-11-27

Family

ID=59874712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710371464.7A Active CN107193959B (en) 2017-05-24 2017-05-24 Pure text-oriented enterprise entity classification method

Country Status (1)

Country Link
CN (1) CN107193959B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423264A (en) * 2017-07-10 2017-12-01 广东华联建设投资管理股份有限公司 A kind of engineering material borrowing-word extracting method
CN107894986B (en) * 2017-09-26 2021-03-30 北京纳人网络科技有限公司 Enterprise relation division method based on vectorization, server and client
CN108255813B (en) * 2018-01-23 2021-11-16 重庆邮电大学 Text matching method based on word frequency-inverse document and CRF
CN108460014B (en) * 2018-02-07 2022-02-25 百度在线网络技术(北京)有限公司 Enterprise entity identification method and device, computer equipment and storage medium
CN108733778B (en) * 2018-05-04 2022-05-17 百度在线网络技术(北京)有限公司 Industry type identification method and device of object
CN108763201B (en) * 2018-05-17 2021-07-23 南京大学 Method for identifying text named entities in open domain based on semi-supervised learning
CN108763402B (en) * 2018-05-22 2021-08-27 广西师范大学 Class-centered vector text classification method based on dependency relationship, part of speech and semantic dictionary
CN109408827A (en) * 2018-11-07 2019-03-01 南京理工大学 A kind of software entity recognition methods based on machine learning
CN111209392B (en) * 2018-11-20 2023-06-20 百度在线网络技术(北京)有限公司 Method, device and equipment for excavating polluted enterprises
CN110083704B (en) * 2019-05-06 2020-06-09 重庆天蓬网络有限公司 Method, storage medium and device for processing company information based on main business
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method
CN110472062B (en) * 2019-07-11 2020-11-10 新华三大数据技术有限公司 Method and device for identifying named entity
CN110502638B (en) * 2019-08-30 2023-05-16 重庆誉存大数据科技有限公司 Enterprise news risk classification method based on target entity
CN110990587B (en) * 2019-12-04 2023-04-18 电子科技大学 Enterprise relation discovery method and system based on topic model
CN111539209B (en) * 2020-04-15 2023-09-15 北京百度网讯科技有限公司 Method and apparatus for entity classification
CN113743117B (en) * 2020-05-29 2024-04-09 华为技术有限公司 Method and device for entity labeling
CN111881685A (en) * 2020-07-20 2020-11-03 南京中孚信息技术有限公司 Small-granularity strategy mixed model-based Chinese named entity identification method and system
CN112418681B (en) * 2020-11-26 2021-08-03 北京上奇数字科技有限公司 Method and apparatus for analyzing industrial development, electronic device, and storage medium
CN113065343B (en) * 2021-03-25 2022-06-10 天津大学 Enterprise research and development resource information modeling method based on semantics
CN113408273B (en) * 2021-06-30 2022-08-23 北京百度网讯科技有限公司 Training method and device of text entity recognition model and text entity recognition method and device
CN114036933B (en) * 2022-01-10 2022-04-22 湖南工商大学 Information extraction method based on legal documents
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787461A (en) * 2016-03-15 2016-07-20 浙江大学 Text-classification-and-condition-random-field-based adverse reaction entity identification method in traditional Chinese medicine literature
CN106503035A (en) * 2016-09-14 2017-03-15 海信集团有限公司 A kind of data processing method of knowledge mapping and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9792549B2 (en) * 2014-11-21 2017-10-17 International Business Machines Corporation Extraction of semantic relations using distributional relation detection
CN104965992B (en) * 2015-07-13 2018-01-09 南开大学 A kind of text mining method based on online medical question and answer information
CN105630768B (en) * 2015-12-23 2018-10-12 北京理工大学 A kind of product name recognition method and device based on stacking condition random field
CN106570179B (en) * 2016-11-10 2019-11-19 中国科学院信息工程研究所 A kind of kernel entity recognition methods and device towards evaluation property text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787461A (en) * 2016-03-15 2016-07-20 浙江大学 Text-classification-and-condition-random-field-based adverse reaction entity identification method in traditional Chinese medicine literature
CN106503035A (en) * 2016-09-14 2017-03-15 海信集团有限公司 A kind of data processing method of knowledge mapping and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Signature automation of UMLS concepts: An un-supervised named entity recognition framework for classification of DNA and RNA in biological text";Muhammad Ashraf Khan Niazi等;《 2015 Science and Information Conference (SAI)》;20150903;全文 *
"基于树核函数的中文实体语义关系抽取方法的研究";庄成龙;《中国优秀硕士学位论文全文数据库 信息科技辑》;20091015;全文 *

Also Published As

Publication number Publication date
CN107193959A (en) 2017-09-22

Similar Documents

Publication Publication Date Title
CN107193959B (en) Pure text-oriented enterprise entity classification method
CN110717047B (en) Web service classification method based on graph convolution neural network
TWI735543B (en) Method and device for webpage text classification, method and device for webpage text recognition
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
CN110297988B (en) Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm
WO2018218706A1 (en) Method and system for extracting news event based on neural network
CN102332028B (en) Webpage-oriented unhealthy Web content identifying method
CN110276054B (en) Insurance text structuring realization method
CN110188344A (en) A kind of keyword extracting method of multiple features fusion
CN109960724A (en) A kind of text snippet method based on TF-IDF
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN105930411A (en) Classifier training method, classifier and sentiment classification system
CN108563638B (en) Microblog emotion analysis method based on topic identification and integrated learning
CN108763402A (en) Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary
CN102541838B (en) Method and equipment for optimizing emotional classifier
CN110532563A (en) The detection method and device of crucial paragraph in text
CN110688836A (en) Automatic domain dictionary construction method based on supervised learning
CN110287314B (en) Long text reliability assessment method and system based on unsupervised clustering
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN114997288A (en) Design resource association method
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
CN111966944A (en) Model construction method for multi-level user comment security audit
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant