CN110879831A - Chinese medicine sentence word segmentation method based on entity recognition technology - Google Patents

Chinese medicine sentence word segmentation method based on entity recognition technology Download PDF

Info

Publication number
CN110879831A
CN110879831A CN201910967537.8A CN201910967537A CN110879831A CN 110879831 A CN110879831 A CN 110879831A CN 201910967537 A CN201910967537 A CN 201910967537A CN 110879831 A CN110879831 A CN 110879831A
Authority
CN
China
Prior art keywords
word
chinese medicine
corpus
sentences
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910967537.8A
Other languages
Chinese (zh)
Inventor
崔智颖
佘莉
黄剑平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Normal University
Original Assignee
Hangzhou Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Normal University filed Critical Hangzhou Normal University
Priority to CN201910967537.8A priority Critical patent/CN110879831A/en
Publication of CN110879831A publication Critical patent/CN110879831A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese medicine sentence word segmentation method based on an entity recognition technology, which comprises the following steps: collecting Chinese medicine sentences as a corpus; extracting common phrases from the corpus to form a lexicon W; classifying the word bank W, and performing sequence labeling on each element of each sequence of the word bank according to a classification result; pre-training a language library by adopting a Bert language model and obtaining word vectors; constructing a time cycle neural network model with a conditional random field, and performing model training based on a labeled data set; inputting a test statement to the trained time cycle neural network model to obtain an entity list; according to the entity list, carrying out first segmentation on the input sentence to segment elements contained in the entity list; scanning the remaining sentences to be divided based on the prefix dictionary to generate all possible word forming conditions of the Chinese characters in the sentences to form a directed acyclic graph; and dynamically planning the directed acyclic graph to search a maximum probability path and find out a maximum segmentation combination based on word frequency.

Description

Chinese medicine sentence word segmentation method based on entity recognition technology
Technical Field
The invention relates to the technical field of Chinese medicine informatization, in particular to a Chinese medicine sentence word segmentation method based on an entity recognition technology.
Background
Traditional Chinese medicine is traditional medicine with unique characteristics and advantages in China, and thousands of years of theoretical knowledge and clinical experience are accumulated. In order to realize the modernization of traditional Chinese medicine, the main information in the literature data of traditional Chinese medicine needs to be extracted and collated by the modern computer technology so as to facilitate research and analysis and promote the utilization, transformation, propagation and popularization of the knowledge of traditional Chinese medicine.
The extraction and the arrangement of the main information of the text can be understood as a Chinese word segmentation problem. At present, researchers have published a number of increasingly sophisticated chinese word segmentation algorithms in succession. However, when the word segmentation algorithms are applied to the field of traditional Chinese medicine, the displayed effect is not satisfactory. After thousands of years of historical transition, the Chinese grammar and expression mode generate huge differences. For the symptoms of traditional Chinese medicine, the language features are mostly dialect or semi-white and semi-white, and the word segmentation difficulty is higher; meanwhile, no unified national standard exists for symptom terms, and the symptom description is flexible and changeable in form, so that higher requirements on the correctness and completeness of symptom identification are provided.
In the past, researchers wanted to implement word segmentation on the aspect of Chinese medicine texts and adopted some measures, but most of them adopt a probability statistics-based method. From the practical effect, the word segmentation accuracy of the methods is improved for the short Chinese medicine symptom terms, but the word segmentation effect is improved for the long symptom terms in the Chinese medicine symptom description sentences is limited.
The key of word segmentation is the extraction of information, and in the extraction process, the boundary problem of the information to be extracted needs to be confirmed, which is very similar to the entity recognition technology. Today, entity recognition technology has gained importance in natural language processing tasks. However, different named entities have different internal features, and it is impossible to describe all the internal features of the entities by a unified model, so a specific entity recognition model is used to recognize a term in traditional Chinese medicine, and the word boundary of a traditional Chinese medicine term is better determined, thereby assisting the word segmentation task of a traditional Chinese medicine symptom sentence.
Therefore, the research on how to utilize the entity recognition technology to efficiently and accurately segment the Chinese medicine symptom description sentences has important significance.
Disclosure of Invention
Aiming at the defects in the field, the invention provides a Chinese sentence segmentation method based on an entity recognition technology, belongs to a Chinese word segmentation method based on a deep learning natural language processing technology, improves the accuracy rate of Chinese symptom description sentence segmentation, and can greatly reduce the workload.
A Chinese medicine sentence word segmentation method based on an entity recognition technology comprises the following steps:
(A) traditional Chinese medicine sentences are collected, and the data are used as a corpus after being cleaned;
(B) counting the occurrence frequency of adjacent Chinese character combinations in the corpus, and extracting common phrases from word combinations larger than a certain threshold value to form a lexicon W; calculating the forward conditional probability and the reverse conditional probability of the other adjacent Chinese character combinations, extracting the combinations meeting the specified threshold value, and adding the combinations into the word bank W;
(C) classifying the word bank W, and performing sequence labeling on each element of each sequence of the corpus by using a label according to a classification result;
(D) pre-training the corpus by adopting a Bert language model and obtaining word vectors;
(E) constructing a time cycle neural network (BilSTM-CRF) model with a conditional random field, and performing model training based on a labeled data set;
(F) inputting a test statement to the trained time cycle neural network model to obtain a prediction entity list;
(G) according to the predicted entity list, carrying out first segmentation on an input test statement, and segmenting elements contained in the predicted entity list;
(H) scanning the remaining sentences to be divided based on the prefix dictionary to generate all possible word forming conditions of the Chinese characters in the sentences to form a directed acyclic graph;
(I) and dynamically planning the directed acyclic graph to search a maximum probability path and find out a maximum segmentation combination based on word frequency.
In the step (A), the Chinese medicine sentence data can be collected and sorted through the channels of the existing Chinese medicine related databases, websites, documents and the like, then the obtained Chinese medicine sentences are integrated into a large-scale Chinese medicine text database, and then the Chinese medicine sentences are preliminarily divided according to Chinese and English punctuations to be used as a corpus.
In the step (B), the forward conditional probability and the reverse conditional probability are calculated according to the formula (I) and the formula (II) respectively:
Figure BDA0002230993880000031
Figure BDA0002230993880000032
for any adjacent Chinese character combination XY with X at the front and Y at the back, P (Y | X) and P (X | Y) are respectively the forward conditional probability and the reverse conditional probability of the combination, count (XY) is the frequency of the combination appearing in the corpus, and count (X) and count (Y) are respectively the frequency of the Chinese character X and the Chinese character Y appearing in the corpus.
Preferably, in step (C), the lexicon W is divided into three types of "body" (body part), "age-level" (age-level), and "symptom" (symptom).
Preferably, in step (C), labeling each element of each sequence of the corpus as "B-M", "I-M" or "O" form using BIO labels according to the classification result;
wherein, "M" represents the type of the segment in which the element belongs, "B" and "I" represent the start position and non-start position of the element in the segment, respectively, and "O" represents that the element does not belong to any type.
In the step (D), Bert is a fine-tuning-based multi-layer bidirectional Transformer encoder, which can further increase the generalization capability of the word vector model and fully describe the character-level, word-level, sentence-level and even sentence-to-sentence relation characteristics.
In the step (E), the core of the time-cycle neural network model with the conditional random field mainly comprises two layers:
one is a bidirectional long-short term memory network layer used for extracting the characteristics of an input sequence and finally outputting a probability distribution matrix of the label type of each character in the sequence;
and secondly, determining the most reasonable sequence path in all the feasible label sequence spaces according to the probability distribution matrix to obtain the corresponding character label.
In the steps (F) and (G), test sentences are input into the BilSTM-CRF model to obtain predicted entities, entity lists are formed, and all elements in the lists are sequentially cut from the input sentences according to the obtained lists.
In the step (H), the remaining sentences which are not segmented are subjected to dictionary searching operation according to a given dictionary to generate several possible sentence segmentations, and the possible segmentation modes form a directed acyclic graph.
In the step (I), according to the directed acyclic graph, reversely calculating the maximum probability of the sentence from right to left, namely, when reaching one node, calculating the maximum path probability from the node in front of the node to the terminal point;
each node of the directed acyclic graph is weighted, and for the words in the prefix dictionary, the weights are the word frequencies of the words.
Compared with the prior art, the invention has the main advantages that:
(1) the invention introduces a Bert + BilSTM + CRF network model in the Chinese word segmentation process, can predict Chinese medicine symptom term entities through the model, and improves the accuracy of Chinese medicine symptom description sentence word segmentation by utilizing the predicted entities to carry out Chinese words.
(2) In the data processing process, manual sequence labeling is a difficult task because of the large data volume and the large amount of professional domain knowledge required for manual labeling. The invention extracts the commonly used phrases from the corpus by using a statistical method, automatically marks the commonly used phrases according to the classification of the phrases, and can greatly reduce the workload.
Drawings
FIG. 1 is a schematic diagram of a word segmentation judgment process in the present invention;
FIG. 2 is a diagram of a Bert + BilsTM + CRF network model structure.
Detailed Description
The invention is further described with reference to the following drawings and specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. The following examples are conducted under conditions not specified, usually according to conventional conditions, or according to conditions recommended by the manufacturer.
The process of the Chinese medicine sentence word segmentation method based on the entity recognition technology in the embodiment is shown in fig. 1, and specifically includes:
step A, collecting Chinese medicine symptom description data and cleaning the data. The method comprises the steps of collecting and sorting traditional Chinese medicine sentences from existing channels such as a traditional Chinese medicine database, a traditional Chinese medicine website, traditional Chinese medicine documents and the like, integrating the traditional Chinese medicine sentences into a large traditional Chinese medicine text database, and performing preliminary segmentation on symptom description sentences according to Chinese and English punctuations to serve as a corpus.
And step B, counting the occurrence frequency of adjacent Chinese character combinations in the corpus, extracting the adjacent Chinese character combinations larger than a certain threshold, and manually screening out common phrases from the extracted combinations to serve as a basic lexicon W. And then, calculating the forward conditional probability and the reverse conditional probability of the remaining adjacent Chinese character combinations, and selecting the phrases meeting the set threshold range to be added into the word stock W.
Firstly, for any adjacent Chinese character combination XY (X is in front of Y is behind), the frequency count (XY) of the adjacent Chinese character combination in the corpus, the frequency count (X) of the Chinese character X in the corpus and the frequency count (Y) of the Chinese character Y in the corpus need to be respectively calculated; then, the forward conditional probability of the adjacent kanji combination XY is calculated according to the formula P (Y | X) ═ count (XY)/count (X), and the reverse conditional probability of the adjacent kanji combination XY is calculated according to the formula P (X | Y) ═ count (XY)/count (Y).
And step C, classifying the word bank W, and labeling the sequence according to the classification result. The phrases in the lexicon W are divided into three categories, namely "body" (limb part), "age-level" (age level) and "symptom" (symptom). For example, "nose" belongs to the body group, "child" belongs to the age-level group, and "red swelling" belongs to the symptom group.
And (4) carrying out sequence labeling on each element in the material library by using a BIO label. The BIO labels are denoted herein as "B-M", "I-M", "O". Wherein "M" represents the type of the fragment in which the element belongs, "B" represents the element located at the start of the fragment, "I" represents the element located at the middle of the fragment, and "O" represents the element not belonging to any type. For example, the word "nose" should be labeled "B-body I-body".
And D, pre-training a corpus to generate word vectors by adopting a novel language model (Bert).
And E, constructing a time cycle neural network model (BilSTM-CRF) with the conditional random field, and training the model based on the labeled data set. Referring to fig. 2, wherein the Bert layer maps each input word into a new low-dimensional dense word vector by using a pre-trained matrix and transmits the new low-dimensional dense word vector to the BiLSTM; the BilSTM layer is used for integrating and extracting the characteristics of the input sequence; and after the text sequence is input, finally outputting the text sequence as a probability distribution matrix of the label type of each character in the sequence. And finally, determining a most reasonable sequence path in all feasible label sequence spaces by using a probability distribution matrix obtained by the CRF according to the BilSTM layer, wherein each label in the sequence corresponds to a character label at the same position.
And F, inputting a test statement into the BilSTM-CRF model to obtain a predicted entity, and forming an entity list L.
And G, sequentially cutting all elements in the L from the input sentence according to the obtained list L.
And step H, carrying out dictionary lookup operation on the sentences which are not segmented according to a given dictionary to generate several possible sentence segmentations, wherein the possible segmentation modes form a directed acyclic graph.
For example, for "infantile convulsions", there are two divisions "small" and "infantile" for "small"; for 'er', there is no prefix, then there is only one division way; for 'surprise', two division modes of 'surprise' and 'fright' are provided, and by analogy, the division mode of the prefix word starting from each character can be obtained.
Step I, after obtaining the directed acyclic graph formed by all possible segmentation modes, calculating a maximum probability path, namely, when reaching a node, calculating the maximum path probability from the node in front to the end point. Each node of the directed acyclic graph is weighted, and for a word in the prefix dictionary, the weight of the word is the word frequency of the word.
The sentence \34000andfatigue is taken as an example, wherein the word \34000andfatigue is used in the Chinese medicine term to indicate symptoms of fatigue, lassitude, cold and heat accompanied, asthma, suffocating, cough, abdominal pain and the like after childbirth. The traditional Chinese word segmentation method is used, and the obtained word segmentation result is { "woman", "straw mat" "lao" }, namely, the traditional word segmentation method separates the' 34000; "lao". In the invention, after the input sentence passes through the network model, the prediction label is ' B-age I-age B-sym I-sym ', namely, the model predicts the entity ' woman ', ' 34000; ' lao '. Meanwhile, the word segmentation result is { "woman", "34000;" lao "}. Therefore, the method improves the accuracy of Chinese medicine description sentence word segmentation.
Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the above description of the present invention, and equivalents also fall within the scope of the invention as defined by the appended claims.

Claims (6)

1. A Chinese medicine sentence word segmentation method based on an entity recognition technology is characterized by comprising the following steps:
(A) traditional Chinese medicine sentences are collected, and the data are used as a corpus after being cleaned;
(B) counting the occurrence frequency of adjacent Chinese character combinations in the corpus, and extracting common phrases from word combinations larger than a certain threshold value to form a lexicon W; calculating the forward conditional probability and the reverse conditional probability of the other adjacent Chinese character combinations, extracting the combinations meeting the specified threshold value, and adding the combinations into the word bank W;
(C) classifying the word bank W, and performing sequence labeling on each element of each sequence of the corpus by using a label according to a classification result;
(D) pre-training the corpus by adopting a Bert language model and obtaining word vectors;
(E) constructing a time cycle neural network model with a conditional random field, and performing model training based on a labeled data set;
(F) inputting a test statement to the trained time cycle neural network model to obtain a prediction entity list;
(G) according to the predicted entity list, carrying out first segmentation on an input test statement, and segmenting elements contained in the predicted entity list;
(H) scanning the remaining sentences to be divided based on the prefix dictionary to generate all possible word forming conditions of the Chinese characters in the sentences to form a directed acyclic graph;
(I) and dynamically planning the directed acyclic graph to search a maximum probability path and find out a maximum segmentation combination based on word frequency.
2. The method of claim 1, wherein the forward conditional probability and the reverse conditional probability are calculated according to the following formulas (I) and (II) in the step (B):
Figure FDA0002230993870000011
Figure FDA0002230993870000012
for any adjacent Chinese character combination XY with X at the front and Y at the back, P (Y | X) and P (X | Y) are respectively the forward conditional probability and the reverse conditional probability of the combination, count (XY) is the frequency of the combination appearing in the corpus, and count (X) and count (Y) are respectively the frequency of the Chinese character X and the Chinese character Y appearing in the corpus.
3. The method as claimed in claim 1, wherein in the step (C), the lexicon W is divided into three types, i.e. body, age-level and symptom.
4. The method for segmenting words in Chinese medicine sentences based on entity recognition technology of claim 1 or 3, wherein in step (C), each element of each sequence in the corpus is labeled as "B-M", "I-M" or "O" form by using BIO label according to the classification result;
wherein, "M" represents the type of the segment in which the element belongs, "B" and "I" represent the start position and non-start position of the element in the segment, respectively, and "O" represents that the element does not belong to any type.
5. The method for segmenting words in Chinese medicine sentences based on entity recognition technology of claim 1 wherein in step (E), said model of time-cyclic neural network with conditional random fields comprises:
the bidirectional long and short term memory network layer is used for extracting the characteristics of the input sequence and finally outputting the characteristics as a probability distribution matrix of the label type of each character in the sequence;
and determining the most reasonable sequence path in all feasible label sequence spaces by the conditional random field according to the probability distribution matrix to obtain the corresponding character label.
6. The method according to claim 1, wherein in step (I), the maximum probability is calculated for the sentence from right to left in reverse direction according to the directed acyclic graph;
each node of the directed acyclic graph is weighted, and for the words in the prefix dictionary, the weights are the word frequencies of the words.
CN201910967537.8A 2019-10-12 2019-10-12 Chinese medicine sentence word segmentation method based on entity recognition technology Pending CN110879831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910967537.8A CN110879831A (en) 2019-10-12 2019-10-12 Chinese medicine sentence word segmentation method based on entity recognition technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910967537.8A CN110879831A (en) 2019-10-12 2019-10-12 Chinese medicine sentence word segmentation method based on entity recognition technology

Publications (1)

Publication Number Publication Date
CN110879831A true CN110879831A (en) 2020-03-13

Family

ID=69728110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910967537.8A Pending CN110879831A (en) 2019-10-12 2019-10-12 Chinese medicine sentence word segmentation method based on entity recognition technology

Country Status (1)

Country Link
CN (1) CN110879831A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444704A (en) * 2020-03-27 2020-07-24 中南大学 Network security keyword extraction method based on deep neural network
CN111563383A (en) * 2020-04-09 2020-08-21 华南理工大学 Chinese named entity identification method based on BERT and semi CRF
CN111680511A (en) * 2020-04-21 2020-09-18 华东师范大学 Military field named entity identification method with cooperation of multiple neural networks
CN111931506A (en) * 2020-05-22 2020-11-13 北京理工大学 Entity relationship extraction method based on graph information enhancement
CN112016319A (en) * 2020-09-08 2020-12-01 平安科技(深圳)有限公司 Pre-training model obtaining method, disease entity labeling method, device and storage medium
CN112036178A (en) * 2020-08-25 2020-12-04 国家电网有限公司 Distribution network entity related semantic search method
CN112307759A (en) * 2020-11-09 2021-02-02 西安交通大学 Cantonese word segmentation method for irregular short text of social network
CN113268988A (en) * 2021-07-19 2021-08-17 中国平安人寿保险股份有限公司 Text entity analysis method and device, terminal equipment and storage medium
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN113808752A (en) * 2020-12-04 2021-12-17 四川医枢科技股份有限公司 Medical document identification method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357780A (en) * 2017-06-28 2017-11-17 浙江大学 A kind of Chinese word cutting method for traditional Chinese medicine symptom sentence
CN108549639A (en) * 2018-04-20 2018-09-18 山东管理学院 Based on the modified Chinese medicine case name recognition methods of multiple features template and system
CN109117472A (en) * 2018-11-12 2019-01-01 新疆大学 A kind of Uighur name entity recognition method based on deep learning
CN109710087A (en) * 2018-12-28 2019-05-03 北京金山安全软件有限公司 Input method model generation method and device
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357780A (en) * 2017-06-28 2017-11-17 浙江大学 A kind of Chinese word cutting method for traditional Chinese medicine symptom sentence
CN108549639A (en) * 2018-04-20 2018-09-18 山东管理学院 Based on the modified Chinese medicine case name recognition methods of multiple features template and system
CN109117472A (en) * 2018-11-12 2019-01-01 新疆大学 A kind of Uighur name entity recognition method based on deep learning
CN109710087A (en) * 2018-12-28 2019-05-03 北京金山安全软件有限公司 Input method model generation method and device
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444704A (en) * 2020-03-27 2020-07-24 中南大学 Network security keyword extraction method based on deep neural network
CN111444704B (en) * 2020-03-27 2023-09-19 中南大学 Network safety keyword extraction method based on deep neural network
CN111563383A (en) * 2020-04-09 2020-08-21 华南理工大学 Chinese named entity identification method based on BERT and semi CRF
CN111680511A (en) * 2020-04-21 2020-09-18 华东师范大学 Military field named entity identification method with cooperation of multiple neural networks
CN111931506B (en) * 2020-05-22 2023-01-10 北京理工大学 Entity relationship extraction method based on graph information enhancement
CN111931506A (en) * 2020-05-22 2020-11-13 北京理工大学 Entity relationship extraction method based on graph information enhancement
CN112036178A (en) * 2020-08-25 2020-12-04 国家电网有限公司 Distribution network entity related semantic search method
CN112016319A (en) * 2020-09-08 2020-12-01 平安科技(深圳)有限公司 Pre-training model obtaining method, disease entity labeling method, device and storage medium
CN112016319B (en) * 2020-09-08 2023-12-15 平安科技(深圳)有限公司 Pre-training model acquisition and disease entity labeling method, device and storage medium
CN112307759A (en) * 2020-11-09 2021-02-02 西安交通大学 Cantonese word segmentation method for irregular short text of social network
CN112307759B (en) * 2020-11-09 2024-04-12 西安交通大学 Yue language word segmentation method for irregular short text of social network
CN113808752A (en) * 2020-12-04 2021-12-17 四川医枢科技股份有限公司 Medical document identification method, device and equipment
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN113268988B (en) * 2021-07-19 2021-10-29 中国平安人寿保险股份有限公司 Text entity analysis method and device, terminal equipment and storage medium
CN113268988A (en) * 2021-07-19 2021-08-17 中国平安人寿保险股份有限公司 Text entity analysis method and device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110879831A (en) Chinese medicine sentence word segmentation method based on entity recognition technology
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN107992597B (en) Text structuring method for power grid fault case
CN110825721B (en) Method for constructing and integrating hypertension knowledge base and system in big data environment
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN110298032A (en) Text classification corpus labeling training system
CN109145260B (en) Automatic text information extraction method
CN111949759A (en) Method and system for retrieving medical record text similarity and computer equipment
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
CN103150381B (en) A kind of High-precision Chinese predicate identification method
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN108363691B (en) Domain term recognition system and method for power 95598 work order
CN112559684A (en) Keyword extraction and information retrieval method
CN101645083A (en) Acquisition system and method of text field based on concept symbols
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN113065341A (en) Automatic labeling and classifying method for environmental complaint report text
CN108763192B (en) Entity relation extraction method and device for text processing
CN113157903A (en) Multi-field-oriented electric power word stock construction method
CN111444704A (en) Network security keyword extraction method based on deep neural network
CN111858842A (en) Judicial case screening method based on LDA topic model
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
Jiang et al. A CRD-WEL system for chemical-disease relations extraction
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200313