CN111259626A - Traditional Chinese medicine entity recognition algorithm - Google Patents
Traditional Chinese medicine entity recognition algorithm Download PDFInfo
- Publication number
- CN111259626A CN111259626A CN202010057863.8A CN202010057863A CN111259626A CN 111259626 A CN111259626 A CN 111259626A CN 202010057863 A CN202010057863 A CN 202010057863A CN 111259626 A CN111259626 A CN 111259626A
- Authority
- CN
- China
- Prior art keywords
- chinese medicine
- traditional chinese
- training
- algorithm
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000003814 drug Substances 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000002372 labelling Methods 0.000 claims abstract description 13
- 238000000034 method Methods 0.000 claims abstract description 7
- 238000013526 transfer learning Methods 0.000 claims abstract description 4
- 238000012512 characterization method Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 6
- 230000011218 segmentation Effects 0.000 abstract description 3
- 238000003745 diagnosis Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 208000011580 syndromic disease Diseases 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a traditional Chinese medicine entity recognition algorithm, which comprises the following steps: A. labeling data; the method comprises the following steps that a BIO labeling mode is adopted in a traditional Chinese medicine medical record text collected by a traditional Chinese medicine group, wherein B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other and is used for labeling unrelated characters; B. pre-training the model; a training mode of fine tuning by using a pre-training model is called transfer learning; C. the training model breaks through the bottleneck of poor word segmentation effect in the traditional Chinese medicine field, lays a foundation for intelligent dialogue and traditional Chinese medicine knowledge graph in the health field and a traditional Chinese medicine auxiliary diagnosis and treatment system, and improves the effect of basic semantic components.
Description
Technical Field
The invention relates to the technical field of natural language processing application, in particular to a traditional Chinese medicine entity recognition algorithm.
Background
The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate the understanding of a sentence by a person. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information.
The main work of the present document is to develop research on three named entities of symptoms, syndrome types and drug names and their mutual association relationship in the traditional Chinese medical record. The method adopted by the research relates to linguistic knowledge of a natural language processing corpus and named entity recognition technology based on a statistical method.
Disclosure of Invention
The invention aims to provide a traditional Chinese medicine entity recognition algorithm to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a Chinese medicine entity recognition algorithm comprises the following steps:
A. labeling data; the method comprises the following steps that a BIO labeling mode is adopted in a traditional Chinese medicine medical record text collected by a traditional Chinese medicine group, wherein B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other and is used for labeling unrelated characters;
B. pre-training the model; a training mode of fine tuning by using a pre-training model is called transfer learning;
C. and (5) training the model.
As a further scheme of the invention: and B, in the step A, each line has a character, a space is formed behind the character, and then the label of the character is attached.
As a further scheme of the invention: each sample is separated by an empty line.
As a further scheme of the invention: and in the step A, a Brat marking tool is used for assisting in marking.
As a further scheme of the invention: in the step B, a Chinese pre-training model of bert which is one of the natural language representation models is used.
As a further scheme of the invention: and the step C is used for training the named entity model based on the bert + lstm + crf algorithm.
As a further scheme of the invention: and C, outputting a training log after the step C is finished.
Compared with the prior art, the invention has the beneficial effects that: the method breaks the bottleneck of poor word segmentation effect in the field of traditional Chinese medicine, lays a foundation for intelligent conversation and traditional Chinese medicine knowledge graph in the field of health and a traditional Chinese medicine auxiliary diagnosis and treatment system, and improves the effect of basic semantic components.
Drawings
FIG. 1 is a model schematic of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1: referring to fig. 1, a traditional Chinese medicine entity recognition algorithm includes the following steps:
A. and (3) data labeling, wherein a BIO labeling mode is adopted for a traditional Chinese medicine case text collected by a traditional Chinese medicine group, B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other characters, is used for labeling irrelevant characters, each line of the characters is followed by a space, then the space is followed by the labeling of the characters, and each sample is separated by an empty line. The Brat marking tool is used for assisting in marking.
B. The pre-training model, the training mode of using the pre-training model for fine tuning is called as transfer learning, so that the training convergence of the user can be faster, and the training on fewer training samples can also obtain good effect. Here we will use the Chinese pre-training model of bert, one of the best natural language representation models at present. Better characterization can be obtained using bert than word2vec (word vector). bert model download address pre-trained in the Chinese Wikipedia: https:// storage. google apis. com/bert _ models/2018_11_ 03/chip _ L-12_ H-768_ A-12. zip.
C. Training a model, the model herein is based on the bert + lstm + crf algorithm to train the named entity model, which is better than the lstm + crf based item whose address is as follows:
https://github.com/macanv/BERT-BiLSTM-CRF-NER。
the format of the output result of the test is the same as that of the output result after the training is completed. If you go to this place completely according to the steps in this text, you already have a named entity recognition model that can recognize symptoms, syndrome types, traditional Chinese medicine names, and three entities in total.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.
Claims (7)
1. A Chinese medicine entity recognition algorithm is characterized by comprising the following steps:
A. labeling data; the method comprises the following steps that a BIO labeling mode is adopted in a traditional Chinese medicine medical record text collected by a traditional Chinese medicine group, wherein B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other and is used for labeling unrelated characters;
B. pre-training the model; a training mode of fine tuning by using a pre-training model is called transfer learning;
C. and (5) training the model.
2. The algorithm of claim 1, wherein in step a, each line has a character, and the character is followed by a space and then by the label of the character.
3. The algorithm of claim 2, wherein each sample is separated by an empty line.
4. The algorithm for entity recognition in traditional Chinese medicine according to claim 3, wherein in the step A, a Brat marking tool is used for assisting in marking.
5. The algorithm of claim 4, wherein the Chinese pre-training model of bert, which is one of the natural language characterization models, is used in step B.
6. The algorithm of claim 4, wherein step C is based on the bert + lstm + crf algorithm to train the named entity model.
7. The algorithm of claim 3, wherein a training log is outputted after step C.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010057863.8A CN111259626A (en) | 2020-01-16 | 2020-01-16 | Traditional Chinese medicine entity recognition algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010057863.8A CN111259626A (en) | 2020-01-16 | 2020-01-16 | Traditional Chinese medicine entity recognition algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111259626A true CN111259626A (en) | 2020-06-09 |
Family
ID=70947143
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010057863.8A Pending CN111259626A (en) | 2020-01-16 | 2020-01-16 | Traditional Chinese medicine entity recognition algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111259626A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080063264A1 (en) * | 2006-09-08 | 2008-03-13 | Porikli Fatih M | Method for classifying data using an analytic manifold |
CN108549639A (en) * | 2018-04-20 | 2018-09-18 | 山东管理学院 | Based on the modified Chinese medicine case name recognition methods of multiple features template and system |
AU2018101606A4 (en) * | 2018-08-09 | 2018-12-13 | Northwest Institute Of Plateau Biology, Chinese Academy Of Sciences | A method for identifying meconopsis quintuplinervia regel from different geographical origins |
CN109635123A (en) * | 2018-11-28 | 2019-04-16 | 北京工业大学 | A kind of Chinese medicine text concept recognition methods of increment type |
CN109918644A (en) * | 2019-01-26 | 2019-06-21 | 华南理工大学 | A kind of Chinese medicine health consultation text name entity recognition method based on transfer learning |
CN110134953A (en) * | 2019-05-05 | 2019-08-16 | 北京科技大学 | Chinese medicine name entity recognition method and identifying system based on Chinese medical book document |
CN110321550A (en) * | 2019-04-25 | 2019-10-11 | 北京科技大学 | A kind of name entity recognition method and device towards Chinese medical book document |
-
2020
- 2020-01-16 CN CN202010057863.8A patent/CN111259626A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080063264A1 (en) * | 2006-09-08 | 2008-03-13 | Porikli Fatih M | Method for classifying data using an analytic manifold |
CN108549639A (en) * | 2018-04-20 | 2018-09-18 | 山东管理学院 | Based on the modified Chinese medicine case name recognition methods of multiple features template and system |
AU2018101606A4 (en) * | 2018-08-09 | 2018-12-13 | Northwest Institute Of Plateau Biology, Chinese Academy Of Sciences | A method for identifying meconopsis quintuplinervia regel from different geographical origins |
CN109635123A (en) * | 2018-11-28 | 2019-04-16 | 北京工业大学 | A kind of Chinese medicine text concept recognition methods of increment type |
CN109918644A (en) * | 2019-01-26 | 2019-06-21 | 华南理工大学 | A kind of Chinese medicine health consultation text name entity recognition method based on transfer learning |
CN110321550A (en) * | 2019-04-25 | 2019-10-11 | 北京科技大学 | A kind of name entity recognition method and device towards Chinese medical book document |
CN110134953A (en) * | 2019-05-05 | 2019-08-16 | 北京科技大学 | Chinese medicine name entity recognition method and identifying system based on Chinese medical book document |
Non-Patent Citations (1)
Title |
---|
舒红平等: "《软件需求工程》", pages: 163 - 164 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111859987B (en) | Text processing method, training method and device for target task model | |
CN105957518B (en) | A kind of method of Mongol large vocabulary continuous speech recognition | |
CN104050160B (en) | Interpreter's method and apparatus that a kind of machine is blended with human translation | |
CN112784696B (en) | Lip language identification method, device, equipment and storage medium based on image identification | |
CN110083710A (en) | It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure | |
JP2023535709A (en) | Language expression model system, pre-training method, device, device and medium | |
CN113205817A (en) | Speech semantic recognition method, system, device and medium | |
WO2021134524A1 (en) | Data processing method, apparatus, electronic device, and storage medium | |
WO2020199600A1 (en) | Sentiment polarity analysis method and related device | |
CN112349294B (en) | Voice processing method and device, computer readable medium and electronic equipment | |
CN112016271A (en) | Language style conversion model training method, text processing method and device | |
CN112800184B (en) | Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction | |
CN115759119B (en) | Financial text emotion analysis method, system, medium and equipment | |
CN108304387B (en) | Method, device, server group and storage medium for recognizing noise words in text | |
CN111339772B (en) | Russian text emotion analysis method, electronic device and storage medium | |
CN113761377A (en) | Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium | |
CN111126084B (en) | Data processing method, device, electronic equipment and storage medium | |
CN110969005B (en) | Method and device for determining similarity between entity corpora | |
CN112949293B (en) | Similar text generation method, similar text generation device and intelligent equipment | |
CN112749277B (en) | Medical data processing method, device and storage medium | |
CN116662495A (en) | Question-answering processing method, and method and device for training question-answering processing model | |
CN111401069A (en) | Intention recognition method and intention recognition device for conversation text and terminal | |
CN111813927A (en) | Sentence similarity calculation method based on topic model and LSTM | |
CN111259626A (en) | Traditional Chinese medicine entity recognition algorithm | |
CN115017876A (en) | Method and terminal for automatically generating emotion text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200609 |
|
RJ01 | Rejection of invention patent application after publication |