CN111259626A - Traditional Chinese medicine entity recognition algorithm - Google Patents

Traditional Chinese medicine entity recognition algorithm Download PDF

Info

Publication number
CN111259626A
CN111259626A CN202010057863.8A CN202010057863A CN111259626A CN 111259626 A CN111259626 A CN 111259626A CN 202010057863 A CN202010057863 A CN 202010057863A CN 111259626 A CN111259626 A CN 111259626A
Authority
CN
China
Prior art keywords
chinese medicine
traditional chinese
training
algorithm
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010057863.8A
Other languages
Chinese (zh)
Inventor
安静梅
张凯文
钱小菲
魏宇涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai National Group Health Technology Co ltd
Original Assignee
Shanghai National Group Health Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai National Group Health Technology Co ltd filed Critical Shanghai National Group Health Technology Co ltd
Priority to CN202010057863.8A priority Critical patent/CN111259626A/en
Publication of CN111259626A publication Critical patent/CN111259626A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a traditional Chinese medicine entity recognition algorithm, which comprises the following steps: A. labeling data; the method comprises the following steps that a BIO labeling mode is adopted in a traditional Chinese medicine medical record text collected by a traditional Chinese medicine group, wherein B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other and is used for labeling unrelated characters; B. pre-training the model; a training mode of fine tuning by using a pre-training model is called transfer learning; C. the training model breaks through the bottleneck of poor word segmentation effect in the traditional Chinese medicine field, lays a foundation for intelligent dialogue and traditional Chinese medicine knowledge graph in the health field and a traditional Chinese medicine auxiliary diagnosis and treatment system, and improves the effect of basic semantic components.

Description

Traditional Chinese medicine entity recognition algorithm
Technical Field
The invention relates to the technical field of natural language processing application, in particular to a traditional Chinese medicine entity recognition algorithm.
Background
The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate the understanding of a sentence by a person. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information.
The main work of the present document is to develop research on three named entities of symptoms, syndrome types and drug names and their mutual association relationship in the traditional Chinese medical record. The method adopted by the research relates to linguistic knowledge of a natural language processing corpus and named entity recognition technology based on a statistical method.
Disclosure of Invention
The invention aims to provide a traditional Chinese medicine entity recognition algorithm to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a Chinese medicine entity recognition algorithm comprises the following steps:
A. labeling data; the method comprises the following steps that a BIO labeling mode is adopted in a traditional Chinese medicine medical record text collected by a traditional Chinese medicine group, wherein B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other and is used for labeling unrelated characters;
B. pre-training the model; a training mode of fine tuning by using a pre-training model is called transfer learning;
C. and (5) training the model.
As a further scheme of the invention: and B, in the step A, each line has a character, a space is formed behind the character, and then the label of the character is attached.
As a further scheme of the invention: each sample is separated by an empty line.
As a further scheme of the invention: and in the step A, a Brat marking tool is used for assisting in marking.
As a further scheme of the invention: in the step B, a Chinese pre-training model of bert which is one of the natural language representation models is used.
As a further scheme of the invention: and the step C is used for training the named entity model based on the bert + lstm + crf algorithm.
As a further scheme of the invention: and C, outputting a training log after the step C is finished.
Compared with the prior art, the invention has the beneficial effects that: the method breaks the bottleneck of poor word segmentation effect in the field of traditional Chinese medicine, lays a foundation for intelligent conversation and traditional Chinese medicine knowledge graph in the field of health and a traditional Chinese medicine auxiliary diagnosis and treatment system, and improves the effect of basic semantic components.
Drawings
FIG. 1 is a model schematic of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1: referring to fig. 1, a traditional Chinese medicine entity recognition algorithm includes the following steps:
A. and (3) data labeling, wherein a BIO labeling mode is adopted for a traditional Chinese medicine case text collected by a traditional Chinese medicine group, B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other characters, is used for labeling irrelevant characters, each line of the characters is followed by a space, then the space is followed by the labeling of the characters, and each sample is separated by an empty line. The Brat marking tool is used for assisting in marking.
B. The pre-training model, the training mode of using the pre-training model for fine tuning is called as transfer learning, so that the training convergence of the user can be faster, and the training on fewer training samples can also obtain good effect. Here we will use the Chinese pre-training model of bert, one of the best natural language representation models at present. Better characterization can be obtained using bert than word2vec (word vector). bert model download address pre-trained in the Chinese Wikipedia: https:// storage. google apis. com/bert _ models/2018_11_ 03/chip _ L-12_ H-768_ A-12. zip.
C. Training a model, the model herein is based on the bert + lstm + crf algorithm to train the named entity model, which is better than the lstm + crf based item whose address is as follows:
https://github.com/macanv/BERT-BiLSTM-CRF-NER。
the format of the output result of the test is the same as that of the output result after the training is completed. If you go to this place completely according to the steps in this text, you already have a named entity recognition model that can recognize symptoms, syndrome types, traditional Chinese medicine names, and three entities in total.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (7)

1. A Chinese medicine entity recognition algorithm is characterized by comprising the following steps:
A. labeling data; the method comprises the following steps that a BIO labeling mode is adopted in a traditional Chinese medicine medical record text collected by a traditional Chinese medicine group, wherein B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other and is used for labeling unrelated characters;
B. pre-training the model; a training mode of fine tuning by using a pre-training model is called transfer learning;
C. and (5) training the model.
2. The algorithm of claim 1, wherein in step a, each line has a character, and the character is followed by a space and then by the label of the character.
3. The algorithm of claim 2, wherein each sample is separated by an empty line.
4. The algorithm for entity recognition in traditional Chinese medicine according to claim 3, wherein in the step A, a Brat marking tool is used for assisting in marking.
5. The algorithm of claim 4, wherein the Chinese pre-training model of bert, which is one of the natural language characterization models, is used in step B.
6. The algorithm of claim 4, wherein step C is based on the bert + lstm + crf algorithm to train the named entity model.
7. The algorithm of claim 3, wherein a training log is outputted after step C.
CN202010057863.8A 2020-01-16 2020-01-16 Traditional Chinese medicine entity recognition algorithm Pending CN111259626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010057863.8A CN111259626A (en) 2020-01-16 2020-01-16 Traditional Chinese medicine entity recognition algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010057863.8A CN111259626A (en) 2020-01-16 2020-01-16 Traditional Chinese medicine entity recognition algorithm

Publications (1)

Publication Number Publication Date
CN111259626A true CN111259626A (en) 2020-06-09

Family

ID=70947143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010057863.8A Pending CN111259626A (en) 2020-01-16 2020-01-16 Traditional Chinese medicine entity recognition algorithm

Country Status (1)

Country Link
CN (1) CN111259626A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080063264A1 (en) * 2006-09-08 2008-03-13 Porikli Fatih M Method for classifying data using an analytic manifold
CN108549639A (en) * 2018-04-20 2018-09-18 山东管理学院 Based on the modified Chinese medicine case name recognition methods of multiple features template and system
AU2018101606A4 (en) * 2018-08-09 2018-12-13 Northwest Institute Of Plateau Biology, Chinese Academy Of Sciences A method for identifying meconopsis quintuplinervia regel from different geographical origins
CN109635123A (en) * 2018-11-28 2019-04-16 北京工业大学 A kind of Chinese medicine text concept recognition methods of increment type
CN109918644A (en) * 2019-01-26 2019-06-21 华南理工大学 A kind of Chinese medicine health consultation text name entity recognition method based on transfer learning
CN110134953A (en) * 2019-05-05 2019-08-16 北京科技大学 Chinese medicine name entity recognition method and identifying system based on Chinese medical book document
CN110321550A (en) * 2019-04-25 2019-10-11 北京科技大学 A kind of name entity recognition method and device towards Chinese medical book document

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080063264A1 (en) * 2006-09-08 2008-03-13 Porikli Fatih M Method for classifying data using an analytic manifold
CN108549639A (en) * 2018-04-20 2018-09-18 山东管理学院 Based on the modified Chinese medicine case name recognition methods of multiple features template and system
AU2018101606A4 (en) * 2018-08-09 2018-12-13 Northwest Institute Of Plateau Biology, Chinese Academy Of Sciences A method for identifying meconopsis quintuplinervia regel from different geographical origins
CN109635123A (en) * 2018-11-28 2019-04-16 北京工业大学 A kind of Chinese medicine text concept recognition methods of increment type
CN109918644A (en) * 2019-01-26 2019-06-21 华南理工大学 A kind of Chinese medicine health consultation text name entity recognition method based on transfer learning
CN110321550A (en) * 2019-04-25 2019-10-11 北京科技大学 A kind of name entity recognition method and device towards Chinese medical book document
CN110134953A (en) * 2019-05-05 2019-08-16 北京科技大学 Chinese medicine name entity recognition method and identifying system based on Chinese medical book document

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
舒红平等: "《软件需求工程》", pages: 163 - 164 *

Similar Documents

Publication Publication Date Title
CN111859987B (en) Text processing method, training method and device for target task model
CN105957518B (en) A kind of method of Mongol large vocabulary continuous speech recognition
CN104050160B (en) Interpreter's method and apparatus that a kind of machine is blended with human translation
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN110083710A (en) It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
JP2023535709A (en) Language expression model system, pre-training method, device, device and medium
CN113205817A (en) Speech semantic recognition method, system, device and medium
WO2021134524A1 (en) Data processing method, apparatus, electronic device, and storage medium
WO2020199600A1 (en) Sentiment polarity analysis method and related device
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN112016271A (en) Language style conversion model training method, text processing method and device
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN108304387B (en) Method, device, server group and storage medium for recognizing noise words in text
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN113761377A (en) Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN110969005B (en) Method and device for determining similarity between entity corpora
CN112949293B (en) Similar text generation method, similar text generation device and intelligent equipment
CN112749277B (en) Medical data processing method, device and storage medium
CN116662495A (en) Question-answering processing method, and method and device for training question-answering processing model
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
CN111259626A (en) Traditional Chinese medicine entity recognition algorithm
CN115017876A (en) Method and terminal for automatically generating emotion text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200609

RJ01 Rejection of invention patent application after publication