CN111259626A

CN111259626A - Traditional Chinese medicine entity recognition algorithm

Info

Publication number: CN111259626A
Application number: CN202010057863.8A
Authority: CN
Inventors: 安静梅; 张凯文; 钱小菲; 魏宇涛
Original assignee: Shanghai National Group Health Technology Co ltd
Current assignee: Shanghai National Group Health Technology Co ltd
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-06-09

Abstract

The invention discloses a traditional Chinese medicine entity recognition algorithm, which comprises the following steps: A. labeling data; the method comprises the following steps that a BIO labeling mode is adopted in a traditional Chinese medicine medical record text collected by a traditional Chinese medicine group, wherein B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other and is used for labeling unrelated characters; B. pre-training the model; a training mode of fine tuning by using a pre-training model is called transfer learning; C. the training model breaks through the bottleneck of poor word segmentation effect in the traditional Chinese medicine field, lays a foundation for intelligent dialogue and traditional Chinese medicine knowledge graph in the health field and a traditional Chinese medicine auxiliary diagnosis and treatment system, and improves the effect of basic semantic components.

Description

Traditional Chinese medicine entity recognition algorithm

Technical Field

The invention relates to the technical field of natural language processing application, in particular to a traditional Chinese medicine entity recognition algorithm.

Background

The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate the understanding of a sentence by a person. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information.

The main work of the present document is to develop research on three named entities of symptoms, syndrome types and drug names and their mutual association relationship in the traditional Chinese medical record. The method adopted by the research relates to linguistic knowledge of a natural language processing corpus and named entity recognition technology based on a statistical method.

Disclosure of Invention

The invention aims to provide a traditional Chinese medicine entity recognition algorithm to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a Chinese medicine entity recognition algorithm comprises the following steps:

A. labeling data; the method comprises the following steps that a BIO labeling mode is adopted in a traditional Chinese medicine medical record text collected by a traditional Chinese medicine group, wherein B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other and is used for labeling unrelated characters;

B. pre-training the model; a training mode of fine tuning by using a pre-training model is called transfer learning;

C. and (5) training the model.

As a further scheme of the invention: and B, in the step A, each line has a character, a space is formed behind the character, and then the label of the character is attached.

As a further scheme of the invention: each sample is separated by an empty line.

As a further scheme of the invention: and in the step A, a Brat marking tool is used for assisting in marking.

As a further scheme of the invention: in the step B, a Chinese pre-training model of bert which is one of the natural language representation models is used.

As a further scheme of the invention: and the step C is used for training the named entity model based on the bert + lstm + crf algorithm.

As a further scheme of the invention: and C, outputting a training log after the step C is finished.

Compared with the prior art, the invention has the beneficial effects that: the method breaks the bottleneck of poor word segmentation effect in the field of traditional Chinese medicine, lays a foundation for intelligent conversation and traditional Chinese medicine knowledge graph in the field of health and a traditional Chinese medicine auxiliary diagnosis and treatment system, and improves the effect of basic semantic components.

Drawings

FIG. 1 is a model schematic of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1: referring to fig. 1, a traditional Chinese medicine entity recognition algorithm includes the following steps:

A. and (3) data labeling, wherein a BIO labeling mode is adopted for a traditional Chinese medicine case text collected by a traditional Chinese medicine group, B, namely Begin, represents a starting I, namely Intermediate, represents an Intermediate O, namely Other, represents Other characters, is used for labeling irrelevant characters, each line of the characters is followed by a space, then the space is followed by the labeling of the characters, and each sample is separated by an empty line. The Brat marking tool is used for assisting in marking.

B. The pre-training model, the training mode of using the pre-training model for fine tuning is called as transfer learning, so that the training convergence of the user can be faster, and the training on fewer training samples can also obtain good effect. Here we will use the Chinese pre-training model of bert, one of the best natural language representation models at present. Better characterization can be obtained using bert than word2vec (word vector). bert model download address pre-trained in the Chinese Wikipedia: https:// storage. google apis. com/bert _ models/2018_11_ 03/chip _ L-12_ H-768_ A-12. zip.

C. Training a model, the model herein is based on the bert + lstm + crf algorithm to train the named entity model, which is better than the lstm + crf based item whose address is as follows:

https://github.com/macanv/BERT-BiLSTM-CRF-NER。

the format of the output result of the test is the same as that of the output result after the training is completed. If you go to this place completely according to the steps in this text, you already have a named entity recognition model that can recognize symptoms, syndrome types, traditional Chinese medicine names, and three entities in total.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A Chinese medicine entity recognition algorithm is characterized by comprising the following steps:

C. and (5) training the model.

2. The algorithm of claim 1, wherein in step a, each line has a character, and the character is followed by a space and then by the label of the character.

3. The algorithm of claim 2, wherein each sample is separated by an empty line.

4. The algorithm for entity recognition in traditional Chinese medicine according to claim 3, wherein in the step A, a Brat marking tool is used for assisting in marking.

5. The algorithm of claim 4, wherein the Chinese pre-training model of bert, which is one of the natural language characterization models, is used in step B.

6. The algorithm of claim 4, wherein step C is based on the bert + lstm + crf algorithm to train the named entity model.

7. The algorithm of claim 3, wherein a training log is outputted after step C.