CN110134953B - Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature - Google Patents

Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature Download PDF

Info

Publication number
CN110134953B
CN110134953B CN201910367376.9A CN201910367376A CN110134953B CN 110134953 B CN110134953 B CN 110134953B CN 201910367376 A CN201910367376 A CN 201910367376A CN 110134953 B CN110134953 B CN 110134953B
Authority
CN
China
Prior art keywords
training
traditional chinese
model
ancient
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910367376.9A
Other languages
Chinese (zh)
Other versions
CN110134953A (en
Inventor
张德政
杨石兵
贾麒
谢永红
夏超
栗辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201910367376.9A priority Critical patent/CN110134953B/en
Publication of CN110134953A publication Critical patent/CN110134953A/en
Application granted granted Critical
Publication of CN110134953B publication Critical patent/CN110134953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a traditional Chinese medicine named entity recognition method and system based on traditional Chinese medicine ancient book documents, which are used for solving the recognition problem of traditional Chinese medicine named entities. The method comprises the steps of carrying out data cleaning on the basis of obtaining Chinese medical ancient book literature corpora, and then carrying out language model pre-training; forming a training set of a subsequent model by carrying out sequence labeling on the linguistic data; the method comprises the steps of training a sequence labeling model based on a model training set of sequence labeling, taking a language model as a coding layer, taking a neural network structure as a decoding layer, and training the sequence labeling model, thereby carrying out traditional Chinese medicine named entity recognition based on the sequence labeling model. The invention combines the existing language training model, such as the language model pre-training method bert proposed by Google, saves the cost of manual labeling based on a small sample training set, improves the recognition effect and the accuracy, is easy to operate, realizes the effective and comprehensive utilization of Chinese medical ancient book documents, especially Chinese medical ancient book medical records, and lays a good foundation for the research of the field of Chinese medicine.

Description

Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature
Technical Field
The invention belongs to the field of information processing and traditional Chinese medicine literature, and particularly relates to a traditional Chinese medicine named entity identification method and system based on traditional Chinese medicine ancient book literature.
Background
Traditional Chinese medicine is profound, and the transmission of traditional Chinese medicine is transmitted through direct experience of elder medical workers on one hand and literature on the other hand. In the literature of traditional Chinese medicine, a large number of ancient medical records of traditional Chinese medicine are kept, and numerous famous and old traditional Chinese medicine experiences and diagnosis and treatment methods are included. The ancient medical record of traditional Chinese medicine refers to the continuous record of symptoms, causes, prescriptions, medicines and the like of patients when the ancient traditional Chinese medicine treats diseases. The named entities in traditional Chinese medicine refer to information entities such as symptoms, prescriptions, medicines and the like for explaining and reproducing the diseases of patients in the ancient medical record of traditional Chinese medicine. In order to better utilize the traditional Chinese medicine documents including the ancient medical record of traditional Chinese medicine, the named entity identification of traditional Chinese medicine is an important prerequisite for the relevant research in the field of traditional Chinese medicine.
Current named entity recognition research on some common entity types (e.g., person name, place name, organization name, etc.) has yielded good results, essentially approaching the level of manual labeling. However, the ancient book literature of traditional Chinese medicine is very different from other literatures in terms of words and grammar, and has own characteristics, and the named entity recognition method in the prior art is applied to the ancient book medical record of traditional Chinese medicine, so that an ideal effect cannot be obtained. Meanwhile, a plurality of relatively intractable grammatical phenomena exist in the ancient Chinese medical record, so that manual marking is difficult and expensive, and the difficulty of Chinese medical named entity identification is further increased.
Disclosure of Invention
The invention aims to solve the technical problem that an effective method for identifying named entities in traditional Chinese medicine is not available in the prior art, provides a method and a system for identifying named entities in traditional Chinese medicine based on ancient book documents in traditional Chinese medicine, and obtains the method for identifying named entities oriented to the ancient book documents in traditional Chinese medicine by combining the existing language training model, such as a language model pre-training method bert provided by Google, so as to realize the effective utilization of the ancient book documents in traditional Chinese medicine, especially the medical cases of the ancient book in traditional Chinese medicine.
In order to solve the technical problems, the embodiment of the invention provides a traditional Chinese medicine named entity identification method based on an ancient book literature of traditional Chinese medicine, which comprises the following steps:
step S1, acquiring traditional Chinese medical ancient medical record corpus;
step S2, performing data cleaning on the ancient Chinese medical record corpus to be processed, which is obtained in the step S1;
step S3, pre-training a language model facing the ancient Chinese medical record corpus based on the ancient Chinese medical record corpus obtained in the step S2;
step S4, based on the cleaned Chinese medical ancient book medical records corpus obtained in step S2, carrying out sequence labeling on the corpus to form a training set of a subsequent model;
step S5, based on the model training set of the sequence annotation obtained in step S4, the language model in step S3 is used as a coding layer, a preset neural network structure is used as a decoding layer, and a corresponding sequence annotation model is trained;
and S6, performing entity recognition on the traditional Chinese medical ancient medical records based on the sequence labeling model obtained by training in the step S5.
In the above scheme, the step S1 of obtaining the ancient medical records corpus of traditional Chinese medicine specifically includes the following steps:
step S11, scanning and recognizing the existing paper-edition traditional Chinese medical ancient book by using optical character recognition to form an electronic text corpus;
step S12, capturing traditional Chinese medical ancient book medical record corpus without paper books from the network by using open-source web crawler;
and step S13, comparing and combining the corpus texts obtained in the step S11 and the step S12 to finally form a unified ancient Chinese medical record corpus to be processed.
In the above scheme, the step S2 of performing data cleaning on the ancient medical record corpus of traditional Chinese medicine to be processed specifically includes the following steps:
step S21, correcting wrongly written characters;
step S22, filter irrelevant statements.
In the above scheme, the language model pre-training in step S3 specifically includes the following steps:
step S31, downloading source codes of the language model pre-training Chinese language training;
step S32, manually arranging character tables related to the ancient medical records of traditional Chinese medicine, comparing the character tables with the Chinese character tables in the source codes, and cutting out uncommon character tables in the field of traditional Chinese medicine;
step S33, merging the rarely-used character table and the Chinese character table in a mode of replacing characters with low use frequency in source codes by characters in the rarely-used character table, and ensuring the length of the Chinese character table to be unchanged;
step S34, segmenting paragraphs in the ancient Chinese medical record corpus cleaned in step S2, presetting a paragraph length threshold and/or a paragraph number threshold, and taking paragraph texts with the paragraph length threshold and/or the paragraph number threshold as training corpuses of the language training model;
and step S35, replacing the word segmentation method in the source code by the word segmentation rule separated according to characters, and performing Chinese ancient medical record-oriented language model pre-training on the training corpus of the language training model in the step S34 by the downloaded language model pre-training method based on the Chinese language training model.
In the above scheme, in step S3, the language model is pre-trained by using a Google language model pre-training method bert.
In the foregoing solution, the pre-training of the language model specifically includes:
step S31, downloading Google open-source Chinese language model Chinese _ L-12_ H-768_ A-12 based on bert training and source codes of the bert;
s32, comparing the character table with a Chinese character table of Google open source to separate out a unique uncommon character table in the field of traditional Chinese medicine;
step S33, merging the Chinese character table of the Google open source with the rarely-used character table, and replacing characters with low use frequency in the Chinese character table by the rarely-used character table during merging so as to ensure that the length of the character table is unchanged;
step S34, segmenting paragraphs in the ancient Chinese medical record corpus cleaned in step S2, presetting a paragraph length threshold and/or a paragraph number threshold, and taking paragraph texts with the paragraph length threshold and/or the paragraph number threshold as training corpuses of the language training model;
and S35, replacing a word segmentation method in the bert source code by a word segmentation rule separated according to characters, and pre-training the Chinese language model facing the traditional Chinese medical ancient medical record on the training corpus segmented in the step S34 by using a bert pre-training method based on a Google open-source Chinese language model.
In the above solution, the step S4 forms the subsequent model training set, and the subsequent model training set is formed by performing sequence labeling in the form of biees on the ancient medical record corpus of traditional Chinese medical science.
In the above scheme, the sequence labeling in the biees form for the ancient medical record corpus in traditional Chinese medicine specifically includes the following steps:
step S41, selecting entity identification type;
step S42, appointing a labeling rule;
step S43, randomly selecting a sentence set with a preset scale from the cleaned corpus, writing the sentence set into the file to be labeled according to character separation, and separating the sentences by an empty line;
and step S44, manually labeling the selected sentence set with the preset scale based on the selected entity identification type and the agreed labeling rule.
In the above scheme, the entity types are respectively: symptoms ZZ, pulse MX, tongue SX, traditional Chinese medicine ZY, dosage JL and prescription FJ.
The embodiment of the invention also provides a traditional Chinese medicine named entity recognition system based on the ancient book literature of traditional Chinese medicine, which comprises: the system comprises a corpus acquisition module, a data cleaning module, a language model pre-training module, a training set labeling module, a sequence labeling model training module and an entity identification module; wherein the content of the first and second substances,
the corpus acquisition module is used for acquiring traditional Chinese medical ancient book medical record corpus;
the data cleaning module is used for cleaning the acquired traditional Chinese medical ancient book medical record corpus to be processed;
the language model pre-training module is used for pre-training a language model facing the traditional Chinese medical ancient medical record corpus based on the traditional Chinese medical ancient medical record corpus;
the training set labeling module is used for carrying out sequence labeling on the linguistic data based on the cleaned traditional Chinese medical ancient book medical record linguistic data to form a training set of a subsequent model;
the sequence labeling model training module is used for training a corresponding sequence labeling model by taking a language model as a coding layer and a preset neural network structure as a decoding layer based on a model training set of sequence labeling;
the entity identification module is used for carrying out entity identification on the traditional Chinese medical ancient medical record based on the sequence marking model.
The technical scheme of the invention has the following beneficial effects:
in the scheme, the traditional Chinese medicine named entity recognition method and system based on the traditional Chinese medicine ancient book literature are combined with the existing language training model, for example, a language model pre-training method bert provided by Google is trained based on a training set of small samples, so that the labeling data in the training set is less, and the manual labeling cost is greatly saved; the named entity recognition method for the traditional Chinese medical ancient book literature is obtained, the traditional Chinese medical ancient book literature can be used more effectively, the named entity recognition method for the traditional Chinese medical ancient book literature is obtained, the operation is easy, the efficiency is high, the traditional Chinese medical ancient book literature, particularly the traditional Chinese medical ancient book medical record, is effectively used, the ancient book literature is used more comprehensively, the effect of named entity recognition in the traditional Chinese medical field is improved, the named entity recognition in the traditional Chinese medical field is more accurate, and a good data base is laid for subsequent application in the traditional Chinese medical field.
Drawings
In order to more clearly illustrate the embodiments of the present invention and the prior art, the following technical scheme description figures of the present invention are briefly introduced, and it is obvious that other figures can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flow chart of a method for identifying named entities in traditional Chinese medicine based on ancient book literature in traditional Chinese medicine according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
In order to solve the problem of traditional Chinese medicine named entity recognition of traditional Chinese medicine ancient book documents including traditional Chinese medicine ancient book medical cases, the invention provides a traditional Chinese medicine named entity recognition method and system based on traditional Chinese medicine ancient book documents, and the named entity recognition method facing the traditional Chinese medicine ancient book documents is obtained by combining the existing language training model, such as a language model pre-training method bert provided by Google, so that the traditional Chinese medicine ancient book documents can be more effectively utilized, and a good foundation is laid for relevant research in the field of traditional Chinese medicine.
The present invention will be described in further detail below with reference to specific embodiments in conjunction with the accompanying drawings.
First embodiment
The embodiment provides a method for identifying named entities in traditional Chinese medicine based on ancient book literature in traditional Chinese medicine, and fig. 1 is a schematic flow chart of the method for identifying named entities in traditional Chinese medicine.
The named entity in this embodiment is directed to medical records in ancient Chinese medical book literature, but the present invention is not limited to medical records, and can also be applied to other ancient Chinese medical book literature.
As shown in fig. 1, the method for identifying named entities in traditional chinese medical science based on ancient book literature in traditional chinese medical science comprises the following steps:
step S1, obtaining the ancient Chinese medical record corpus.
Further, the method for acquiring the ancient Chinese medical record corpus specifically comprises the following steps:
step S11, scanning and recognizing the existing paper ancient medical records and books by Optical Character Recognition (OCR) to form an electronic text corpus.
And step S12, capturing traditional Chinese medical ancient book medical record corpus without paper books from the network by using the open-source web crawler.
And step S13, comparing and combining the corpus texts obtained in the step S11 and the step S12 to finally form a unified ancient Chinese medical record corpus to be processed.
And step S2, performing data cleaning on the ancient Chinese medical record corpus to be processed acquired in the step S1.
Further, the data cleaning is carried out on the traditional Chinese medical ancient book medical record corpus to be processed, and the method specifically comprises the following steps:
step S21, the wrongly written characters are corrected.
In this step, the wrongly written characters refer to the phenomenon of wrongly written characters when the doctor is recording or the wrong characters are recognized when the doctor is obtaining the corpus. For example, "one-medical-treatment diarrhea" is thirst, which is actually "one-medical-treatment diarrhea and thirst".
Step S22, filter irrelevant statements.
In this step, the irrelevant statement includes:
22.1, since a part of the original corpus is from books organized into books, it contains many titles, authors, etc., such as "medical records of both parties", "week of course", etc.
22.2, because the doctor has strong subjective randomness when recording the medical record or the medical record with long history is not completely stored, some sentences which have unknown meanings or only express the personal emotion of the doctor, such as 'attaching' are included. "," a. "," Once "," temporary ", but! "and the like.
Further, the filtering irrelevant statements comprises the following steps:
step S221, the characteristics of the irrelevant sentences needing to be filtered are manually sorted.
Preferably, the features include:
221.1, information such as titles, authors and the like are basically bound together to appear;
221.2, the title basically comprises a book name number;
221.3, other sentences whose meanings are unknown or which express the emotion of the doctor, the basic form being fixed, such as "Once again! "," a. "and the like;
221.4, the length of the irrelevant sentences to be filtered is mostly short, preferably not more than 10-15.
Step S222, designing a corresponding regular expression according to the characteristics arranged in the step S221, segmenting the sentence according to the sentence segmentation rule, and filtering out related sentences in the corpus to be processed by combining with length limitation.
Preferably, the sentence segmentation rule is:
222.1, taking the character string ending by the period, the question mark and the exclamation mark as a sentence;
222.2, if no period, question mark, exclamation mark, but a single line, it is also considered as a sentence.
And step S3, pre-training a language model facing the ancient Chinese medical record corpus based on the ancient Chinese medical record corpus obtained in the step S2.
Preferably, the language model is pre-trained in this step, and a method bert of pre-training the language model of Google is adopted.
Further, the language model pre-training for the ancient medical record corpus of traditional Chinese medicine specifically comprises the following steps:
step S31, downloading the source code of the language model pre-training Chinese language training.
For example, taking Google's bert training as an example, Google's open-source Chinese language model chinese _ L-12_ H-768_ A-12, which is based on the bert training, and the source code of the bert are downloaded.
And step S32, manually arranging character tables related to the ancient medical records of traditional Chinese medicine, comparing the character tables with the Chinese character tables in the source codes, and cutting out uncommon character tables in the field of traditional Chinese medicine.
For example, the Google berg training is taken as an example, and compared with a Google open-source Chinese character table, a unique uncommon character table in the field of traditional Chinese medicine is distinguished. Table 1 is an example of a table of isolated parts of rare words.
TABLE 1 rarely-used character table based on Google open source
Figure BDA0002048647910000071
Figure BDA0002048647910000081
And step S33, merging the rarely-used character table and the Chinese character table in a mode of replacing characters with low use frequency in the source code by characters in the rarely-used character table, and ensuring that the length of the Chinese character table is unchanged.
Taking Google open-source bert training as an example, merging the Chinese character table of the Google open source and the rarely-used character table, and replacing characters with the rarely-used character table during merging so as to ensure that the length of the character table is unchanged.
And step S34, segmenting paragraphs in the ancient Chinese medical record corpus cleaned in the step S2, presetting a paragraph length threshold and/or a paragraph number threshold, and taking paragraph texts with the paragraph length threshold and/or the paragraph number threshold as training corpuses of the language training model.
Preferably, the segmentation of the paragraphs in the ancient medical record corpus of traditional Chinese medical science is performed according to the rules of 222.1 and 222.2. Preferably, the paragraph length threshold is 150, and the paragraph number threshold is 3.
And step S35, replacing the word segmentation method in the source code by the word segmentation rule separated according to characters, and performing Chinese ancient medical record-oriented language model pre-training on the training corpus of the language training model in the step S34 by the downloaded language model pre-training method based on the Chinese language training model.
Taking Google' S bert training as an example, the step is to replace the word segmentation method in the bert source code by the word segmentation rule separated according to characters, and based on the Google open-source Chinese language model, the step is to perform Chinese traditional ancient medical record oriented language model pre-training on the training corpus segmented in the step S34 by using the bert pre-training method.
And step S4, carrying out sequence labeling on the linguistic data based on the cleaned traditional Chinese medical ancient book medical record linguistic data obtained in the step S2 to form a training set of a subsequent model.
Further, the subsequent model training set is formed by performing sequence labeling in a BIOES form on the ancient Chinese medical record corpus. The labels here are manual labels.
Further, the ancient medical record corpus of traditional Chinese medical science is subjected to sequence labeling in a BIOES form, and the label form of the sequence labeling is BIOES, wherein: b, Begin, represents the beginning character of the entity; i, intermedate, represents the middle character of an entity; e, End, represents the ending character of the entity; s, Single, representing an entity consisting of Single characters; o, Other, indicates otherwise, for marking extraneous characters. Specifically, the method comprises the following steps:
step S41, selecting an entity identification type.
The named entities involved in ancient medical records of traditional Chinese medicine are various in types, and the preferred entity types in this embodiment are respectively: symptom (ZZ), pulse (MX), tongue (SX), Chinese medicine (ZY), dosage (JL) and prescription (FJ). In practical applications, the entity types may be added or deleted according to the specific text of the ancient medical records of traditional Chinese medicine or other needs, and the above examples do not limit the selection of the entity types.
Step S42, appointing marking rules.
Named entities in ancient Chinese medical records have distinct field characteristics, particularly the entities with 'symptoms' are the most prominent, so that in the labeling process, general labeling rules need to be modified according to the characteristics of the named entities. In ancient Chinese medical records, the symptom name generally consists of the disease location, disease condition and disease nature, such as "black tongue and burnt lips". However, due to the liberty of the physician when recording, the symptom names appear in a variety of forms, including:
42.1, several symptoms appear in parallel without punctuation mark separation, for the condition, we agree that if the symptoms belong to the parallel relation, the symptoms are labeled separately, if 'bitter taste and deafness', the symptoms are respectively taken as independent symptom names; if the symptoms are in progressive relation, the symptoms are marked as a symptom, such as 'abdominal pain and diarrhea', and the symptoms are taken as a symptom name.
42.1, several disease positions are in parallel, but punctuation marks are separated among the disease positions, and for the condition, we agree that only the last disease position is reserved, such as chest pain and abdominal pain, and only the abdominal pain is regarded as a symptom.
Step S43, randomly selecting a sentence set with a preset scale from the cleaned corpus, writing the sentence set into the file to be labeled according to character separation, and separating the sentences by an empty line.
Preferably, the sentence sets of the predetermined size each contain 5000 sentences, which account for 5% of the total corpus text.
And step S44, manually labeling the selected sentence set with the preset scale based on the selected entity identification type and the agreed labeling rule.
The preferred entity types-symptom (ZZ), pulse (MX), tongue (SX), Chinese medicine (ZY), dosage (JL), Formula (FJ) -are shown in Table 2 for the examples of artificial labeling:
table 2 example of manual annotation of a selected set of sentences
Figure BDA0002048647910000101
And step S5, based on the sequence labeled model training set obtained in step S4, taking the language model in step S3 as a coding layer, taking a preset neural network structure as a decoding layer, and training a corresponding sequence labeled model.
Preferably, the preset neural network structure is BilSTM, and/or CRF, and/or Softmax.
Further, the training of the sequence labeling model is performed based on the model training set of the sequence labeling in this step, which specifically includes the following steps:
and step S51, dividing a training set, a verification set and a test set.
Preferably, in this step, the training set labeled in step S4 is uniformly divided into 10 shares, wherein each share is guaranteed to contain a substantially consistent number of sentences of six entity types. In each training, 1 part of the data is selected as a verification set, 1 part is selected as a test set, and the rest 8 parts are selected as training sets.
And step S52, training the sequence labeling model to form an automatic labeling result of the model.
Preferably, in this step, the language model trained based on the pre-training method in step S3 is used as a coding layer, the neural network structure in the form of "BiLSTM + CRF" is used as a decoding layer, model training is performed by combining the training set, the verification set and the test set divided in step S51, wherein different data are selected as the test set to repeat the model training until the training set covers all the manual labeling sets, and finally the labeling results of the model are collected to form the automatic labeling result of the model.
And step S53, comparing the automatic labeling result with the manual labeling result, filtering out the inconsistent labeling sequences, manually correcting the filtered labeling results, and finally writing the corrected results back to the training set.
Step S54, judging whether the labeling requirement is satisfied, if so, outputting the final labeling result; if not, the process returns to step S51.
Further, the judging whether the labeling requirement is met includes: whether the precision of the model reaches a preset threshold value or whether the automatic labeling result of the model is completely consistent with the manual labeling result. The preset threshold here may be a peak of the precision.
And S6, performing entity recognition on the traditional Chinese medical ancient medical records based on the sequence labeling model obtained by training in the step S5.
Further, based on the trained sequence labeling model, the entity recognition of the traditional Chinese medical ancient medical record specifically comprises the following steps:
and step S61, automatically labeling the ancient Chinese medical record corpus to be identified based on the trained sequence labeling model.
Step S62, converting the word-level labels into word-level labels according to the general rules of the biees labeling model, i.e. separating the entities from the labeling results.
Also, preferred examples of the entity type-symptom (ZZ), pulse (MX), tongue (SX), Chinese medicine (ZY), dosage (JL), and Formulation (FJ) -are shown in Table 3 as examples of the separation of the labeled results.
Table 3 separate annotation results example
Figure BDA0002048647910000121
According to the technical scheme of the embodiment, the traditional Chinese medicine named entity recognition method based on the traditional Chinese medicine ancient book medical record is combined with the existing language training model, such as a Google bert training model, and training is carried out based on a training set of small samples, so that the labeling data in the training set are less, and the manual labeling cost is greatly saved; the named entity recognition method for the ancient book literature of the traditional Chinese medicine is obtained, the ancient book literature of the traditional Chinese medicine can be more effectively utilized, the effect of named entity recognition in the field of the traditional Chinese medicine is improved, the named entity recognition in the field of the traditional Chinese medicine is more accurate, and a good data base is laid for subsequent application in the field of the traditional Chinese medicine. Second embodiment
The embodiment provides a traditional Chinese medicine named entity recognition system based on traditional Chinese medicine ancient book literature, which comprises: the system comprises a corpus acquisition module, a data cleaning module, a language model pre-training module, a training set labeling module, a sequence labeling model training module and an entity identification module; wherein the content of the first and second substances,
the corpus acquisition module is used for acquiring traditional Chinese medical ancient book medical record corpus;
the data cleaning module is used for cleaning the acquired traditional Chinese medical ancient book medical record corpus to be processed;
the language model pre-training module is used for pre-training a language model facing the traditional Chinese medical ancient medical record corpus based on the traditional Chinese medical ancient medical record corpus;
the training set labeling module is used for carrying out sequence labeling on the linguistic data based on the cleaned traditional Chinese medical ancient book medical record linguistic data to form a training set of a subsequent model;
the sequence labeling model training module is used for training a corresponding sequence labeling model by taking a language model as a coding layer and a preset neural network structure as a decoding layer based on a model training set of sequence labeling;
the entity identification module is used for carrying out entity identification on the traditional Chinese medical ancient medical record based on the sequence marking model.
The named entity recognition system of the present embodiment based on ancient Chinese medical literature corresponds to the named entity recognition method of the first embodiment, and therefore, the corresponding description of the first embodiment is also applicable to the named entity recognition system of the present embodiment, and is not repeated herein.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. A traditional Chinese medicine named entity recognition method based on traditional Chinese medicine ancient book literature is characterized by comprising the following steps:
step S1, acquiring traditional Chinese medicine ancient medical record corpus to be processed;
step S2, performing data cleaning on the ancient Chinese medical record corpus to be processed, which is obtained in the step S1;
step S3, pre-training a language model facing the ancient Chinese medical record corpus based on the ancient Chinese medical record corpus obtained in the step S2; the method specifically comprises the following steps:
step S31, downloading source codes of the language model pre-training Chinese language training;
step S32, manually arranging character tables related to the ancient medical records of traditional Chinese medicine, comparing the character tables with the Chinese character tables in the source codes, and cutting out uncommon character tables in the field of traditional Chinese medicine;
step S33, merging the rarely-used character table and the Chinese character table in a mode of replacing characters with low use frequency in source codes by characters in the rarely-used character table, and ensuring the length of the Chinese character table to be unchanged;
step S34, segmenting paragraphs in the ancient Chinese medical record corpus cleaned in step S2, presetting a paragraph length threshold and/or a paragraph number threshold, and taking paragraph texts with the paragraph length threshold and/or the paragraph number threshold as training corpuses of the language training model;
step S35, replacing word segmentation method in source code with word segmentation rule separated according to characters, based on Chinese language training model, using downloaded language model pre-training method to perform Chinese traditional ancient medical record oriented language model pre-training on the training corpus of the language training model in step S34;
step S4, based on the cleaned Chinese medical ancient book medical records corpus obtained in step S2, carrying out sequence labeling on the corpus to form a training set of a subsequent model; the sequence labeling specifically comprises the following steps:
step S41, selecting entity identification type;
step S42, appointing a labeling rule; when several symptoms appear in parallel and no punctuations are separated, if the symptoms belong to parallel relation, the symptoms are separately marked, and if the symptoms belong to progressive relation, the symptoms are marked as one symptom; when several disease sites appear in parallel, but punctuation marks are separated among the disease sites, only the last disease site is reserved;
step S43, randomly selecting a sentence set with a preset scale from the cleaned corpus, writing the sentence set into the file to be labeled according to character separation, and separating the sentences by an empty line;
step S44, manually labeling the selected sentence set with the preset scale based on the selected entity recognition type and the agreed labeling rule;
step S5, based on the model training set of the sequence annotation obtained in step S4, the language model in step S3 is used as a coding layer, a preset neural network structure is used as a decoding layer, and a corresponding sequence annotation model is trained; the method specifically comprises the following steps:
step S51, dividing a training set, a verification set and a test set;
step S52, training sequence label model to form automatic label result;
step S53, comparing the automatic labeling result with the manual labeling result, filtering out the inconsistent labeling sequence, manually correcting the filtered labeling result, and finally writing the corrected result back to the training set;
step S54, judging whether the labeling requirement is satisfied, if so, outputting the final labeling result; if not, returning to the step S51;
and S6, performing entity recognition on the traditional Chinese medical ancient medical records based on the sequence labeling model obtained by training in the step S5.
2. The method for identifying named entities in traditional Chinese medicine according to claim 1, wherein the step S1 of obtaining ancient medical records corpus in traditional Chinese medicine comprises the following steps:
step S11, scanning and recognizing the existing paper-edition traditional Chinese medical ancient book by using optical character recognition to form an electronic text corpus;
step S12, capturing traditional Chinese medical ancient book medical record corpus without paper books from the network by using open-source web crawler;
and step S13, comparing and combining the corpus texts obtained in the step S11 and the step S12 to finally form a unified ancient Chinese medical record corpus to be processed.
3. The method for recognizing named entities in traditional Chinese medicine according to claim 1, wherein said step S2 is performed to clear data of ancient medical records of traditional Chinese medicine to be processed, and comprises the following steps:
step S21, correcting wrongly written characters;
step S22, filter irrelevant statements.
4. The method for recognizing named entities according to claim 1, wherein in step S3, the language model is pre-trained by using a method bert pre-trained by a language model of Google.
5. The method for recognizing named entities according to claim 4, wherein the language model is pre-trained, specifically comprising:
step S31, downloading Google open-source Chinese language model Chinese _ L-12_ H-768_ A-12 based on bert training and source codes of the bert;
s32, comparing the character table with a Chinese character table of Google open source to separate out a unique uncommon character table in the field of traditional Chinese medicine;
step S33, merging the Chinese character table of the Google open source with the rarely-used character table, and replacing characters with low use frequency in the Chinese character table by the rarely-used character table during merging so as to ensure that the length of the character table is unchanged;
step S34, segmenting paragraphs in the ancient Chinese medical record corpus cleaned in step S2, presetting a paragraph length threshold and/or a paragraph number threshold, and taking paragraph texts with the paragraph length threshold and/or the paragraph number threshold as training corpuses of the language training model;
and S35, replacing a word segmentation method in the bert source code by a word segmentation rule separated according to characters, and pre-training the Chinese language model facing the traditional Chinese medical ancient medical record on the training corpus segmented in the step S34 by using a bert pre-training method based on a Google open-source Chinese language model.
6. The method for recognizing named entities in traditional chinese medical science according to claim 1, wherein the entity recognition types are respectively: symptoms ZZ, pulse MX, tongue SX, traditional Chinese medicine ZY, dosage JL and prescription FJ.
7. A traditional Chinese medicine named entity recognition system based on traditional Chinese medicine ancient book literature is characterized by comprising: the system comprises a corpus acquisition module, a data cleaning module, a language model pre-training module, a training set labeling module, a sequence labeling model training module and an entity identification module; wherein the content of the first and second substances,
the corpus acquisition module is used for acquiring traditional Chinese medical ancient book medical record corpuses to be processed;
the data cleaning module is used for cleaning the acquired traditional Chinese medical ancient book medical record corpus to be processed;
the language model pre-training module is used for pre-training a language model facing the traditional Chinese medical ancient medical record corpus based on the traditional Chinese medical ancient medical record corpus; and further for:
downloading source codes of pre-training Chinese language training of a language model; manually arranging character tables related to the ancient medical record of the traditional Chinese medicine, comparing the character tables with the Chinese character tables in the source codes, and cutting out the rarely-used character tables in the field of the traditional Chinese medicine; combining the rarely-used character table and the Chinese character table in a mode of replacing characters with low use frequency in source codes by characters in the rarely-used character table, and ensuring the length of the Chinese character table to be unchanged; segmenting paragraphs in the cleaned Chinese medical ancient book medical record corpus, presetting a paragraph length threshold value and/or a paragraph number threshold value, and taking paragraph texts which are larger than the paragraph length threshold value and/or the paragraph number threshold value as training corpuses of a language training model; replacing a word segmentation method in the source code by a word segmentation rule separated according to characters, and performing Chinese ancient medical record-oriented language model pre-training on training linguistic data of the language training model by a downloaded language model pre-training method based on a Chinese language training model;
the training set labeling module is used for carrying out sequence labeling on the linguistic data based on the cleaned traditional Chinese medical ancient book medical record linguistic data to form a training set of a subsequent model; and further for:
selecting an entity identification type; appointing a marking rule; when several symptoms appear in parallel and no punctuations are separated, if the symptoms belong to parallel relation, the symptoms are separately marked, and if the symptoms belong to progressive relation, the symptoms are marked as one symptom; when several disease sites appear in parallel, but punctuation marks are separated among the disease sites, only the last disease site is reserved; randomly selecting a sentence set with a preset scale from the cleaned corpus, writing the sentence set into a file to be labeled according to character separation, and separating sentences by an empty line; manually labeling the selected sentence set with the preset scale based on the selected entity identification type and the appointed labeling rule;
the sequence labeling model training module is used for training a corresponding sequence labeling model by taking a language model as a coding layer and a preset neural network structure as a decoding layer based on a model training set of sequence labeling; and further for: dividing a training set, a verification set and a test set; training the sequence labeling model to form an automatic labeling result of the model; comparing the automatic labeling result with the manual labeling result of the model, filtering out inconsistent labeling sequences, manually correcting the filtered labeling results, and finally writing the corrected results back to the training set; judging whether the labeling requirements are met, and if so, outputting a final labeling result; if not, returning to the division of the data set;
the entity identification module is used for carrying out entity identification on the traditional Chinese medical ancient medical records based on the sequence labeling model.
CN201910367376.9A 2019-05-05 2019-05-05 Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature Active CN110134953B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910367376.9A CN110134953B (en) 2019-05-05 2019-05-05 Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910367376.9A CN110134953B (en) 2019-05-05 2019-05-05 Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature

Publications (2)

Publication Number Publication Date
CN110134953A CN110134953A (en) 2019-08-16
CN110134953B true CN110134953B (en) 2020-12-18

Family

ID=67576196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910367376.9A Active CN110134953B (en) 2019-05-05 2019-05-05 Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature

Country Status (1)

Country Link
CN (1) CN110134953B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128225A (en) * 2019-12-31 2021-07-16 阿里巴巴集团控股有限公司 Named entity identification method and device, electronic equipment and computer storage medium
CN111259626A (en) * 2020-01-16 2020-06-09 上海国民集团健康科技有限公司 Traditional Chinese medicine entity recognition algorithm
CN111312356B (en) * 2020-01-17 2022-07-01 四川大学 Traditional Chinese medicine prescription generation method based on BERT and integration efficacy information
CN111274764B (en) * 2020-01-23 2021-02-23 北京百度网讯科技有限公司 Language generation method and device, computer equipment and storage medium
CN111507351B (en) * 2020-04-16 2023-05-30 华南理工大学 Ancient book document digitizing method
CN112183099A (en) * 2020-10-09 2021-01-05 上海明略人工智能(集团)有限公司 Named entity identification method and system based on semi-supervised small sample extension
CN112163410A (en) * 2020-10-14 2021-01-01 四川大学 Ancient text pre-training system based on deep learning and training method thereof
CN112364655B (en) * 2020-10-30 2021-08-24 北京中科凡语科技有限公司 Named entity recognition model establishing method and named entity recognition method
CN112949310B (en) * 2021-03-01 2023-06-06 创新奇智(上海)科技有限公司 Model training method, traditional Chinese medicine name recognition method, device and network model
CN113255357A (en) * 2021-06-24 2021-08-13 北京金山数字娱乐科技有限公司 Data processing method, target recognition model training method, target recognition method and device
CN114297693B (en) * 2021-12-30 2022-11-18 北京海泰方圆科技股份有限公司 Model pre-training method and device, electronic equipment and storage medium
CN116796742A (en) * 2023-03-27 2023-09-22 上海交通大学医学院 Method, device, equipment and storage medium for identifying ancient books named entity of traditional Chinese medicine

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
CN109190113A (en) * 2018-08-10 2019-01-11 北京科技大学 A kind of knowledge mapping construction method of theory of traditional Chinese medical science ancient books and records

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
CN109190113A (en) * 2018-08-10 2019-01-11 北京科技大学 A kind of knowledge mapping construction method of theory of traditional Chinese medical science ancient books and records

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Semantic Feature Expansion Technology Based on Knowledge Map;Yonghong Xie et al.;《14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery》;20181231;第1110-1114页 *
基于BERT预训练的中文命名实体识别TensorFlow实现;macanv;《CSDN博客-https://blog.csdn.net/macanv/article/details/85684284》;20190103;第1-8页 *
基于深度学习的中医典籍命名实体识别研究;高甦 等;《情报工程》;20190110;第5卷(第1期);第113-123页 *
高甦 等.基于深度学习的中医典籍命名实体识别研究.《情报工程》.2019,第5卷(第1期), *

Also Published As

Publication number Publication date
CN110134953A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110134953B (en) Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature
MacWhinney et al. The child language data exchange system: An update
Creutz et al. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0
Springmann et al. OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus
CN109815341B (en) Text extraction model training method, text extraction method and device
CN111897967A (en) Medical inquiry recommendation method based on knowledge graph and social media
CN107368547A (en) A kind of intelligent medical automatic question-answering method based on deep learning
CN109597886B (en) Extraction generation mixed abstract generation method
Blevins Advances in Proto-Basque reconstruction with evidence for the Proto-Indo-European-Euskarian hypothesis
Allen et al. A linguistic ‘time capsule’: the Newcastle Electronic Corpus of Tyneside English
Bollmann Normalization of historical texts with neural network models
CN1949211A (en) New Chinese characters spoken language analytic method and device
CN107943786A (en) A kind of Chinese name entity recognition method and system
CN111523316A (en) Medicine identification method based on machine learning and related equipment
CN108319584A (en) A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms
CN111506595A (en) Data query method, system and related equipment
Al-Tamimi et al. Phonetic complexity and stuttering in Arabic
Sangiacomo et al. Mapping the evolution of early modern natural philosophy: corpus collection and authority acknowledgement
CN114970502A (en) Text error correction method applied to digital government
CN117493504A (en) Medical event extraction method based on generated pre-training language model
CN111178047A (en) Ancient medical record prescription extraction method based on hierarchical sequence labeling
Ohashi et al. ESP corpus design: compilation of the Veterinary Nursing Medical Chart Corpus and the Veterinary Nursing Wordlist
Melka Structural observations regarding rongorongo tablet ‘Keiti’
Bailey et al. Breathing new life into death certificates: Extracting handwritten cause of death in the LIFE-M project
CN111209924B (en) System for automatically extracting medical advice and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant