CN110134953B

CN110134953B - Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature

Info

Publication number: CN110134953B
Application number: CN201910367376.9A
Authority: CN
Inventors: 张德政; 杨石兵; 贾麒; 谢永红; 夏超; 栗辉
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2019-05-05
Filing date: 2019-05-05
Publication date: 2020-12-18
Anticipated expiration: 2039-05-05
Also published as: CN110134953A

Abstract

The invention provides a traditional Chinese medicine named entity recognition method and system based on traditional Chinese medicine ancient book documents, which are used for solving the recognition problem of traditional Chinese medicine named entities. The method comprises the steps of carrying out data cleaning on the basis of obtaining Chinese medical ancient book literature corpora, and then carrying out language model pre-training; forming a training set of a subsequent model by carrying out sequence labeling on the linguistic data; the method comprises the steps of training a sequence labeling model based on a model training set of sequence labeling, taking a language model as a coding layer, taking a neural network structure as a decoding layer, and training the sequence labeling model, thereby carrying out traditional Chinese medicine named entity recognition based on the sequence labeling model. The invention combines the existing language training model, such as the language model pre-training method bert proposed by Google, saves the cost of manual labeling based on a small sample training set, improves the recognition effect and the accuracy, is easy to operate, realizes the effective and comprehensive utilization of Chinese medical ancient book documents, especially Chinese medical ancient book medical records, and lays a good foundation for the research of the field of Chinese medicine.

Description

Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature

Technical Field

The invention belongs to the field of information processing and traditional Chinese medicine literature, and particularly relates to a traditional Chinese medicine named entity identification method and system based on traditional Chinese medicine ancient book literature.

Background

Traditional Chinese medicine is profound, and the transmission of traditional Chinese medicine is transmitted through direct experience of elder medical workers on one hand and literature on the other hand. In the literature of traditional Chinese medicine, a large number of ancient medical records of traditional Chinese medicine are kept, and numerous famous and old traditional Chinese medicine experiences and diagnosis and treatment methods are included. The ancient medical record of traditional Chinese medicine refers to the continuous record of symptoms, causes, prescriptions, medicines and the like of patients when the ancient traditional Chinese medicine treats diseases. The named entities in traditional Chinese medicine refer to information entities such as symptoms, prescriptions, medicines and the like for explaining and reproducing the diseases of patients in the ancient medical record of traditional Chinese medicine. In order to better utilize the traditional Chinese medicine documents including the ancient medical record of traditional Chinese medicine, the named entity identification of traditional Chinese medicine is an important prerequisite for the relevant research in the field of traditional Chinese medicine.

Current named entity recognition research on some common entity types (e.g., person name, place name, organization name, etc.) has yielded good results, essentially approaching the level of manual labeling. However, the ancient book literature of traditional Chinese medicine is very different from other literatures in terms of words and grammar, and has own characteristics, and the named entity recognition method in the prior art is applied to the ancient book medical record of traditional Chinese medicine, so that an ideal effect cannot be obtained. Meanwhile, a plurality of relatively intractable grammatical phenomena exist in the ancient Chinese medical record, so that manual marking is difficult and expensive, and the difficulty of Chinese medical named entity identification is further increased.

Disclosure of Invention

The invention aims to solve the technical problem that an effective method for identifying named entities in traditional Chinese medicine is not available in the prior art, provides a method and a system for identifying named entities in traditional Chinese medicine based on ancient book documents in traditional Chinese medicine, and obtains the method for identifying named entities oriented to the ancient book documents in traditional Chinese medicine by combining the existing language training model, such as a language model pre-training method bert provided by Google, so as to realize the effective utilization of the ancient book documents in traditional Chinese medicine, especially the medical cases of the ancient book in traditional Chinese medicine.

In order to solve the technical problems, the embodiment of the invention provides a traditional Chinese medicine named entity identification method based on an ancient book literature of traditional Chinese medicine, which comprises the following steps:

step S1, acquiring traditional Chinese medical ancient medical record corpus;

step S2, performing data cleaning on the ancient Chinese medical record corpus to be processed, which is obtained in the step S1;

step S3, pre-training a language model facing the ancient Chinese medical record corpus based on the ancient Chinese medical record corpus obtained in the step S2;

step S4, based on the cleaned Chinese medical ancient book medical records corpus obtained in step S2, carrying out sequence labeling on the corpus to form a training set of a subsequent model;

step S5, based on the model training set of the sequence annotation obtained in step S4, the language model in step S3 is used as a coding layer, a preset neural network structure is used as a decoding layer, and a corresponding sequence annotation model is trained;

and S6, performing entity recognition on the traditional Chinese medical ancient medical records based on the sequence labeling model obtained by training in the step S5.

In the above scheme, the step S1 of obtaining the ancient medical records corpus of traditional Chinese medicine specifically includes the following steps:

step S11, scanning and recognizing the existing paper-edition traditional Chinese medical ancient book by using optical character recognition to form an electronic text corpus;

step S12, capturing traditional Chinese medical ancient book medical record corpus without paper books from the network by using open-source web crawler;

and step S13, comparing and combining the corpus texts obtained in the step S11 and the step S12 to finally form a unified ancient Chinese medical record corpus to be processed.

In the above scheme, the step S2 of performing data cleaning on the ancient medical record corpus of traditional Chinese medicine to be processed specifically includes the following steps:

step S21, correcting wrongly written characters;

step S22, filter irrelevant statements.

In the above scheme, the language model pre-training in step S3 specifically includes the following steps:

step S31, downloading source codes of the language model pre-training Chinese language training;

step S32, manually arranging character tables related to the ancient medical records of traditional Chinese medicine, comparing the character tables with the Chinese character tables in the source codes, and cutting out uncommon character tables in the field of traditional Chinese medicine;

step S33, merging the rarely-used character table and the Chinese character table in a mode of replacing characters with low use frequency in source codes by characters in the rarely-used character table, and ensuring the length of the Chinese character table to be unchanged;

step S34, segmenting paragraphs in the ancient Chinese medical record corpus cleaned in step S2, presetting a paragraph length threshold and/or a paragraph number threshold, and taking paragraph texts with the paragraph length threshold and/or the paragraph number threshold as training corpuses of the language training model;

and step S35, replacing the word segmentation method in the source code by the word segmentation rule separated according to characters, and performing Chinese ancient medical record-oriented language model pre-training on the training corpus of the language training model in the step S34 by the downloaded language model pre-training method based on the Chinese language training model.

In the above scheme, in step S3, the language model is pre-trained by using a Google language model pre-training method bert.

In the foregoing solution, the pre-training of the language model specifically includes:

step S31, downloading Google open-source Chinese language model Chinese _ L-12_ H-768_ A-12 based on bert training and source codes of the bert;

s32, comparing the character table with a Chinese character table of Google open source to separate out a unique uncommon character table in the field of traditional Chinese medicine;

step S33, merging the Chinese character table of the Google open source with the rarely-used character table, and replacing characters with low use frequency in the Chinese character table by the rarely-used character table during merging so as to ensure that the length of the character table is unchanged;

and S35, replacing a word segmentation method in the bert source code by a word segmentation rule separated according to characters, and pre-training the Chinese language model facing the traditional Chinese medical ancient medical record on the training corpus segmented in the step S34 by using a bert pre-training method based on a Google open-source Chinese language model.

In the above solution, the step S4 forms the subsequent model training set, and the subsequent model training set is formed by performing sequence labeling in the form of biees on the ancient medical record corpus of traditional Chinese medical science.

In the above scheme, the sequence labeling in the biees form for the ancient medical record corpus in traditional Chinese medicine specifically includes the following steps:

step S41, selecting entity identification type;

step S42, appointing a labeling rule;

step S43, randomly selecting a sentence set with a preset scale from the cleaned corpus, writing the sentence set into the file to be labeled according to character separation, and separating the sentences by an empty line;

and step S44, manually labeling the selected sentence set with the preset scale based on the selected entity identification type and the agreed labeling rule.

In the above scheme, the entity types are respectively: symptoms ZZ, pulse MX, tongue SX, traditional Chinese medicine ZY, dosage JL and prescription FJ.

The embodiment of the invention also provides a traditional Chinese medicine named entity recognition system based on the ancient book literature of traditional Chinese medicine, which comprises: the system comprises a corpus acquisition module, a data cleaning module, a language model pre-training module, a training set labeling module, a sequence labeling model training module and an entity identification module; wherein the content of the first and second substances,

the corpus acquisition module is used for acquiring traditional Chinese medical ancient book medical record corpus;

the data cleaning module is used for cleaning the acquired traditional Chinese medical ancient book medical record corpus to be processed;

the language model pre-training module is used for pre-training a language model facing the traditional Chinese medical ancient medical record corpus based on the traditional Chinese medical ancient medical record corpus;

the training set labeling module is used for carrying out sequence labeling on the linguistic data based on the cleaned traditional Chinese medical ancient book medical record linguistic data to form a training set of a subsequent model;

the sequence labeling model training module is used for training a corresponding sequence labeling model by taking a language model as a coding layer and a preset neural network structure as a decoding layer based on a model training set of sequence labeling;

the entity identification module is used for carrying out entity identification on the traditional Chinese medical ancient medical record based on the sequence marking model.

The technical scheme of the invention has the following beneficial effects:

in the scheme, the traditional Chinese medicine named entity recognition method and system based on the traditional Chinese medicine ancient book literature are combined with the existing language training model, for example, a language model pre-training method bert provided by Google is trained based on a training set of small samples, so that the labeling data in the training set is less, and the manual labeling cost is greatly saved; the named entity recognition method for the traditional Chinese medical ancient book literature is obtained, the traditional Chinese medical ancient book literature can be used more effectively, the named entity recognition method for the traditional Chinese medical ancient book literature is obtained, the operation is easy, the efficiency is high, the traditional Chinese medical ancient book literature, particularly the traditional Chinese medical ancient book medical record, is effectively used, the ancient book literature is used more comprehensively, the effect of named entity recognition in the traditional Chinese medical field is improved, the named entity recognition in the traditional Chinese medical field is more accurate, and a good data base is laid for subsequent application in the traditional Chinese medical field.

Drawings

In order to more clearly illustrate the embodiments of the present invention and the prior art, the following technical scheme description figures of the present invention are briefly introduced, and it is obvious that other figures can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flow chart of a method for identifying named entities in traditional Chinese medicine based on ancient book literature in traditional Chinese medicine according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

In order to solve the problem of traditional Chinese medicine named entity recognition of traditional Chinese medicine ancient book documents including traditional Chinese medicine ancient book medical cases, the invention provides a traditional Chinese medicine named entity recognition method and system based on traditional Chinese medicine ancient book documents, and the named entity recognition method facing the traditional Chinese medicine ancient book documents is obtained by combining the existing language training model, such as a language model pre-training method bert provided by Google, so that the traditional Chinese medicine ancient book documents can be more effectively utilized, and a good foundation is laid for relevant research in the field of traditional Chinese medicine.

The present invention will be described in further detail below with reference to specific embodiments in conjunction with the accompanying drawings.

First embodiment

The embodiment provides a method for identifying named entities in traditional Chinese medicine based on ancient book literature in traditional Chinese medicine, and fig. 1 is a schematic flow chart of the method for identifying named entities in traditional Chinese medicine.

The named entity in this embodiment is directed to medical records in ancient Chinese medical book literature, but the present invention is not limited to medical records, and can also be applied to other ancient Chinese medical book literature.

As shown in fig. 1, the method for identifying named entities in traditional chinese medical science based on ancient book literature in traditional chinese medical science comprises the following steps:

step S1, obtaining the ancient Chinese medical record corpus.

Further, the method for acquiring the ancient Chinese medical record corpus specifically comprises the following steps:

step S11, scanning and recognizing the existing paper ancient medical records and books by Optical Character Recognition (OCR) to form an electronic text corpus.

And step S12, capturing traditional Chinese medical ancient book medical record corpus without paper books from the network by using the open-source web crawler.

And step S2, performing data cleaning on the ancient Chinese medical record corpus to be processed acquired in the step S1.

Further, the data cleaning is carried out on the traditional Chinese medical ancient book medical record corpus to be processed, and the method specifically comprises the following steps:

step S21, the wrongly written characters are corrected.

In this step, the wrongly written characters refer to the phenomenon of wrongly written characters when the doctor is recording or the wrong characters are recognized when the doctor is obtaining the corpus. For example, "one-medical-treatment diarrhea" is thirst, which is actually "one-medical-treatment diarrhea and thirst".

Step S22, filter irrelevant statements.

In this step, the irrelevant statement includes:

22.1, since a part of the original corpus is from books organized into books, it contains many titles, authors, etc., such as "medical records of both parties", "week of course", etc.

22.2, because the doctor has strong subjective randomness when recording the medical record or the medical record with long history is not completely stored, some sentences which have unknown meanings or only express the personal emotion of the doctor, such as 'attaching' are included. "," a. "," Once "," temporary ", but! "and the like.

Further, the filtering irrelevant statements comprises the following steps:

step S221, the characteristics of the irrelevant sentences needing to be filtered are manually sorted.

Preferably, the features include:

221.1, information such as titles, authors and the like are basically bound together to appear;

221.2, the title basically comprises a book name number;

221.3, other sentences whose meanings are unknown or which express the emotion of the doctor, the basic form being fixed, such as "Once again! "," a. "and the like;

221.4, the length of the irrelevant sentences to be filtered is mostly short, preferably not more than 10-15.

Step S222, designing a corresponding regular expression according to the characteristics arranged in the step S221, segmenting the sentence according to the sentence segmentation rule, and filtering out related sentences in the corpus to be processed by combining with length limitation.

Preferably, the sentence segmentation rule is:

222.1, taking the character string ending by the period, the question mark and the exclamation mark as a sentence;

222.2, if no period, question mark, exclamation mark, but a single line, it is also considered as a sentence.

And step S3, pre-training a language model facing the ancient Chinese medical record corpus based on the ancient Chinese medical record corpus obtained in the step S2.

Preferably, the language model is pre-trained in this step, and a method bert of pre-training the language model of Google is adopted.

Further, the language model pre-training for the ancient medical record corpus of traditional Chinese medicine specifically comprises the following steps:

step S31, downloading the source code of the language model pre-training Chinese language training.

For example, taking Google's bert training as an example, Google's open-source Chinese language model chinese _ L-12_ H-768_ A-12, which is based on the bert training, and the source code of the bert are downloaded.

And step S32, manually arranging character tables related to the ancient medical records of traditional Chinese medicine, comparing the character tables with the Chinese character tables in the source codes, and cutting out uncommon character tables in the field of traditional Chinese medicine.

For example, the Google berg training is taken as an example, and compared with a Google open-source Chinese character table, a unique uncommon character table in the field of traditional Chinese medicine is distinguished. Table 1 is an example of a table of isolated parts of rare words.

TABLE 1 rarely-used character table based on Google open source

And step S33, merging the rarely-used character table and the Chinese character table in a mode of replacing characters with low use frequency in the source code by characters in the rarely-used character table, and ensuring that the length of the Chinese character table is unchanged.

Taking Google open-source bert training as an example, merging the Chinese character table of the Google open source and the rarely-used character table, and replacing characters with the rarely-used character table during merging so as to ensure that the length of the character table is unchanged.

And step S34, segmenting paragraphs in the ancient Chinese medical record corpus cleaned in the step S2, presetting a paragraph length threshold and/or a paragraph number threshold, and taking paragraph texts with the paragraph length threshold and/or the paragraph number threshold as training corpuses of the language training model.

Preferably, the segmentation of the paragraphs in the ancient medical record corpus of traditional Chinese medical science is performed according to the rules of 222.1 and 222.2. Preferably, the paragraph length threshold is 150, and the paragraph number threshold is 3.

Taking Google' S bert training as an example, the step is to replace the word segmentation method in the bert source code by the word segmentation rule separated according to characters, and based on the Google open-source Chinese language model, the step is to perform Chinese traditional ancient medical record oriented language model pre-training on the training corpus segmented in the step S34 by using the bert pre-training method.

And step S4, carrying out sequence labeling on the linguistic data based on the cleaned traditional Chinese medical ancient book medical record linguistic data obtained in the step S2 to form a training set of a subsequent model.

Further, the subsequent model training set is formed by performing sequence labeling in a BIOES form on the ancient Chinese medical record corpus. The labels here are manual labels.

Further, the ancient medical record corpus of traditional Chinese medical science is subjected to sequence labeling in a BIOES form, and the label form of the sequence labeling is BIOES, wherein: b, Begin, represents the beginning character of the entity; i, intermedate, represents the middle character of an entity; e, End, represents the ending character of the entity; s, Single, representing an entity consisting of Single characters; o, Other, indicates otherwise, for marking extraneous characters. Specifically, the method comprises the following steps:

step S41, selecting an entity identification type.

The named entities involved in ancient medical records of traditional Chinese medicine are various in types, and the preferred entity types in this embodiment are respectively: symptom (ZZ), pulse (MX), tongue (SX), Chinese medicine (ZY), dosage (JL) and prescription (FJ). In practical applications, the entity types may be added or deleted according to the specific text of the ancient medical records of traditional Chinese medicine or other needs, and the above examples do not limit the selection of the entity types.

Step S42, appointing marking rules.

Named entities in ancient Chinese medical records have distinct field characteristics, particularly the entities with 'symptoms' are the most prominent, so that in the labeling process, general labeling rules need to be modified according to the characteristics of the named entities. In ancient Chinese medical records, the symptom name generally consists of the disease location, disease condition and disease nature, such as "black tongue and burnt lips". However, due to the liberty of the physician when recording, the symptom names appear in a variety of forms, including:

42.1, several symptoms appear in parallel without punctuation mark separation, for the condition, we agree that if the symptoms belong to the parallel relation, the symptoms are labeled separately, if 'bitter taste and deafness', the symptoms are respectively taken as independent symptom names; if the symptoms are in progressive relation, the symptoms are marked as a symptom, such as 'abdominal pain and diarrhea', and the symptoms are taken as a symptom name.

42.1, several disease positions are in parallel, but punctuation marks are separated among the disease positions, and for the condition, we agree that only the last disease position is reserved, such as chest pain and abdominal pain, and only the abdominal pain is regarded as a symptom.

Step S43, randomly selecting a sentence set with a preset scale from the cleaned corpus, writing the sentence set into the file to be labeled according to character separation, and separating the sentences by an empty line.

Preferably, the sentence sets of the predetermined size each contain 5000 sentences, which account for 5% of the total corpus text.

The preferred entity types-symptom (ZZ), pulse (MX), tongue (SX), Chinese medicine (ZY), dosage (JL), Formula (FJ) -are shown in Table 2 for the examples of artificial labeling:

table 2 example of manual annotation of a selected set of sentences

And step S5, based on the sequence labeled model training set obtained in step S4, taking the language model in step S3 as a coding layer, taking a preset neural network structure as a decoding layer, and training a corresponding sequence labeled model.

Preferably, the preset neural network structure is BilSTM, and/or CRF, and/or Softmax.

Further, the training of the sequence labeling model is performed based on the model training set of the sequence labeling in this step, which specifically includes the following steps:

and step S51, dividing a training set, a verification set and a test set.

Preferably, in this step, the training set labeled in step S4 is uniformly divided into 10 shares, wherein each share is guaranteed to contain a substantially consistent number of sentences of six entity types. In each training, 1 part of the data is selected as a verification set, 1 part is selected as a test set, and the rest 8 parts are selected as training sets.

And step S52, training the sequence labeling model to form an automatic labeling result of the model.

Preferably, in this step, the language model trained based on the pre-training method in step S3 is used as a coding layer, the neural network structure in the form of "BiLSTM + CRF" is used as a decoding layer, model training is performed by combining the training set, the verification set and the test set divided in step S51, wherein different data are selected as the test set to repeat the model training until the training set covers all the manual labeling sets, and finally the labeling results of the model are collected to form the automatic labeling result of the model.

And step S53, comparing the automatic labeling result with the manual labeling result, filtering out the inconsistent labeling sequences, manually correcting the filtered labeling results, and finally writing the corrected results back to the training set.

Step S54, judging whether the labeling requirement is satisfied, if so, outputting the final labeling result; if not, the process returns to step S51.

Further, the judging whether the labeling requirement is met includes: whether the precision of the model reaches a preset threshold value or whether the automatic labeling result of the model is completely consistent with the manual labeling result. The preset threshold here may be a peak of the precision.

Further, based on the trained sequence labeling model, the entity recognition of the traditional Chinese medical ancient medical record specifically comprises the following steps:

and step S61, automatically labeling the ancient Chinese medical record corpus to be identified based on the trained sequence labeling model.

Step S62, converting the word-level labels into word-level labels according to the general rules of the biees labeling model, i.e. separating the entities from the labeling results.

Also, preferred examples of the entity type-symptom (ZZ), pulse (MX), tongue (SX), Chinese medicine (ZY), dosage (JL), and Formulation (FJ) -are shown in Table 3 as examples of the separation of the labeled results.

Table 3 separate annotation results example

According to the technical scheme of the embodiment, the traditional Chinese medicine named entity recognition method based on the traditional Chinese medicine ancient book medical record is combined with the existing language training model, such as a Google bert training model, and training is carried out based on a training set of small samples, so that the labeling data in the training set are less, and the manual labeling cost is greatly saved; the named entity recognition method for the ancient book literature of the traditional Chinese medicine is obtained, the ancient book literature of the traditional Chinese medicine can be more effectively utilized, the effect of named entity recognition in the field of the traditional Chinese medicine is improved, the named entity recognition in the field of the traditional Chinese medicine is more accurate, and a good data base is laid for subsequent application in the field of the traditional Chinese medicine. Second embodiment

The embodiment provides a traditional Chinese medicine named entity recognition system based on traditional Chinese medicine ancient book literature, which comprises: the system comprises a corpus acquisition module, a data cleaning module, a language model pre-training module, a training set labeling module, a sequence labeling model training module and an entity identification module; wherein the content of the first and second substances,

The named entity recognition system of the present embodiment based on ancient Chinese medical literature corresponds to the named entity recognition method of the first embodiment, and therefore, the corresponding description of the first embodiment is also applicable to the named entity recognition system of the present embodiment, and is not repeated herein.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A traditional Chinese medicine named entity recognition method based on traditional Chinese medicine ancient book literature is characterized by comprising the following steps:

step S1, acquiring traditional Chinese medicine ancient medical record corpus to be processed;

step S3, pre-training a language model facing the ancient Chinese medical record corpus based on the ancient Chinese medical record corpus obtained in the step S2; the method specifically comprises the following steps:

step S35, replacing word segmentation method in source code with word segmentation rule separated according to characters, based on Chinese language training model, using downloaded language model pre-training method to perform Chinese traditional ancient medical record oriented language model pre-training on the training corpus of the language training model in step S34;

step S4, based on the cleaned Chinese medical ancient book medical records corpus obtained in step S2, carrying out sequence labeling on the corpus to form a training set of a subsequent model; the sequence labeling specifically comprises the following steps:

step S41, selecting entity identification type;

step S42, appointing a labeling rule; when several symptoms appear in parallel and no punctuations are separated, if the symptoms belong to parallel relation, the symptoms are separately marked, and if the symptoms belong to progressive relation, the symptoms are marked as one symptom; when several disease sites appear in parallel, but punctuation marks are separated among the disease sites, only the last disease site is reserved;

step S44, manually labeling the selected sentence set with the preset scale based on the selected entity recognition type and the agreed labeling rule;

step S5, based on the model training set of the sequence annotation obtained in step S4, the language model in step S3 is used as a coding layer, a preset neural network structure is used as a decoding layer, and a corresponding sequence annotation model is trained; the method specifically comprises the following steps:

step S51, dividing a training set, a verification set and a test set;

step S52, training sequence label model to form automatic label result;

step S53, comparing the automatic labeling result with the manual labeling result, filtering out the inconsistent labeling sequence, manually correcting the filtered labeling result, and finally writing the corrected result back to the training set;

step S54, judging whether the labeling requirement is satisfied, if so, outputting the final labeling result; if not, returning to the step S51;

2. The method for identifying named entities in traditional Chinese medicine according to claim 1, wherein the step S1 of obtaining ancient medical records corpus in traditional Chinese medicine comprises the following steps:

3. The method for recognizing named entities in traditional Chinese medicine according to claim 1, wherein said step S2 is performed to clear data of ancient medical records of traditional Chinese medicine to be processed, and comprises the following steps:

step S21, correcting wrongly written characters;

step S22, filter irrelevant statements.

4. The method for recognizing named entities according to claim 1, wherein in step S3, the language model is pre-trained by using a method bert pre-trained by a language model of Google.

5. The method for recognizing named entities according to claim 4, wherein the language model is pre-trained, specifically comprising:

6. The method for recognizing named entities in traditional chinese medical science according to claim 1, wherein the entity recognition types are respectively: symptoms ZZ, pulse MX, tongue SX, traditional Chinese medicine ZY, dosage JL and prescription FJ.

7. A traditional Chinese medicine named entity recognition system based on traditional Chinese medicine ancient book literature is characterized by comprising: the system comprises a corpus acquisition module, a data cleaning module, a language model pre-training module, a training set labeling module, a sequence labeling model training module and an entity identification module; wherein the content of the first and second substances,

the corpus acquisition module is used for acquiring traditional Chinese medical ancient book medical record corpuses to be processed;

the language model pre-training module is used for pre-training a language model facing the traditional Chinese medical ancient medical record corpus based on the traditional Chinese medical ancient medical record corpus; and further for:

downloading source codes of pre-training Chinese language training of a language model; manually arranging character tables related to the ancient medical record of the traditional Chinese medicine, comparing the character tables with the Chinese character tables in the source codes, and cutting out the rarely-used character tables in the field of the traditional Chinese medicine; combining the rarely-used character table and the Chinese character table in a mode of replacing characters with low use frequency in source codes by characters in the rarely-used character table, and ensuring the length of the Chinese character table to be unchanged; segmenting paragraphs in the cleaned Chinese medical ancient book medical record corpus, presetting a paragraph length threshold value and/or a paragraph number threshold value, and taking paragraph texts which are larger than the paragraph length threshold value and/or the paragraph number threshold value as training corpuses of a language training model; replacing a word segmentation method in the source code by a word segmentation rule separated according to characters, and performing Chinese ancient medical record-oriented language model pre-training on training linguistic data of the language training model by a downloaded language model pre-training method based on a Chinese language training model;

the training set labeling module is used for carrying out sequence labeling on the linguistic data based on the cleaned traditional Chinese medical ancient book medical record linguistic data to form a training set of a subsequent model; and further for:

selecting an entity identification type; appointing a marking rule; when several symptoms appear in parallel and no punctuations are separated, if the symptoms belong to parallel relation, the symptoms are separately marked, and if the symptoms belong to progressive relation, the symptoms are marked as one symptom; when several disease sites appear in parallel, but punctuation marks are separated among the disease sites, only the last disease site is reserved; randomly selecting a sentence set with a preset scale from the cleaned corpus, writing the sentence set into a file to be labeled according to character separation, and separating sentences by an empty line; manually labeling the selected sentence set with the preset scale based on the selected entity identification type and the appointed labeling rule;

the sequence labeling model training module is used for training a corresponding sequence labeling model by taking a language model as a coding layer and a preset neural network structure as a decoding layer based on a model training set of sequence labeling; and further for: dividing a training set, a verification set and a test set; training the sequence labeling model to form an automatic labeling result of the model; comparing the automatic labeling result with the manual labeling result of the model, filtering out inconsistent labeling sequences, manually correcting the filtered labeling results, and finally writing the corrected results back to the training set; judging whether the labeling requirements are met, and if so, outputting a final labeling result; if not, returning to the division of the data set;

the entity identification module is used for carrying out entity identification on the traditional Chinese medical ancient medical records based on the sequence labeling model.