CN115310446A

CN115310446A - Traditional Chinese medicine ancient book named entity identification method and device, electronic equipment and memory

Info

Publication number: CN115310446A
Application number: CN202210928069.5A
Authority: CN
Inventors: 晏峻峰; 沈蓉蓉
Original assignee: Hunan University of Chinese Medicine
Current assignee: Hunan University of Chinese Medicine
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-11-08

Abstract

The application relates to a method and a device for identifying ancient Chinese medicine book entities, electronic equipment and a storage medium, belonging to the technical field of information processing and Chinese medicine literature. The method comprises the following steps: acquiring a traditional Chinese medicine ancient book text, and preprocessing the traditional Chinese medicine ancient book text to obtain a traditional Chinese medicine ancient book text sequence; constructing a named entity recognition model of the traditional Chinese medicine ancient book, wherein the named entity recognition model comprises an embedding layer, an ELECTRA module, a BilSTM module and a CRF module which are sequentially connected; in the pre-training stage, considering that the pre-training data is too little, a semi-fixed random mask mechanism is added to the model; in the fine tuning stage, the model introduces the character components to enrich the connotation of the text information. The method can improve the recognition effect of the prescription entity, the traditional Chinese medicine entity, the syndrome entity, the disease entity and the syndrome manifestation entity in the ancient Chinese medicinal books.

Description

Traditional Chinese medicine ancient book named entity identification method and device, electronic equipment and memory

Technical Field

The present application relates to the field of information processing and traditional Chinese medicine literature technologies, and in particular, to a method and an apparatus for identifying a named entity of an ancient Chinese medicine book, an electronic device, and a memory.

Background

With the acceleration of the digitized construction step of traditional Chinese medicine, vast ancient Chinese medical data and classic become important bases of traditional Chinese medicine research; at the same time, the growing text literature and medical diagnostic data obscures traditional literature research methods from color. In the age of rapid data growth, how to fully utilize text resources and technical resources to mine the internal relationship among more traditional Chinese medicine entities is one of the important propositions of scientific research and development in the field of traditional Chinese medicine.

The research combining named entity recognition and the traditional Chinese medicine field is in the real trend, but the research of named entity recognition in the traditional Chinese medicine field is slow in progress, and particularly in the ancient Chinese medicine field, the research results are few. The existing named entity recognition method based on the pre-training model adopts a random mask mechanism for training, and as training linguistic data in the traditional Chinese medicine ancient book field is less, and virtual words, pseudonyms and the like in the traditional Chinese medicine ancient book are numerous, the training quality cannot be ensured by adopting the random mask (mask), so that the effect of the named entity recognition of the traditional Chinese medicine ancient book cannot meet the actual requirement.

Disclosure of Invention

Accordingly, there is a need to provide a method, an apparatus, an electronic device and a memory for identifying ancient Chinese medical books and named entities in order to solve the above technical problems.

A traditional Chinese medicine ancient book named entity identification method comprises the following steps:

the method comprises the steps of obtaining a traditional Chinese medicine ancient book text, and preprocessing the traditional Chinese medicine ancient book text to obtain a traditional Chinese medicine ancient book text sequence.

The method comprises the steps of constructing a named entity recognition model of the traditional Chinese medicine ancient book, wherein the named entity recognition model comprises an embedding layer, an ELECTRA module, a BilSTM module and a CRF module which are sequentially connected.

In the pre-training stage, a key word mask mechanism is newly added in the ELECTRA module; the key word mask mechanism adopts a semi-fixed random mask mode to randomly select key words in a preset key word table for mask.

And pre-training the ELECTRA module on the embedding layer and after a new keyword mask mechanism is added by adopting the Chinese medicinal ancient book text sequence to obtain the pre-trained ELECTRA module.

And in the fine adjustment stage, newly adding character radical characteristics of the character into an embedded coding module of the named entity recognition model, and obtaining a new embedding layer.

Carrying out fine tuning training on the named entity recognition model in the fine tuning stage by adopting a fine tuning training sample to obtain a trained named entity recognition model; the named entity recognition model in the fine tuning stage comprises a new embedding layer, a pretrained ELECTRA module, a BilSTM module and a CRF module.

And identifying the ancient medical book named entities by adopting a trained named entity identification model.

In one embodiment, the method for pre-training the eletcra module after the embedding layer and the newly added keyword mask mechanism by using the traditional Chinese medicine ancient book text sequence to obtain the pre-trained eletcra module comprises the following steps:

and setting a pre-training step number segmentation parameter.

And dividing the pre-training total step number into the step number trained in a conventional training mode and the training step number adopting a keyword mask mechanism according to the pre-training step number division parameter.

And training the ELECTRA module by adopting a random mask mechanism of the ELECTRA module according to a code obtained by inputting the Chinese medicinal ancient book text sequence into the embedding layer until the number of training steps reaches the number of training steps in a conventional training mode, and obtaining the ELECTRA module after conventional training.

Continuously training the ELECTRA module after conventional training by adopting a key word mask mechanism according to codes obtained by inputting the Chinese medicinal ancient book text sequence into an embedding layer until the number of training steps reaches the number of training steps adopting the key word mask mechanism, and obtaining the ELECTRA module after pre-training; the key mask mechanism adopts a random mask15% token mode, and the token belongs to a key in a preset key table.

In one embodiment, the step of constructing the preset keyword table includes:

extracting key word characteristics of prescription entity, traditional Chinese medicine entity, disease entity, syndrome entity and symptom expression entity in the traditional Chinese medicine ancient book, and determining key words corresponding to various entities.

And extracting words with the frequency of 1 or more from the subordinate dictionary as keywords of the term dictionary.

And constructing a preset keyword table according to the keywords of the term dictionary and the keywords corresponding to various entities.

In one embodiment, the fine-tuning training of the named entity recognition model in the fine-tuning stage by using the fine-tuning training sample to obtain the trained named entity recognition model includes:

inputting the fine tuning training sample into a new embedding layer of the named entity recognition model in the fine tuning stage, acquiring word embedding information, word position information and coding information of a sentence where the word is located by adopting a look up operation, and acquiring the radical coding of the Chinese character according to a Chinese character-radical index look up.

Inputting the word embedding information, the position information of the word, the coding information of the sentence where the word is located and the radical coding of the Chinese character into an ELECTRA module after pre-training of a named entity recognition model in a fine tuning stage to obtain the embedding representation of the word,

and inputting the embedded expression of the word into a BilSTM module of a named entity recognition model in a fine tuning stage for feature extraction, and inputting the extracted features into a CRF module for decoding to obtain a predicted tag sequence.

And determining the model loss in the fine tuning stage according to the predicted label sequence and the corresponding token, and performing fine tuning training on the named entity recognition model in the fine tuning stage according to the model loss in the fine tuning stage and the label of the fine tuning training sample to obtain the trained named entity recognition model.

In one embodiment, the method for using the obtained marked ancient Chinese medicine book text sequence as a fine-tuning training sample comprises the following steps:

and acquiring the traditional Chinese medicine ancient book text.

And correcting the ancient book text of the traditional Chinese medical institution, wherein the correction content comprises the following steps: correcting wrongly written characters, replacing traditional characters and modifying ambiguous characters.

Manually marking the corrected Chinese medicinal ancient book text to mark 5 types of entities; the 5 types of entities include: prescription, traditional Chinese medicine, syndrome, disease and symptom.

And dividing the marked traditional Chinese medicine ancient book texts according to a preset proportion to obtain a training set, a verification set and a test set.

And taking Chinese periods, exclamation marks and question marks as segmentation points, performing sequence segmentation on the segmented data set, and ensuring that the lengths of the segmented sequences are similar under the condition that the length of the segmented sequences is smaller than the maximum value of the preset sequence length.

And (4) performing data processing on the result after the sequence segmentation processing by adopting a BIO labeling mode to obtain an entity-labeled Chinese medicinal ancient book text sequence.

And taking the Chinese medicinal ancient book text sequence after entity labeling as a fine-tuning training sample.

An ancient book named entity recognition device of traditional Chinese medicine, the device comprises:

the traditional Chinese medicine ancient book text sequence determining module is used for acquiring a traditional Chinese medicine ancient book text and preprocessing the traditional Chinese medicine ancient book text to obtain a traditional Chinese medicine ancient book text sequence.

The named entity recognition model building module is used for building a named entity recognition model of the traditional Chinese medicine ancient book, and the named entity recognition model comprises an embedding layer, an ELECTRA module, a BilSTM module and a CRF module which are sequentially connected.

The pretraining module of the ELECTRA module is used for adding a keyword mask mechanism in the ELECTRA module in a pretraining stage; the key word mask mechanism adopts a semi-fixed random mask mode to randomly select key words in a preset key word table for mask; and pre-training the ELECTRA module on the embedding layer and after a new keyword mask mechanism is added by adopting the Chinese medicinal ancient book text sequence to obtain the pre-trained ELECTRA module.

The named entity recognition model fine tuning training module is used for inputting the character features of the radicals newly added to the embedded coding module of the named entity recognition model in the fine tuning stage to obtain a new embedding layer; taking the obtained marked Chinese medicinal ancient book text sequence as a fine tuning training sample; carrying out fine tuning training on the named entity recognition model in the fine tuning stage by adopting the fine tuning training sample to obtain a trained named entity recognition model; the named entity recognition model in the fine tuning stage comprises a new embedding layer, a pretrained ELECTRA module, a BilSTM module and a CRF module.

And the traditional Chinese medicine ancient book named entity recognition module is used for recognizing the traditional Chinese medicine ancient book named entities by adopting a trained named entity recognition model.

In one embodiment, the pretraining module of the ELECTRA module is further configured to set a pretraining step number segmentation parameter; dividing the pre-training total step number into a step number trained in a conventional training mode and a training step number adopting a keyword mask mechanism according to the pre-training step number division parameter; training the ELECTRA module by adopting a random mask mechanism of the ELECTRA module according to a code obtained by inputting the Chinese medicinal ancient book text sequence into an embedding layer until the number of training steps reaches the number of training steps in a conventional training mode to obtain the ELECTRA module after conventional training; continuously training the ECTRA module after conventional training by adopting a key word mask mechanism according to codes obtained by inputting the Chinese medicinal ancient book text sequence into an embedding layer until the training step number reaches the training step number adopting the key word mask mechanism, and obtaining the ECTRA module after pre-training; the key mask mechanism adopts a random mask15% token mode, and the token belongs to a key in a preset key table.

In one embodiment, the pre-training module of the ELECTRA module comprises a construction module of a preset keyword table, and is used for extracting the keyword characteristics of a prescription entity, a traditional Chinese medicine entity, a disease entity, a syndrome entity and a symptom expression entity in the traditional Chinese medicine ancient book and determining the keywords corresponding to various entities; extracting words with the frequency of occurrence more than or equal to 1 from a subordinate dictionary as key words of a term dictionary; and constructing a preset keyword table according to the keywords of the term dictionary and the keywords corresponding to various entities.

An electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above.

The traditional Chinese medicine ancient book name entity identification method, the traditional Chinese medicine ancient book name entity identification device, the electronic equipment and the storage medium are characterized in that the method comprises the following steps: acquiring a traditional Chinese medicine ancient book text, and preprocessing the traditional Chinese medicine ancient book text to obtain a traditional Chinese medicine ancient book text sequence; constructing a named entity recognition model of the traditional Chinese medicine ancient book, wherein the named entity recognition model comprises an embedding layer, an ELECTRA module, a BilSTM module and a CRF module which are sequentially connected; in the pre-training stage, considering that the random mask mechanism of the ELECTRA model and the newly pre-trained data are too little, the model is newly provided with a semi-fixed random mask mechanism; in the fine tuning stage, the model introduces the character components to enrich the connotation of the text information. The method can improve the recognition effect of the prescription entity, the traditional Chinese medicine entity, the syndrome entity, the disease entity and the syndrome manifestation entity in the ancient Chinese medicinal books.

Drawings

FIG. 1 is a flow chart illustrating a method for identifying ancient Chinese medical book named entities in an embodiment;

FIG. 2 is a schematic diagram of a mechanism for introducing a keyword mask during a pre-training phase in another embodiment;

FIG. 3 is an embedding layer of the named entity recognition model of medical ancient book at the fine tuning stage in another embodiment;

FIG. 4 is a diagram illustrating an embodiment of entity index change;

FIG. 5 is a block diagram of an embodiment of a device for identifying ancient Chinese medical book named entities;

FIG. 6 is a diagram illustrating an internal structure of an electronic device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a method for identifying ancient Chinese medicine books, which comprises the following steps:

step 100: and acquiring an ancient Chinese medicine book text, and preprocessing the ancient Chinese medicine book text to obtain a Chinese medicine ancient book text sequence.

Specifically, the ancient Chinese medicine book text is a ancient Chinese medicine book document selected from three periods of Ming, qing and Ming nations.

The preprocessing mainly comprises the steps of performing character correction, data segmentation and the like on the Chinese and medical ancient book text.

The Chinese medicine ancient book text sequence is a text beginning with an identifier [ cls ] and ending with [ seq ], such as: [ cls ] Sun yangming syndrome [ seq ].

The Chinese traditional medicine ancient book text sequence is unmarked text data.

Step 102: and constructing a named entity recognition model of the traditional Chinese medicine ancient book.

The named entity recognition model of the traditional Chinese medicine ancient book comprises an imbedding layer, an ELECTRA module, a BilSTM module and a CRF module which are sequentially connected.

Specifically, the imbedding layer acquires word embedding (Token imbedding) information through a look up operation; position information (Position entries) of a word and encoding information (Segment entries) of a sentence in which the word is located are acquired.

The ELECTRA module is a pre-training model and is mainly used for obtaining embedded representation of words. The training method of the model adopts a Fine-tuning-based method (Fine-tuning), compared with a characteristic method, the method can dynamically represent the character according to the context information of the character, and the problem of polysemy of a word is solved; compared with the common neural network model, the ELECTRA model has deep representation characteristics and stronger characterization capability.

BilSTM has strong feature extraction capability on Chinese medicine texts, so that the invention introduces a BilSTM model (selectivity) after pre-training the model so as to obtain better entity recognition effect.

The decoding layer adopts a CRF model, the characteristic functions of the CRF comprise a state characteristic function and a transfer characteristic function, the state characteristic represents the state information of the current position, and the transfer characteristic comprises the characteristic information from the previous state to the current state. The characteristic function property of the CRF enables the predicted output to be better constrained, so that the predicted tag sequence is ensured to conform to the constraint of the basic dependency relationship, and the predicted tag sequence is finally output.

Step 104: in the pre-training stage, a key word mask mechanism is newly added in an ELECTRA module; the key word mask mechanism adopts a semi-fixed random mask mode to randomly select key words in a preset key word table for mask.

Specifically, in traditional Chinese medicine ancient books, the fictitious words are important constituents of ancient documents, and by taking ' censored famous medical prescription ' as an example, 18 fictitious words commonly seen in ancient texts, such as ' jie ', ' zhan ', ' and ' also ', are in 13.69%, and if a random mask mechanism is adopted, the model cannot systematically learn key words in the text. In the pre-training stage, considering that the random mask mechanism of the ECTRA model and the newly pre-trained data are too little, a new keyword mask mechanism is added in the ECTRA model pre-training stage, the mechanism adopts a semi-fixed random mask mode, words in a keyword table in a random mask sequence are used for better learning the keyword characteristics of the ancient text, and a model schematic diagram is shown in FIG. 2.

The reason that a keyword mask mechanism is added behind a conventional mask mechanism is selected, firstly, training linguistic data are few, but fictional words, pseudonyms and the like in ancient Chinese medical books are numerous, and the training quality cannot be guaranteed by a random mask mechanism; secondly, the token with larger learning difficulty in the subsequent stage of the model is more in line with the training practice.

Step 106: and pre-training the electromedding layer and the ECTRA module after a new keyword mask mechanism is added by adopting a traditional Chinese medicine ancient book text sequence to obtain the pre-trained ELECTRRA module.

Specifically, considering that the elettra chinese model is trained from chinese wiki corpus, and the difference between the text in the traditional Chinese medicine field and the text in the general field is large, and the difference between the languages of the ancient chinese language and the modern chinese language is large, the research selects and utilizes a new training model of the traditional Chinese medicine ancient book text, the literature is selected from three periods of Ming, qing and Ming nations, and the book recording conditions in each period are shown in table 1.

TABLE 1 book Condition

Serial number	Dynasty	Number of	Example book
				1	Qingdai dynasty	14	Total Collection of Yi Zong jin Jian (except the Ming Bu famous medical Square treatise)
2	Qingdai Dynasty	34	A medical record of cottage in love, a medical record of Cheng Xingxuan, and an ancient and modern medical record
				3	Ming dynasty	1	Xue's medical record
4	The republic of China	11	Ding Ganren, Yuan Yi, and Hua Yun Lou Yi

Remarking: the reason why the complete collection of Yi Zong jin Jian does not include the famous medical records in the table is as follows: the 'Min and Bu famous medical prescription' is a data set labeled manually, and during actual training, the data set is classified into a training set: and (4) verification set: test set =8:1:1, in order to ensure the fairness, effectiveness and objectivity of the test result, the 'Shang Bu Ming Yi Fang Lun' is not pre-trained to avoid influencing the performance judgment.

Step 108: in the fine tuning stage, character input of radicals of new characters is added in an embedded coding module of the named entity recognition model, and a new embedding layer is obtained;

specifically, when the ELECTRA pre-training model is introduced into the BilSTM-CRF model and then fine-tuning training is carried out, the information carried by rarely-used characters appearing in part is less by considering the ELECTRA model obtained by the existing pre-training, so that the invention provides the characteristic input of the radicals of newly-added characters in the embedded coding part of the model to enrich the content of text information.

The specific implementation mode is that parameters are initialized randomly, then radical codes of the Chinese characters are obtained according to the Chinese character-radical index look up, the loss in the fine tuning training stage is the sum of the losses of all tokens (tokens), and the schematic diagram of the modified imbedding layer is shown in fig. 3.

Step 110: and taking the obtained marked Chinese medicinal ancient book text sequence as a fine-tuning training sample.

Step 112: carrying out fine tuning training on the named entity recognition model in the fine tuning stage by adopting a fine tuning training sample to obtain a trained named entity recognition model; the named entity recognition model in the fine tuning stage comprises a new embedding layer, a pretrained ELECTRA module, a BilSTM module and a CRF module.

Step 114: and identifying the ancient medical book named entities by adopting a trained named entity identification model.

In the method for identifying ancient book famous entities in traditional Chinese medicine, the method comprises the following steps: acquiring an ancient Chinese medicine book text, and preprocessing the ancient Chinese medicine book text to obtain a Chinese medicine ancient book text sequence; constructing a named entity recognition model of the traditional Chinese medicine ancient book, wherein the named entity recognition model comprises an embedding layer, an ELECTRA module, a BilSTM module and a CRF module which are sequentially connected; in the pre-training stage, considering that the random mask mechanism of the ELECTRA model and the newly pre-trained data are too little, the model is newly provided with a semi-fixed random mask mechanism; in the fine tuning stage, the model introduces the character components to enrich the connotation of the text information. The method can improve the recognition effect of the prescription entity, the traditional Chinese medicine entity, the syndrome entity, the disease entity and the syndrome manifestation entity in the ancient Chinese medicinal books.

In one embodiment, step 106 includes: setting a pre-training step number segmentation parameter; dividing the pre-training total step number into a step number trained in a conventional training mode and a training step number adopting a keyword mask mechanism according to the pre-training step number division parameter; training an ECTRA module by adopting a random mask mechanism of the ECTRA module according to codes obtained by inputting Chinese medicinal ancient book text sequences into an embedding layer until the number of training steps reaches the number of training steps in a conventional training mode, and obtaining the ECTRA module after conventional training; continuously training the ELECTRA module after conventional training by adopting a key word mask mechanism according to codes obtained by inputting the Chinese medicinal ancient book text sequence into an embedding layer until the number of training steps reaches the number of training steps adopting the key word mask mechanism, and obtaining the ELECTRA module after pre-training; the key mask mechanism adopts a random mask15% token mode, and the token belongs to a key in a preset key table.

Specifically, a pre-training step number segmentation parameter, divide _ portion, is set to segment the training step number, the step number trained in the conventional training mode is divide _ portion × num _ train _ steps, (where num _ train _ steps is the total step number of the pre-training), and the rest of the step number adopts a key word mask mechanism. For a newly added keyword mask mechanism, a mode of random mask15% token (the token belongs to a preset keyword table) is still adopted, meanwhile, the structure of the transform is unchanged, and the loss of the model is still the sum of losses of the generator and the descriptor.

In one embodiment, the preset keyword table in step 104 is constructed by the following steps: extracting key word characteristics of a prescription entity, a traditional Chinese medicine entity, a disease entity, a syndrome entity and a symptom expression entity in the traditional Chinese medicine ancient book, and determining key words corresponding to various entities; extracting words with the frequency of occurrence being more than or equal to 1 from a subordinate dictionary as key words of a term dictionary; and constructing a preset keyword table according to the keywords of the term dictionary and the keywords corresponding to various entities.

In particular, the keyword data is used for the improvement of the pre-training phase model. The keyword data is composed of two parts of data, wherein the keyword characteristics can be obtained through direct induction and summarization, and the high-frequency words in the sorted term dictionary are collected (in order to ensure that the experiment is carried out smoothly, the high-frequency words refer to the occurrence frequency of Chinese characters to be more than or equal to 1). The recipe carries out entity identification research on 5 types of entities, namely prescription entities, traditional Chinese medicines, diseases, syndrome types and syndrome expression entities, wherein each type of entity has keyword characteristics, and the prescription entities, such as 'Shenfu decoction' and 'Baoyuan decoction', are mostly provided with formulation keywords such as 'decoction', 'preparation', 'drink', and the like; the disease entity, "taiyang disease", "taiyang yangming syndrome" and so on, which are mostly included in the word "disease"; the syndrome type entity is mostly composed of syndrome elements such as heart, liver, spleen, lung and kidney; the manifestations of syndromes, such as pulse and tongue, are mostly characterized by the key words, pulse and tongue. For this reason, the author collates keywords and term dictionaries of 5 types of entities, wherein the term dictionaries include 917 common prescription names, 2035 disease names, 3199 syndrome names and 2461 syndrome expression names. The sorted keyword table is shown in table 2.

TABLE 2 Key Table

Keyword categories	Number of	Examples of such applications are
			Prescription	19	Pill, powder and soup
Chinese medicine	3	Raw, roasted and wine
			Disease and disorder	4	Diseases, diseases and cancers
Syndrome type	63	Syndrome/defense and ying/ying
			Manifestation of syndrome	8	Pulse, tongue and symptoms
Dictionary word frequency	3139	Wei, wu and Wei dampness

In one embodiment, step 112 includes: inputting a Chinese medicinal ancient book text sequence into a new embedding layer of a named entity recognition model in a fine adjustment stage, acquiring word embedding information, word position information and coding information of a sentence where a word is positioned by adopting a look up operation, and acquiring a radical code of the Chinese character according to a Chinese character-radical index look up; inputting the word embedding information, the position information of the word, the coding information of the sentence where the word is located and the radical coding of the Chinese character into an ELECTRA module after pre-training of a named entity recognition model in a fine tuning stage to obtain the embedding representation of the word; inputting the embedded expression of the word into a BilSTM module of a named entity recognition model in a fine tuning stage for feature extraction, and inputting the extracted features into a CRF module for decoding to obtain a predicted tag sequence; and determining the model loss in the fine tuning stage according to the predicted label sequence and the corresponding token, and performing fine tuning training on the named entity recognition model in the fine tuning stage according to the model loss in the fine tuning stage and the mark of the fine tuning training sample to obtain the trained named entity recognition model.

In particular, the radical data is used for fine-tuning the improvement of the phase model. The traditional Chinese medicine culture is the crystallization of human historical culture and mental culture, and the propagation and development of the traditional Chinese medicine culture are indistinguishable from the Chinese character culture. The Chinese characters originate from the oracle bone characters 1000 years before the first yuan, and the oracle bone characters have the characteristics of pictures and are typical pictographs; then, chinese characters are developed continuously, the structure of the Chinese characters is changed from complex to easy, but the Chinese characters are used as Chinese record symbols and represent or express the semantics of characters in a certain system. At present, simplified Chinese characters are common symbols in Chinese texts, basic composition units are components, and the components of the Chinese characters have the same meanings, such as the components 'eyes', which are mostly related to eyes, and the components '', which are mostly related to diseases, and the like. In order to allow machines to learn relevant features, the present study adds Chinese character components as input. The author makes statistics on the full text of 'Min Bu Ming Yi Fang Lun', 1831 Chinese characters (without repetition) and 225 Chinese characters (if there are a plurality of Chinese characters, the best Chinese character is selected manually), and the Chinese character components are obtained by crawling Baidu Chinese. The kanji-radical table is shown in table 3, and txt files are generated separately from the radicals for radical coding.

TABLE 3 Chinese character-radical table

Chinese characters	Radical component
		Human being	Human being
Ginseng radix (Panax ginseng C.A. Meyer)
		Nourishing food
Rong (Chinese character of 'Rong')	+ -%
		Soup	Chinese medicine

In one embodiment, step 110 comprises: acquiring a traditional Chinese medicine ancient book text; and (3) carrying out data correction on the ancient book text of the traditional Chinese medical institution in the fine adjustment stage by labeling the disharmony and fine adjustment training samples, wherein the correction content comprises the following steps: correcting wrongly written characters, replacing traditional characters and modifying ambiguous characters; manually marking the corrected Chinese medicinal ancient book text to mark 5 types of entities; and labeling of the training samples and fine tuning the 5 types of entities in the fine tuning stage comprise: prescription, traditional Chinese medicine, syndrome type, disease and symptom expression; dividing the marked traditional Chinese medicine ancient book texts according to a preset proportion to obtain a training set, a verification set and a test set; taking Chinese periods, exclamation marks and question marks as segmentation points, performing sequence segmentation processing on the segmented data set, and ensuring that the lengths of the segmented sequences are close under the condition that the length of the segmented sequences is smaller than the maximum value of the preset sequence length; performing data processing on the result after the sequence segmentation processing by adopting a BIO labeling mode to obtain an entity-labeled Chinese medicinal ancient book text sequence; and taking the Chinese medicinal ancient book text sequence after entity labeling as a fine-tuning training sample.

Specifically, the correction content includes correcting wrongly written characters, replacing traditional characters, and modifying ambiguous characters, such as "Huangqi" to "Huangqi"; 'the hemlock tip' is replaced by the hemlock tip

"and the like.

The entity labeling is carried out in an artificial mode, a labeling tool selects a doccano term labeling system, 5 types of entities of prescriptions, traditional Chinese medicines, syndrome types, diseases and syndrome expressions are labeled together, and the definitions and examples of the various types of entities are shown in the table 4. The entities of syndrome and condition include symptoms, pulse condition and tongue condition, and the symptoms are "eating and spitting ascaris", and the pulse condition is "pulse weakness", etc., and no auxiliary words or adjectives are removed during the entity labeling. The labeling results are cross checked and modified by a plurality of doctor in traditional Chinese medicine, and finally, the number of manually labeled entities in each category is as follows (including repetition): 760 prescription prescriptions, 3375 Chinese herbs, 831 symptoms and signs, 593 diseases and 1762 syndromes.

TABLE 4 entities, entity classes, and entity definitions

Processing data by adopting a BIO labeling mode after data segmentation, wherein B is a beginning (begin) and represents the beginning of an entity, I is an Inside (Inside), represents other positions except the beginning of the entity, and O is an Outside (Outside) and represents a non-entity type; for example, the recipe "RENSHENYANGRONG decoction", the "ren" is labeled as B-description, the rest of the entity is labeled as I-description, and the specific labeling is shown in Table 5.

TABLE 5 concrete label example table

Token	Labeling	Token	Labeling
				If it is	O	Yellow colour	B-Medicine
Watch (A)	B-Syndrome	Radix astragali	I-Medicine
				Deficiency of Qi	I-Syndrome	Yang (Yang)	B-Syndrome
From	B-Disease	Deficiency of Qi	I-Syndrome
				Sweat pad	I-Disease	Jue	B-Symptom
，	O	Sweat pad	I-Symptom
				To be provided with	O	,	O
Attached with	B-Medicine	Radix astragali	B-Prescription
				Seed of Japanese apricot	I-Medicine	Attached with	I-Prescription
Easy to use	O	Soup	I-Prescription

(2) In the data segmentation, the length of an input sequence is limited by the pretrained model ELECTRA, and in consideration of less training data of the research, an author performs sequence segmentation processing on a divided data set. When the sequence is segmented, chinese periods, exclamation marks and question marks are used as segmentation points, and the segmented sequences are ensured to be similar as far as possible under the condition that the segmented sequence length is smaller than the maximum value (max _ seq _ length) of the length of the hyper-parametric preset sequence.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In a validated embodiment, the study performed experiments on the proposed optimization scheme, experiment 1, which was optimized only in the model pre-training phase; experiment 2, optimization was only performed at the model fine tuning stage; experiment 3, a keyword mask mechanism is added in the pre-training stage, and Chinese character radicals are introduced in the fine-tuning stage to serve as input features.

(1) Experiment 1 core parameters, divide _ contribution =7/8, num _train _steps =10,0000, left _rate =5e-4, electrora object = true, model _size = "base"; epich =40, learning-rate =3e-4; the remaining parameters are shown in table 6, and the experimental results are shown in table 7;

TABLE 6 remaining parameters

Table 7 index values of model when only semi-fixed random mask strategy is introduced

Entity type	Rate of accuracy	Recall rate	Harmonic mean
				Prescription	87.85	92.16	89.95
Chinese medicine	97.33	97.07	97.20
				Syndrome type	75.00	58.88	65.97
Disease and disorder	70.27	60.47	65.00
				Manifestation of syndrome	65.58	72.68	68.95
Average	84.21	83.80	84.00

(2) Experiment 2 core parameters use _ radial _ embedding = True, epoch =40, learning-rate =3e-4; the remaining parameters are in agreement with table 6, and the experimental results are shown in table 8;

TABLE 8 model index values when introducing only radical features

Entity type	Rate of accuracy	Recall rate	Harmonic mean
				Prescription	85.19	90.20	87.62
Chinese medicine	96.82	97.33	97.07
				Syndrome type	67.37	59.81	63.37
Disease and disorder	61.70	67.44	64.44
				Manifestation of syndrome	63.93	72.16	67.80
Average	81.56	84.04	82.78

(3) Experiment 3 core parameters use _ radial _ embedding = True, epoch =40, learning-rate =3e-4; the remaining parameters are in accordance with Table 6 and the results are shown in Table 9.

TABLE 9 index values of the optimized model

Entity type	Rate of accuracy	Recall rate	Harmonic mean
				Prescription	91.00	89.22	90.10
Chinese medicine	98.63	96.00	97.30
				Syndrome type	68.57	67.29	67.92
Disease and disorder	69.05	67.44	68.24
				Manifestation of syndrome	71.65	71.65	71.65
Average	85.73	84.17	84.94

TABLE 10 model before optimization index values

Comparing the evaluation indexes of the models before and after optimization, it can be found that the method of adding the keyword mask mechanism in the pre-training stage has a greater improvement on the model performance than the method of introducing the character of the radical in the fine-tuning stage, if the two methods are combined, the obtained benefit is greater, and the average P, R, F value of each model entity is as shown in fig. 4.

Comparing tables 9 and 10, it can be found that the recognition effects of the improved model on the entities of the prescription, chinese medicine, syndrome, disease and syndrome are all improved by a small margin, 3.1%,0.4%,5.01%,7.77% and 7.05%. The disease entity is promoted to the maximum extent, the traditional Chinese medicine entity is promoted to the minimum extent, the main reason for the big disease entity is that the key words and the radical characteristics of the text are obvious, the reason for the small traditional Chinese medicine entity is that the system already identifies most entities, and the rest entities such as ephedra and cassia twig in the 'cinnamon twig' are difficult to solve only by adding the radical characteristics. From experimental results, the optimization method is helpful for identifying syndromes such as malaria, O before optimization and malaria after optimization; the "intermittent pulse" is recognized.

In one embodiment, as shown in fig. 5, there is provided a traditional Chinese medicine ancient book named entity recognition apparatus, including: the system comprises a traditional Chinese medicine ancient book text sequence determining module, a named entity recognition model building module, a pretraining module of an ELECTRA module, a named entity recognition model fine-tuning training module and a traditional Chinese medicine ancient book named entity recognizing module, wherein:

the traditional Chinese medicine ancient book text sequence determining module is used for acquiring traditional Chinese medicine ancient book texts and preprocessing the traditional Chinese medicine ancient book texts to obtain traditional Chinese medicine ancient book text sequences.

The pretraining module of the ELECTRA module is used for adding a keyword mask mechanism in the ELECTRA module in a pretraining stage; the key word mask mechanism adopts a semi-fixed random mask mode to randomly select key words in a preset key word table for mask; and pre-training the ELECTRA module on the embedding layer and after a new keyword mask mechanism is added by adopting a traditional Chinese medicine ancient book text sequence to obtain the pre-trained ELECTRA module.

The named entity recognition model fine tuning training module is used for inputting the character features of the radicals newly added to the embedded coding module of the named entity recognition model in the fine tuning stage to obtain a new embedding layer; taking the obtained marked Chinese medicinal ancient book text sequence as a fine-tuning training sample; carrying out fine tuning training on the named entity recognition model in the fine tuning stage by adopting the fine tuning training sample to obtain a trained named entity recognition model; the named entity recognition model in the fine tuning stage comprises a new embedding layer, a pretrained ELECTRA module, a BilSTM module and a CRF module.

In one embodiment, the pretraining module of the ELECTRA module is further configured to set a pretraining step number segmentation parameter; dividing the pre-training total step number into a step number trained in a conventional training mode and a training step number adopting a keyword mask mechanism according to the pre-training step number division parameter; training the ELECTRA module by adopting a random mask mechanism of the ELECTRA module according to codes obtained by inputting the Chinese medicinal ancient book text sequence into an embedding layer until the number of training steps reaches the number of training steps in a conventional training mode, and obtaining the ELECTRA module after conventional training; continuously training the ELECTRA module after conventional training by adopting a key word mask mechanism according to codes obtained by inputting the Chinese medicinal ancient book text sequence into an embedding layer until the number of training steps reaches the number of training steps adopting the key word mask mechanism, and obtaining the ELECTRA module after pre-training; the key mask mechanism adopts a random mask15% token mode, and the token belongs to a key in a preset key table.

In one embodiment, the pre-training module of the ELECTRA module comprises a construction module of a preset keyword table, and is used for extracting the keyword characteristics of a prescription entity, a traditional Chinese medicine entity, a disease entity, a syndrome entity and a symptom expression entity in the traditional Chinese medicine ancient book and determining the keywords corresponding to various entities; extracting words with the frequency of occurrence being more than or equal to 1 from a subordinate dictionary as key words of a term dictionary; and constructing a preset keyword table according to the keywords of the term dictionary and the keywords corresponding to various entities.

In one embodiment, the named entity recognition model fine-tuning training module is further configured to input the traditional Chinese medicine ancient book text sequence into a new embedding layer of the named entity recognition model in the fine-tuning stage, obtain word embedding information, word position information and coding information of a sentence where the word is located by using a look up operation, and obtain a radical code of the Chinese character according to a Chinese character-radical index look up; inputting the word embedding information, the position information of the word, the coding information of the sentence where the word is located and the radical coding of the Chinese character into a pretrained ELECTRA module of a named entity recognition model in a fine adjustment stage to obtain the embedding representation of the word; inputting the embedded expression of the word into a BilSTM module of a named entity recognition model in a fine tuning stage for feature extraction, and inputting the extracted features into a CRF module for decoding to obtain a predicted tag sequence; and determining the model loss in the fine tuning stage according to the predicted label sequence and the corresponding token, and performing fine tuning training on the named entity recognition model in the fine tuning stage according to the model loss in the fine tuning stage and the mark of the fine tuning training sample to obtain the trained named entity recognition model.

In one embodiment, the traditional Chinese medicine ancient book text sequence determining module is further used for acquiring a traditional Chinese medicine ancient book text; the data correction is carried out on the ancient book text in the hospital, and the correction content comprises the following steps: correcting wrongly written characters, replacing traditional characters and modifying ambiguous characters; manually marking the corrected Chinese medicinal ancient book text to mark 5 types of entities; the 5 types of entities include: prescription, traditional Chinese medicine, syndrome type, disease and symptom expression; dividing the marked traditional Chinese medicine ancient book texts according to a preset proportion to obtain a training set, a verification set and a test set; taking Chinese periods, exclamation marks and question marks as segmentation points, performing sequence segmentation processing on the segmented data set, and ensuring that the lengths of the segmented sequences are similar under the condition that the length of the segmented sequences is smaller than the maximum value of the preset sequence length; performing data processing on the result after the sequence segmentation processing by adopting a BIO labeling mode to obtain an entity-labeled Chinese medicinal ancient book text sequence; and taking the Chinese medicinal ancient book text sequence after entity labeling as a fine-tuning training sample.

For the specific limitation of the apparatus for identifying ancient Chinese medical books, reference may be made to the above limitation of the method for identifying ancient Chinese medical books, which is not described herein again. All modules in the traditional Chinese medicine ancient book named entity recognition device can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The electronic device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method for identifying ancient Chinese medicinal books. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, an electronic device is provided, comprising a memory storing a computer program and a processor implementing the steps of the above method embodiments when the processor executes the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A traditional Chinese medicine ancient book named entity identification method is characterized by comprising the following steps:

acquiring a traditional Chinese medicine ancient book text, and preprocessing the traditional Chinese medicine ancient book text to obtain a traditional Chinese medicine ancient book text sequence;

constructing a named entity recognition model of the traditional Chinese medicine ancient book, wherein the named entity recognition model comprises an embedding layer, an ELECTRA module, a BilSTM module and a CRF module which are sequentially connected;

in the pre-training stage, a key word mask mechanism is newly added in an ELECTRA module; the key word mask mechanism adopts a semi-fixed random mask mode to randomly select key words in a preset key word table for mask;

pre-training the EMBedding layer and the ECTRA module after a new keyword mask mechanism is added by adopting the Chinese medicinal ancient book text sequence to obtain the pre-trained ELECTRRA module;

in the fine tuning stage, character input of radicals of newly added characters is carried out on an embedded coding module of the named entity recognition model, and a new embedding layer is obtained;

taking the obtained marked Chinese medicinal ancient book text sequence as a fine-tuning training sample;

carrying out fine tuning training on the named entity recognition model in the fine tuning stage by adopting the fine tuning training sample to obtain a trained named entity recognition model; the named entity recognition model in the fine tuning stage comprises a new embedding layer, a pretrained ELECTRA module, a BilSTM module and a CRF module;

and identifying the named entities of the ancient medical books by adopting a trained named entity identification model.

2. The method of claim 1, wherein pre-training the ELECTRA module after the embedding layer and the new keyword mask mechanism by using the ancient Chinese medicine text sequence to obtain a pre-trained ELECTRA module comprises:

setting a pre-training step number segmentation parameter;

dividing the total pre-training steps into steps trained in a conventional training mode and training steps adopting a keyword mask mechanism according to the pre-training step division parameters;

according to the codes obtained by inputting the Chinese medicinal ancient book text sequence into the embedding layer, training the ELECTRA module by adopting a random mask mechanism of the ELECTRA module until the number of training steps reaches the number of training steps in a conventional training mode, and obtaining the ELECTRA module after conventional training;

according to the codes obtained by inputting the Chinese medicinal ancient book text sequence into the embedding layer, continuously training the ELECTRA module after conventional training by adopting a key word mask mechanism until the training steps reach the training steps adopting the key word mask mechanism, and obtaining the ELECTRA module after pre-training; the key mask mechanism adopts a random mask15% token mode, and the token belongs to a key in a preset key table.

3. The method of claim 1, wherein the step of constructing the preset key table comprises:

extracting key word characteristics of a prescription entity, a traditional Chinese medicine entity, a disease entity, a syndrome entity and a symptom expression entity in the traditional Chinese medicine ancient book, and determining key words corresponding to various entities;

extracting words with the frequency of occurrence more than or equal to 1 from a subordinate dictionary as key words of a term dictionary;

4. The method of claim 1, wherein the fine-tuning training of the named entity recognition model in the fine-tuning stage using the fine-tuning training samples to obtain the trained named entity recognition model comprises:

inputting the fine tuning training sample into a new embedding layer of a named entity recognition model in a fine tuning stage, acquiring word embedding information, word position information and coding information of a sentence where the word is located by adopting a look up operation, and acquiring a radical code of the Chinese character according to a Chinese character-radical index look up;

inputting the embedded expression of the word into a BilSTM module of a named entity recognition model in a fine tuning stage for feature extraction, and inputting the extracted features into a CRF module for decoding to obtain a predicted tag sequence;

5. The method of claim 1, wherein the step of using the obtained marked ancient Chinese medicine text sequence as a fine training sample comprises:

acquiring a traditional Chinese medicine ancient book text;

and correcting the ancient book text of the traditional Chinese medical institution, wherein the correction content comprises the following steps: correcting wrongly written characters, replacing traditional characters and modifying ambiguous characters;

manually marking the corrected Chinese medicinal ancient book text to mark 5 types of entities; the 5 types of entities include: prescription, traditional Chinese medicine, syndrome type, disease and symptom expression;

dividing the marked traditional Chinese medicine ancient book texts according to a preset proportion to obtain a training set, a verification set and a test set;

taking Chinese periods, exclamation marks and question marks as segmentation points, performing sequence segmentation processing on the segmented data set, and ensuring that the lengths of the segmented sequences are similar under the condition that the length of the segmented sequences is smaller than the maximum value of the preset sequence length;

performing data processing on the result after the sequence segmentation processing by adopting a BIO labeling mode to obtain an entity-labeled Chinese medicinal ancient book text sequence;

6. An ancient book named entity recognition device of traditional Chinese medicine, which is characterized by comprising:

the traditional Chinese medicine ancient book text sequence determining module is used for acquiring traditional Chinese medicine ancient book texts and preprocessing the traditional Chinese medicine ancient book texts to obtain traditional Chinese medicine ancient book text sequences;

the named entity recognition model building module is used for building a named entity recognition model of the traditional Chinese medicine ancient book, and the named entity recognition model comprises an embedding layer, an ELECTRA module, a BilSTM module and a CRF module which are sequentially connected;

the pretraining module of the ELECTRA module is used for adding a key word mask mechanism in the ELECTRA module in a pretraining stage; the key word mask mechanism adopts a semi-fixed random mask mode to randomly select key words in a preset key word table for mask; pre-training the EMBedding layer and the ECTRA module after a new keyword mask mechanism is added by adopting the Chinese medicinal ancient book text sequence to obtain the pre-trained ELECTRRA module;

the named entity recognition model fine tuning training module is used for inputting the character features of the radicals newly added to the embedded coding module of the named entity recognition model in the fine tuning stage to obtain a new embedding layer; taking the obtained marked Chinese medicinal ancient book text sequence as a fine-tuning training sample; performing fine tuning training on the named entity recognition model in the fine tuning stage by adopting the fine tuning training sample to obtain a trained named entity recognition model; the named entity recognition model in the fine tuning stage comprises a new embedding layer, a pretrained ELECTRA module, a BilSTM module and a CRF module;

7. The apparatus of claim 6, wherein the pretraining module of the ELECTRA module is further configured to set a pretraining step number split parameter; dividing the pre-training total step number into a step number trained in a conventional training mode and a training step number adopting a keyword mask mechanism according to the pre-training step number division parameter; training the ELECTRA module by adopting a random mask mechanism of the ELECTRA module according to a code obtained by inputting the Chinese medicinal ancient book text sequence into an embedding layer until the number of training steps reaches the number of training steps in a conventional training mode to obtain the ELECTRA module after conventional training; continuously training the ELECTRA module after conventional training by adopting a key word mask mechanism according to codes obtained by inputting the Chinese medicinal ancient book text sequence into an embedding layer until the number of training steps reaches the number of training steps adopting the key word mask mechanism, and obtaining the ELECTRA module after pre-training; the key mask mechanism adopts a random mask15% token mode, and the token belongs to a key in a preset key table.

8. The apparatus of claim 6, wherein the pre-training module of the ELECTRA module comprises a building module of a preset keyword table, which is used for extracting the keyword features of the prescription entity, the Chinese medicine entity, the disease entity, the syndrome entity and the symptom expression entity in the ancient Chinese medicine and medical books and determining the keywords corresponding to each entity; extracting words with the frequency of occurrence more than or equal to 1 from a subordinate dictionary as key words of a term dictionary; and constructing a preset keyword table according to the keywords of the term dictionary and the keywords corresponding to various entities.

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.