CN110134953A - Chinese medicine name entity recognition method and identifying system based on Chinese medical book document - Google Patents

Chinese medicine name entity recognition method and identifying system based on Chinese medical book document Download PDF

Info

Publication number
CN110134953A
CN110134953A CN201910367376.9A CN201910367376A CN110134953A CN 110134953 A CN110134953 A CN 110134953A CN 201910367376 A CN201910367376 A CN 201910367376A CN 110134953 A CN110134953 A CN 110134953A
Authority
CN
China
Prior art keywords
training
chinese medical
corpus
medical book
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910367376.9A
Other languages
Chinese (zh)
Other versions
CN110134953B (en
Inventor
张德政
杨石兵
贾麒
谢永红
夏超
栗辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201910367376.9A priority Critical patent/CN110134953B/en
Publication of CN110134953A publication Critical patent/CN110134953A/en
Application granted granted Critical
Publication of CN110134953B publication Critical patent/CN110134953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The present invention provides a kind of, and the Chinese medicine based on Chinese medical book document names entity recognition method and system, to solve the problems, such as the identification of Chinese medicine name entity.The method carries out data scrubbing on the basis of obtaining Chinese medical book document corpus, then carries out language model pre-training;By carrying out sequence labelling to corpus, the training set of following model is formed;Model training collection based on sequence labelling, using language model as coding layer, using neural network structure as decoding layer, training sequence marking model, so that carrying out Chinese medicine based on sequence labelling model names Entity recognition.The present invention combines existing speech training model, the language model pre-training method bert proposed such as Google, the cost manually marked is saved based on small sample training set, improve recognition effect and accuracy rate, it is easy to operate, it realizes effective and comprehensive utilization to Chinese medicine literature of ancient book, especially Chinese medical book case, lays a good foundation for traditional Chinese medical science field research.

Description

Chinese medicine name entity recognition method and identifying system based on Chinese medical book document
Technical field
The invention belongs to information processings and TCM Document field, and in particular to a kind of Chinese medicine life based on Chinese medical book document Name entity recognition method and identifying system.
Background technique
Traditional Chinese medicine is of extensive knowledge and profound scholarship, on the one hand the propagation of Chinese medicine is passed by the direct experience of older generation's medical worker It passs, is on the other hand document.In TCM Document, a large amount of Chinese medical book case is saved, wherein it is old to contain numerous names Experience of tcm and diagnosis and treatment method.Chinese medical book case mentioned herein just refers to related to patient when ancient Chinese medicine treats disease The continuous record that symptom, the cause of disease, prescription, medication etc. are done.Wherein, Chinese medicine name entity just refers in Chinese medical book case to patient Disease is illustrated the information entities such as symptom, prescription, the medication of reproduction.In order to preferably using including Chinese medical book case TCM Document, Chinese medicine name Entity recognition be traditional Chinese medical science field correlative study important prerequisite.
Research of the current name Entity recognition on some common entity types (such as name, place name, mechanism name) is It is obtained well as a result, all having approached artificial mark level substantially.However, Chinese medical book document and other documents are in word It is grammatically all very different, with the characteristic of oneself, name entity recognition method in the prior art is applied to Chinese medicine In ancient books case, it is unable to get ideal effect.There are many more intractable grammatical phenomenons in Chinese medical book case simultaneously, leads The artificial mark of cause also becomes difficult and valuableness, further increases the difficulty that Chinese medicine names Entity recognition.
Summary of the invention
The technical problem to be solved in the present invention is directed to the recognition methods that there is no effective Chinese medicine name entity in the prior art, A kind of Chinese medicine name entity recognition method and identifying system based on Chinese medical book document is provided, in conjunction with existing speech training mould Type obtains the name Entity recognition side towards Chinese medical book document such as the language model pre-training method bert that Google is proposed Method is realized to Chinese medicine literature of ancient book, the especially effective use of Chinese medical book case.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of Chinese medicine name entity based on Chinese medical book document Recognition methods, described method includes following steps:
Step S1 obtains Chinese medical book case corpus;
Step S2 carries out data scrubbing to the Chinese medical book case corpus to be processed that step S1 is obtained;
Step S3 carries out the language towards Chinese medical book case corpus based on the Chinese medical book case corpus that step S2 is obtained Say model pre-training;
Step S4 carries out sequence to the corpus based on the Chinese medical book case corpus after step S2 cleaning obtained Mark, forms the training set of following model;
Step S5, based on the model training collection of the obtained sequence labelling of step S4, using the language model of step S3 as volume Code layer, using preset neural network structure as decoding layer, the corresponding sequence labelling model of training;
Step S6 carries out Entity recognition to Chinese medical book case based on the sequence labelling model that step S5 training obtains.
In above scheme, the step S1 obtains Chinese medical book case corpus, specifically includes the following steps:
Step S11 scans and recognizes existing papery version Chinese medical book case books using optical character identification, shape At e-text corpus;
Step S12 grabs the Chinese medical book case corpus of no papery version books using open source web crawlers from network;
Step S11 and step S12 the corpus text got are compared, are merged by step S13, ultimately form unification Chinese medical book case corpus to be processed.
In above scheme, the step S2 carries out data scrubbing to Chinese medical book case corpus to be processed, specifically includes Following steps:
Step S21 corrects wrong word;
Step S22 filters unrelated sentence.
In above scheme, language model pre-training described in step S3, specifically includes the following steps:
Step S31, the source code of download language model pre-training Chinese language training;
Step S32, the word table that manual sorting Chinese medical book case is related to compare, cutting with the middle literal table in source code The rarely used word table of traditional Chinese medical science field out;
Step S33 will be described uncommon in such a way that the character in rarely used word table replaces the character that frequency of use is low in source code Word table merges with middle literal table, and guarantee in literal table length it is constant;
Step S34 is segmented the paragraph in the Chinese medical book case corpus after step S2 cleaning, presets bout length Threshold value and/or paragraph include sentence amount threshold, will be greater than the bout length threshold value and/or paragraph includes sentence amount threshold Paragraph text, the training corpus as speech training model;
Step S35, using the segmenting method in the word segmentation regulation replacement source code separated by character, based on Chinese language training Model, it is enterprising with the training corpus of the speech training model of the language model pre-training method downloaded in the step S34 Language model pre-training of the row towards Chinese medical book case.
In above scheme, step S3, the language model pre-training, using the language model pre-training method of Google bert。
In above scheme, the language model pre-training is specifically included:
Step S31, the Chinese language model chinese_L-12_H-768_ based on bert training of downloading Google open source The source code of A-12 and bert;
Step S32 compares with the middle literal table of Google open source, is syncopated as the unique rarely used word table of traditional Chinese medical science field;
The Google middle literal table increased income is merged with the rarely used word table, is replaced when merging with rarely used word table by step S33 The low character of frequency of use in middle literal table, to guarantee that word table length is constant;
Step S34 is segmented the paragraph in the Chinese medical book case corpus after step S2 cleaning, presets bout length Threshold value and/or paragraph include sentence amount threshold, will be greater than the bout length threshold value and/or paragraph includes sentence amount threshold Paragraph text, the training corpus as speech training model;
Step S35 is opened using the segmenting method in the word segmentation regulation replacement bert source code separated by character based on Google The Chinese language model in source is carried out using bert pre-training method on the training corpus that step S34 is syncopated as towards Chinese medicine Gu The language model pre-training of nationality case.
In above scheme, following model training set is formed described in step S4, by the Chinese medical book case corpus It carries out the sequence labelling of BIOES form and is formed.
In above scheme, the sequence labelling that BIOES form is carried out to Chinese medical book case corpus is specifically included following Step:
Step S41 chooses Entity recognition type;
Step S42, agreement mark rule;
Step S43 randomly selects the sentence set of pre-determined size from the corpus after cleaning, is written by character separation wait mark Explanatory notes part, with null separation between sentence;
Step S44, the mark rule of Entity recognition type and agreement based on selection, to the sentence of the pre-determined size of selection Set is manually marked.
In above scheme, the entity type is respectively as follows: symptom ZZ, pulse condition MX, tongue picture SX, Chinese medicine ZY, dosage JL, prescription FJ。
The embodiment of the invention also provides a kind of, and the Chinese medicine based on Chinese medical book document names entity recognition system, the system System includes: that corpus obtains module, data scrubbing module, language model pre-training module, training set labeling module, sequence labelling mould Type training module, Entity recognition module;Wherein,
The corpus obtains module for obtaining Chinese medical book case corpus;
The data scrubbing module is used to carry out data scrubbing to the Chinese medical book case corpus to be processed of acquisition;
The language model pre-training module is used to be based on Chinese medical book case corpus, carries out towards Chinese medical book case language The language model pre-training of material;
The training set labeling module is used for based on the Chinese medical book case corpus after cleaning, carries out sequence to the corpus Mark, forms the training set of following model;
The sequence labelling model training module is used for the model training collection based on sequence labelling, using language model as volume Code layer, using preset neural network structure as decoding layer, the corresponding sequence labelling model of training;
The Entity recognition module be used for based on sequence labelling model, to Chinese medical book case carry out Entity recognition.
Above-mentioned technical proposal of the present invention has the beneficial effect that:
In above scheme, the Chinese medicine name entity recognition method and identifying system based on Chinese medical book document, in conjunction with Existing speech training model, as the Google language model pre-training method bert proposed is carried out based on the training set of small sample It trains, the labeled data in training set is less, is greatly saved artificial mark cost;It has obtained towards Chinese medical book document Entity recognition method is named, significantly more efficient Chinese medical book document can be utilized, obtain the name towards Chinese medical book document Entity recognition method, it is easy to operate, it is high-efficient, it realizes to Chinese medicine literature of ancient book, the especially effective use of Chinese medical book case, more Add comprehensively using literature of ancient book, improves the effect of traditional Chinese medical science field name Entity recognition, make the knowledge of traditional Chinese medical science field name entity It is more inaccurate, good data basis has been laid for traditional Chinese medical science field subsequent applications.
Detailed description of the invention
For the elaboration the embodiment of the present invention being more clear and existing technical solution, below by technical side of the invention Case illustrates that attached drawing does simple introduction, it is clear that, without creative efforts, ordinary skill people Member can obtain other attached drawings by this attached drawing.
Fig. 1 is Chinese medicine name entity recognition method flow diagram of the embodiment of the present invention based on Chinese medical book document.
Specific embodiment
To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool Body embodiment is described in detail.
To solve the problems, such as that the Chinese medicine of the Chinese medical book document including Chinese medical book case names Entity recognition, the present invention A kind of Chinese medicine name entity recognition method and identifying system based on Chinese medical book document is provided, in conjunction with existing speech training Model has obtained the name entity towards Chinese medical book document and has known such as the language model pre-training method bert that Google is proposed Other method significantly more efficient can utilize Chinese medical book document, lay a good foundation for the correlative study of traditional Chinese medical science field.
With reference to the accompanying drawing, by specific embodiment, the present invention is described in further detail.
First embodiment
A kind of Chinese medicine name entity recognition method based on Chinese medical book document is present embodiments provided, Fig. 1 show institute State Chinese medicine name entity recognition method flow diagram.
Entity is named described in the present embodiment, for the case document in Chinese medical book document, but the present invention not office It is limited to case, can also be applied to other Chinese medical book documents.
As shown in Figure 1, the Chinese medicine based on Chinese medical book document names entity recognition method, include the following steps:
Step S1 obtains Chinese medical book case corpus.
Further, the acquisition Chinese medical book case corpus, specifically includes the following steps:
Step S11, using optical character identification (Optical Characters Recognition, OCR) to existing Papery version Chinese medical book case books scan and recognize, and form e-text corpus.
Step S12 grabs the Chinese medical book case corpus of no papery version books using open source web crawlers from network.
Step S11 and step S12 the corpus text got are compared, are merged by step S13, ultimately form unification Chinese medical book case corpus to be processed.
Step S2 carries out data scrubbing to the Chinese medical book case corpus to be processed that step S1 is obtained.
Further, data scrubbing is carried out to Chinese medical book case corpus to be processed, specifically includes the following steps:
Step S21 corrects wrong word.
In this step, the wrong word refers to mistake caused by identification mistake when doctor wrongly writes in record or obtains corpus Character phenomenon.For example, " doctor sees that diarrhea " " is thirsty ", actually should be " doctor sees that diarrhea is thirsty ".
Step S22 filters unrelated sentence.
In this step, the unrelated sentence includes:
22.1, due to the books that some in original language material is bound into book form from arrangement, wherein containing many titles, author Etc. information, such as " " two all cases " ", " Cheng Congzhou ".
22.2, due to having the stronger subjective longer case of random or history to save not when doctor records case It is complete, wherein contain some interrogatories or only express the sentence of doctor's personal emotion, it is such as " attached.", " certain.", " Alas!" Deng.
Further, the unrelated sentence of the filtering, comprising the following steps:
Step S221, the feature for the unrelated sentence that manual sorting needs to filter.
Preferably, the feature includes:
221.1, the information such as title, author substantially bind together appearance;
221.2, it substantially all include punctuation marks used to enclose the title in title;
221.3, other interrogatory or the sentence for stating doctor's emotion, citation form are fixed, such as " Alas!", " certain." etc.;
221.4, the unrelated sentence length for needing to filter is mostly shorter, it is preferable that is no more than 10~15.
Step S222 designs corresponding regular expression according to the feature that step S221 is arranged, the rule divided by sentence Sentence is split, in conjunction with length limitation, the correlative in corpus to be processed is filtered out.
Preferably, the rule of the sentence segmentation are as follows:
222.1, sentence is considered as with the character string of fullstop, question mark, exclamation mark ending;
222.2, if not including fullstop, question mark, exclamation mark, but individually in a row, also it is considered as sentence.
Step S3 carries out the language towards Chinese medical book case corpus based on the Chinese medical book case corpus that step S2 is obtained Say model pre-training.
Preferably, language model pre-training described in this step, using the language model pre-training method bert of Google.
Further, the language model pre-training towards Chinese medical book case corpus, specifically includes the following steps:
Step S31, the source code of download language model pre-training Chinese language training.
For example, by taking the bert of Google training as an example, the Chinese language mould based on bert training of downloading Google open source The source code of type chinese_L-12_H-768_A-12 and bert.
Step S32, the word table that manual sorting Chinese medical book case is related to compare, cutting with the middle literal table in source code The rarely used word table of traditional Chinese medical science field out.
For example, being compared with the middle literal table of Google open source by taking the bert of Google training as an example, being syncopated as traditional Chinese medical science field Unique rarely used word table.Table 1 for the part rarely used word table being syncopated as example.
The rarely used word table that table 1 is increased income based on Google
Step S33 will be described uncommon in such a way that the character in rarely used word table replaces the character that frequency of use is low in source code Word table merges with middle literal table, and guarantee in literal table length it is constant.
By taking the bert training of Google open source as an example, the Google middle literal table increased income is merged with the rarely used word table, With the character that frequency of use in literal table in the replacement of rarely used word table is low when merging, to guarantee that word table length is constant.
Step S34 is segmented the paragraph in the Chinese medical book case corpus after step S2 cleaning, presets bout length Threshold value and/or paragraph include sentence amount threshold, will be greater than the bout length threshold value and/or paragraph includes sentence amount threshold Paragraph text, the training corpus as speech training model.
Preferably, the paragraph in Chinese medical book case corpus is segmented, according to 222.1,222.2 rule into Row.Preferably, the bout length threshold value is 150, and paragraph includes that sentence amount threshold is 3.
Step S35, using the segmenting method in the word segmentation regulation replacement source code separated by character, based on Chinese language training Model, it is enterprising with the training corpus of the speech training model of the language model pre-training method downloaded in the step S34 Language model pre-training of the row towards Chinese medical book case.
By taking the bert of Google training as an example, this step is using in the word segmentation regulation replacement bert source code separated by character Segmenting method, based on Google open source Chinese language model, the instruction being syncopated as using bert pre-training method in step S34 Practice and carries out the language model pre-training towards Chinese medical book case on corpus.
Step S4 carries out sequence to the corpus based on the Chinese medical book case corpus after step S2 cleaning obtained Mark, forms the training set of following model.
Further, the formation following model training set, by carrying out BIOES shape to the Chinese medical book case corpus The sequence labelling of formula and formed.Here be labeled as manually marking.
Further, the sequence labelling that BIOES form is carried out to Chinese medical book case corpus, the sequence labelling Label form be BIOES, in which: B, i.e. Begin, the beginning character of presentation-entity;I, i.e. Intermediate, presentation-entity Intermediate character;E, i.e. End, the ending character of presentation-entity;S, i.e. Single indicate the entity of single character composition;O, i.e., Other indicates other, for marking unrelated character.Specifically, comprising the following steps:
Step S41 chooses Entity recognition type.
The name entity class being related in Chinese medical book case is various, preferred entity type difference in the present embodiment Are as follows: symptom (ZZ), pulse condition (MX), tongue picture (SX), Chinese medicine (ZY), dosage (JL), prescription (FJ).In practical applications, Ke Yigen According to the specific text of Chinese medical book case or other need that above-mentioned entity type is accordingly increased or deleted, above-mentioned example The limitation chosen to entity type is not constituted.
Step S42, agreement mark rule.
Name entity in Chinese medical book case has distinct domain features, especially the most prominent with " symptom " entity, So needing to be transformed general mark rule for its feature in annotation process.In Chinese medical book case, symptom Name is generally made of sick position, patient's condition and characteristic of disease three parts, such as " the black parched lips of tongue ".However randomness when due to doctor's record, The form that symptom name occurs is varied, comprising:
42.1, several symptoms occur side by side, have no punctuation mark separation, if we belong to arranging symptom between such case It is then separately marked in coordination, such as " bitter taste is deaf ", then " bitter taste ", " deafness " is regarded into independent symptom name respectively;If Belong to progressive relationship between symptom and then marked into a symptom, such as " abdominal pain is rushed down ", then it is regarded into a symptom name.
42.1, several disease positions occur side by side, but sick interdigit has punctuation mark separation, and to such case, we arrange only to retain The last one sick position, such as " chest, abdomen pain ", then only regard as symptom for " abdomen pain ".
Step S43 randomly selects the sentence set of pre-determined size from the corpus after cleaning, is written by character separation wait mark Explanatory notes part, with null separation between sentence.
Preferably, the sentence set of the pre-determined size includes 5000 sentences for each set, accounts for whole corpus texts 5%.
Step S44, the mark rule of Entity recognition type and agreement based on selection, to the sentence of the pre-determined size of selection Set is manually marked.
With preferred entity type-symptom (ZZ), pulse condition (MX), tongue picture (SX), Chinese medicine (ZY), dosage (JL), prescription (FJ)-for, artificial mark is as shown in table 2:
Artificial mark example of the table 2 to selected sentence set
Step S5, based on the model training collection of the obtained sequence labelling of step S4, using the language model of step S3 as volume Code layer, using preset neural network structure as decoding layer, the corresponding sequence labelling model of training.
Preferably, the preset neural network structure is BiLSTM, and/or CRF, and/or Softmax.
Further, model training collection of this step based on sequence labelling carries out the training of sequence labelling model, specific to wrap Include following steps:
Step S51 divides training set, verifying collection, test set.
Preferably, in this step, it is 10 parts that the training set marked in step S4, which is evenly dividing, wherein guaranteeing that every part includes The quantity of the sentence of six kinds of entity types is almost the same.Wherein 1 part of data is as verifying collection for training selection every time, and 1 part as survey Examination collection, remaining 8 parts are used as training set.
Step S52, training sequence marking model form the automatic marking result of model.
Preferably, in this step, the language model using step S3 based on the training of pre-training method, will as coding layer The neural network structure of " BiLSTM+CRF " form is as decoding layer, the training set divided in conjunction with step S51, verifying collection, test Collection carries out model training, wherein choosing different data as test set repetition training model, until training set covered all Artificial mark set, the annotation results of model are finally aggregated to form to the automatic marking result of model.
Step S53 compares the automatic marking result of model and artificial annotation results, and inconsistent annotated sequence is filtered Out, then manually the annotation results filtered out are proofreaded, the result after check and correction is finally write back into training set.
Step S54 judges whether that meeting mark requires, if satisfied, then exporting final annotation results;If not satisfied, then Return step S51.
Further, described to judge whether that meeting mark requires, comprising: whether the precision of model reaches preset threshold or mould The result of type automatic marking and artificial annotation results are completely the same.Here preset threshold can be the peak value of precision.
Step S6 carries out Entity recognition to Chinese medical book case based on the sequence labelling model that step S5 training obtains.
Further, it is based on trained sequence labelling model, Entity recognition is carried out to Chinese medical book case, specifically includes Following steps:
Step S61 is based on trained sequence labelling model, is marked automatically to Chinese medical book case corpus to be identified Note.
Word rank mark is converted into word rank labeling form according to the general rule of BIOES dimension model by step S62, Entity is separated from annotation results.
Or with preferred entity type-symptom (ZZ), pulse condition (MX), tongue picture (SX), Chinese medicine (ZY), dosage (JL), side For agent (FJ)-, table 3 show the separation example of annotation results.
The isolated annotation results example of table 3
The embodiment of the present invention is named real based on the Chinese medicine of Chinese medical book case it can be seen from the present embodiment technical solution Body recognition methods is combined with existing speech training model, such as the bert training pattern of Google, the training based on small sample Collection is trained, and the labeled data in training set is less, is greatly saved artificial mark cost;It has obtained towards Chinese medical book The name entity recognition method of document significantly more efficient can utilize Chinese medical book document, improve traditional Chinese medical science field name entity The effect of identification keeps the identification of traditional Chinese medical science field name entity more accurate, has laid good number for traditional Chinese medical science field subsequent applications According to basis.Second embodiment
Present embodiments provide a kind of Chinese medicine name entity recognition system based on Chinese medical book document, the system packet Include: corpus obtains module, data scrubbing module, language model pre-training module, training set labeling module, sequence labelling model instruction Practice module, Entity recognition module;Wherein,
The corpus obtains module for obtaining Chinese medical book case corpus;
The data scrubbing module is used to carry out data scrubbing to the Chinese medical book case corpus to be processed of acquisition;
The language model pre-training module is used to be based on Chinese medical book case corpus, carries out towards Chinese medical book case language The language model pre-training of material;
The training set labeling module is used for based on the Chinese medical book case corpus after cleaning, carries out sequence to the corpus Mark, forms the training set of following model;
The sequence labelling model training module is used for the model training collection based on sequence labelling, using language model as volume Code layer, using preset neural network structure as decoding layer, the corresponding sequence labelling model of training;
The Entity recognition module be used for based on sequence labelling model, to Chinese medical book case carry out Entity recognition.
The institute of Chinese medicine name entity recognition system and first embodiment of the embodiment of the present invention based on Chinese medical book document State Chinese medicine name entity recognition method be it is corresponding, therefore, the corresponding description for first embodiment is equally applicable to this reality The name entity recognition system of example is applied, details are not described herein.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, several improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (10)

1. a kind of Chinese medicine based on Chinese medical book document names entity recognition method, which is characterized in that the method includes as follows Step:
Step S1 obtains Chinese medical book case corpus;
Step S2 carries out data scrubbing to the Chinese medical book case corpus to be processed that step S1 is obtained;
Step S3 carries out the language mould towards Chinese medical book case corpus based on the Chinese medical book case corpus that step S2 is obtained Type pre-training;
Step S4 carries out sequence labelling to the corpus based on the Chinese medical book case corpus after step S2 cleaning obtained, Form the training set of following model;
Step S5, based on the model training collection of the obtained sequence labelling of step S4, using the language model of step S3 as coding layer, Using preset neural network structure as decoding layer, the corresponding sequence labelling model of training;
Step S6 carries out Entity recognition to Chinese medical book case based on the sequence labelling model that step S5 training obtains.
2. Chinese medicine according to claim 1 names entity recognition method, which is characterized in that it is ancient that the step S1 obtains Chinese medicine Nationality case corpus, specifically includes the following steps:
Step S11 scans and recognizes existing papery version Chinese medical book case books using optical character identification, forms electricity Sub- corpus of text;
Step S12 grabs the Chinese medical book case corpus of no papery version books using open source web crawlers from network;
Step S11 and step S12 the corpus text got are compared, are merged by step S13, ultimately form it is unified to The Chinese medical book case corpus of processing.
3. Chinese medicine according to claim 1 names entity recognition method, which is characterized in that the step S2 is to be processed Chinese medical book case corpus carries out data scrubbing, specifically includes the following steps:
Step S21 corrects wrong word;
Step S22 filters unrelated sentence.
4. Chinese medicine according to claim 1 names entity recognition method, which is characterized in that language model described in step S3 Pre-training, specifically includes the following steps:
Step S31, the source code of download language model pre-training Chinese language training;
Step S32, word table that manual sorting Chinese medical book case is related to are compared with the middle literal table in source code, are syncopated as The rarely used word table in doctor field;
Step S33, in such a way that the character in rarely used word table replaces the character that frequency of use is low in source code, by the rarely used word table Merge with middle literal table, and guarantee in literal table length it is constant;
Step S34 is segmented the paragraph in the Chinese medical book case corpus after step S2 cleaning, presets bout length threshold value And/or paragraph includes sentence amount threshold, will be greater than the bout length threshold value and/or paragraph includes the section of sentence amount threshold Text is fallen, the training corpus as speech training model;
Step S35, using the segmenting method in the word segmentation regulation replacement source code separated by character, based on Chinese language training mould Type carries out on the training corpus of the speech training model in the step S34 in the language model pre-training method downloaded Language model pre-training towards Chinese medical book case.
5. Chinese medicine according to claim 1 names entity recognition method, which is characterized in that step S3, the language model Pre-training, using the language model pre-training method bert of Google.
6. Chinese medicine according to claim 5 names entity recognition method, which is characterized in that the language model pre-training, It specifically includes:
Step S31, the Chinese language model chinese_L-12_H-768_A-12 based on bert training of downloading Google open source And the source code of bert;
Step S32 compares with the middle literal table of Google open source, is syncopated as the unique rarely used word table of traditional Chinese medical science field;
The Google middle literal table increased income is merged by step S33 with the rarely used word table, Chinese with the replacement of rarely used word table when merging The low character of frequency of use in word table, to guarantee that word table length is constant;
Step S34 is segmented the paragraph in the Chinese medical book case corpus after step S2 cleaning, presets bout length threshold value And/or paragraph includes sentence amount threshold, will be greater than the bout length threshold value and/or paragraph includes the section of sentence amount threshold Text is fallen, the training corpus as speech training model;
Step S35, using the segmenting method in the word segmentation regulation replacement bert source code separated by character, based on Google open source Chinese language model carries out curing towards Chinese medical book on the training corpus that step S34 is syncopated as using bert pre-training method The language model pre-training of case.
7. Chinese medicine according to claim 1 names entity recognition method, which is characterized in that formation described in step S4 is subsequent Model training collection is formed and carrying out the sequence labelling of BIOES form to the Chinese medical book case corpus.
8. Chinese medicine according to claim 7 names entity recognition method, which is characterized in that described to Chinese medical book case language Material carries out the sequence labelling of BIOES form, specifically includes the following steps:
Step S41 chooses Entity recognition type;
Step S42, agreement mark rule;
Step S43 randomly selects the sentence set of pre-determined size from the corpus after cleaning, separates by character and text to be marked is written Part, with null separation between sentence;
Step S44, the mark rule of Entity recognition type and agreement based on selection, to the sentence set of the pre-determined size of selection Manually marked.
9. Chinese medicine according to claim 7 names entity recognition method, which is characterized in that the entity type is respectively as follows: Symptom ZZ, pulse condition MX, tongue picture SX, Chinese medicine ZY, dosage JL, prescription FJ.
10. a kind of Chinese medicine based on Chinese medical book document names entity recognition system, which is characterized in that the system comprises: language Material obtains module, data scrubbing module, language model pre-training module, training set labeling module, sequence labelling model training mould Block, Entity recognition module;Wherein,
The corpus obtains module for obtaining Chinese medical book case corpus;
The data scrubbing module is used to carry out data scrubbing to the Chinese medical book case corpus to be processed of acquisition;
The language model pre-training module is used to be based on Chinese medical book case corpus, carries out towards Chinese medical book case corpus Language model pre-training;
The training set labeling module is used for based on the Chinese medical book case corpus after cleaning, carries out sequence mark to the corpus Note, forms the training set of following model;
The sequence labelling model training module is used for the model training collection based on sequence labelling, using language model as coding Layer, using preset neural network structure as decoding layer, the corresponding sequence labelling model of training;
The Entity recognition module be used for based on sequence labelling model, to Chinese medical book case carry out Entity recognition.
CN201910367376.9A 2019-05-05 2019-05-05 Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature Active CN110134953B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910367376.9A CN110134953B (en) 2019-05-05 2019-05-05 Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910367376.9A CN110134953B (en) 2019-05-05 2019-05-05 Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature

Publications (2)

Publication Number Publication Date
CN110134953A true CN110134953A (en) 2019-08-16
CN110134953B CN110134953B (en) 2020-12-18

Family

ID=67576196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910367376.9A Active CN110134953B (en) 2019-05-05 2019-05-05 Traditional Chinese medicine named entity recognition method and recognition system based on traditional Chinese medicine ancient book literature

Country Status (1)

Country Link
CN (1) CN110134953B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259626A (en) * 2020-01-16 2020-06-09 上海国民集团健康科技有限公司 Traditional Chinese medicine entity recognition algorithm
CN111274764A (en) * 2020-01-23 2020-06-12 北京百度网讯科技有限公司 Language generation method and device, computer equipment and storage medium
CN111312356A (en) * 2020-01-17 2020-06-19 四川大学 Traditional Chinese medicine prescription generation method based on BERT and integration efficacy information
CN111507351A (en) * 2020-04-16 2020-08-07 华南理工大学 Ancient book document digitalization method
CN112163410A (en) * 2020-10-14 2021-01-01 四川大学 Ancient text pre-training system based on deep learning and training method thereof
CN112183099A (en) * 2020-10-09 2021-01-05 上海明略人工智能(集团)有限公司 Named entity identification method and system based on semi-supervised small sample extension
CN112364655A (en) * 2020-10-30 2021-02-12 北京中科凡语科技有限公司 Named entity recognition model establishing method and named entity recognition method
CN112949310A (en) * 2021-03-01 2021-06-11 创新奇智(上海)科技有限公司 Model training method, traditional Chinese medicine name recognition method and device and network model
CN113128225A (en) * 2019-12-31 2021-07-16 阿里巴巴集团控股有限公司 Named entity identification method and device, electronic equipment and computer storage medium
CN113255357A (en) * 2021-06-24 2021-08-13 北京金山数字娱乐科技有限公司 Data processing method, target recognition model training method, target recognition method and device
CN114297693A (en) * 2021-12-30 2022-04-08 北京海泰方圆科技股份有限公司 Model pre-training method and device, electronic equipment and storage medium
CN116796742A (en) * 2023-03-27 2023-09-22 上海交通大学医学院 Method, device, equipment and storage medium for identifying ancient books named entity of traditional Chinese medicine

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
CN109190113A (en) * 2018-08-10 2019-01-11 北京科技大学 A kind of knowledge mapping construction method of theory of traditional Chinese medical science ancient books and records

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
CN109190113A (en) * 2018-08-10 2019-01-11 北京科技大学 A kind of knowledge mapping construction method of theory of traditional Chinese medical science ancient books and records

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MACANV: "基于BERT预训练的中文命名实体识别TensorFlow实现", 《CSDN博客-HTTPS://BLOG.CSDN.NET/MACANV/ARTICLE/DETAILS/85684284》 *
YONGHONG XIE ET AL.: "Semantic Feature Expansion Technology Based on Knowledge Map", 《14TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY》 *
高甦 等: "基于深度学习的中医典籍命名实体识别研究", 《情报工程》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128225A (en) * 2019-12-31 2021-07-16 阿里巴巴集团控股有限公司 Named entity identification method and device, electronic equipment and computer storage medium
CN111259626A (en) * 2020-01-16 2020-06-09 上海国民集团健康科技有限公司 Traditional Chinese medicine entity recognition algorithm
CN111312356A (en) * 2020-01-17 2020-06-19 四川大学 Traditional Chinese medicine prescription generation method based on BERT and integration efficacy information
CN111312356B (en) * 2020-01-17 2022-07-01 四川大学 Traditional Chinese medicine prescription generation method based on BERT and integration efficacy information
CN111274764A (en) * 2020-01-23 2020-06-12 北京百度网讯科技有限公司 Language generation method and device, computer equipment and storage medium
CN111507351A (en) * 2020-04-16 2020-08-07 华南理工大学 Ancient book document digitalization method
CN111507351B (en) * 2020-04-16 2023-05-30 华南理工大学 Ancient book document digitizing method
CN112183099A (en) * 2020-10-09 2021-01-05 上海明略人工智能(集团)有限公司 Named entity identification method and system based on semi-supervised small sample extension
CN112163410A (en) * 2020-10-14 2021-01-01 四川大学 Ancient text pre-training system based on deep learning and training method thereof
CN112364655B (en) * 2020-10-30 2021-08-24 北京中科凡语科技有限公司 Named entity recognition model establishing method and named entity recognition method
CN113807097A (en) * 2020-10-30 2021-12-17 北京中科凡语科技有限公司 Named entity recognition model establishing method and named entity recognition method
CN112364655A (en) * 2020-10-30 2021-02-12 北京中科凡语科技有限公司 Named entity recognition model establishing method and named entity recognition method
CN112949310A (en) * 2021-03-01 2021-06-11 创新奇智(上海)科技有限公司 Model training method, traditional Chinese medicine name recognition method and device and network model
CN113255357A (en) * 2021-06-24 2021-08-13 北京金山数字娱乐科技有限公司 Data processing method, target recognition model training method, target recognition method and device
CN114297693A (en) * 2021-12-30 2022-04-08 北京海泰方圆科技股份有限公司 Model pre-training method and device, electronic equipment and storage medium
CN116796742A (en) * 2023-03-27 2023-09-22 上海交通大学医学院 Method, device, equipment and storage medium for identifying ancient books named entity of traditional Chinese medicine

Also Published As

Publication number Publication date
CN110134953B (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN110134953A (en) Chinese medicine name entity recognition method and identifying system based on Chinese medical book document
Katamba Morphology: Morphology: its place in the wider context
CN112154509A (en) Machine learning model with evolving domain-specific dictionary features for text annotation
Kohl Rhetoric, the Bible, and the origins of free verse: The Early “hymns” of Friedrich Gottlieb Klopstock
Bollmann Normalization of historical texts with neural network models
Booth et al. A Penn-style Treebank of Middle Low German
Candito et al. A word clustering approach to domain adaptation: Effective parsing of biomedical texts
CN111178047B (en) Ancient medical record prescription extraction method based on hierarchical sequence labeling
Hauser et al. A multilingual simplified language news corpus
Boschetti A corpus-based approach to philological issues
CN105426355A (en) Syllabic size based method and apparatus for identifying Tibetan syntax chunk
Knight et al. The Cambridge handbook of phonetics
CN113609365A (en) Traditional chinese medical science ancient book data collection arrangement system
Ladd et al. On the systematic nature of Dinka noun number morphology
Reggiani Ancient doctors’ literacies and the digital edition of papyri of medical content
Salah et al. Towards the automatic generation of Arabic Lexical Recognition Tests using orthographic and phonological similarity maps
Serif From Networks of Texts to Networks of Topics? On the Classification of (Texts in) Compilations with a View towards Manuscript Transmission
Sauer Making Milton Matter
Opalińska Mnemonic verses and pastoral care–the case of MS Aldini 69 from Bibliotheca Universitaria in Pavia
Svavarsdóttir et al. Language resources for early Modern Icelandic
Shamim A reanalysis of Bengali vowel assimilation with special attention to metaphony
Faßhauer Accessing, Editing and Indexing Large Manuscript Collections
Roberts Where couldst thou words of such a compass find?: an investigation into Milton’s neologisms in the OED in relation to his contemporaries
Magary Translation technique in the Peshitta of the book of Micah
Hauser OCR-postcorrection of historical texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant