CN110879831A

CN110879831A - Chinese medicine sentence word segmentation method based on entity recognition technology

Info

Publication number: CN110879831A
Application number: CN201910967537.8A
Authority: CN
Inventors: 崔智颖; 佘莉; 黄剑平
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-03-13

Abstract

The invention discloses a Chinese medicine sentence word segmentation method based on an entity recognition technology, which comprises the following steps: collecting Chinese medicine sentences as a corpus; extracting common phrases from the corpus to form a lexicon W; classifying the word bank W, and performing sequence labeling on each element of each sequence of the word bank according to a classification result; pre-training a language library by adopting a Bert language model and obtaining word vectors; constructing a time cycle neural network model with a conditional random field, and performing model training based on a labeled data set; inputting a test statement to the trained time cycle neural network model to obtain an entity list; according to the entity list, carrying out first segmentation on the input sentence to segment elements contained in the entity list; scanning the remaining sentences to be divided based on the prefix dictionary to generate all possible word forming conditions of the Chinese characters in the sentences to form a directed acyclic graph; and dynamically planning the directed acyclic graph to search a maximum probability path and find out a maximum segmentation combination based on word frequency.

Description

Chinese medicine sentence word segmentation method based on entity recognition technology

Technical Field

The invention relates to the technical field of Chinese medicine informatization, in particular to a Chinese medicine sentence word segmentation method based on an entity recognition technology.

Background

Traditional Chinese medicine is traditional medicine with unique characteristics and advantages in China, and thousands of years of theoretical knowledge and clinical experience are accumulated. In order to realize the modernization of traditional Chinese medicine, the main information in the literature data of traditional Chinese medicine needs to be extracted and collated by the modern computer technology so as to facilitate research and analysis and promote the utilization, transformation, propagation and popularization of the knowledge of traditional Chinese medicine.

The extraction and the arrangement of the main information of the text can be understood as a Chinese word segmentation problem. At present, researchers have published a number of increasingly sophisticated chinese word segmentation algorithms in succession. However, when the word segmentation algorithms are applied to the field of traditional Chinese medicine, the displayed effect is not satisfactory. After thousands of years of historical transition, the Chinese grammar and expression mode generate huge differences. For the symptoms of traditional Chinese medicine, the language features are mostly dialect or semi-white and semi-white, and the word segmentation difficulty is higher; meanwhile, no unified national standard exists for symptom terms, and the symptom description is flexible and changeable in form, so that higher requirements on the correctness and completeness of symptom identification are provided.

In the past, researchers wanted to implement word segmentation on the aspect of Chinese medicine texts and adopted some measures, but most of them adopt a probability statistics-based method. From the practical effect, the word segmentation accuracy of the methods is improved for the short Chinese medicine symptom terms, but the word segmentation effect is improved for the long symptom terms in the Chinese medicine symptom description sentences is limited.

The key of word segmentation is the extraction of information, and in the extraction process, the boundary problem of the information to be extracted needs to be confirmed, which is very similar to the entity recognition technology. Today, entity recognition technology has gained importance in natural language processing tasks. However, different named entities have different internal features, and it is impossible to describe all the internal features of the entities by a unified model, so a specific entity recognition model is used to recognize a term in traditional Chinese medicine, and the word boundary of a traditional Chinese medicine term is better determined, thereby assisting the word segmentation task of a traditional Chinese medicine symptom sentence.

Therefore, the research on how to utilize the entity recognition technology to efficiently and accurately segment the Chinese medicine symptom description sentences has important significance.

Disclosure of Invention

Aiming at the defects in the field, the invention provides a Chinese sentence segmentation method based on an entity recognition technology, belongs to a Chinese word segmentation method based on a deep learning natural language processing technology, improves the accuracy rate of Chinese symptom description sentence segmentation, and can greatly reduce the workload.

A Chinese medicine sentence word segmentation method based on an entity recognition technology comprises the following steps:

(A) traditional Chinese medicine sentences are collected, and the data are used as a corpus after being cleaned;

(B) counting the occurrence frequency of adjacent Chinese character combinations in the corpus, and extracting common phrases from word combinations larger than a certain threshold value to form a lexicon W; calculating the forward conditional probability and the reverse conditional probability of the other adjacent Chinese character combinations, extracting the combinations meeting the specified threshold value, and adding the combinations into the word bank W;

(C) classifying the word bank W, and performing sequence labeling on each element of each sequence of the corpus by using a label according to a classification result;

(D) pre-training the corpus by adopting a Bert language model and obtaining word vectors;

(E) constructing a time cycle neural network (BilSTM-CRF) model with a conditional random field, and performing model training based on a labeled data set;

(F) inputting a test statement to the trained time cycle neural network model to obtain a prediction entity list;

(G) according to the predicted entity list, carrying out first segmentation on an input test statement, and segmenting elements contained in the predicted entity list;

(H) scanning the remaining sentences to be divided based on the prefix dictionary to generate all possible word forming conditions of the Chinese characters in the sentences to form a directed acyclic graph;

(I) and dynamically planning the directed acyclic graph to search a maximum probability path and find out a maximum segmentation combination based on word frequency.

In the step (A), the Chinese medicine sentence data can be collected and sorted through the channels of the existing Chinese medicine related databases, websites, documents and the like, then the obtained Chinese medicine sentences are integrated into a large-scale Chinese medicine text database, and then the Chinese medicine sentences are preliminarily divided according to Chinese and English punctuations to be used as a corpus.

In the step (B), the forward conditional probability and the reverse conditional probability are calculated according to the formula (I) and the formula (II) respectively:

for any adjacent Chinese character combination XY with X at the front and Y at the back, P (Y | X) and P (X | Y) are respectively the forward conditional probability and the reverse conditional probability of the combination, count (XY) is the frequency of the combination appearing in the corpus, and count (X) and count (Y) are respectively the frequency of the Chinese character X and the Chinese character Y appearing in the corpus.

Preferably, in step (C), the lexicon W is divided into three types of "body" (body part), "age-level" (age-level), and "symptom" (symptom).

Preferably, in step (C), labeling each element of each sequence of the corpus as "B-M", "I-M" or "O" form using BIO labels according to the classification result;

wherein, "M" represents the type of the segment in which the element belongs, "B" and "I" represent the start position and non-start position of the element in the segment, respectively, and "O" represents that the element does not belong to any type.

In the step (D), Bert is a fine-tuning-based multi-layer bidirectional Transformer encoder, which can further increase the generalization capability of the word vector model and fully describe the character-level, word-level, sentence-level and even sentence-to-sentence relation characteristics.

In the step (E), the core of the time-cycle neural network model with the conditional random field mainly comprises two layers:

one is a bidirectional long-short term memory network layer used for extracting the characteristics of an input sequence and finally outputting a probability distribution matrix of the label type of each character in the sequence;

and secondly, determining the most reasonable sequence path in all the feasible label sequence spaces according to the probability distribution matrix to obtain the corresponding character label.

In the steps (F) and (G), test sentences are input into the BilSTM-CRF model to obtain predicted entities, entity lists are formed, and all elements in the lists are sequentially cut from the input sentences according to the obtained lists.

In the step (H), the remaining sentences which are not segmented are subjected to dictionary searching operation according to a given dictionary to generate several possible sentence segmentations, and the possible segmentation modes form a directed acyclic graph.

In the step (I), according to the directed acyclic graph, reversely calculating the maximum probability of the sentence from right to left, namely, when reaching one node, calculating the maximum path probability from the node in front of the node to the terminal point;

each node of the directed acyclic graph is weighted, and for the words in the prefix dictionary, the weights are the word frequencies of the words.

Compared with the prior art, the invention has the main advantages that:

(1) the invention introduces a Bert + BilSTM + CRF network model in the Chinese word segmentation process, can predict Chinese medicine symptom term entities through the model, and improves the accuracy of Chinese medicine symptom description sentence word segmentation by utilizing the predicted entities to carry out Chinese words.

(2) In the data processing process, manual sequence labeling is a difficult task because of the large data volume and the large amount of professional domain knowledge required for manual labeling. The invention extracts the commonly used phrases from the corpus by using a statistical method, automatically marks the commonly used phrases according to the classification of the phrases, and can greatly reduce the workload.

Drawings

FIG. 1 is a schematic diagram of a word segmentation judgment process in the present invention;

FIG. 2 is a diagram of a Bert + BilsTM + CRF network model structure.

Detailed Description

The invention is further described with reference to the following drawings and specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. The following examples are conducted under conditions not specified, usually according to conventional conditions, or according to conditions recommended by the manufacturer.

The process of the Chinese medicine sentence word segmentation method based on the entity recognition technology in the embodiment is shown in fig. 1, and specifically includes:

step A, collecting Chinese medicine symptom description data and cleaning the data. The method comprises the steps of collecting and sorting traditional Chinese medicine sentences from existing channels such as a traditional Chinese medicine database, a traditional Chinese medicine website, traditional Chinese medicine documents and the like, integrating the traditional Chinese medicine sentences into a large traditional Chinese medicine text database, and performing preliminary segmentation on symptom description sentences according to Chinese and English punctuations to serve as a corpus.

And step B, counting the occurrence frequency of adjacent Chinese character combinations in the corpus, extracting the adjacent Chinese character combinations larger than a certain threshold, and manually screening out common phrases from the extracted combinations to serve as a basic lexicon W. And then, calculating the forward conditional probability and the reverse conditional probability of the remaining adjacent Chinese character combinations, and selecting the phrases meeting the set threshold range to be added into the word stock W.

Firstly, for any adjacent Chinese character combination XY (X is in front of Y is behind), the frequency count (XY) of the adjacent Chinese character combination in the corpus, the frequency count (X) of the Chinese character X in the corpus and the frequency count (Y) of the Chinese character Y in the corpus need to be respectively calculated; then, the forward conditional probability of the adjacent kanji combination XY is calculated according to the formula P (Y | X) ═ count (XY)/count (X), and the reverse conditional probability of the adjacent kanji combination XY is calculated according to the formula P (X | Y) ═ count (XY)/count (Y).

And step C, classifying the word bank W, and labeling the sequence according to the classification result. The phrases in the lexicon W are divided into three categories, namely "body" (limb part), "age-level" (age level) and "symptom" (symptom). For example, "nose" belongs to the body group, "child" belongs to the age-level group, and "red swelling" belongs to the symptom group.

And (4) carrying out sequence labeling on each element in the material library by using a BIO label. The BIO labels are denoted herein as "B-M", "I-M", "O". Wherein "M" represents the type of the fragment in which the element belongs, "B" represents the element located at the start of the fragment, "I" represents the element located at the middle of the fragment, and "O" represents the element not belonging to any type. For example, the word "nose" should be labeled "B-body I-body".

And D, pre-training a corpus to generate word vectors by adopting a novel language model (Bert).

And E, constructing a time cycle neural network model (BilSTM-CRF) with the conditional random field, and training the model based on the labeled data set. Referring to fig. 2, wherein the Bert layer maps each input word into a new low-dimensional dense word vector by using a pre-trained matrix and transmits the new low-dimensional dense word vector to the BiLSTM; the BilSTM layer is used for integrating and extracting the characteristics of the input sequence; and after the text sequence is input, finally outputting the text sequence as a probability distribution matrix of the label type of each character in the sequence. And finally, determining a most reasonable sequence path in all feasible label sequence spaces by using a probability distribution matrix obtained by the CRF according to the BilSTM layer, wherein each label in the sequence corresponds to a character label at the same position.

And F, inputting a test statement into the BilSTM-CRF model to obtain a predicted entity, and forming an entity list L.

And G, sequentially cutting all elements in the L from the input sentence according to the obtained list L.

And step H, carrying out dictionary lookup operation on the sentences which are not segmented according to a given dictionary to generate several possible sentence segmentations, wherein the possible segmentation modes form a directed acyclic graph.

For example, for "infantile convulsions", there are two divisions "small" and "infantile" for "small"; for 'er', there is no prefix, then there is only one division way; for 'surprise', two division modes of 'surprise' and 'fright' are provided, and by analogy, the division mode of the prefix word starting from each character can be obtained.

Step I, after obtaining the directed acyclic graph formed by all possible segmentation modes, calculating a maximum probability path, namely, when reaching a node, calculating the maximum path probability from the node in front to the end point. Each node of the directed acyclic graph is weighted, and for a word in the prefix dictionary, the weight of the word is the word frequency of the word.

The sentence \34000andfatigue is taken as an example, wherein the word \34000andfatigue is used in the Chinese medicine term to indicate symptoms of fatigue, lassitude, cold and heat accompanied, asthma, suffocating, cough, abdominal pain and the like after childbirth. The traditional Chinese word segmentation method is used, and the obtained word segmentation result is { "woman", "straw mat" "lao" }, namely, the traditional word segmentation method separates the' 34000; "lao". In the invention, after the input sentence passes through the network model, the prediction label is ' B-age I-age B-sym I-sym ', namely, the model predicts the entity ' woman ', ' 34000; ' lao '. Meanwhile, the word segmentation result is { "woman", "34000;" lao "}. Therefore, the method improves the accuracy of Chinese medicine description sentence word segmentation.

Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the above description of the present invention, and equivalents also fall within the scope of the invention as defined by the appended claims.

Claims

1. A Chinese medicine sentence word segmentation method based on an entity recognition technology is characterized by comprising the following steps:

(E) constructing a time cycle neural network model with a conditional random field, and performing model training based on a labeled data set;

2. The method of claim 1, wherein the forward conditional probability and the reverse conditional probability are calculated according to the following formulas (I) and (II) in the step (B):

3. The method as claimed in claim 1, wherein in the step (C), the lexicon W is divided into three types, i.e. body, age-level and symptom.

4. The method for segmenting words in Chinese medicine sentences based on entity recognition technology of claim 1 or 3, wherein in step (C), each element of each sequence in the corpus is labeled as "B-M", "I-M" or "O" form by using BIO label according to the classification result;

5. The method for segmenting words in Chinese medicine sentences based on entity recognition technology of claim 1 wherein in step (E), said model of time-cyclic neural network with conditional random fields comprises:

the bidirectional long and short term memory network layer is used for extracting the characteristics of the input sequence and finally outputting the characteristics as a probability distribution matrix of the label type of each character in the sequence;

and determining the most reasonable sequence path in all feasible label sequence spaces by the conditional random field according to the probability distribution matrix to obtain the corresponding character label.

6. The method according to claim 1, wherein in step (I), the maximum probability is calculated for the sentence from right to left in reverse direction according to the directed acyclic graph;