CN114780691B - Model pre-training and natural language processing method, device, equipment and storage medium - Google Patents

Model pre-training and natural language processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN114780691B
CN114780691B CN202210701343.5A CN202210701343A CN114780691B CN 114780691 B CN114780691 B CN 114780691B CN 202210701343 A CN202210701343 A CN 202210701343A CN 114780691 B CN114780691 B CN 114780691B
Authority
CN
China
Prior art keywords
training
training text
target
model
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210701343.5A
Other languages
Chinese (zh)
Other versions
CN114780691A (en
Inventor
冯韬
胡加学
贺志阳
赵景鹤
肖飞
鹿晓亮
魏思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iflytek Medical Technology Co ltd
Original Assignee
Anhui Xunfei Medical Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Xunfei Medical Co ltd filed Critical Anhui Xunfei Medical Co ltd
Priority to CN202210701343.5A priority Critical patent/CN114780691B/en
Publication of CN114780691A publication Critical patent/CN114780691A/en
Application granted granted Critical
Publication of CN114780691B publication Critical patent/CN114780691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for model pre-training and natural language processing, wherein in the process of pre-training a model, a training text and a knowledge graph in the field are obtained, a target entity word matched in the training text and a triplet matched with the training text are searched based on the knowledge graph, the target entity word in the training text is masked to obtain a masked training text, meanwhile, a target entity word is selected, the head entity word and a relation word are spliced with the training text to obtain a spliced training text, and then a neural network model is trained by taking the target entity word masked in the predicted masked training text and a tail entity word in the target triplet contained in the predicted spliced training text as targets to obtain the pre-training model. Therefore, knowledge in the knowledge graph of the field to which the training text belongs is integrated into the model pre-training process, and understanding and mastering of the knowledge of the relevant field by the model are promoted.

Description

Model pre-training and natural language processing method, device, equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for model pre-training and natural language processing.
Background
With the continuous development of computer science, the application of natural language processing technology is more extensive, such as machine translation, text information extraction and question-answering systems. The natural language processing algorithm can be divided into three stages, namely a non-neural network complete supervised learning, a neural network based complete supervised learning, a pre-training and a fine tuning two-stage scheme.
The pre-training and fine-tuning scheme divides the natural language processing task into two stages, and firstly, the general linguistic knowledge, such as syntax, lexical characteristics and the like, is learned from large-scale unmarked linguistic data. And then, aiming at a specific natural language processing task, performing parameter adjustment on the basis of a large-scale pre-training model by using a small amount of task related labeled corpora, and learning the semantic characteristics of the specific task. The method can fully utilize relatively cheap unmarked linguistic data and can remarkably improve the performance of downstream natural language processing tasks, so that the method becomes a mainstream method for natural language processing at present.
The current pre-training model does not reach the upper limit, the pre-training effect is generally improved only by enlarging the data volume of the training data set, the utilization degree of the training data is not high, and the pre-training effect of the model is further improved by designing a more effective pre-training scheme urgently.
Disclosure of Invention
In view of the foregoing, the present application is provided to provide a method, an apparatus, a device, and a storage medium for model pre-training and natural language processing, so as to further improve the pre-training effect of the model and the natural language processing effect. The specific scheme is as follows:
in a first aspect, a model pre-training method is provided, including:
acquiring a training text and a knowledge graph of the field to which the training text belongs;
searching a target entity word matched with the knowledge graph in the training text, and masking the target entity word matched in the training text to obtain a masked training text;
searching a triple matched with the training text based on the knowledge graph, wherein the triple comprises a head entity word, a relation word and a tail entity word;
selecting a target triple from the matched triples, and splicing the head entity word and the relation word in the selected target triple with the training text to obtain a spliced training text;
and training a neural network model by taking the target entity words which are predicted to be masked in the training text after the mask and the tail entity words in the target triples contained in the training text after the splicing as targets until a set training end condition is reached, and obtaining a pre-training model.
In a second aspect, a model pre-training apparatus is provided, including:
the data acquisition unit is used for acquiring a training text and a knowledge graph of the field to which the training text belongs;
the target entity word searching unit is used for searching the target entity words matched with the knowledge graph in the training text;
the entity word mask unit is used for masking the target entity words matched in the training text to obtain a masked training text;
the triple searching unit is used for searching the triples matched with the training texts based on the knowledge graph, wherein the triples comprise head entity words, relation words and tail entity words;
the training text splicing unit is used for selecting a target triple from the matched triples, splicing the head entity word and the relation word in the selected target triple with the training text to obtain a spliced training text;
and the parameter updating unit is used for training a neural network model by taking the target entity words which are predicted to be masked in the training text after the masking and the tail entity words in the target triples contained in the training text after the splicing as targets until a set training ending condition is reached, so as to obtain a pre-training model.
In a third aspect, a model pre-training apparatus is provided, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the model pre-training method.
In a fourth aspect, a storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the model pre-training method as described above.
By means of the technical scheme, in the process of pre-training a model, a training text and a knowledge graph in the field are obtained, a target entity word matched in the training text and a triple matched with the training text are searched based on the knowledge graph, the triple comprises a head entity word, a relation word and a tail entity word, the target entity word in the training text is masked to obtain a masked training text, meanwhile, a target entity word is selected, the head entity word and the relation word are spliced with the training text to obtain a spliced training text, the masked target entity word in the predicted masked training text and the tail entity word in the target triple contained in the predicted spliced training text are used as targets, a neural network model is trained, and the pre-training model is obtained. Therefore, knowledge in the knowledge graph of the field to which the training text belongs is integrated into the model pre-training process, and understanding and mastering of the knowledge of the relevant field by the model are promoted.
Meanwhile, when the training text is subjected to mask masking, the target entity words are preferably masked, so that the model predicts the target entity words of the mask, and the implicit knowledge is integrated, so that the learning capacity of the model on the field knowledge can be improved. Further, the method and the device further increase the fusion of triple knowledge, namely, a head entity word and a relation word in a target triple matched with a training text are spliced with the training text and then input into the model, the fact that the triple exists in the training text is clearly told to the model, the model predicts a tail entity word in the target triple, the fusion of the displayed knowledge and the fusion of the implicit knowledge are combined, different semantic information and different types of features can be obtained and are fused in a cross mode, the understanding of the model on the semantics and the knowledge is promoted, the learning capacity of the model on the domain knowledge is further improved, and the effect of the pre-training model is greatly improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic flow chart of a model pre-training method according to an embodiment of the present disclosure;
FIG. 2 illustrates a schematic view of a knowledge-graph of a medical field;
FIG. 3 illustrates a schematic diagram of a pre-trained model structure;
FIG. 4 illustrates a mask character prediction process diagram;
FIG. 5 is a schematic structural diagram of a model pre-training apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of model pre-training equipment provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
The application provides a model pre-training scheme which can be applied to a model pre-training stage in various fields. The trained model can be further subjected to secondary training by combining with the labeled data under the specific task, and model parameters are adjusted to obtain a specific natural language processing task model so as to process the specific natural language processing task. There are various fields to which the present disclosure may be applied, examples being: medical fields, judicial fields, and the like. Specifically, natural language processing tasks may also be varied, examples being: machine translation, question answering, dialog, text classification, sentiment analysis, and the like.
The scheme can be realized based on a terminal with data processing capacity, and the terminal can be a mobile phone, a computer, a server, a cloud terminal and the like.
Next, as described in conjunction with fig. 1, the model pre-training method of the present application may include the following steps:
step S100, obtaining a training text and a knowledge graph of the field to which the training text belongs.
Specifically, after determining a domain to which the model to be trained is applicable, a text of the domain may be acquired as the training text. It may not be required for the training text to carry annotations. The training text may contain knowledge information in the corresponding domain. Meanwhile, a knowledge graph in the field is also obtained in the step. The knowledge graph is composed of nodes and edges, the nodes represent entity words representing knowledge in the field, such as field professional terms, proper nouns and the like, the edges represent relations among the entity words, two entity words and relation words corresponding to one edge form a triple, namely the triple comprises a head entity word, a relation word and a tail entity word.
Taking the medical field as an example:
the training texts may include medical records, examination reports, medical texts, medical guidelines, etc., which contain a large number of medical terms and medical common sense.
The medical knowledge map can be referred to as figure 2, nodes represent medical terms and proper nouns, edges represent relationships among entity words, and entity words and relationship words form a triple group, such as irbesartan (common adverse reaction of drugs, hepatitis) and irbesartan (common adverse reaction of drugs) which can cause hepatitis.
Step S110, searching a target entity word matched with the knowledge graph in the training text, and masking the target entity word matched in the training text to obtain a masked training text.
Specifically, the entity words included in the knowledge graph belong to the domain knowledge, and in order to train the model to learn the meaning of the domain knowledge, the entity words representing the knowledge in the training text may be masked. Therefore, firstly, a target entity word matched with the knowledge graph in the training text needs to be searched, namely, whether each entity word in the knowledge graph exists in the training text is judged, if yes, the existing position is found, the target entity word in the position is replaced by a mask [ mask ], and the training text after the mask is obtained.
Referring to table 1 below, it illustrates the process of target entity word matching and masking in the medical field:
Figure 202721DEST_PATH_IMAGE002
for the input medical training text: the common cold is also called as cold, and is characterized by symptoms of rhinorrhea, dry throat and the like; the input knowledge graph comprises two triplets, respectively: (cold, clinical manifestations, dry throat), (cold, clinical manifestations, cough).
Firstly, searching entity words in a knowledge graph in a medical training text, indexing that the starting and ending positions of cold are respectively 2 and 3, and the starting and ending positions of pharyngeal dryness are respectively 18 and 19, and further replacing target entity words at corresponding positions in the medical training text with mask to obtain a finally output masked medical training text: the common symptoms of the mask are rhinorrhea and the like.
And S120, searching the triple matched with the training text based on the knowledge graph.
Specifically, the knowledge graph contains a plurality of triples, and in order to further enable the model to learn the triples, a concatenation mask of the triples may be performed on the training text. First, the triples matching the training text with the knowledge-graph need to be found.
Referring to table 2 below, which illustrates the process of triplet matching in the medical field:
Figure 35679DEST_PATH_IMAGE004
if the head entity word and the tail entity word of the triple in the knowledge graph appear in the training text at the same time, the triple can be considered to be matched with the training text.
It should be noted that the execution order of step S110 and step S120 does not have to be a sequential order, and the two steps may be executed simultaneously without being sequential, and fig. 1 only illustrates an optional execution order.
Step S130, selecting a target triple from the matched triples, and splicing the head entity word and the relation word in the selected target triple with the training text to obtain a spliced training text.
Specifically, there may be more than one triple matched with the training text, and at this time, a target triple may be selected from the matched triples, and then the head entity word and the relation word in the target triple are spliced with the training text to obtain the spliced training text.
The spliced training text is used for training the neural network model, namely, the model training text is told to contain triples, the model is required to predict tail entity words in the triples, and the understanding of the model on semantics and knowledge can be promoted through the 'implicit' knowledge fusion, so that the learning ability of the model on domain knowledge is further improved.
For the case illustrated in table 2 above:
assuming that a random selection (cold, alternative name, cold) is taken as a target triple, the cold and the alternative name can be spliced in front of a training text, meanwhile, for the purpose of distinguishing, a setting separator SEP can be used for separating the alternative name and the training text, and the obtained spliced training text is as follows:
the cold is also called SEP common cold as cold, and is manifested as runny nose, dry throat, etc.
And step S140, training the neural network model by using the training texts after the masks and the spliced training texts.
Specifically, the masked training text and the spliced training text may be input into a neural network model, so as to predict target entity words masked in the masked training text and predict tail entity words in the target triples included in the spliced training text as targets, and the neural network model is trained until a set training end condition is reached, so as to obtain a pre-training model.
The neural network model may adopt a plurality of language models, that is, model the possibility of existence of a sentence, so that the model learns the probability distribution of each word or word from massive data. The language model represents a structure such as transform Block, BERT, or the like.
Referring to fig. 3, a pre-training model structure is illustrated, which shows a process of pre-training in the medical field based on a Transformer Block model.
For the training text, namely the common cold also known as the cold, which is expressed as symptoms of runny nose, dry throat and the like, knowledge matching search is firstly carried out, the process can be divided into two links of knowledge mask and triple mask, wherein the knowledge mask refers to the step of masking the matched target entity words in the training text, and after the mask is obtained, the training text, namely the common [ mask ] [ mask ] also known as the cold, is expressed as symptoms of runny nose, mask ] [ mask ] and the like. The triple mask is to select a target triple from the triples matched with the training text, such as a cold (named separately as a cold), splice the head entity word and the relation word with the training text to obtain a spliced training text, wherein the cold is named separately as an SEP common cold and is also called as a cold, and is expressed as symptoms of rhinorrhea, dry throat and the like.
Inputting the masked training text and the spliced training text into a transform Block to extract hidden layer characteristics, predicting masked target entity words through a knowledge prediction module, and predicting tail entity words in target triples through a triple prediction module.
In the embodiment of the application, in the process of pre-training a model, a training text and a knowledge graph in the field are obtained, a target entity word matched in the training text and a triple matched with the training text are searched based on the knowledge graph, the target entity word in the training text is masked to obtain a masked training text, meanwhile, a target entity word is selected, the head entity word and the relation word are spliced with the training text to obtain a spliced training text, and a neural network model is trained by taking the target entity word masked in the training text after predicting the masking and the tail entity word in the target triple contained in the training text after predicting the splicing as targets to obtain the pre-training model. Therefore, knowledge in the knowledge graph of the field to which the training text belongs is integrated into the model pre-training process, understanding and mastering of the model on knowledge in the related field are promoted, and interpretability of the model is enhanced.
Meanwhile, when the mask is carried out on the training text, the mask is carried out on the target entity words preferentially, the model predicts the target entity words of the mask, and the implicit knowledge is integrated, so that the learning capability of the model on the field knowledge can be improved. Furthermore, the method and the device have the advantages that the integration of triple knowledge is added, namely, the head entity words and the relation words in the target triples matched with the training text are spliced with the training text and then input into the model, the fact that triples exist in the training text of the model is definitely told, the model predicts the tail entity words in the target triples, the integration of the displayed knowledge is combined with the integration of the implicit knowledge, different semantic information and different types of feature cross fusion can be obtained, the understanding of the model on the semantics and the knowledge is promoted, the learning capacity of the model on the domain knowledge is further improved, and the effect of the pre-training model is greatly improved.
In some embodiments of the present application, a process of searching for a target entity word matched with the knowledge graph in the training text in step S110, and masking the target entity word matched with the knowledge graph in the training text to obtain a masked training text is described.
Specifically, the entity words in the knowledge graph may be acquired, and further, words that are the same as the entity words in the knowledge graph are searched for in a training text as target entity words.
When masking the target entity words in the training text, there are several alternative masking ways:
firstly, each target entity word in the training text is replaced by a set mask character respectively to obtain a masked training text.
Namely, each target entity word in the training text is subjected to mask processing. Meanwhile, the mask may not be performed on the non-target entity words, or a part of the non-target entity words may be selected and subjected to the mask processing at the same time.
And secondly, carrying out random mask on the training text in a mode of carrying out mask on the target entity words in a biased mode to obtain a masked training text.
Specifically, random mask is performed in a mode biased to the target entity word mask, namely, the target entity word is masked according to a first probability, and non-target entity words are masked according to a second probability, wherein the first probability is greater than the second probability, so that the model can strengthen learning of knowledge and improve understanding of the knowledge.
In some embodiments of the present application, the process of finding the triplet matched with the training text based on the knowledge-graph in the above step S120 is described.
Alternatively, if step S110 is executed before step S120 is executed, that is, each target entity word matched by the training text has been obtained, the process of matching the triples may include:
s1, acquiring a triple set contained in the knowledge graph.
And S2, pairwise combination is carried out on the training text and each target entity word matched with the knowledge graph, whether the target entity word pair exists in one triple in the triple set or not is judged for each target entity word pair after combination, and if yes, the triple in which the target entity word pair exists is used as the triple matched with the training text.
In another optional case, the process of matching the triples in step S120 may include:
s1, acquiring a triple set contained in the knowledge graph.
S2, for each triple in the triple set:
and judging whether the head entity word and the tail entity word in the triple exist in the training text at the same time, if so, taking the triple as the triple matched with the training text.
After the triple matching is performed in step S120, the triple matched with the training text can be obtained. The number of matching triplets may be one or more.
And selecting a target triple from the matched triples, sequentially splicing the head entity word and the relation word in the selected target triple in front of the training text, and separating the relation word and the training text by using a set separator to obtain the spliced training text.
In some embodiments of the present application, the process of training the neural network model by using the training texts after the mask and the training texts after the concatenation in the step S140 is described, which specifically includes the following steps:
s1, inputting the masked training text and the spliced training text into a neural network model.
S2, predicting original characters corresponding to mask characters in the training text after the mask by using a neural network model, and determining a first loss function based on a model prediction result.
In order to predict mask characters more accurately, the neural network model may predict original characters corresponding to mask words based on feature vectors of the mask characters and feature vectors of unmasked characters nearest to the mask characters.
Specifically, a neural network model may be used to determine a feature vector of each character in the training text after masking, and predict an original character corresponding to a masked character based on the feature vector of the masked character and the feature vectors of the unmasked characters in the nearest neighbors before and after the masked character.
Referring to FIG. 4:
for the input masked training text: common [ mask] [mask]Also known as cold, it is manifested as runny nose. Determining the feature vector of each character in the character through a neural network model, namely a vector h 1 -h 17
When predicting the character of the [ mask ], simultaneously referring to the feature vector of the character of the [ mask ] and the feature vector corresponding to the boundary of the part of the [ mask ], taking prediction of the first [ mask ] as an example:
first one [ mask]The feature vectors of the non-mask characters adjacent to the nearest neighbor before and after are respectively h 2 And h 5 First one [ mask]The corresponding feature vector is h 3 Then the three feature vectors are merged, and the merged vector is represented as
Figure 569428DEST_PATH_IMAGE005
Figure 623972DEST_PATH_IMAGE006
The neural network model predicts the original character corresponding to the first mask by using the merged vector:
Figure 834242DEST_PATH_IMAGE007
Figure 607026DEST_PATH_IMAGE008
where y represents the original character corresponding to the first [ mask ].
Assuming that the feature vector of each character coded by the neural network model is n-dimensional, merging the vectors
Figure 413308DEST_PATH_IMAGE005
Is 3n. Mapping through an activation function gelu
Figure 587938DEST_PATH_IMAGE005
Mapping to word list space
Figure 204995DEST_PATH_IMAGE009
If the length of the vocabulary is m, the dimension of W is (3n, m). And finally, obtaining the original character y corresponding to the predicted mask character through a softmax function.
And S3, predicting tail entity words in the target triples contained in the spliced training texts by using the neural network model, and determining a second loss function based on a model prediction result.
Specifically, in the foregoing embodiment, when determining the spliced training text, the labels of the spliced training text may be determined at the same time, where the labels may include positive example labels and negative example labels, and the positive example labels are tail entity words in the target triples. The negative example labels are all the remaining entity words except the head entity word and the tail entity word in the target triple in each target entity word matched with the training text.
By way of illustration of the foregoing example, a target triple matched with the training text is a triple (cold, alternative name, cold), and the head entity word and the relation word are spliced with the training text to obtain a spliced training text, namely the training text "cold, alternative name SEP common cold is also called" cold ", which is manifested as symptoms such as runny nose, dry throat, and the like". It will be understood that the positive example label of the training text is "cold", and the negative example label may include: "dry throat" and "runny nose".
Based on the positive example label and the negative example label of the spliced training text, the neural network model can be trained in a comparative learning mode, so that the learning capability of the neural network model is improved.
Based on the positive example label and the negative example label of the spliced training text, the process of training the neural network model in a comparative learning mode may include:
and S31, determining a feature vector of each character in the spliced training text by using a neural network model, and determining respective feature vectors of the positive case label and the negative case label based on the feature vectors of the characters.
The above positive and negative labels are used as examples to explain:
feature vector h of head entity word 'cold' in target triple Common cold Feature vector h of "wind" of "regular case label Common cold Negative case labels "dry throat" and "running nose" feature vector h Dry throat 、h Runny nose Respectively expressed as:
h common cold =mealpool(h Feeling of ,h Cap with heating means )
h Common cold =mealpool(h Injury due to wound ,h Wind power )
h Runny nose =mealpool(h Flow of ,h Nose ,h Nasal discharge )
h Dry throat =mealpool(h Pharynx with throat opening ,h Dry food )
Wherein mealpool () represents an orientation amount average value.
And S32, calculating scores of the positive example label and the negative example label based on the feature vectors of the positive example label and the negative example label.
Specifically, when calculating the scores of the positive example label and the negative example label, the score may be determined according to the similarity between each label and the feature vector between the head entities in the target triplet, and the higher the similarity is, the higher the score of the corresponding label is.
The above example is still used to illustrate:
Figure 730654DEST_PATH_IMAGE010
wherein p is i The score of the ith label is shown, the above example contains 1 positive example label and 2 negative example labels, and the total number of the labels is 3, and then the value of i is [1,3 ]],h i A feature vector representing the ith label, j represents the jth label, h j Representing the feature vector of the jth tag.
And S33, determining a second loss function based on the scores of the positive example label and the negative example label.
It is understood that the model training expects the positive example label to score as high as possible and the negative example label to score as low as possible, so the second loss function can be calculated based on the respective scores of the positive example label and the negative example label.
And S4, determining a total loss function based on the first loss function and the second loss function, and updating parameters of the neural network model based on the total loss function.
In this embodiment, two loss functions, i.e., a first loss function and a second loss function, are provided during training of the neural network model, and a total loss function may be determined based on the two loss functions, so as to update parameters of the neural network model according to the total loss function. Wherein, the first loss and the second loss function can adopt cross entropy loss.
The model pre-training method provided by the embodiment of the application can be suitable for various fields, the corresponding training texts and the corresponding knowledge graph are data in the corresponding fields, and taking the medical field as an example, the training texts are medical texts, and the corresponding knowledge graph is a medical knowledge graph.
Of course, the method can be applied to other fields besides the medical field, such as the judicial field, the agricultural field and the like.
Based on the model pre-training method introduced in the foregoing embodiment, the embodiment of the present application further provides a natural language processing method, and the present application may perform secondary training on the basis of the pre-training model obtained by the model pre-training method of the foregoing embodiment, so as to obtain a natural language processing task model after the secondary training. And further inputting the task data to be subjected to natural language processing into the natural language processing task model to obtain a natural language processing result output by the model.
Where the natural language processing tasks may be of various types such as machine translation, question answering, dialogue, text classification, and so forth. When the pre-training model is trained for the second time, the pre-training model can be adjusted by adopting the labeled data under the corresponding task according to the difference of specific tasks.
In an example, the pre-training model may be applied to a medical text structured processing task, and then the pre-training model may be trained for the second time based on the labeled data of the task, so as to obtain a trained medical text structured processing model. And further, the text data to be structurally processed can be processed by utilizing the medical text structural processing model to obtain a processing result.
The model pre-training device provided in the embodiments of the present application is described below, and the model pre-training device described below and the model pre-training method described above may be referred to in a corresponding manner.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a model pre-training apparatus disclosed in the embodiment of the present application.
As shown in fig. 5, the apparatus may include:
the data acquisition unit 11 is used for acquiring a training text and a knowledge graph of the field to which the training text belongs;
a target entity word searching unit 12, configured to search for a target entity word in the training text, where the target entity word is matched with the knowledge graph;
an entity word mask unit 13, configured to mask target entity words matched in the training text to obtain a masked training text;
the triple searching unit 14 is configured to search a triple matched with the training text based on the knowledge graph, where the triple includes a head entity word, a relation word, and a tail entity word;
the training text splicing unit 15 is configured to select a target triple from the matched triples, splice the head entity word and the relation word in the selected target triple with the training text, and obtain a spliced training text;
and the parameter updating unit 16 is configured to train a neural network model with the target entity words predicted to be masked in the masked training text and the tail entity words predicted in the target triples included in the spliced training text as targets until a set training end condition is reached, so as to obtain a pre-training model.
Optionally, the process of searching for the target entity word matched with the knowledge graph in the training text by the target entity word searching unit may include:
acquiring entity words in the knowledge graph;
and searching the words which are the same as the entity words in the knowledge graph in the training text to serve as target entity words.
Optionally, the process of masking the target entity words matched in the training text by the entity word masking unit to obtain the masked training text may include:
replacing each target entity word in the training text with a set mask character respectively to obtain a mask training text;
or the like, or, alternatively,
and carrying out random mask on the training text in a mode of carrying out mask on the target entity words in a biased mode to obtain a masked training text.
Optionally, the embodiment of the present application introduces two processing logics of the triple lookup unit, which are respectively as follows:
first, the process of finding the triplet matched with the training text by the triplet finding unit based on the knowledge graph may include:
acquiring a triple set contained in the knowledge graph;
and combining every two target entity words matched with the knowledge graph and the training text, judging whether the target entity word pair exists in a triple in the triple set or not for each target entity word pair after combination, and if so, taking the triple in which the target entity word pair exists as the triple matched with the training text.
Secondly, the process of finding the triplet matched with the training text by the triplet finding unit based on the knowledge graph may include:
acquiring a triple set contained in the knowledge graph;
for each triplet in the set of triplets:
and judging whether the head entity word and the tail entity word in the triple exist in the training text at the same time, if so, taking the triple as the triple matched with the training text.
Optionally, the process of splicing the head entity word and the relation word in the selected target triple with the training text by the training text splicing unit to obtain the spliced training text may include:
and splicing the head entity words and the relation words in the selected target triples in sequence in front of the training text, and separating the relation words and the training text by using set separators to obtain the spliced training text.
Optionally, the process of training the neural network model by the parameter updating unit with a target entity word predicted to be masked in the training text after the masking and a tail entity word predicted in the target triple included in the training text after the concatenation as targets may include:
inputting the masked training text and the spliced training text into a neural network model;
predicting original characters corresponding to mask characters in the training text after the mask by using the neural network model, and determining a first loss function based on a model prediction result;
predicting tail entity words in the target triples contained in the spliced training texts by using the neural network model, and determining a second loss function based on a model prediction result;
a total loss function is determined based on the first loss function and the second loss function, and parameters of a neural network model are updated based on the total loss function.
Optionally, the process of predicting, by the parameter updating unit, the original character corresponding to the mask character in the masked training text by using the neural network model may include:
determining a feature vector of each character in the training text after masking by using a neural network model, and predicting an original character corresponding to the mask character based on the feature vector of the mask character and the feature vectors of the unmasked characters which are nearest to the mask character before and after the mask character.
Optionally, the labels of the spliced training texts may include positive example labels and negative example labels, the positive example labels are tail entity words in the target triples, and the negative example labels are each remaining entity word in each target entity word matched with the training texts, except for a head entity word and a tail entity word in the target triples. On this basis, the process that the parameter updating unit predicts the tail entity words in the target triples included in the spliced training text by using the neural network model and determines the second loss function based on the model prediction result may include:
determining a feature vector of each character in the spliced training text by using a neural network model, and determining respective feature vectors of the positive example label and the negative example label based on the feature vector of each character;
calculating respective scores of the positive example label and the negative example label based on the respective feature vectors of the positive example label and the negative example label;
a second loss function is determined based on the respective scores of the positive and negative examples labels.
Alternatively, the training text may be medical text and the knowledge-graph may be a medical knowledge-graph.
In some embodiments of the present application, there is further provided a natural language processing apparatus, which may include:
the task data acquisition unit is used for acquiring task data to be subjected to natural language processing;
the task data processing unit is used for inputting the task data into a configured natural language processing task model to obtain a natural language processing result output by the model; the natural language processing task model is obtained by performing secondary training on the basis of the pre-training model obtained by the model pre-training method in the embodiment.
The model pre-training device provided by the embodiment of the application can be applied to model pre-training equipment, such as a terminal: mobile phones, computers, etc. Optionally, fig. 6 shows a block diagram of a hardware structure of the model pre-training device, and referring to fig. 6, the hardware structure of the model pre-training device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring a training text and a knowledge graph of the field to which the training text belongs;
searching a target entity word matched with the knowledge graph in the training text, and masking the target entity word matched in the training text to obtain a masked training text;
searching the triples matched with the training texts based on the knowledge graph;
selecting a target triple from the matched triples, and splicing the head entity word and the relation word in the selected target triple with the training text to obtain a spliced training text;
and training a neural network model by taking the target entity words of the predicted masked training texts and the tail entity words of the target triples contained in the predicted spliced training texts as targets until set training ending conditions are reached, and obtaining a pre-training model.
Alternatively, the detailed function and the extended function of the program may be as described above.
An embodiment of the present application further provides a storage medium, where the storage medium may store a program adapted to be executed by a processor, where the program is configured to:
acquiring a training text and a knowledge graph of the field to which the training text belongs;
searching a target entity word matched with the knowledge graph in the training text, and masking the target entity word matched in the training text to obtain a masked training text;
searching the triples matched with the training texts based on the knowledge graph;
selecting a target triple from the matched triples, and splicing the head entity word and the relation word in the selected target triple with the training text to obtain a spliced training text;
and training a neural network model by taking the target entity words which are predicted to be masked in the training text after the mask and the tail entity words in the target triples contained in the training text after the splicing as targets until a set training end condition is reached, and obtaining a pre-training model.
Alternatively, the detailed function and the extended function of the program may refer to the above description.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (14)

1. A method of model pre-training, comprising:
acquiring a training text and a knowledge graph of the field to which the training text belongs;
searching a target entity word matched with the knowledge graph in the training text, and masking the target entity word matched with the knowledge graph in the training text to obtain a masked training text;
searching a triple matched with the training text based on the knowledge graph, wherein the triple comprises a head entity word, a relation word and a tail entity word;
selecting a target triple from the matched triples, and only splicing the head entity word and the relation word in the selected target triple with the training text to obtain a spliced training text; determining positive example labels and negative example labels of the spliced training texts at the same time, wherein the positive example labels are tail entity words in the target triples; the negative example labels are each remaining entity word except a head entity word and a tail entity word in the target triple in each target entity word matched with the training text;
and training a neural network model in a comparison learning mode on the basis of positive example labels and negative example labels of the spliced training texts by taking the target entity words which are predicted to be masked in the training texts after masking and the tail entity words in the target triples contained in the spliced training texts as targets until a set training end condition is reached, so as to obtain a pre-training model.
2. The method of claim 1, wherein the searching for the target entity word in the training text that matches the knowledge-graph comprises:
acquiring entity words in the knowledge graph;
and searching the words which are the same as the entity words in the knowledge graph in the training text to serve as target entity words.
3. The method of claim 1, wherein the masking the target entity words matched in the training text to obtain a masked training text comprises:
replacing each target entity word in the training text with a set mask character respectively to obtain a mask training text;
or the like, or, alternatively,
and carrying out random mask on the training text in a mode of carrying out mask on the target entity words in a biased mode to obtain a masked training text.
4. The method of claim 1, wherein the finding the triples matched by the training text based on the knowledge-graph comprises:
acquiring a triple set contained in the knowledge graph;
and combining every two target entity words matched with the knowledge graph and the training text, judging whether the target entity word pair exists in a triple in the triple set or not for each target entity word pair after combination, and if so, taking the triple in which the target entity word pair exists as the triple matched with the training text.
5. The method of claim 1, wherein the finding the triples matched by the training text based on the knowledge-graph comprises:
for each triplet in the set of triplets:
and judging whether the head entity word and the tail entity word in the triple exist in the training text at the same time, if so, taking the triple as the triple matched with the training text.
6. The method according to claim 1, wherein the concatenating the selected head entity word and relation word in the target triplet with the training text to obtain a concatenated training text comprises:
and splicing the head entity words and the relation words in the selected target triples in sequence in front of the training text, and separating the relation words and the training text by using set separators to obtain the spliced training text.
7. The method according to any one of claims 1 to 6, wherein training a neural network model with a target of predicting the target entity words masked in the masked training text and predicting the tail entity words in the target triples contained in the stitched training text as targets comprises:
inputting the masked training text and the spliced training text into a neural network model;
predicting original characters corresponding to mask characters in the training text after the mask by using the neural network model, and determining a first loss function based on a model prediction result;
predicting tail entity words in the target triples contained in the spliced training texts by using the neural network model, and determining a second loss function based on a model prediction result;
a total loss function is determined based on the first loss function and the second loss function, and parameters of a neural network model are updated based on the total loss function.
8. The method of claim 7, wherein predicting, using the neural network model, original characters corresponding to masked characters in the masked training text comprises:
determining a feature vector of each character in the training text after the mask by using a neural network model, and predicting an original character corresponding to the mask character based on the feature vector of the mask character and the feature vectors of the unmasked characters which are nearest to the front and the back of the mask character.
9. The method according to claim 7, wherein the labels of the spliced training text comprise positive example labels and negative example labels, the positive example labels are tail entity words in the target triples, and the negative example labels are each remaining entity word except for head entity words and tail entity words in the target triples matched with the training text;
predicting tail entity words in the target triples contained in the spliced training texts by using the neural network model, and determining a second loss function based on a model prediction result, wherein the method comprises the following steps:
determining a feature vector of each character in the spliced training text by using a neural network model, and determining respective feature vectors of the positive example label and the negative example label based on the feature vector of each character;
calculating respective scores of the positive example label and the negative example label based on the respective feature vectors of the positive example label and the negative example label;
a second loss function is determined based on the respective scores of the positive and negative examples.
10. The method of claim 1, wherein the training text is medical text and the knowledge-graph is a medical knowledge-graph.
11. A natural language processing method, comprising:
acquiring task data to be subjected to natural language processing;
inputting the task data into a configured natural language processing task model to obtain a natural language processing result output by the model;
the natural language processing task model is obtained by performing secondary training on the basis of a pre-training model obtained by the model pre-training method according to any one of claims 1 to 10.
12. A model pre-training apparatus, comprising:
the data acquisition unit is used for acquiring a training text and a knowledge graph of the field to which the training text belongs;
the target entity word searching unit is used for searching the target entity words matched with the knowledge graph in the training text;
the entity word mask unit is used for masking the target entity words matched in the training text to obtain a masked training text;
the triple searching unit is used for searching the triples matched with the training texts based on the knowledge graph, wherein the triples comprise head entity words, relation words and tail entity words;
the training text splicing unit is used for selecting a target triple from the matched triples and splicing only the head entity word and the relation word in the selected target triple with the training text to obtain a spliced training text; determining positive example labels and negative example labels of the spliced training texts at the same time, wherein the positive example labels are tail entity words in the target triples; the negative example labels are each remaining entity word except a head entity word and a tail entity word in the target triple in each target entity word matched with the training text;
and the parameter updating unit is used for training a neural network model in a comparison learning mode on the basis of the positive example label and the negative example label of the spliced training text by taking the target entity word which is predicted to be masked in the training text after the mask and the tail entity word in the target triple contained in the spliced training text as targets until a set training end condition is reached, so as to obtain a pre-training model.
13. A model pre-training apparatus, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is used for executing the program and realizing the steps of the model pre-training method as claimed in any one of claims 1 to 10.
14. A storage medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the model pre-training method according to any one of claims 1 to 10.
CN202210701343.5A 2022-06-21 2022-06-21 Model pre-training and natural language processing method, device, equipment and storage medium Active CN114780691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210701343.5A CN114780691B (en) 2022-06-21 2022-06-21 Model pre-training and natural language processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210701343.5A CN114780691B (en) 2022-06-21 2022-06-21 Model pre-training and natural language processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114780691A CN114780691A (en) 2022-07-22
CN114780691B true CN114780691B (en) 2022-12-02

Family

ID=82421352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210701343.5A Active CN114780691B (en) 2022-06-21 2022-06-21 Model pre-training and natural language processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114780691B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995903B (en) * 2022-05-30 2023-06-27 中电金信软件有限公司 Class label identification method and device based on pre-training language model
CN115312127B (en) * 2022-08-05 2023-04-18 抖音视界有限公司 Pre-training method of recognition model, recognition method, device, medium and equipment
CN116401335A (en) * 2023-03-15 2023-07-07 北京擎盾信息科技有限公司 Quantitative retrieval method and device for legal documents, storage medium and electronic device
CN117251555B (en) * 2023-11-17 2024-04-16 深圳须弥云图空间科技有限公司 Language generation model training method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941716B (en) * 2019-11-05 2023-07-18 北京航空航天大学 Automatic construction method of information security knowledge graph based on deep learning
CN111192692B (en) * 2020-01-02 2023-12-08 上海联影智能医疗科技有限公司 Entity relationship determination method and device, electronic equipment and storage medium
CN112632996A (en) * 2020-12-08 2021-04-09 浙江大学 Entity relation triple extraction method based on comparative learning
CN114330281B (en) * 2022-03-08 2022-06-07 北京京东方技术开发有限公司 Training method of natural language processing model, text processing method and device

Also Published As

Publication number Publication date
CN114780691A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN114780691B (en) Model pre-training and natural language processing method, device, equipment and storage medium
CN106095932B (en) Encyclopedic knowledge question recognition method and device
CN107526834B (en) Word2vec improvement method for training correlation factors of united parts of speech and word order
US9195646B2 (en) Training data generation apparatus, characteristic expression extraction system, training data generation method, and computer-readable storage medium
Daumé III et al. A large-scale exploration of effective global features for a joint entity detection and tracking model
CN110532328B (en) Text concept graph construction method
US11170169B2 (en) System and method for language-independent contextual embedding
Ismailov et al. A comparative study of stemming algorithms for use with the Uzbek language
US11003950B2 (en) System and method to identify entity of data
CN111046179A (en) Text classification method for open network question in specific field
JP6729095B2 (en) Information processing device and program
CN112131876A (en) Method and system for determining standard problem based on similarity
Wang et al. Semi-supervised self-training for sentence subjectivity classification
CN107256212A (en) Chinese search word intelligence cutting method
CN115713072A (en) Relation category inference system and method based on prompt learning and context awareness
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
Schaback et al. Multi-level feature extraction for spelling correction
CN114491079A (en) Knowledge graph construction and query method, device, equipment and medium
CN115774996A (en) Question-following generation method and device for intelligent interview and electronic equipment
CN110162615A (en) A kind of intelligent answer method, apparatus, electronic equipment and storage medium
CN116450781A (en) Question and answer processing method and device
CN113449119A (en) Method and device for constructing knowledge graph, electronic equipment and storage medium
Yao Product name recognition and normalization in internet forums
CN108733757B (en) Text search method and system
CN112269877A (en) Data labeling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 230000 floor 23-24, building A5, No. 666, Wangjiang West Road, high tech Zone, Hefei City, Anhui Province

Patentee after: IFLYTEK Medical Technology Co.,Ltd.

Address before: 230088 floor 23-24, building A5, No. 666, Wangjiang West Road, high tech Zone, Hefei, Anhui Province

Patentee before: Anhui Xunfei Medical Co.,Ltd.

CP03 Change of name, title or address