CN112990290A - Sample data generation method, device, equipment and storage medium - Google Patents

Sample data generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN112990290A
CN112990290A CN202110259314.3A CN202110259314A CN112990290A CN 112990290 A CN112990290 A CN 112990290A CN 202110259314 A CN202110259314 A CN 202110259314A CN 112990290 A CN112990290 A CN 112990290A
Authority
CN
China
Prior art keywords
target
sentences
medical record
sample data
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110259314.3A
Other languages
Chinese (zh)
Inventor
孙超
王健宗
吴天博
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110259314.3A priority Critical patent/CN112990290A/en
Publication of CN112990290A publication Critical patent/CN112990290A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The application is suitable for the technical field of artificial intelligence, and provides a sample data generation method, a device, equipment and a storage medium, wherein medical record sentences containing named entities are obtained, a plurality of target linguistic data which comprise the named entities and have similarity values with the medical record sentences meeting a preset range are searched in a preset corpus, and then the sample data of a pre-training language model is generated according to the target linguistic data and a preset question template. In the application, the generated sample data of the pre-training language model is automatically generated according to a plurality of target corpora and question templates, but not generated according to one target corpus, so that the richness of the generated sample data is greatly improved.

Description

Sample data generation method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating sample data.
Background
With the development of artificial intelligence natural language processing, especially in the field of intelligent inquiry, pre-trained language models are increasingly applied to achieve artificial intelligence natural language processing.
However, in the process of implementing artificial intelligence natural language processing by using the pre-training language model, the pre-training language model usually needs to be trained by large-scale artificial labeling data, and a large amount of manpower and material resources need to be consumed. Meanwhile, different people have different understandings of the same natural language, so that different people have different labels of the same problem, sample data of the labels are inaccurate, and the pre-training language model is inaccurate.
How to improve the richness of the sample data of the pre-training language model is an urgent problem to be solved in the field of artificial intelligent natural language processing.
Disclosure of Invention
The application provides a method, a device, a system, equipment and a storage medium for generating sample data, which can improve the richness of the sample data of a pre-training language model.
In a first aspect, an embodiment of the present application provides a method for generating sample data, where the method includes:
acquiring medical record sentences containing named entities;
searching a plurality of target corpora which comprise named entities and have similarity values with medical record sentences meeting a preset range in a preset corpus;
and generating sample data of the pre-training language model according to the plurality of target linguistic data and a preset question template.
In an embodiment, the searching, in the preset corpus, a plurality of target corpora including the named entity and having similarity values with medical history sentences that satisfy a preset range includes:
searching a candidate sentence comprising a named entity in a preset corpus;
acquiring similarity values of the candidate sentences and the medical record sentences;
and taking a plurality of candidate sentences of which the similarity values with medical record sentences are smaller than a preset threshold value as a plurality of target linguistic data.
In an embodiment, the obtaining the similarity value between the candidate statement and the medical record statement includes:
removing the candidate sentences in the medical record where the medical record sentences are located to obtain target candidate sentences;
inputting the target candidate statement and the medical record statement into a vector conversion model to obtain a candidate vector corresponding to the target candidate statement and a statement vector corresponding to the medical record statement;
and calculating cosine values between the candidate vectors and the statement vectors to obtain similarity values of the target candidate statements and the medical record statements.
In an embodiment, the taking the candidate corpuses with similarity values smaller than the preset threshold as the target corpuses includes:
sequencing the target candidate sentences with the similarity value smaller than a preset threshold value with the medical record sentences according to the sequence of the similarity values from large to small to obtain the serial numbers of the target candidate sentences;
and taking each target candidate statement with the sequence number smaller than N as a plurality of target linguistic data, wherein N is a positive integer.
In an embodiment, the generating the sample data of the pre-training language model according to the target corpus and the preset question template includes:
and replacing the named entity in the target corpus by the target question word to obtain sample data of the pre-training language model.
In an embodiment, the replacing named entities in the target corpus with the target query words to obtain sample data of the pre-training language model includes:
and replacing the named entity in the target corpus with the target query word, setting the statement segment behind the named entity in the target corpus behind the target query word, and setting the statement segment in front of the named entity in the target corpus behind the statement segment behind the named entity to obtain sample data of the pre-training language model.
In an embodiment, before generating sample data of a pre-training language model according to the target corpus and the preset question template, the method further includes:
determining the query words corresponding to the named entities in the medical record sentences according to the preset corresponding relation between the named entities and the query words;
and taking the query words corresponding to the named entities in the medical record sentences as target query words in a preset query sentence template.
In a second aspect, an apparatus for generating sample data includes:
the acquisition module is used for acquiring medical record sentences containing named entities;
the searching module is used for searching a plurality of target corpora which comprise named entities and have similarity values with medical history sentences meeting a preset range in a preset corpus;
and the generating module is used for generating sample data of the pre-training language model according to the plurality of target linguistic data and a preset question template.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method according to the first aspect when executing the computer program.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method according to the first aspect.
The sample data generating method, the device, the equipment and the storage medium acquire medical history sentences containing the named entities, search a plurality of target linguistic data which comprise the named entities and have similarity values with the medical history sentences meeting a preset range in a preset corpus, and then generate sample data of a pre-training language model according to the target linguistic data and a preset question template. In the application, the generated sample data of the pre-training language model is automatically generated according to a plurality of target corpora and question templates, but not generated according to one target corpus, so that the richness of the generated sample data is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a system for generating sample data according to an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating a method for generating sample data according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating a method for generating sample data according to another embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating a method for generating sample data according to another embodiment of the present application
FIG. 5 is a schematic flow chart illustrating a method for generating sample data according to another embodiment of the present application
FIG. 6 is a schematic flow chart illustrating a method for generating sample data according to another embodiment of the present application
Fig. 7 is a schematic structural diagram of a sample data generation apparatus according to an embodiment of the present application;
fig. 8 is an internal structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It is to be understood that the terms "first," "second," "third," "fourth," and the like (if any) in the embodiments of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The method for generating sample data provided by this embodiment may be applied to the application environment shown in fig. 1. Therein, the electronic device 100 may generate sample data 110 of a pre-trained language model. The electronic device 100 may be, but is not limited to, an electronic device with a data processing function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, or a personal digital assistant, and the specific form of the electronic device 100 is not limited in this embodiment.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.
The execution subject of the method embodiments described below may also be a sample data generation apparatus, which may be implemented as part or all of the electronic device by software, hardware, or a combination of software and hardware. The following method embodiments are described by taking an execution subject as an electronic device as an example.
Fig. 2 is a schematic flow chart of a method for generating sample data according to an embodiment of the present application. This embodiment relates to a specific process of how to pre-train the richness of sample data of a language model. As shown in fig. 2, the method comprises the steps of:
s101, acquiring medical record sentences containing named entities.
Named entities may refer to names of people, organizations, places, and all other entities identified by names. Further, named entities may also include numbers, dates, currencies, addresses, and the like. The named entities in the target paragraph may be obtained through a named entity recognition technique, which refers to a process of identifying a name or symbol of a particular type of thing in a document collection. Named entity recognition can generally be achieved by: identifying a named entity in the text; determining the type of the entity; when multiple entities represent the same thing, one of the entities is selected as a representative of the group of entities. A medical record statement can refer to a statement in a medical record, for example, a medical record in a medical corpus, which specifically includes: the patient carries out the radical treatment of the general anesthesia uplink rectal cancer before 3 months in our hospital, and after the operation, the anti-infection and nutrition support treatment is given, and the incision is well healed. The postoperative pathology showed rectal adenocarcinoma (low grade differentiation), infiltrating ulcerated form, area 3.5 x 2cm, invading the external mold. The medical record statement can be a statement that includes a named entity in the medical record. If the named entity is 'rectal cancer', correspondingly, the medical record statement can be a statement comprising 'rectal cancer', namely 'the patient is subjected to a radical treatment of general anesthesia and ascending rectal cancer in our hospital before 3 months, and then anti-infection and nutrition support treatment is given after the operation, so that the incision is well healed'.
It should be noted that there can be more than one named entity in the medical record statement. For example, if the medical history statement is that the patient carries out a radical treatment on general anesthesia and ascending rectal cancer in our hospital through laparoscope due to the reason that the rectal cancer is caused before 3 months, the anti-infection and nutrition support treatment is given after the operation, and the incision healing is good. Wherein "rectal cancer" may be the first named entity and "laparoscope" may be the second named entity. It should be noted that the number of named entities is not limited in the present application.
S102, searching a plurality of target corpora which comprise named entities and have similarity values with medical record sentences meeting a preset range in a preset corpus.
The preset corpus can be a medical corpus, and the target corpus can be a corpus which contains the named entities and has similarity with medical history sentences meeting a preset range. Generally, when a target corpus is searched in a predetermined corpus, the target corpus is searched from the entire predetermined corpus, and therefore, a plurality of target corpora are searched in the predetermined corpus. It should be noted that the embodiment of the present application can also be applied to the field of insurance dialogues, and correspondingly, the preset corpus can also be an insurance dialogues corpus, and the medical history statement in the "acquiring medical history statements containing named entities" in S101 can also be referred to as an insurance statement.
When the sentences including the named entities are retrieved from the preset corpus, the sentences of which the similarity values with medical record sentences meet the preset range can be used as target corpora. In a possible case, before a statement whose similarity value with a medical record statement satisfies a preset range is taken as a target corpus, the statement in the medical record where the medical record statement is located may be removed, and a plurality of target corpora are retrieved in a preset corpus except for the medical record where the medical record statement is located, which is not limited in the embodiment of the present application.
It can be known from the foregoing embodiment that there may be one or more named entities, and correspondingly, when a plurality of target corpora are obtained by retrieving in a preset corpus according to medical history statements, if the number of the named entities is multiple, for example, the named entities include "rectal cancer" and "laparoscope", where the target corpora matched with the "rectal cancer" are m, and the target corpora matched with the "laparoscope" of the named entities to be retrieved are n, the obtained target corpora are m + n.
S103, generating sample data of the pre-training language model according to the target linguistic data and the question sentence template.
The pre-training language model may be a model for obtaining a corresponding answer according to an input question. The method can be applied to the field of intelligent diagnosis guide and can also be applied to the field of insurance question answering, and the embodiment of the application does not limit the method. After obtaining a plurality of target linguistic data, the language order of the target linguistic data and the words in the target linguistic data can be adjusted according to the question template, and sample data of the pre-training language model is obtained.
The method for generating the sample data comprises the steps of obtaining medical history sentences containing named entities, searching a plurality of target corpora which comprise the named entities and have similarity values with the medical history sentences meeting a preset range in a preset corpus, and generating the sample data of a pre-training language model according to the target corpora and a preset question template. In the application, the generated sample data of the pre-training language model is automatically generated according to a plurality of target corpora and question templates, but not generated according to one target corpus, so that the richness of the generated sample data is greatly improved.
Fig. 3 is a schematic flowchart of a sample data generating method according to another embodiment of the present application. The embodiment relates to a specific process for searching a plurality of target corpora which comprise named entities and have similarity values with medical record sentences meeting a preset range. As shown in fig. 3, one possible implementation method of the step S102 "searching a plurality of target corpora including named entities and having similarity values with medical history sentences satisfying a preset range in a preset corpus" includes the following steps:
s201, candidate sentences comprising the named entities are searched in a preset corpus.
The candidate sentences are sentences including named entities in the preset corpus, and may be sentences in the medical record in which the medical record sentences are located, or not sentences in the medical record in which the medical record sentences are located. When candidate sentences including named entities are searched in a preset corpus, an elastic search engine can be adopted to search in the preset corpus.
S202, obtaining the similarity value of the candidate statement and the medical record statement.
The similarity value may be used to compare the similarity of two sentences. Generally, the distance between the features of the sentences is calculated, and if the distance is small, the similarity is large; if the distance is large, the similarity is small. In another specific example, when the similarity values of the candidate sentences and the medical record sentences are obtained, the embodiment shown in fig. 4 may be used for detailed description. As shown in fig. 4, one possible implementation method of the step S202 of obtaining the similarity value between the candidate statement and the medical record statement includes the following steps:
s301, removing the candidate sentences in the medical record where the medical record sentences are located to obtain target candidate sentences.
And removing the candidate sentences in the medical record where the medical record sentences are located to obtain other candidate sentences except the medical record where the medical record sentences are located in the preset corpus as target candidate sentences. For example, the preset corpus is a medical corpus, the medical history in which medical history sentences are located is that the patient has a radical treatment of general anesthesia uplink rectal cancer before 3 months, and after operation, anti-infection and nutrition support treatment is given, so that the incision is well healed. Pathological colorectal cancer after operation (low degree differentiation), infiltrating ulcer type, area 3.5 x 2cm and invasion external model, wherein, the medical history statement is that the patient carries out full-anesthesia uplink rectal cancer radical operation before 3 months, anti-infection and nutrition support treatment are given after operation, incision healing is good, correspondingly, the candidate statement obtained in searching the preset corpus database is that the pathological colorectal cancer after operation (low degree differentiation), infiltrating ulcer type, area 3.5 x 2cm and invasion external model, although the context statement of the target statement in the target section is that the pathological colorectal cancer after operation (low degree differentiation), infiltrating ulcer type, area 3.5 x 2cm and invasion external model, wherein the named entity rectal cancer to be searched is included, the candidate statement in differentiation where the medical history statement is located is removed before the similarity value of the candidate statement and the medical history statement is obtained, infiltration ulcer type, area 3.5 × 2cm, invade to reach the external mold ".
S302, inputting the target candidate sentences and the medical record sentences into a vector conversion model to obtain candidate vectors corresponding to the target candidate sentences and sentence vectors corresponding to the medical record sentences.
The vector conversion model may obtain a vector value of the statement, which may be a TF-IDF model or another model, and this is not limited in this embodiment of the present application. The vector conversion model can obtain the vector of the statement by extracting the characteristic value of the statement. And inputting the target candidate statement into the vector conversion model to obtain a candidate vector corresponding to the target candidate statement. And outputting the medical record sentences to the vector conversion model to obtain the sentence vectors corresponding to the medical record sentences.
S303, calculating cosine values between the candidate vectors and the statement vectors to obtain similarity values of the target candidate statements and the medical record statements.
On the basis of the above embodiment, a candidate vector and a sentence vector are obtained. Cosine values between the candidate vectors and the statement vectors can be calculated, and the calculated cosine values are used as similarity values of the target candidate statements and the medical record statements. Generally, the more the cosine value approaches 1, the higher the similarity of the two vectors. Cosine similarity is independent of the magnitude of the vector and only dependent on the direction of the vector.
S203, taking a plurality of candidate sentences of which the similarity value with the medical record sentences is smaller than a preset threshold value as a plurality of target linguistic data.
After the similarity values of the target candidate sentences and the medical record sentences are obtained, the target candidate sentences with the similarity values smaller than the preset threshold value with the medical record sentences can be used as target corpora. For example, the preset threshold may be 95%, and after the similarity value of each target candidate sentence and the medical record sentence is obtained, a plurality of target candidate sentences having similarity values smaller than 95% are taken as the plurality of target corpora. In a possible case, a plurality of target corpuses can also be determined by the embodiment shown in fig. 5. Optionally, as shown in fig. 5, one possible implementation method of the step S203 of using a plurality of candidate sentences having similarity values with medical history sentences smaller than a preset threshold as a plurality of target corpora includes the following steps:
s401, sequencing the target candidate sentences with the similarity value smaller than the preset threshold value with the medical record sentences according to the sequence of the similarity values from large to small to obtain the sequence numbers of the target candidate sentences.
As can be seen from the foregoing embodiments, the similarity value is a cosine value between two vectors, that is, a specific numerical value, and therefore, the target candidate statements whose similarity to the medical record statements is smaller than the preset threshold value can be sorted in the order from the largest similarity value to the smallest similarity value between the target candidate statements and the medical record statements, so as to obtain the sequence numbers corresponding to the target candidate statements.
S402, taking each target candidate statement with the sequence number smaller than N as a plurality of target linguistic data, wherein N is a positive integer.
After the sequence numbers of the target candidate sentences are obtained, the target candidate sentences with the sequence numbers smaller than N may be used as a plurality of target corpora. For example, the corpora with similarity value less than 95% are arranged in the order from large to small, and the target candidate sentences with sequence numbers less than 20 are used as the target corpora, that is, the first 20 candidate corpora with similarity less than 95% to the target question sentence can be selected from the predetermined corpus.
The method for generating the sample data searches the candidate sentences including the named entities in the preset corpus, obtains the similarity values of the candidate sentences and the medical record sentences, and then takes the candidate sentences with the similarity values smaller than the preset threshold value as the target corpora, so that the target corpora are not completely and always similar to the medical record sentences, and further the sample data generated according to the target corpora is richer and more accurate, and further the pre-training language model obtained according to the sample data training is more accurate,
In an embodiment, the preset question template includes a target question word, and optionally, the S103 "generating sample data of a pre-training language model according to the target corpus and the preset question template" includes: and replacing the named entity in the target corpus by the target question word to obtain sample data of the pre-training language model.
The target query word can be a question word of what disease, who, which part and the like, the named entity in the target corpus is replaced by the target query word, and a question sentence, namely sample data of the pre-training language model, can be obtained.
For example, the target linguistic data is "diagnosis of rectal cancer in our hospital three months ago of the patient, and a radical treatment of laparoscopic rectal cancer in general anesthesia uplink on 26 days 10 and 26 months in 2019", wherein a named entity "rectal cancer" in the target linguistic data is replaced by a target query word, and sample data of the obtained pre-training language model is "what disease is diagnosed in the radical treatment of laparoscopic rectal cancer in general anesthesia uplink on 26 days 10 and 26 months in 2019", and the patient is diagnosed in our hospital three months ago.
When a named entity in a target corpus is replaced by a target query word, a binary map (Bi-gram) can be sampled based on a prior probability of the binary map being associated with the named entity, wherein the prior probability is calculated from a preset data set based on the named entity and a problem binary model. For example, in generating sample data for a pre-trained medical language model, prior probabilities associated with named entities may be computed from the CCKS medical data set by the named entities and the problem bigram. It should be noted that the accuracy of the obtained sample data of the pre-training language model is not affected by the selection of the target query word.
According to the sample data generation method, the target query words are adopted to replace named entities in the target linguistic data to obtain the sample data of the pre-training language model, so that the sample data of the pre-training language model is obtained by automatically replacing the named entities in each target linguistic data with the target query words, and the convenience of obtaining the sample data is improved.
In an embodiment, when the target query word is used to replace the named entity in the target corpus to obtain sample data of the pre-training language model, the language order of each sentence fragment in the target corpus may be adjusted to obtain the sample data of the pre-training language model. Optionally, the target query word is used to replace the named entity in the target corpus, the sentence fragment of the target corpus after the named entity is set behind the target query word, and the sentence fragment of the target corpus before the named entity is set behind the sentence fragment after the named entity, so as to obtain sample data of the pre-training language model.
When generating sample data of the pre-training language model, except that the target query word is used to replace the named entity in the target corpus, it is also necessary to set the statement fragment in the target corpus after the named entity behind the target query word, and set the statement fragment in the target corpus before the named entity behind the target query word, so as to obtain the sample data of the pre-training language model. For example, the target corpus is "the patient is diagnosed with rectal cancer three months before in our hospital, and is diagnosed with celioscope rectal cancer in general anesthesia on 26 th 10 th 2019" wherein, "rectal cancer" is a named entity, the target query word is "what disease", the sentence fragment before "rectal cancer" in the target corpus is diagnosed with "what disease" in our hospital three months before, and the sentence fragment after the named entity in the target corpus is "set with" what disease "in 2019 on 26 th general anesthesia on celioscope rectal cancer" before, and sample data of the pre-trained language model is obtained "what disease is diagnosed with celioscope rectal cancer in general anesthesia on 26 th 10 th 2019 th 26 th, and the patient is diagnosed with celioscope cancer in our hospital three months before.
The method for generating the sample data comprises the steps of replacing a named entity in a target corpus with a target query word, setting a statement fragment behind the named entity in the target corpus behind the target query word, and setting a statement fragment in front of the named entity in the target corpus behind the statement fragment behind the named entity to obtain the sample data of the pre-training language model, wherein the language sequence of the generated sample data is different from that of the target corpus, namely, the diversity of the generated sample data is improved in the language sequence, so that the pre-training language model is trained according to the diversified sample data, and the accuracy of the obtained pre-training language model is higher.
Fig. 6 is a schematic flowchart of a method for generating sample data according to another embodiment of the present application. The embodiment relates to a specific process of generating a preset question template. As shown in fig. 6, before the step S103 "generating sample data of a pre-training language model according to a plurality of target corpora and a preset question template", the method further includes the following steps:
s501, determining the query words corresponding to the named entities in the medical record sentences according to the preset corresponding relation between the named entities and the query words.
The query words and the named entities may be in one-to-one correspondence, or multiple named entities may correspond to one query word, which is not limited in the embodiments of the present application. And determining the query words corresponding to the named entities in the medical record sentences according to the preset corresponding relation. For example, the query word corresponding to the named entity "rectal cancer" in the medical record sentence is "what disease", and the query word corresponding to the named entity "orthopedist" in the medical record sentence is "who".
S502, taking the query words corresponding to the named entities in the medical record sentences as target query words in a preset query sentence template.
According to the method for generating the sample data, the query word corresponding to the named entity in the medical record statement is determined according to the preset corresponding relation between the named entity and the query word, and the query word corresponding to the named entity in the medical record statement is used as the target query word in the preset query statement template, so that the target query word is the query word mutually managed with the named entity, the named entity can be replaced more accurately, and the accuracy of the sample data generated by replacing the named entity with the target query word is further improved.
Table 1 shows corpus information used in generating sample data of a pre-trained language model. As shown in Table 1, a complete medical record of ' rectal cancer ' of a patient before 3 months ' in the medical corpus is obtained in the general anesthesia uplink rectal cancer radical treatment of the hospital, anti-infection and nutrition support treatment are given after the operation, and the incision healing is good. The postoperative pathology shows that rectal adenocarcinoma (low-degree differentiation), infiltrative ulcer type, the area is 3.5 x 2cm, the target sentence in the invasion external model, that is, the patient carries out total anesthesia uplink rectal cancer radical treatment on the patient before 3 months because of rectal cancer, and the postoperative treatment is provided with anti-infection and nutrition support treatment, so that the incision is well healed. "according to the target sentence, searching is carried out in a preset corpus database, and a target corpus is obtained, wherein the target corpus is diagnosed as the rectal cancer in our hospital 3 months ago, and the laparoscopic rectal cancer radical treatment is carried out on general anesthesia in 2019-10-26. Replacing a named entity ' rectal cancer ' in a target corpus with a target query word ' what disease ', adjusting the language sequence of statement fragments before and after the named entity in the target corpus to obtain generated sample data ' what disease is treated in 2019-10-26 in general anesthesia upper laparoscopic rectal cancer radical surgery, and diagnosis in our hospital before 3 months for a patient? "
TABLE 1
Figure BDA0002969307320000101
Figure BDA0002969307320000111
It should be understood that, although the respective steps in the flowcharts in the above-described embodiments are sequentially shown as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
Fig. 7 is a schematic structural diagram of a device for generating sample data according to an embodiment of the present application, and as shown in fig. 7, the device for generating sample data includes: an obtaining module 10, a retrieving module 20 and a generating module 30, wherein:
an obtaining module 10, configured to obtain a medical record statement including a named entity;
the searching module 20 is configured to search a plurality of target corpora, which include named entities and have similarity values with medical history sentences that meet a preset range, in a preset corpus;
and the generating module 30 is configured to generate sample data of the pre-training language model according to the plurality of target corpora and a preset question template.
In one embodiment, the lookup module 20 includes: a search unit 201, an acquisition unit 202 and a determination unit 203, wherein:
a searching unit 201, configured to search a candidate sentence including a named entity in a preset corpus;
an obtaining unit 202, configured to obtain similarity values of the candidate statements and the medical record statements;
the determining unit 203 is configured to use a plurality of candidate sentences whose similarity values with medical record sentences are smaller than a preset threshold as a plurality of target corpora.
In an embodiment, the obtaining unit 202 is specifically configured to remove a candidate statement in a medical record where a medical record statement is located, so as to obtain a target candidate statement; inputting the target candidate statement and the medical record statement into a vector conversion model to obtain a candidate vector corresponding to the target candidate statement and a statement vector corresponding to the medical record statement; and calculating cosine values between the candidate vectors and the statement vectors to obtain similarity values of the target candidate statements and the medical record statements.
In an embodiment, the obtaining unit 202 is specifically configured to sort, according to a descending order of similarity values, target candidate statements whose similarity values with medical record statements are smaller than a preset threshold, and obtain sequence numbers of the target candidate statements; and taking each target candidate statement with the sequence number smaller than N as a plurality of target linguistic data, wherein N is a positive integer.
In an embodiment, the preset question template includes a target question word, and the generating module 30 is specifically configured to replace a named entity in the target corpus with the target question word to obtain sample data of the pre-training language model.
In an embodiment, the generating module 30 is specifically configured to replace a named entity in the target corpus with a target query word, set a sentence fragment after the named entity in the target corpus after the target query word, and set a sentence fragment before the named entity in the target corpus after the sentence fragment after the named entity to obtain sample data of the pre-training language model.
The apparatus for generating sample data further comprises: a setup module 40, wherein:
the setting module 40 is specifically configured to determine the query word corresponding to the named entity in the medical record statement according to a preset correspondence between the named entity and the query word; and taking the query words corresponding to the named entities in the medical record sentences as target query words in a preset query sentence template.
The apparatus for generating sample data provided in the embodiment of the present application may implement the method embodiment, and its implementation principle and technical effect are similar, which are not described herein again.
For a specific limitation of a device for generating sample data, reference may be made to the above limitation on the method for generating sample data, which is not described herein again. The modules in the sample data generating device may be wholly or partially implemented by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the electronic device, or can be stored in a memory in the electronic device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, an electronic device is provided, the internal structure of which may be as shown in FIG. 8. The electronic device includes a processor, a memory, a network interface, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of generating sample data.
Those skilled in the art will appreciate that the structure shown in fig. 8 is a block diagram of only a portion of the structure relevant to the present disclosure, and does not constitute a limitation on the electronic device to which the present disclosure may be applied, and that a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
It should be clear that, in the embodiments of the present application, the process of executing the computer program by the processor is consistent with the process of executing the steps in the above method, and specific reference may be made to the description above.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the method for generating sample data provided in the above method embodiments of the present application can be implemented.
It should be clear that, in the embodiments of the present application, the process of executing the computer program by the processor is consistent with the process of executing the steps in the above method, and specific reference may be made to the description above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for generating sample data, comprising:
acquiring medical record sentences containing named entities;
searching a plurality of target corpora which comprise the named entity and have similarity values with the medical record sentences meeting a preset range in a preset corpus;
and generating sample data of the pre-training language model according to the target linguistic data and a preset question template.
2. The method according to claim 1, wherein the searching for the plurality of target corpora including the named entity and having similarity to the medical history sentences within a predetermined range in a predetermined corpus includes:
searching a candidate sentence comprising the named entity in the preset corpus;
acquiring similarity values of the candidate sentences and the medical record sentences;
and taking a plurality of candidate sentences of which the similarity values with the medical record sentences are smaller than a preset threshold value as the plurality of target linguistic data.
3. The method of claim 2, wherein the obtaining the similarity value between the candidate sentences and the medical record sentences comprises:
removing the candidate sentences in the medical record where the medical record sentences are located to obtain target candidate sentences;
inputting the target candidate statement and the medical record statement into a vector conversion model to obtain a candidate vector corresponding to the target candidate statement and a statement vector corresponding to the medical record statement;
and calculating cosine values between the candidate vectors and the statement vectors to obtain similarity values of the target candidate statements and the medical record statements.
4. The method according to claim 3, wherein the using, as the target corpuses, candidate corpuses whose similarity to the medical record sentences is smaller than a preset threshold includes:
sequencing the target candidate sentences of which the similarity values with the medical record sentences are smaller than the preset threshold value according to the sequence of the similarity values from large to small to obtain the serial numbers of the target candidate sentences;
and taking each target candidate statement with the sequence number smaller than N as the plurality of target linguistic data, wherein N is a positive integer.
5. The method according to any one of claims 1 to 4, wherein the preset question template includes a target question word, and the generating sample data of a pre-training language model according to the target corpus and the preset question template includes:
and replacing the named entity in the target corpus with the target question word to obtain sample data of the pre-training language model.
6. The method according to claim 5, wherein said replacing named entities in the target corpus with the target query words to obtain sample data of the pre-training language model comprises:
replacing the named entity in the target corpus with the target query word, setting a statement fragment behind the named entity in the target corpus behind the target query word, and setting a statement fragment in front of the named entity in the target corpus behind the statement fragment behind the named entity to obtain sample data of the pre-training language model.
7. The method according to claim 6, wherein before replacing the named entity in the target corpus with the target query word to obtain sample data of the pre-trained language model, the method further comprises:
determining the query words corresponding to the named entities in the medical record sentences according to the preset corresponding relationship between the named entities and the query words;
and taking the query words corresponding to the named entities in the medical record sentences as target query words in the preset query sentence template.
8. An apparatus for generating sample data, the apparatus comprising:
the acquisition module is used for acquiring medical record sentences containing named entities;
the searching module is used for searching a plurality of target corpora which comprise the named entity and have similarity values with the medical history sentences meeting a preset range in a preset corpus;
and the generating module is used for generating sample data of the pre-training language model according to the target linguistic data and a preset question template.
9. An electronic device, comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 7 when the processor executes the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202110259314.3A 2021-03-10 2021-03-10 Sample data generation method, device, equipment and storage medium Pending CN112990290A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110259314.3A CN112990290A (en) 2021-03-10 2021-03-10 Sample data generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110259314.3A CN112990290A (en) 2021-03-10 2021-03-10 Sample data generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112990290A true CN112990290A (en) 2021-06-18

Family

ID=76334774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110259314.3A Pending CN112990290A (en) 2021-03-10 2021-03-10 Sample data generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112990290A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554107A (en) * 2021-07-28 2021-10-26 工银科技有限公司 Corpus generating method, apparatus, device, storage medium and program product
CN114357974A (en) * 2021-12-28 2022-04-15 北京海泰方圆科技股份有限公司 Similar sample corpus generation method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554107A (en) * 2021-07-28 2021-10-26 工银科技有限公司 Corpus generating method, apparatus, device, storage medium and program product
CN114357974A (en) * 2021-12-28 2022-04-15 北京海泰方圆科技股份有限公司 Similar sample corpus generation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110032739B (en) Method and system for extracting named entities of Chinese electronic medical record
CN110955761A (en) Method and device for acquiring question and answer data in document, computer equipment and storage medium
CN110674319A (en) Label determination method and device, computer equipment and storage medium
CN112256822A (en) Text search method and device, computer equipment and storage medium
CN110377558A (en) Document searching method, device, computer equipment and storage medium
CN110929524A (en) Data screening method, device, equipment and computer readable storage medium
CN112990290A (en) Sample data generation method, device, equipment and storage medium
CN110808095B (en) Diagnostic result recognition method, model training method, computer equipment and storage medium
US11397756B2 (en) Data archiving method and computing device implementing same
CN110688853A (en) Sequence labeling method and device, computer equipment and storage medium
CN113722438A (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN112016311A (en) Entity identification method, device, equipment and medium based on deep learning model
CN113536735A (en) Text marking method, system and storage medium based on keywords
CN113239697B (en) Entity recognition model training method and device, computer equipment and storage medium
CN111382570A (en) Text entity recognition method and device, computer equipment and storage medium
CN110287270B (en) Entity relationship mining method and equipment
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN116860747A (en) Training sample generation method and device, electronic equipment and storage medium
CN110956043A (en) Domain professional vocabulary word embedding vector training method, system and medium based on alias standardization
CN115827877A (en) Proposal auxiliary combination method, device, computer equipment and storage medium
CN113486169B (en) Synonymous statement generation method, device, equipment and storage medium based on BERT model
CN115757801A (en) Decision tree-based model training method and device for medical texts
CN111324701B (en) Content supplement method, content supplement device, computer equipment and storage medium
CN112149389A (en) Resume information structured processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination