WO2021170085A1 - 标注方法、关系抽取方法、存储介质和运算装置 - Google Patents

标注方法、关系抽取方法、存储介质和运算装置 Download PDF

Info

Publication number
WO2021170085A1
WO2021170085A1 PCT/CN2021/078145 CN2021078145W WO2021170085A1 WO 2021170085 A1 WO2021170085 A1 WO 2021170085A1 CN 2021078145 W CN2021078145 W CN 2021078145W WO 2021170085 A1 WO2021170085 A1 WO 2021170085A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
template
seed
sentence
correct
Prior art date
Application number
PCT/CN2021/078145
Other languages
English (en)
French (fr)
Inventor
代亚菲
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to EP21761699.4A priority Critical patent/EP4113358A4/en
Priority to US17/435,197 priority patent/US20220327280A1/en
Publication of WO2021170085A1 publication Critical patent/WO2021170085A1/zh
Priority to US18/395,509 priority patent/US20240126984A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the technical field of language recognition, and more specifically, to a labeling method, a relationship extraction method, a storage medium, and an arithmetic device.
  • relationship extraction is usually based on deep learning.
  • the premise of deep learning is to provide a large amount of labeled data for model training.
  • the current practice is based on manually labeling each sentence of the labeled text, which results in high labor and time costs.
  • a tagging method which includes: step S1, determining the text to be tagged, multiple correct seeds, and multiple error seeds, where the entities in each sentence in the text to be tagged have been The label is marked as the first entity or the second entity, and each of the correct seed and the wrong seed is an entity pair composed of the first entity and the second entity; step S2, according to the correct The seed traverses each sentence in the text to be annotated to generate at least one first template, and screens the at least one first template according to the correct seed and the wrong seed; step S3, according to the screened first template A template traverses each sentence in the to-be-annotated text to match at least one new seed; step S4, evaluate the at least one new seed that is matched, wherein the new seed that has passed the evaluation is regarded as the correct one with the existing correct seed.
  • step S5 replace the correct seed in step S2 with the correct seed obtained in step S4, and replace the error seed in step S2 with the incorrect seed obtained in step S4
  • step S6 outputting the matched correct seed and the classification relationship between the first entity and the second entity in the correct seed.
  • screening the at least one first template according to the correct seed and the wrong seed includes: using the at least one first template to match an entity pair in the text to be annotated; Determining the correct seed and the wrong seed, determining whether the entity pair matched by the at least one first template is the correct seed or the wrong seed; determining the number of correct seeds and the number of wrong seeds in the entity pair matched by the at least one first template Calculate the evaluation index of the at least one first template according to the number of correct seeds and the number of wrong seeds in the entity pair matched by the at least one first template; and, according to the evaluation index of the at least one first template, The at least one first template is screened.
  • the evaluation index of the at least one first template is calculated by the following formula:
  • Pip represents the number of correct seeds matched by the first template Pi
  • Pin represents the number of incorrect seeds matched by the first template Pi.
  • traversing each sentence in the text to be annotated according to the filtered first template to match at least one new seed includes: obtaining the second template according to the sentence in the text to be annotated; The similarity between the second template and the first template after screening; and extracting entity pairs from the second template according to the similarity to match the at least one new seed.
  • the similarity between the second template and the filtered first template is calculated by the following formula:
  • Ci represents the second template obtained by matching the sentence of the seed T from the first template Pi
  • p is the vectorized expression and appearance of the first character in the field before the corresponding correct seed in the first template Pi
  • q is the second template
  • the first character vectorized expression of the field in Ci that appears before the first entity and the second entity in the corresponding sentence, and the second character vector of the field between the first entity and the second entity in the corresponding sentence A list of vectorized expressions of third characters that appear in the field after the first entity and the second entity in the corresponding sentence.
  • ⁇ , ⁇ , and ⁇ are all scale coefficients greater than 0.
  • evaluating the at least one new seed that is matched includes: evaluating the first template based on the similarity between the second template and the first template after screening. The index calculates the evaluation index of the matched new seed; and according to the evaluation index of the new seed, evaluates the matched new seed.
  • the evaluation index of the new seed is calculated by the following formula:
  • T represents the new seed to be evaluated
  • Ci represents the second template obtained by matching the first template Pi to the sentence where the seed T is located
  • Conf1 (Pi) represents the evaluation index of the first template Pi
  • Match(Ci, Pi) represents the similarity between the first template Pi and the second template Ci.
  • the selected condition includes: repeating steps S2-S4 for a set number of times, or the number of correct seeds that have passed the evaluation reaches a set threshold.
  • traversing each sentence in the to-be-annotated text according to the correct seed to generate at least one first template includes: performing sentences with the correct seed in the sentences in the to-be-annotated text Clustering
  • a first template is obtained according to the same type of sentence and the corresponding correct seed.
  • the first template includes the first character vectorized expression of the field that appears before the corresponding correct seed in the same type of sentence, and appears in the first corresponding correct seed.
  • obtaining the second template according to the sentence in the text to be annotated includes: determining that the first entity and the second entity of the sentence in the text to be annotated are present, and according to the first entity and the second entity Generate a second template, wherein the second template includes the first character vectorized expression of the field before both the first entity and the second entity in the sentence, and the first entity and the second entity that appear in the sentence The second character vectorized expression of the field between the two, and the third character vectorized expression of the field that appears after both the first entity and the second entity in the sentence.
  • determining the text to be annotated includes: performing entity recognition based on the sentence of the text to be annotated based on a medical dictionary, and placing a corresponding label on the location of the entity, the label respectively indicating the name of the disease, the examination method, the treatment method, One of symptoms and preventive measures.
  • the first entity includes a field indicating the name of a disease
  • the second entity includes a field indicating inspection methods, treatment methods, manifestations of symptoms, and preventive measures
  • the classification relationship includes disease-inspection, disease- Treatment, disease-symptoms, disease-prevention.
  • the entity in each sentence in the text to be labeled includes multiple first entities or multiple second entities
  • multiple copies of the sentence are made, and the tag for each copy includes one A first entity and a second entity, and at least one of the first entity and the second entity in different shares is different.
  • a relationship extraction method including: using the above-mentioned labeling method to label text to be labeled; and using at least part of the sentences in the labeled text to be labeled to train a deep learning model to Get the relation extraction model.
  • the relationship extraction method further includes: testing the relationship extraction model by using at least part of the sentences in the annotated text to be annotated that are not involved in model training as a test set.
  • the deep learning model includes a segmented convolutional neural network combined with an attention mechanism learning model.
  • a non-transitory computer storage medium that stores instructions that can be executed by a processor to execute the above-mentioned labeling method or the above-mentioned Relation extraction method.
  • a computing device includes a storage medium and a processor, the storage medium stores instructions, and the instructions can be executed by the processor to execute the above-mentioned labeling method or the above-mentioned labeling method. Relation extraction method.
  • the computing device further includes a human-computer interaction interface for the user to input the original text to be annotated, multiple correct seeds and multiple error seeds, and/or to confirm the labeling result.
  • Fig. 1 is a flowchart of a labeling method according to an embodiment of the present disclosure.
  • Figure 2a is a schematic diagram of a human-computer interaction interface for inputting text to be labeled, correct seeds, and wrong seeds in the labeling method provided by an embodiment of the present disclosure.
  • Fig. 2b is a schematic diagram of a human-computer interaction interface for verifying a labeling result in the labeling method provided by an embodiment of the present disclosure.
  • Fig. 2c is a schematic diagram of a human-computer interaction interface for inputting a file to be relationship extracted in an embodiment of the present disclosure.
  • FIG. 2d is a schematic diagram of the test result of the human-computer interaction interface of the relationship extraction model in an embodiment of the present disclosure.
  • Fig. 2e is a schematic diagram of a saving interface of a human-computer interaction interface of a relationship extraction result in an embodiment of the present disclosure.
  • FIG. 3 is a detailed flowchart of the relationship extraction method according to an embodiment of the present disclosure.
  • Fig. 4 is a block diagram of an arithmetic device according to an embodiment of the present disclosure.
  • an embodiment of the present disclosure provides a labeling method, which includes the following steps.
  • step S1 the text to be labeled, multiple correct seeds, and multiple error seeds are determined.
  • the entities in each sentence in the to-be-annotated text have been marked as the first entity or the second entity by tags.
  • Both the correct seed and the wrong seed are an entity pair composed of a first entity and a second entity. That is, the entity in each sentence in the text to be marked is the first entity or the second entity.
  • the text to be marked is unstructured text data.
  • a human-computer interaction interface may be provided for the user to input the file to be marked, the file containing the correct seed, and the file containing the wrong seed.
  • these files can also be obtained in other ways.
  • the text to be marked as a medical text is taken as an example for description.
  • a seed is a pair of entities, or a pair of entities.
  • the correct seed means that there is a logical connection between the two entities. For example: fracture; X-ray film, which means that the X-ray film can be used to detect whether a fracture occurs.
  • the correct seeds are, for example: mediastinal tumors; esophageal barium meal radiography, which means that the esophageal barium meal radiography can be used to detect whether diaphragm tumors occur.
  • the error seed means that the two entities are not logically related. For example: diabetes; body weight, hypoproteinemia; blood oxygen saturation. Diabetes; the seed of weight indicates that the symptoms of diabetes are related to weight, which is obviously a wrong logical connection. Hypoproteinemia; the seed of blood oxygen saturation indicates that the symptoms of hypoproteinemia are related to blood oxygen saturation, which is obviously a wrong logical connection.
  • the original text of the text to be labeled may only consist of multiple sentences, and the entities of interest are not marked with corresponding labels in the sentences. At this time, you can tag based on the dictionary.
  • entity recognition is performed on unlabeled sentences based on a medical dictionary, and corresponding labels are placed on the location of the entity.
  • the labels respectively indicate one of the name of the disease, inspection method, treatment method, manifestation of symptoms, and preventive measures.
  • the first entity includes a field that indicates the name of the disease
  • the second entity includes a field that indicates an examination method, a treatment method, manifestation symptoms, and preventive measures.
  • the sentence in the original text is for example: "The clinical manifestations of this disease have great variations, and none of the malformations are unique to trisomy 18, therefore, the diagnosis cannot be made based on the clinical malformations alone, but a cellular chromosome examination must be done. , The diagnosis is based on the results of karyotyping &". & Is a programming symbol, marking the end of a sentence.
  • ⁇ DES> and ⁇ /DES> are specific forms of tags, and their meaning is a field that indicates the name of the disease.
  • ⁇ CHE> and ⁇ /CHE> are the specific forms of tags, and their meaning is to indicate the field of the inspection method.
  • a human-computer interaction interface can also be provided for the user to input the original text, the multiple correct seeds, and the multiple error seeds.
  • the classification relationship includes disease-examination, disease-treatment, disease-symptom, and disease-prevention.
  • NG invalid relationship
  • each sentence only uses tags to indicate a first entity and a second entity. If you encounter multiple first entities or multiple second entities in a sentence, copy multiple copies of the sentence, each with a different label. That is, the label of each copy includes a first entity and a second entity, and at least one of the first entity and the second entity in different copies is different, so that each copy after copying is distinguished.
  • step S2 each sentence in the text to be annotated is traversed according to the correct seed to generate a first template, and the at least one first template is screened according to the correct seed and the wrong seed.
  • traversing each sentence in the to-be-annotated text according to the correct seed to generate the first template includes: clustering the sentences in the to-be-annotated text with the correct seed in the sentence; and then according to the same The class sentence and the corresponding correct seed get the first template.
  • the first template includes the vectorized expression of the first character of the field before the corresponding correct seed in the sentence of the same type, and the second character of the field between the first entity and the second entity in the corresponding correct seed.
  • tag1 can be used for tag2 can be diagnosed sometimes.
  • tag1 and tag2 represent the two entities in the seed, in no particular order, the vectorized expression of the field before the entity pair is empty, and the field between the entity pair is "can do" (specifically, vectorized Expression), the field after the entity pair is "diagnosed sometimes”.
  • the template generated from the correct seed is referred to as the first template.
  • the first template can also be understood as a list of text vectorized. This disclosure does not limit how to vectorize text. For example, you can choose the classic word2vector algorithm or the TF-IDF (term frequency—inverse document frequency) method for vectorized expression. If the entity vectorizes the left, middle, and right expressions as V1, V2 , V3, the list is [V1,V2,V3].
  • the screening of the at least one first template according to the correct seed and the wrong seed includes the following steps. First, the at least one first template is used to match the entity pairs in the text to be annotated. Then, according to the correct seed and the wrong seed, it is determined whether the entity pair matched by the at least one first template is the correct seed or the wrong seed. Then, the number of correct seeds and the number of wrong seeds in the entity pair matched by the at least one first template are determined. Then, the evaluation index of the at least one first template is calculated according to the number of correct seeds and the number of wrong seeds in the entity pair matched by the at least one first template. Finally, the at least one first template is screened according to the evaluation index of the at least one first template. If the evaluation index of at least one first template is within a predetermined threshold range, the first template is selected and retained.
  • Pip is the number of positive examples matched by the first template Pi (that is, the number of correct seeds matched); Pin is the number of negative examples matched by the template P (that is, the number of wrong seeds matched).
  • Whether the seed matched by the first template Pi is a negative example can be determined by the predetermined error seeds in step S1, that is, if the negative example matches one of the predetermined error seeds, then the new Seed is a negative example. Otherwise, the new seed is the correct seed.
  • step S3 each sentence in the text to be marked is traversed according to the filtered first template to match at least one new seed.
  • traversing each sentence in the to-be-annotated text according to the filtered first template to match at least one new seed includes the following steps. First, obtain the second template according to the sentence in the text to be annotated; then, calculate the similarity between the second template and the filtered first template; finally, obtain the second template from the second template according to the similarity And extract the entity pair from the database to match the at least one new seed.
  • the second template may be obtained by the following method: it is determined that the first entity and the second entity of the sentence in the text to be annotated are present, and the second template is generated according to the first entity and the second entity.
  • the second template includes the vectorized expression of the first character of the field that appears before the first entity and the second entity in the sentence, and the vectorized expression of the field that appears between the first entity and the second entity in the sentence.
  • the classical word2vector algorithm or the TF-IDF method can be used for vectorized expression.
  • the second template can be represented as three parts: the left part of the entity pair, the part between the two entities in the entity pair, and the right part of the entity pair.
  • the second template includes the first character vectorized expression of the fields that appear in the sentence before the first entity and the second entity in the sentence, and both the first entity and the second entity appear in the sentence.
  • the template is "tag1 can be used as tag2 and sometimes it can be diagnosed”
  • a sentence to be marked is " ⁇ DES>Disease A ⁇ /DES> can be used as ⁇ CHE>detection A ⁇ /CHE> and sometimes it can be diagnosed”.
  • Disease A represents the name of a certain disease
  • examination A represents the name of a certain examination method.
  • the template obtained from the sentence to be marked is also "tag1 can be used as tag2 and sometimes it can be diagnosed", and the similarity of the two templates is 100%. Of course, it is sufficient that the similarity of the two templates is greater than a certain threshold.
  • the method of comparing the similarity of two templates can be to use the three parts of the left, middle and right of the two entities to be vectorized and then multiply to find the similarity of the templates. That is, the vector direction cosine formula Cosine (p, q) is used to evaluate the similarity of the two templates.
  • sentence 1 " ⁇ DES>Leiomyoma ⁇ /DES> patients can see that smooth muscle cells are long spindle-shaped or slightly corrugated in ⁇ CHE>histopathological examination ⁇ /CHE> Parallel arrangement&” and sentence 2: " ⁇ DES>Rectal prolapse ⁇ /DES> patients can feel the mucosa in the rectal cavity folded and accumulated during the ⁇ CHE>digital rectal examination ⁇ /CHE>, soft and smooth, moving up and down, and obstructed There is a ring groove between the internal prolapsed part and the intestinal wall.
  • the embodiment of the present disclosure proposes an algorithm for calculating the similarity of two templates.
  • the similarity between the second template and the filtered first template can be calculated by the following formula (2). Then, an entity pair is extracted from the second template according to the similarity. In the case where the similarity between the first template and the second template is greater than a set threshold, the entity pair is selected and retained.
  • Cosine is the cosine function
  • Euclidean is the Euclidean distance
  • Tanimoto is the similarity function of the two vectors. That is, three evaluation indicators are used to comprehensively judge the similarity between the two templates.
  • the values of the three parameters ⁇ , ⁇ , and ⁇ can be set according to experience, or part of the results in step S3 can be analyzed and adjusted to make the function value closer to the real situation.
  • Cosine, Euclidean and Tanimoto are commonly known functions in the field.
  • the description of the symbols in formula (2) is as follows: the first template is denoted as Pi, the second template is denoted as Ci, p is the vectorized expression of the first character of the field that appears before the corresponding correct seed in the first template Pi, and appears in the corresponding A list (or vector) consisting of the second character vectorized expression of the field between the first entity and the second entity in the correct seed, and the third character vectorized expression of the field appearing after the corresponding correct seed, q is the first In the second template Ci, the vectorized expression of the first character of the field that appears before the first entity and the second entity in the corresponding sentence, and the second of the field that appears between the first entity and the second entity in the corresponding sentence Character vectorized expression, a list (or vector) of the third character vectorized expression appearing in the field after both the first entity and the second entity in the corresponding sentence, and ⁇ , ⁇ , and ⁇ are all proportional coefficients greater than 0.
  • step S4 at least one new seed that has been matched is evaluated.
  • the new seeds that are qualified and the existing correct seeds are used as the correct seeds, and the new seeds that are not qualified are used as the wrong seeds together with the existing erroneous seeds.
  • the evaluation index of the matched new seed is calculated according to the similarity between the second template and the first template after screening and the evaluation index of the first template after screening; and the evaluation index of the new seed is calculated according to the new seed.
  • Evaluation index which evaluates the matched new seeds.
  • the evaluation index of the new seed is calculated according to the following formula (3):
  • the seed to be evaluated is denoted as T
  • Ci is the second template obtained by matching the first template Pi to the sentence where the seed T is located.
  • the second template includes the first character vectorized expression of the field before the first entity and the second entity, the second character vectorized expression of the field between the first entity and the second entity in the sentence, and the first entity in the sentence A list composed of the third character vectorized expression of the field after the second entity.
  • Conf1(Pi) characterizes the pros and cons of the template Pi itself. Obviously, the more effective each template Pi itself, the more similar the template Pi generating the new seed T and the corresponding sentence to be labeled, the higher the accuracy of the new seed. A certain threshold can be set. If the evaluation score of the Conf(T) function is higher than a certain threshold, the new seed is considered to be a qualified and correct seed (ie a positive example), and if the evaluation score is below a certain threshold, the new seed is considered to be an unqualified error. The seed (ie negative example) is the wrong seed obtained from the first template.
  • the new seed that is matched by traversing each sentence in the text to be annotated according to the first template may be an error seed, and the error seed type is the same as the error seed type predetermined in step S1, then It can be determined that the new seed is an error seed.
  • step S5 replace the correct seed in step S2 with the correct seed obtained in step S4, replace the wrong seed in step S2 with the wrong seed in the new seed obtained in step S4, repeat steps S2-S4, and then In S6, it is judged whether the selection condition is satisfied, for example, whether the number of correct seeds that meet the set number of times or until the evaluation is qualified reaches the set threshold. If the selection conditions are met, go to step S7, otherwise, go back to step S2.
  • step S7 the matched correct seed and the classification relationship between the first entity and the second entity in the correct seed are output.
  • the classification relationship here is determined by the type of the second entity in the correct seed. For example, if the second entity belongs to the inspection method category, then the type of the correct seed is the "disease-inspection method" category, and so on.
  • a human-computer interaction interface is provided for the user to confirm the marking result.
  • the marking process of the present disclosure is basically completed automatically by program operation, which greatly reduces labor costs. Only need to manually complete the confirmed work.
  • To Disease-Check Disease-Treatment Disease-symptoms prevent disease Number of texts 10720 10009 13045 11852
  • To Disease-Check Disease-Treatment Disease-symptoms prevent disease Accuracy 95% 92% 82% 78%
  • Embodiments of the present disclosure also provide a relationship extraction method, including: using the aforementioned labeling method to label text to be labeled; using at least part of the sentences in the labeled text to be labeled to train deep learning (for example, a PCNN+ATT model) To get the relation extraction model. All the sentences in the text to be labeled can be used as the training set; it can also be manually selected, some as the training set and the other as the test set.
  • deep learning for example, a PCNN+ATT model
  • the file to be extracted from the relationship is the file obtained by the aforementioned labeling method.
  • the method further includes using as a test set at least part of sentences in the annotated text to be annotated that are not involved in model training, and testing the relationship extraction model. That is, part of the text obtained by the aforementioned labeling method is used for model training, and part is used for testing.
  • Figure 3 provides a complete flow of the relationship extraction method.
  • the training set and the test set are sentences in different parts of the text obtained by the aforementioned labeling method.
  • m.11452m.12527 Pituitary gigantism is overgrowth in childhood, taller, and extremities grow very fast/symptoms. You speed &".
  • m.11452 is the character vectorized expression of cistern gigantism
  • m.12527 is the character vectorized expression of childhood overgrowth, tall stature, and rapid growth of limbs.
  • Pituitary gigantism is the first entity and excessive childhood Growth
  • tall stature rapid growth of limbs is the second entity
  • /symptom is the classification relationship of the sentence (that is, the sentence is a sentence describing the symptoms of the disease)
  • & is the end symbol, meaningless.
  • the above-mentioned four types of relationship labels are not extracted by the above-mentioned labeling method and the sentences with a certain interference are classified as NA (i.e., interference or error), so there are a total of five categories .
  • NA i.e., interference or error
  • an experiment was conducted with 2000 sentences in the training set and 500 sentences in the test set (all marked by the above-mentioned labeling method).
  • the result AUC value is 0.9, and the accuracy is 0.94.
  • Receiver operating characteristic curve refers to the probability of false report P(y/N) obtained by subjects under different judgment criteria under specific stimulus conditions as the abscissa, and the probability of hitting P(y/SN) is the ordinate, the line of each point drawn.
  • the area of the ROC curve is AUC (Area Under the Curve, with a value range of [0.5, 1]. The larger the value, the better the effect of model prediction.
  • FIG. 2d the accuracy of the display test is shown.
  • Fig. 2e a human-computer interaction interface is shown for the user to confirm whether to save the result of relation extraction.
  • the python modules are encapsulated and then called in the software, which is more convenient for users to operate.
  • the result, displayed in the text box needs to be manually verified and clicked OK, as the annotation data of the deep learning relation extraction module.
  • a message box will pop up to view the evaluation indicators of the model.
  • enter the name of the file to be relationship extraction input the name of the parameter file and use the trained PCNN+ATT model to extract the relationship.
  • a pop-up message box will pop up whether to save the relationship extraction results. Click OK to save the corresponding results.
  • step 1 set the correct seeds A and B, and the wrong seeds C and D.
  • the first template is matched according to the correct seeds A and B.
  • “ ⁇ DES>Mediastinal tumors ⁇ /DES> can be diagnosed sometimes by ⁇ CHE>Barium esophageal radiography ⁇ /CHE>&” can be generated template "tag1 can be used as tag2 sometimes can be diagnosed”.
  • three first templates a, b, and c can be generated in this step.
  • step 3 the first template a, b, and c are evaluated using the above formula (1) (ie, screening and calculating the corresponding evaluation index).
  • template a is "tag1 can be used as tag2 and sometimes it can be diagnosed"
  • template a has a score of 0.33, which is less than the threshold such as 0.5, c If the score is less than 0.5, then discard; if template b has a score of 0.8, then use b in the next round.
  • the template b is about to be selected and retained.
  • step 4 template b is used to match all sentences in the original text to be marked. If a new seed E is extracted from the original text to be tagged through b and E is only extracted through b, the original sentence where E is located is "tag1 is the cause of tag2" (the second template), and template b is "tag1" The common cause is tag2". Use the above formula (2) to calculate the similarity between the second template and template b. If the calculated similarity meets the threshold requirement, the newly extracted seed E is selected and retained.
  • Embodiments of the present disclosure also provide a non-transitory computer storage medium that stores instructions that can be executed by a processor to execute the above-mentioned labeling method or the above-mentioned relationship extraction method.
  • an embodiment of the present disclosure also provides a computing device, including a storage medium 100 and a processor 200, the storage medium 100 stores instructions, the instructions can be executed by the processor 200 to perform the above-mentioned labeling method or the above-mentioned relationship Extraction method.
  • the computing device may also include the above-mentioned human-computer interaction interface for the user to input the original text, multiple correct seeds and multiple error seeds, and/or to confirm the marking result.
  • the storage medium 100 may include a non-transitory computer storage medium and/or a transitory computer storage medium.
  • the apparatus, equipment, and computer-readable storage medium provided in the embodiments of the present application correspond to the method in a one-to-one manner. Therefore, the apparatus, equipment, and computer-readable storage medium also have beneficial technical effects similar to their corresponding methods.
  • the beneficial technical effects of the method are described in detail, and therefore, the beneficial technical effects of the device, equipment and computer-readable storage medium are not repeated here.
  • the embodiments of the present disclosure can be provided as a method, a system, or a computer program product. Therefore, the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPU, MCU, microcontroller, etc.), input/output interfaces, network interfaces, and memory.
  • processors CPU, MCU, microcontroller, etc.
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种标注方法、关系抽取方法、非暂时性计算机存储介质和运算装置,该标注方法包括:根据正确种子遍历待标注文本中的每一个句子以生成第一模板并进行筛选;根据筛选后的第一模板遍历待标注文本中每一个句子以匹配出至少一个新种子;对匹配出的至少一个新种子进行评价;重复执行上述步骤直至满足选定条件后停止,输出匹配出的正确种子及该正确种子中第一实体和第二实体之间的分类关系。

Description

标注方法、关系抽取方法、存储介质和运算装置
相关申请的交叉引用
本申请要求于2020年2月27日在中国知识产权局提交的No.202010124863.5的中国专利申请的优先权,该中国专利申请的全部内容通过引用合并于此。
技术领域
本公开涉及语言识别技术领域,更具体地,涉及一种标注方法、一种关系抽取方法、一种存储介质和一种运算装置。
背景技术
在自然语言识别技术领域,通常会基于深度学习进行关系抽取。进行深度学习的前提是提供大量的已标注数据以进行模型训练。目前做法是基于人工对待标注文本的每一个句子进行标注,导致人力和时间成本都很高。
发明内容
根据本公开的一个方面,提供了一种标注方法,包括:步骤S1、确定待标注文本、多个正确种子和多个错误种子,所述待标注文本中的每一个句子中的实体均已由标签标示为第一实体或第二实体,所述正确种子和所述错误种子中的每一个均是由所述第一实体和所述第二实体构成的实体对;步骤S2、根据所述正确种子遍历所述待标注文本中的每一个句子以生成至少一个第一模板,并且根据所述正确种子和所述错误种子对所述至少一个第一模板进行筛选;步骤S3、根据筛选后的第一模板遍历所述待标注文本中的每一个句子以匹配出至少一个新种子;步骤S4、对匹配出的至少一个新种子进行评价,其中,评价合格的新种子与已有正确种子一起作为正确种子,评价不 合格的新种子与已有错误种子一起作为错误种子;步骤S5、用步骤S4中得到的正确种子替换步骤S2中的正确种子以及用步骤S4中得到的错误种子替换步骤S2中的错误种子重复执行步骤S2-S4,直至满足选定条件后停止;以及步骤S6、输出匹配出的正确种子及所述正确种子中的第一实体和第二实体之间的分类关系。
在一些实施例中,根据所述正确种子和所述错误种子对所述至少一个第一模板进行筛选,包括:利用所述至少一个第一模板匹配所述待标注文本中的实体对;根据所述正确种子和所述错误种子,确定所述至少一个第一模板匹配的实体对为正确种子还是错误种子;确定所述至少一个第一模板匹配的实体对中正确种子的数量和错误种子的数量;根据所述至少一个第一模板匹配的实体对中正确种子的数量和错误种子的数量,计算所述至少一个第一模板的评价指数;以及根据所述至少一个第一模板的评价指数,对所述至少一个第一模板进行筛选。
在一些实施例中,通过下式计算所述至少一个第一模板的评价指数:
Conf1(Pi)=(Pip)/(Pip+Pin)
其中,Pip表示第一模板Pi匹配出来的正确种子的数量;Pin表示第一模板Pi匹配出来的错误种子的数量。
在一些实施例中,根据筛选后的第一模板遍历所述待标注文本中的每一个句子以匹配出至少一个新种子,包括:根据所述待标注文本中的句子得到第二模板;计算所述第二模板与所述筛选后的第一模板的相似度;以及根据所述相似度从所述第二模板中提取实体对,以匹配出所述至少一个新种子。
在一些实施例中,通过下式计算所述第二模板与所述筛选后的第一模板之间的相似度:
Match(Ci,Pi)=α*Cosine(p,q)+β*Euclidean(p,q)+γ*Tanimoto(p,q)
其中,Ci表示通过由第一模板Pi匹配出种子T时种子T所在句子得到的第二模板,p为第一模板Pi中出现在对应的正确种子之前 的字段的第一字符向量化表达、出现在对应的正确种子中第一实体与第二实体之间的字段的第二字符向量化表达、出现在对应的正确种子之后的字段的第三字符向量化表达组成的列表,q为第二模板Ci中出现在对应句子中第一实体和第二实体二者之前的字段的第一字符向量化表达、出现在对应句子中第一实体和第二实体二者之间的字段的第二字符向量化表达、出现在对应句子中第一实体和第二实体二者之后的字段的第三字符向量化表达组成的列表,α、β与γ均为大于0的比例系数。
在一些实施例中,对匹配出的至少一个新种子进行评价,包括:根据所述第二模板与所述筛选后的第一模板之间的相似度以及所述筛选后的第一模板的评价指数计算匹配出的新种子的评价指数;以及根据新种子的评价指数,评价匹配出的新种子。
在一些实施例中,通过下式计算新种子的评价指数:
Figure PCTCN2021078145-appb-000001
其中,T表示待评价的新种子,P={Pi}表示产生新种子T的所有第一模板,Ci表示通过由第一模板Pi匹配出种子T时种子T所在句子得到的第二模板,Conf1(Pi)表示所述第一模板Pi的评价指数,Match(Ci,Pi)表示所述第一模板Pi与第二模板Ci的相似度。
在一些实施例中,所述选定条件包括:重复执行步骤S2-S4设定次数,或评价合格的正确种子的数量达到设定阈值。
在一些实施例中,根据所述正确种子遍历所述待标注文本中的每一个句子以生成至少一个第一模板,包括:将所述待标注文本中的句子中出现所述正确种子的句子进行聚类;
根据同一类句子和对应的正确种子得到第一模板,所述第一模板包括该同一类句子中出现在对应的正确种子之前的字段的第一字符向量化表达、出现在对应的正确种子中第一实体与第二实体之间的字段的第二字符向量化表达、出现在对应的正确种子之后的字段的第三字符向量化表达。
在一些实施例中,根据所述待标注文本中的句子得到第二模板,包括:确定所述待标注文本中的句子的第一实体和第二实体在,并且 根据第一实体和第二实体生成第二模板,其中,所述第二模板包括出现在该句子中第一实体和第二实体二者之前的字段的第一字符向量化表达、出现在该句子中第一实体和第二实体二者之间的字段的第二字符向量化表达、出现在该句子中第一实体和第二实体二者之后的字段的第三字符向量化表达。
在一些实施例中,确定待标注文本包括:基于医学词典对待标注文本的句子进行实体识别,并在所述实体所在位置打上对应的标签,所述标签分别标示疾病名称、检查方法、治疗方法、表现症状、预防措施中的一项。
在一些实施例中,所述第一实体包括标示疾病名称的字段,所述第二实体包括标示检查方法、治疗方法、表现症状和预防措施的字段,所述分类关系包括疾病-检查、疾病-治疗、疾病-症状、疾病-预防。
在一些实施例中,在所述待标注文本中的每一个句子中的实体包括多个第一实体或多个第二实体的情况下,将该句子复制多份,每份打的标签包括一个第一实体和一个第二实体,并且不同份中的第一实体和第二实体中的至少一个不同。
根据本公开的一个方面,提供了一种关系抽取方法,包括:采用以上所述的标注方法对待标注文本进行标注;以及利用标注后的待标注文本中的至少部分句子对深度学习模型进行训练以得到关系抽取模型。
在一些实施例中,所述关系抽取方法还包括:将标注后的待标注文本中未参与模型训练的至少部分句子作为测试集,对所述关系抽取模型进行测试。
在一些实施例中,所述深度学习模型包括分段卷积神经网络结合注意力机制学习模型。
根据本公开的一个方面,提供了一种非暂时性计算机存储介质,所述非暂时性计算机存储介质存储指令,所述指令能够被处理器运行以执行以上所述的标注方法或者以上所述的关系抽取方法。
根据本公开的一个方面,包括一种运算装置,包括存储介质和 处理器,所述存储介质存储指令,所述指令能够被所述处理器运行以执行以上所述的标注方法或者以上所述的关系抽取方法。
在一些实施例中,所述运算装置还包括人机交互界面,以供用户输入原始待标注文本、多个正确种子和多个错误种子,和/或对标注结果进行确认。
附图说明
图1是本公开实施例的标注方法的流程图。
图2a是本公开实施例提供的标注方法中输入待标注文本、正确种子和错误种子人机交互界面的示意图。
图2b是本公开实施例提供的标注方法中对标注结果进行校验的人机交互界面示意图。
图2c是本公开实施例中输入待进行关系抽取的文件的人机交互界面示意图。
图2d是本公开实施例中关系抽取模型的人机交互界面的测试结果示意图。
图2e是本公开实施例关系抽取结果的人机交互界面的保存界面示意图。
图3是本公开实施例的关系抽取方法的详细流程示意图。
图4是本公开实施例的运算装置的框图。
具体实施方式
为使本领域技术人员更好地理解本公开的技术方案,下面结合附图和具体实施方式对本公开作进一步详细描述。
在本公开中,应理解,诸如“包括”或“具有”等术语旨在指示本说明书中所公开的特征、数字、步骤、行为、部件、部分或其组合的存在,并且不旨在排除一个或多个其他特征、数字、步骤、行为、部件、部分或其组合存在的可能性。
另外还需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详 细说明本公开。
参考图1,本公开的实施例提供了一种标注方法,包括以下步骤。
在步骤S1,确定待标注文本、多个正确种子和多个错误种子。所述待标注文本中的每一个句子中的实体均已由标签标示为第一实体或第二实体。所述正确种子和所述错误种子均是由第一实体和第二实体构成的实体对。也就是说,该待标注文本中的每一个句子中的实体为第一实体或第二实体。
例如,所述待标注文本为非结构化的文本数据。
例如参考图2a,可以提供人机交互界面,以供用户输入所述待标注文件、包含所述正确种子的文件、包含所述错误种子的文件。当然这些文件也可以通过其他方式获取。以下实施例中,均以待标注句子为医疗类的文本为例进行说明。
种子即实体对,或者说是一对实体。正确种子表示其中两个实体之间存在逻辑关联。例如为:骨折;X光片,即表明可通过X光片检测是否发生骨折。正确种子又例如是:纵隔肿瘤;食管钡餐造影,即表明可通过食管钡餐造影检测是否发生隔膜肿瘤。错误种子表示其中两个实体没有逻辑关联。例如为:糖尿病;体重、低蛋白血症;血氧饱和度。糖尿病;体重这个种子表明糖尿病的症状表现为与体重有关,显然是错误的逻辑关联。低蛋白血症;血氧饱和度这个种子表明低蛋白血症表现症状与血氧饱和度有关,显然是错误的逻辑关联。
只需要提供少量种子(例如提供十个正确的种子和十个错误的种子),即可通过程序自动运行得到更多正确种子,当然也就能确定出其余错误的种子。
待标注文本的原始文本可能仅是多个句子,句子中并未在关注的实体上打上对应的标签。此时可以基于词典打标签。
具体地,基于医学词典对未打标签的句子进行实体识别,并在实体所在位置打上对应的标签。所述标签分别标示疾病名称、检查方法、治疗方法、表现症状、预防措施中的一项。本公开实施例中的部分句子来自于百度百科。
具体地,所述第一实体包括标示疾病名称的字段,所述第二实 体包括标示检查方法、治疗方法、表现症状和预防措施的字段。
原始文本中的句子例如是:“本病临床表现有很大的变异,而且没有一种畸形是18-三体综合征特有的,因此,不能仅根据临床畸形做出诊断,必须做细胞染色体检查,确诊根据核型分析结果&”。&为一编程符号,标示一个句子的结尾。
医学词典中会有很多表示疾病名称的词、表示检查方法的词,这些词在医学词典中都会有对应的属性说明。原始文本中的句子中如出现一个表示疾病名称的词,就会在这个词的前后打上标签,表明这个词是疾病名称。以此类推。
利用医学词典识别出两个实体“18-三体综合征”和“细胞染色体检查”,并判断前一个实体为疾病名称,后一个实体为检查方法,故在待标注句子的对应位置打标签。打标签之后的结果为:“本病临床表现有很大的变异,而且没有一种畸形是<DES>18-三体综合征</DES>特有的,因此,不能仅根据临床畸形做出诊断,必须做<CHE>细胞染色体检查</CHE>,确诊根据核型分析结果&”。
<DES>和</DES>是标签的具体形式,其意义是标示出疾病名称的字段。<CHE>和</CHE>是标签的具体形式,其意义是标示出检查方法的字段。
需要说明的是,一个句子中出现了表示疾病名称和检查方法的两个词,并不表示这个句子的逻辑关系就是介绍该疾病的检查方法,即该句子所属分类关系并不一定是疾病-检查的对应关系。故需后续步骤识别出这句话是不是真的在讲该疾病的检查方法。
这种情况下还可以提供人机交互界面以供用户输入原始文本、所述多个正确种子和所述多个错误种子。
本公开的实施例中,所述分类关系包括疾病-检查、疾病-治疗、疾病-症状、疾病-预防。当然,可能存在某些句子中标签标示的实体之间并无逻辑关系(或者逻辑关系是错的),那么分类关系还包括无效关系(NG)。
需要说明的是,本公开的实施例中,每个句子仅用标签标示出一个第一实体和一个第二实体。如遇到一个句子中有多个第一实体或 多个第二实体,则将该句子复制多份,每份打的标签各有区别。即,每份打的标签包括一个第一实体和一个第二实体,并且不同份中的第一实体和第二实体中的至少一个不同,从而对复制后的每份进行区分。
在步骤S2,根据所述正确种子遍历所述待标注文本中的每一个句子以生成第一模板,并且根据所述正确种子和所述错误种子对所述至少一个第一模板进行筛选。
具体地,根据所述正确种子遍历所述待标注文本中的每一个句子以生成第一模板包括:将所述待标注文本中的句子中出现所述正确种子的句子进行聚类;然后根据同一类句子和对应的正确种子得到第一模板。所述第一模板包括该同一类句子中出现在对应的正确种子之前的字段的第一字符向量化表达、出现在对应的正确种子中第一实体与第二实体之间的字段的第二字符向量化表达、以及出现在对应的正确种子之后的字段的第三字符向量化表达。
例如利用“纵隔肿瘤;食管钡餐造影”这个种子,由“<DES>纵隔肿瘤</DES>可做<CHE>食管钡餐造影</CHE>有时即可诊断&”生成第一模板“tag1可做tag2有时即可诊断”。该第一模板中,tag1和tag2表示种子中的两个实体,不分先后,位于实体对之前的字段向量化表达为空,位于实体对之间的字段为“可做”(具体采用向量化表达),位于实体对之后的字段为“有时即可诊断”。
本公开中将由正确种子生成的模板称为第一模板。第一模板也可以理解为文本向量化后的一个列表。本公开对如何将文本向量化不做限定,例如可以选择经典的word2vector算法或者TF-IDF(term frequency–inverse document frequency)方法进行向量化表达,假如实体对左中右向量化表达为V1,V2,V3,则列表为[V1,V2,V3]。
模板的准确性越高、覆盖程度越大,后期根据该模板提取的新的种子越多,准确性越高。故需要根据经验,选择合适的正确种子作为最初的输入信息,以及在后续生成种子时对生成的种子进行评价挑选好的种子作为正确种子。
根据所述正确种子和所述错误种子对所述至少一个第一模板进行筛选包括包括以下几个步骤。首先,利用所述至少一个第一模板匹 配所述待标注文本中的实体对。然后,根据所述正确种子和所述错误种子,确定所述至少一个第一模板匹配的实体对为正确种子还是错误种子。然后,确定所述至少一个第一模板匹配的实体对中正确种子的数量和错误种子的数量。然后,根据所述至少一个第一模板匹配的实体对中正确种子的数量和错误种子的数量,计算所述至少一个第一模板的评价指数。最后,根据所述至少一个第一模板的评价指数,对所述至少一个第一模板进行筛选。如果至少一个第一模板的评价指数在预定的阈值范围内,则选择并保留该第一模板。
可采用下式(1)计算第一模板的评价指数:
Conf1(Pi)=(Pip)/(Pip+Pin)          (1)
其中,Pip是第一模板Pi匹配出来的正例个数(即匹配出的正确种子的个数);Pin是模板P匹配出来的负例个数(即匹配出的错误种子的个数)。第一模板Pi匹配出的种子是否是负例可通过步骤S1中预先确定的多个错误种子来确定,即,如果负例与预先确定的多个错误种子中的一者匹配,则说明该新种子为负例。否则,该新种子为正确种子。
在步骤S3,根据筛选后的第一模板遍历所述待标注文本中的每一个句子以匹配出至少一个新种子。
具体地,根据筛选后的第一模板遍历所述待标注文本中的每一个句子以匹配出至少一个新种子,包括以下步骤。首先,根据所述待标注文本中的句子得到第二模板;然后,计算所述第二模板与所述筛选后的第一模板的相似度;最后,根据所述相似度从所述第二模板中提取实体对,以匹配出所述至少一个新种子。
具体地,第二模板可通过以下方法获得:确定所述待标注文本中的句子的第一实体和第二实体在,并且根据第一实体和第二实体生成第二模板。所述第二模板包括出现在该句子中第一实体和第二实体二者之前的字段的第一字符向量化表达、出现在该句子中第一实体和第二实体二者之间的字段的第二字符向量化表达、出现在该句子中第一实体和第二实体二者之后的字段的第三字符向量化表达。
可采用经典的word2vector算法或者TF-IDF方法进行向量化表 达,例如将第二模板表示为三部分:实体对左边部分、实体对中两个实体之间的部分以及实体对右边部分。例如,所述第二模板包括该句子中出现在该句子中第一实体和第二实体二者之前的字段的第一字符向量化表达、出现在该句子中第一实体和第二实体二者之间的字段的第二字符向量化表达、出现在该句子中第一实体和第二实体二者之后的字段的第三字符向量化表达。
例如模板为“tag1可做tag2有时即可诊断”,某个待标注的句子为“<DES>疾病A</DES>可做<CHE>检测A</CHE>有时即可诊断”。疾病A代表某一疾病的名称,检查A代表某种检查手段的名称。那么待标注句子得到的模板也是“tag1可做tag2有时即可诊断”,两个模板的相似度为100%。当然,两个模板的相似度大于一定阈值即可。
通常比较两个模板的相似度方法可以是采用两实体左中右三个部分向量化处理后分别相乘求其模板的相似程度。即采用向量方向余弦公式Cosine(p,q)评价两个模板的相似度。
由于医学数据的特异性较强,例如句1:“<DES>平滑肌瘤</DES>患者在<CHE>病理组织学检查</CHE>可见平滑肌细胞呈长梭形或略显波纹状常平行排列&”、和句2:“<DES>直肠脱垂</DES>患者在<CHE>直肠指诊</CHE>时可触及直肠腔内黏膜折叠堆积,柔软光滑,上下移动,有壅阻感,内脱垂部分与肠壁之间有环形沟&”。本公开的实施例提出一种计算两个模板相似度的算法。
具体地,在本公开中,可通过下式(2)计算所述第二模板与所述筛选后的第一模板之间的相似度。然后,根据所述相似度从所述第二模板中提取实体对。在所述第一模板与所述第二模板的相似度大于设定阈值的情况下,选择并保留该实体对。所述第一模板与所述筛选后的第二模板的相似度可通过以下公式(2)计算:Match(Ci,Pi)=α*Cosine(p,q)+β*Euclidean(p,q)+γ*Tanimoto(p,q)(2)
其中,Cosine为余弦函数,Euclidean为欧式距离,Tanimoto为两个向量的相似度函数。即采用三种评价指标综合判断两个模板之间的相似度。三个参数α、β和γ的取值可根据经验设置,也可以对步骤S3中的部分结果进行分析后调整,以让函数值更接近真实的情 况。Cosine、Euclidean和Tanimoto为本领域常用已知函数。
公式(2)中符号说明如下:第一模板记为Pi,第二模板记为Ci,p为第一模板Pi中出现在对应的正确种子之前的字段的第一字符向量化表达、出现在对应的正确种子中第一实体与第二实体之间的字段的第二字符向量化表达、出现在对应的正确种子之后的字段的第三字符向量化表达组成的列表(或向量),q为第二模板Ci中出现在对应句子中第一实体和第二实体二者之前的字段的第一字符向量化表达、出现在对应句子中第一实体和第二实体二者之间的字段的第二字符向量化表达、出现在对应句子中第一实体和第二实体二者之后的字段的第三字符向量化表达组成的列表(或向量),α、β与γ均为大于0的比例系数。
在步骤S4,对匹配出的至少一个新种子进行评价。评价合格的新种子与已有正确种子一起作为正确种子,评价不合格的新种子与已有错误种子一起作为错误种子。
具体地,根据所述第二模板与所述筛选后的第一模板之间的相似度以及所述筛选后的第一模板的评价指数计算匹配出的新种子的评价指数;以及根据新种子的评价指数,评价匹配出的新种子。
具体地,按照如下公式(3)计算新种子的评价指数:
Figure PCTCN2021078145-appb-000002
其中,待评价的种子记为T,P={Pi}是产生种子T的所有第一模板,Ci是通过由第一模板Pi匹配出种子T时种子T所在句子得到的第二模板。第二模板包括第一实体和第二实体之前的字段的第一字符向量化表达、该句子中第一实体与第二实体之间的字段的第二字符向量化表达、该句子中第一实体与第二实体之后的字段的第三字符向量化表达组成的列表。
Conf1(Pi)表征模板Pi本身的优劣,显然各模板Pi本身越有效,产生新的种子T的模板Pi与对应的待标注句子越相似,则新的种子准确性越高。可以设定一定的阈值,Conf(T)函数评价分数高于一定阈值则认为该新种子是合格的正确种子(即正例),评价分数低于一定阈值则认为该新种子是不合格的错误种子(即负例),也即是由第 一模板得到的错误的种子。也就是说,根据所述第一模板遍历所述待标注文本中每一个句子以匹配出的新种子,可能为错误种子,并且该错误种子类型与步骤S1中预先确定的错误种子类型相同,则可以确定该新种子为错误种子。
在步骤S5,用步骤S4中得到的正确种子替换步骤S2中的正确种子,用步骤S4中得到的新种子中的错误种子替换步骤S2中的错误种子,重复执行步骤S2-S4,然后在步骤S6中判断是否满足选定条件,例如是否满足设定次数后或至评价合格的正确种子的数量是否达到设定阈值。如满足选定条件,则转至步骤S7,否则返回步骤S2。
即用新得到的正确种子再去生成新的正确种子,则正确种子的数量如同滚雪球(snowball)般增多。实验表明迭代5次左右之后正确种子的数量不会再增加。
在步骤S7,输出匹配出的正确种子及该正确种子中第一实体和第二实体之间的分类关系。
这里的分类关系由正确种子中第二实体的类型决定,例如是第二实体属于检查方法类的,那么该正确种子的类型就是“疾病-检查方法”类,以此类推。
当然,进一步还可以输出得到该正确种子的句子。
参考图2b,提供了人机交互界面以供用户对标注结果进行确认。
如是,本公开的标注过程基本由程序运行自动完成,大大降低了人力成本。仅需人工完成确认的工作。
参考表1,在一个实验例中,在待标注文本中,句子内实体对为疾病-检查关系的句子有10720句。句子内实体对为疾病-治疗关系的句子有10009句。句子内实体对为疾病-症状关系的句子有13045句。句子中实体对为疾病-预防关系的句子有11852句。当然,实体对关系为疾病-治疗,其所在句子的逻辑关系并不一定是疾病-治疗。
表1
  疾病-检查 疾病-治疗 疾病-症状 疾病-预防
文本数量 10720 10009 13045 11852
运用前述的标注方法进行试验,得到不同类型的句子标注的准确率见表2。
表2
  疾病-检查 疾病-治疗 疾病-症状 疾病-预防
准确率 95% 92% 82% 78%
本公开的实施例还提供一种关系抽取方法,包括:采用前述的标注方法对待标注文本进行标注;利用标注后的待标注文本中的至少部分句子对深度学习(例如PCNN+ATT模型)进行训练以得到关系抽取模型。既可以将待标注文本中全部句子作为训练集;也可以人工挑选,部分作为训练集,部分作为测试集。
参考图2c,待进行关系抽取的文件即前述标注方法得到的文件。
可选地,还包括将标注后的待标注文本中未参与模型训练的至少部分句子作为测试集,对所述关系抽取模型进行测试。即前述标注方法得到的文本部分用于模型训练,部分用于测试。图3提供了一个关系抽取方法的完整流程。其中训练集和测试集分别是前述标注方法得到的文本中的不同部分的句子。
在一个实验例中,基于Tensorflow的分段卷积神经网络(Piecewise Convolutional Neural Networks,PCNN)加注意力机制(Adversarial Tactics,Techniques)(PCNN+ATT)的方法,利用上述标注方法提取出的文本取排序靠前的句子(即匹配出的种子得分较高的句子)经查看无误后作为对应分类标签的文本,在python中整理数据格式。
模型训练的训练集中句子格式举例为:“m.11452m.12527垂体性巨人症儿童期过度生长,身材高大,四肢生长尤速/症状垂体性巨人症表现为儿童期过度生长,身材高大,四肢生长尤速&”。其中m.11452是池体性巨人症的字符向量化表达,m.12527是儿童期过度生长,身材高大,四肢生长尤速的字符向量化表达,垂体性巨人症是第一实体,儿童期过度生长,身材高大,四肢生长尤速是第二实体,/症状即为句子的分类关系(即该句子是描述疾病的症状的句子),&为结束符号,无意义。
对于原始句子(即未打标签的句子)经上述标注方法未提取出上述四类关系标签且具有一定干扰性的句子经查看归类为NA(即干 扰类或错误类),因此共为五分类。为查看上述关系抽取方法的效果(例如分类的准确率),采用训练集2000个句子,测试集500个句子(均由上述标注方法进行标注得到)进行实验,结果AUC值为0.9,准确率为0.94。
接受者操作特征曲线(receiver operating characteristic curve),简称ROC曲线,是指在特定刺激条件下,以被试在不同判断标准下所得的虚报概率P(y/N)为横坐标,以击中概率P(y/SN)为纵坐标,画得的各点的连线。ROC曲线的面积就是AUC(Area Under the Curve,取值范围[0.5,1],越大表示模型预测的效果越好。
可见该方法可行性较高,准确率高,同时降低了人工标注的成本。
参考图2d,示出了显示测试的正确率。参考图2e,示出了人机交互界面供用户确认是否保存关系抽取的结果。
将各python模块封装后在软件中进行调用,更便于使用者进行操作。结合附图2a-2e,首先,输入待标注的文本名称、包含正确种子的文件名称与包含错误种子的文件名称,调用python中编写的利用半监督方法标注数据的模块,返回值为数据标注的结果,在文本框中显示,需人工校验后单击确定,作为深度学习关系抽取模块的标注数据。然后,进行模型训练并返回测试集结果,弹出消息框可查看模型的评价指标。最后,输入待进行关系抽取的文件名称,传入参数文件名称并利用训练好的PCNN+ATT模型进行关系抽取,弹出消息框是否保存关系抽取结果,单击确定则将相应结果保存下来。
以下通过一个具体示例来对本公开的标注方法进行说明。
在步骤1中,设定正确种子A、B,以及错误种子C、D。
在步骤2中,根据正确种子A、B匹配出第一模板。例如“<DES>纵隔肿瘤</DES>可做<CHE>食管钡餐造影</CHE>有时即可诊断&”可生成模板“tag1可做tag2有时即可诊断”。例如在该步骤中可生成三个第一模板a、b和c。
在步骤3中,用上述式(1)评价第一模板a、b和c(即筛选,计算相应的评价指数)。假设模板a是“tag1可做tag2有时即可诊 断”,在所有句子中匹配出实体对也就是种子A、C和D,以及其他实体对,该模板a为0.33得分,小于阈值例如0.5,c同样得分小于0.5,那么舍弃;假如模板b为得分为0.8,那么下一轮使用b。即将选择并保留模板b。
在步骤4中,用模板b进行原始待标记文本中所有句子的匹配。假如通过b从原始待标记文本中新提取出了种子E且E仅是通过b提取出来的,E所在的原句为“tag1的病因是tag2”(第二模板),而模板b是“tag1常见病因是tag2”。用上述式(2)计算第二模板与模板b的相似度。如果计算的相似度满足阈值要求,则选择并保留该新提取出的种子E。
在步骤5中,利用模板b与第二模板的相似度以及模板b的评价指数,对新提取出的种子E进行评价。假设模板b与第二模板的相似度是0.9,那么E对应的评价指数通过上述式(3)计算为:conf(T)=1-(1-0.8*0.9)=7.2,大于了预设阈值例如0.7,则E也作为了正确的种子,和正确种子A、B一起再提取新的模板。
需要说明的是,本申请所描述的各个步骤之间没有执行上的先后顺序限制,对于各个步骤的描述顺序并不构成对本申请的方案的限制。
本公开的实施例还提供一种非暂时性计算机存储介质,所述非暂时性计算机存储介质存储指令,所述指令能够被处理器运行以执行上述的标注方法或者上述的关系抽取方法。
参考图4,本公开的实施例还提供一种运算装置,包括存储介质100和处理器200,存储介质100存储指令,所述指令能够被处理器200运行以执行上述的标注方法或者上述的关系抽取方法。该运算装置还包括可以包括上述人机交互界面,以供用户输入原始文本、多个正确种子和多个错误种子,和/或对标注结果进行确认。存储介质100可包括非暂时性计算机存储介质和/或暂时性计算机存储介质。
本申请中的各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置、设备和计算机可读存储介质实施例而言,由于其基本相似于方法实施 例,所以其描述进行了简化,相关之处可参见方法实施例的部分说明即可。
本申请实施例提供的装置、设备和计算机可读存储介质与方法是一一对应的,因此,装置、设备和计算机可读存储介质也具有与其对应的方法类似的有益技术效果,由于上面已经对方法的有益技术效果进行了详细说明,因此,这里不再赘述装置、设备和计算机可读存储介质的有益技术效果。
本领域内的技术人员应明白,本公开的实施例可提供为方法、系统或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框 或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU、MCU、单片机等)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。此外,尽管在附图中以特定顺序描述了本公开方法的操作,但是,这并非要求或者暗示必须按照该特定顺序来执行这些操作,或是必须执行全部所示的操作才能实现期望的结果。附加地或备选地,可以省略某些步骤,将多个步骤合并为一个步骤执行,和/或将一个步骤分解为多个步骤执行。
可以理解的是,以上实施方式仅仅是为了说明本公开的原理而采用的示例性实施方式,然而本公开并不局限于此。对于本领域内的普通技术人员而言,在不脱离本公开的精神和实质的情况下,可以做出各种变型和改进,这些变型和改进也视为本公开的保护范围。

Claims (19)

  1. 一种标注方法,包括:
    步骤S1、确定待标注文本、多个正确种子和多个错误种子,所述待标注文本中的每一个句子中的实体均已由标签标示为第一实体或第二实体,所述正确种子和所述错误种子中的每一个均是由所述第一实体和所述第二实体构成的实体对;
    步骤S2、根据所述正确种子遍历所述待标注文本中的每一个句子以生成至少一个第一模板,并且根据所述正确种子和所述错误种子对所述至少一个第一模板进行筛选;
    步骤S3、根据筛选后的第一模板遍历所述待标注文本中的每一个句子以匹配出至少一个新种子;
    步骤S4、对匹配出的至少一个新种子进行评价,其中,评价合格的新种子与已有正确种子一起作为正确种子,评价不合格的新种子与已有错误种子一起作为错误种子;
    步骤S5、用步骤S4中得到的正确种子替换步骤S2中的正确种子以及用步骤S4中得到的错误种子替换步骤S2中的错误种子重复执行步骤S2-S4,直至满足选定条件后停止;以及
    步骤S6、输出匹配出的正确种子及所述正确种子中的第一实体和第二实体之间的分类关系。
  2. 根据权利要求1所述的标注方法,其中,根据所述正确种子和所述错误种子对所述至少一个第一模板进行筛选,包括:
    利用所述至少一个第一模板匹配所述待标注文本中的实体对;
    根据所述正确种子和所述错误种子,确定所述至少一个第一模板匹配的实体对为正确种子还是错误种子;
    确定所述至少一个第一模板匹配的实体对中正确种子的数量和错误种子的数量;
    根据所述至少一个第一模板匹配的实体对中正确种子的数量和错误种子的数量,计算所述至少一个第一模板的评价指数;以及
    根据所述至少一个第一模板的评价指数,对所述至少一个第一模板进行筛选。
  3. 根据权利要求2所述的标注方法,其中,通过下式计算所述至少一个第一模板的评价指数:
    Conf1(Pi)=(Pip)/(Pip+Pin)
    其中,Pip表示第一模板Pi匹配出来的正确种子的数量;Pin表示第一模板Pi匹配出来的错误种子的数量。
  4. 根据权利要求1-3中任一项所述的标注方法,其中,根据筛选后的第一模板遍历所述待标注文本中的每一个句子以匹配出至少一个新种子,包括:
    根据所述待标注文本中的句子得到第二模板;
    计算所述第二模板与所述筛选后的第一模板的相似度;以及
    根据所述相似度从所述第二模板中提取实体对,以匹配出所述至少一个新种子。
  5. 根据权利要求4所述的标注方法,其中,通过下式计算所述第二模板与所述筛选后的第一模板之间的相似度:
    Match(Ci,Pi)=α*Cosine(p,q)+β*Euclidean(p,q)+γ*Tanimoto(p,q)
    其中,Ci表示通过由第一模板Pi匹配出种子T时种子T所在句子得到的第二模板,p为第一模板Pi中出现在对应的正确种子之前的字段的第一字符向量化表达、出现在对应的正确种子中第一实体与第二实体之间的字段的第二字符向量化表达、出现在对应的正确种子之后的字段的第三字符向量化表达组成的列表,q为第二模板Ci中出现在对应句子中第一实体和第二实体二者之前的字段的第一字符向量化表达、出现在对应句子中第一实体和第二实体二者之间的字段的第二字符向量化表达、出现在对应句子中第一实体和第二实体二者之后的字段的第三字符向量化表达组成的列表,α、β与γ均为大于 0的比例系数。
  6. 根据权利要求5所述的标注方法,其中,对匹配出的至少一个新种子进行评价,包括:
    根据所述第二模板与所述筛选后的第一模板之间的相似度以及所述筛选后的第一模板的评价指数计算匹配出的新种子的评价指数;以及
    根据新种子的评价指数,评价匹配出的新种子。
  7. 根据权利要求6所述的标注方法,其中,通过下式计算新种子的评价指数:
    Figure PCTCN2021078145-appb-100001
    其中,T表示待评价的新种子,P={Pi}表示产生新种子T的所有第一模板,Ci表示通过由第一模板Pi匹配出种子T时种子T所在句子得到的第二模板,Conf1(Pi)表示所述第一模板Pi的评价指数,Match(Ci,Pi)表示所述第一模板Pi与第二模板Ci的相似度。
  8. 根据权利要求1-7中任一项所述的标注方法,其中,所述选定条件包括:重复执行步骤S2-S4设定次数,或评价合格的正确种子的数量达到设定阈值。
  9. 根据权利要求1-8中任一项所述的标注方法,其中,根据所述正确种子遍历所述待标注文本中的每一个句子以生成至少一个第一模板,包括:
    将所述待标注文本中的句子中出现所述正确种子的句子进行聚类;
    根据同一类句子和对应的正确种子得到第一模板,所述第一模板包括该同一类句子中出现在对应的正确种子之前的字段的第一字符向量化表达、出现在对应的正确种子中第一实体与第二实体之间的字段的第二字符向量化表达、出现在对应的正确种子之后的字段的第 三字符向量化表达。
  10. 根据权利要求4-9中任一项所述的标注方法,其中,根据所述待标注文本中的句子得到第二模板,包括:
    确定所述待标注文本中的句子的第一实体和第二实体在,并且根据第一实体和第二实体生成第二模板,其中,所述第二模板包括出现在该句子中第一实体和第二实体二者之前的字段的第一字符向量化表达、出现在该句子中第一实体和第二实体二者之间的字段的第二字符向量化表达、出现在该句子中第一实体和第二实体二者之后的字段的第三字符向量化表达。
  11. 根据权利要求1-10中任一项所述的标注方法,其中,确定待标注文本包括:
    基于医学词典对待标注文本的句子进行实体识别,并在所述实体所在位置打上对应的标签,所述标签分别标示疾病名称、检查方法、治疗方法、表现症状、预防措施中的一项。
  12. 根据权利要求11所述的标注方法,其中,所述第一实体包括标示疾病名称的字段,所述第二实体包括标示检查方法、治疗方法、表现症状和预防措施的字段,所述分类关系包括疾病-检查、疾病-治疗、疾病-症状、疾病-预防。
  13. 根据权利要求1-12中任一项所述的标注方法,其中,在所述待标注文本中的每一个句子中的实体包括多个第一实体或多个第二实体的情况下,将该句子复制多份,每份打的标签包括一个第一实体和一个第二实体,并且不同份中的第一实体和第二实体中的至少一个不同。
  14. 一种关系抽取方法,包括:
    采用根据权利要求1-13中任一项所述的标注方法对待标注文本 进行标注;以及
    利用标注后的待标注文本中的至少部分句子对深度学习模型进行训练以得到关系抽取模型。
  15. 根据权利要求14所述的关系抽取方法,还包括:
    将标注后的待标注文本中未参与模型训练的至少部分句子作为测试集,对所述关系抽取模型进行测试。
  16. 根据权利要求15所述的关系抽取方法,其中,所述深度学习模型包括分段卷积神经网络结合注意力机制学习模型。
  17. 一种非暂时性计算机存储介质,所述非暂时性计算机存储介质存储指令,所述指令能够被处理器运行以执行根据权利要求1-13中任一项所述的标注方法或者根据权利要求14-16中任一项所述的关系抽取方法。
  18. 一种运算装置,包括存储介质和处理器,所述存储介质存储指令,所述指令能够被所述处理器运行以执行根据权利要求1-13中任一项所述的标注方法或者根据权利要求14-16中任一项所述的关系抽取方法。
  19. 根据权利要求18所述的运算装置,还包括人机交互界面,以供用户输入原始待标注文本、多个正确种子和多个错误种子,和/或对标注结果进行确认。
PCT/CN2021/078145 2020-02-27 2021-02-26 标注方法、关系抽取方法、存储介质和运算装置 WO2021170085A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21761699.4A EP4113358A4 (en) 2020-02-27 2021-02-26 LABELING METHOD, RELATION EXTRACTION METHOD, STORAGE MEDIA AND OPERATING APPARATUS
US17/435,197 US20220327280A1 (en) 2020-02-27 2021-02-26 Annotation method, relation extraction method, storage medium and computing device
US18/395,509 US20240126984A1 (en) 2020-02-27 2023-12-23 Annotation method, relation extraction method, storage medium and computing device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010124863.5A CN111291554B (zh) 2020-02-27 2020-02-27 标注方法、关系抽取方法、存储介质和运算装置
CN202010124863.5 2020-02-27

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US17/435,197 A-371-Of-International US20220327280A1 (en) 2020-02-27 2021-02-26 Annotation method, relation extraction method, storage medium and computing device
US18/395,509 Continuation US20240126984A1 (en) 2020-02-27 2023-12-23 Annotation method, relation extraction method, storage medium and computing device

Publications (1)

Publication Number Publication Date
WO2021170085A1 true WO2021170085A1 (zh) 2021-09-02

Family

ID=71028346

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078145 WO2021170085A1 (zh) 2020-02-27 2021-02-26 标注方法、关系抽取方法、存储介质和运算装置

Country Status (4)

Country Link
US (2) US20220327280A1 (zh)
EP (1) EP4113358A4 (zh)
CN (1) CN111291554B (zh)
WO (1) WO2021170085A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238524A (zh) * 2021-12-21 2022-03-25 军事科学院系统工程研究院网络信息研究所 基于增强样本模型的卫星频轨数据信息抽取方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11392585B2 (en) * 2019-09-26 2022-07-19 Palantir Technologies Inc. Functions for path traversals from seed input to output
CN111291554B (zh) * 2020-02-27 2024-01-12 京东方科技集团股份有限公司 标注方法、关系抽取方法、存储介质和运算装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933164A (zh) * 2015-06-26 2015-09-23 华南理工大学 互联网海量数据中命名实体间关系提取方法及其系统
CN108427717A (zh) * 2018-02-06 2018-08-21 北京航空航天大学 一种基于逐步扩展的字母类语系医疗文本关系抽取方法
WO2018223271A1 (en) * 2017-06-05 2018-12-13 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for providing recommendations based on seeded supervised learning
CN109977391A (zh) * 2017-12-28 2019-07-05 中国移动通信集团公司 一种文本数据的信息抽取方法及装置
CN111291554A (zh) * 2020-02-27 2020-06-16 京东方科技集团股份有限公司 标注方法、关系抽取方法、存储介质和运算装置

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9436759B2 (en) * 2007-12-27 2016-09-06 Nant Holdings Ip, Llc Robust information extraction from utterances
US20140082003A1 (en) * 2012-09-17 2014-03-20 Digital Trowel (Israel) Ltd. Document mining with relation extraction
US10223410B2 (en) * 2014-01-06 2019-03-05 Cisco Technology, Inc. Method and system for acquisition, normalization, matching, and enrichment of data
KR101536520B1 (ko) * 2014-04-28 2015-07-14 숭실대학교산학협력단 토픽을 추출하고, 추출된 토픽의 적합성을 평가하는 방법 및 서버
US10977573B1 (en) * 2015-05-07 2021-04-13 Google Llc Distantly supervised wrapper induction for semi-structured documents
US20180260860A1 (en) * 2015-09-23 2018-09-13 Giridhari Devanathan A computer-implemented method and system for analyzing and evaluating user reviews
US10755195B2 (en) * 2016-01-13 2020-08-25 International Business Machines Corporation Adaptive, personalized action-aware communication and conversation prioritization
US11210324B2 (en) * 2016-06-03 2021-12-28 Microsoft Technology Licensing, Llc Relation extraction across sentence boundaries
US11069432B2 (en) * 2016-10-17 2021-07-20 International Business Machines Corporation Automatic disease detection from unstructured textual reports
CN108052501A (zh) * 2017-12-13 2018-05-18 北京数洋智慧科技有限公司 一种基于人工智能的实体关系对识别方法及系统
US11221856B2 (en) * 2018-05-31 2022-01-11 Siemens Aktiengesellschaft Joint bootstrapping machine for text analysis
US20210365611A1 (en) * 2018-09-27 2021-11-25 Oracle International Corporation Path prescriber model simulation for nodes in a time-series network
CN109472033B (zh) * 2018-11-19 2022-12-06 华南师范大学 文本中的实体关系抽取方法及系统、存储介质、电子设备
US10871950B2 (en) * 2019-05-16 2020-12-22 Microsoft Technology Licensing, Llc Persistent annotation of syntax graphs for code optimization
US11526808B2 (en) * 2019-05-29 2022-12-13 The Board Of Trustees Of The Leland Stanford Junior University Machine learning based generation of ontology for structural and functional mapping
CN110444259B (zh) * 2019-06-06 2022-09-23 昆明理工大学 基于实体关系标注策略的中医电子病历实体关系提取方法
CN110289101A (zh) * 2019-07-02 2019-09-27 京东方科技集团股份有限公司 一种计算机设备、系统及可读存储介质
US20210005316A1 (en) * 2019-07-03 2021-01-07 Kenneth Neumann Methods and systems for an artificial intelligence advisory system for textual analysis
US11144728B2 (en) * 2019-07-19 2021-10-12 Siemens Aktiengesellschaft Neural relation extraction within and across sentence boundaries
US11636099B2 (en) * 2019-08-23 2023-04-25 International Business Machines Corporation Domain-specific labeled question generation for training syntactic parsers
US11709878B2 (en) * 2019-10-14 2023-07-25 Microsoft Technology Licensing, Llc Enterprise knowledge graph
US20210191938A1 (en) * 2019-12-19 2021-06-24 Oracle International Corporation Summarized logical forms based on abstract meaning representation and discourse trees
US11321382B2 (en) * 2020-02-11 2022-05-03 International Business Machines Corporation Secure matching and identification of patterns
US11669740B2 (en) * 2021-02-25 2023-06-06 Robert Bosch Gmbh Graph-based labeling rule augmentation for weakly supervised training of machine-learning-based named entity recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933164A (zh) * 2015-06-26 2015-09-23 华南理工大学 互联网海量数据中命名实体间关系提取方法及其系统
WO2018223271A1 (en) * 2017-06-05 2018-12-13 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for providing recommendations based on seeded supervised learning
CN109977391A (zh) * 2017-12-28 2019-07-05 中国移动通信集团公司 一种文本数据的信息抽取方法及装置
CN108427717A (zh) * 2018-02-06 2018-08-21 北京航空航天大学 一种基于逐步扩展的字母类语系医疗文本关系抽取方法
CN111291554A (zh) * 2020-02-27 2020-06-16 京东方科技集团股份有限公司 标注方法、关系抽取方法、存储介质和运算装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4113358A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238524A (zh) * 2021-12-21 2022-03-25 军事科学院系统工程研究院网络信息研究所 基于增强样本模型的卫星频轨数据信息抽取方法

Also Published As

Publication number Publication date
EP4113358A1 (en) 2023-01-04
CN111291554A (zh) 2020-06-16
CN111291554B (zh) 2024-01-12
US20240126984A1 (en) 2024-04-18
EP4113358A4 (en) 2023-07-12
US20220327280A1 (en) 2022-10-13

Similar Documents

Publication Publication Date Title
WO2021170085A1 (zh) 标注方法、关系抽取方法、存储介质和运算装置
CN109472033B (zh) 文本中的实体关系抽取方法及系统、存储介质、电子设备
US10853695B2 (en) Method and system for cell annotation with adaptive incremental learning
CN111613339B (zh) 一种基于深度学习的相似病历查找方法与系统
CN109378053A (zh) 一种用于医学影像的知识图谱构建方法
CN114582470B (zh) 一种模型的训练方法、训练装置及医学影像报告标注方法
CN106874643A (zh) 基于词向量自动构建知识库实现辅助诊疗的方法和系统
CN112541066B (zh) 基于文本结构化的医技报告检测方法及相关设备
CN108959474B (zh) 实体关系提取方法
US11074406B2 (en) Device for automatically detecting morpheme part of speech tagging corpus error by using rough sets, and method therefor
CN111243729B (zh) 一种肺部x线胸片检查报告自动生成方法
CN112435651A (zh) 一种语音数据自动标注的质量评估方法
CN111275118A (zh) 基于自我修正式标签生成网络的胸片多标签分类方法
CN112257441A (zh) 一种基于反事实生成的命名实体识别增强方法
WO2023045725A1 (zh) 用于数据集创建的方法、电子设备和计算机程序产品
CN113159134A (zh) 基于乳腺结构化报告的智能化诊断评估方法
Abbood et al. EventEpi—A natural language processing framework for event-based surveillance
CN110674642B (zh) 一种用于含噪稀疏文本的语义关系抽取方法
CN116091836A (zh) 一种多模态视觉语言理解与定位方法、装置、终端及介质
CN113808758A (zh) 一种检验数据标准化的方法、装置、电子设备和存储介质
CN111709475B (zh) 一种基于N-grams的多标签分类方法及装置
Hong et al. Rule-enhanced noisy knowledge graph embedding via low-quality error detection
US20170206317A1 (en) Systems and methods for targeted radiology resident training
CN109657710B (zh) 数据筛选方法、装置、服务器及存储介质
WO2023000725A1 (zh) 电力计量的命名实体识别方法、装置和计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21761699

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021761699

Country of ref document: EP

Effective date: 20220927