WO2021147726A1 - 信息抽取方法、装置、电子设备及存储介质 - Google Patents
信息抽取方法、装置、电子设备及存储介质 Download PDFInfo
- Publication number
- WO2021147726A1 WO2021147726A1 PCT/CN2021/071485 CN2021071485W WO2021147726A1 WO 2021147726 A1 WO2021147726 A1 WO 2021147726A1 CN 2021071485 W CN2021071485 W CN 2021071485W WO 2021147726 A1 WO2021147726 A1 WO 2021147726A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- model
- text
- triplet
- neural network
- Prior art date
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 116
- 238000012549 training Methods 0.000 claims abstract description 94
- 238000003062 neural network model Methods 0.000 claims description 84
- 238000002372 labelling Methods 0.000 claims description 73
- 238000000034 method Methods 0.000 claims description 52
- 230000006870 function Effects 0.000 claims description 48
- 238000004590 computer program Methods 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 9
- 238000002790 cross-validation Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 4
- 238000010428 oil painting Methods 0.000 description 10
- 238000010422 painting Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 239000013598 vector Substances 0.000 description 7
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000013140 knowledge distillation Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000013515 script Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the present disclosure relates to the field of information processing technology, and in particular to an information extraction method, device, electronic equipment, and storage medium.
- the knowledge graph consists of entities, attributes and relationships. It is essentially a semantic network.
- the nodes in the network represent entities or attribute values that exist in the real world, and the edges between nodes represent the relationship between two entities.
- knowledge graph technology is mainly used in intelligent semantic search, mobile personal assistants and question answering systems.
- the present disclosure provides an information extraction method, device, electronic equipment, and storage medium to improve the efficiency and accuracy of information extraction.
- an information extraction method which includes:
- the text data into a pre-trained information extraction model to obtain triple information contained in the text data.
- the triple information includes the subject, predicate, and object in the text data; wherein, the
- the information extraction model includes a binary classification sub-model and a multi-label classification sub-model.
- the binary classification sub-model is used to extract the subject in the text data
- the multi-label classification sub-model is used to extract the subject and the text data according to the subject and the text data. , Extract the predicate and object corresponding to the subject in the text data.
- the method before the step of inputting the text data into the pre-trained information extraction model to obtain the triple information contained in the text data, the method further includes: obtaining the information The extraction model, wherein the step of obtaining the information extraction model includes:
- Obtaining a sample set including a plurality of texts to be trained and triplet labeling information of each text to be trained, the triples labeling information including subject labeling information, predicate labeling information, and object labeling information;
- the output information of the first neural network model, the output information of the second neural network model, and the triplet tagging information, the first pre-training language model, the first neural network model, and the The second pre-training language model and the second neural network model are trained to obtain the information extraction model, wherein the trained first pre-training language model and the first neural network model constitute the two-class sub-model, The trained second pre-training language model and the second neural network model constitute the multi-label classification sub-model.
- the first pre-training is performed based on the output information of the first neural network model, the output information of the second neural network model, and the triplet label information
- the step of training the language model, the first neural network model, the second pre-training language model, and the second neural network model to obtain the information extraction model includes:
- the parameters in the first pre-training language model, the first neural network model, the second pre-training language model, and the second neural network model are optimized to obtain the information extraction model so that the The sum of the first loss function and the second loss function is the smallest.
- the first loss function and the second loss function are both cross-entropy loss functions.
- the step of obtaining a sample set includes:
- the step of obtaining the sample set further includes:
- the first triplet information is triplet information that appears in the triplet prediction information but does not appear in the triplet tagging information of the text to be tagged;
- the second triplet information is deleted from the triplet tagging information of the text to be tagged, wherein, the first The two-triple information is the triplet information that appears in the triplet tagging information of the text to be tagged but does not appear in the triplet prediction information;
- K is greater than or equal to 5 and less than or equal to 10.
- the method before the step of using pre-trained K prediction models to predict the to-be-labeled text to obtain K triples prediction information, the method includes:
- the K-fold cross-validation method is adopted to obtain K prediction models.
- an information extraction device which includes:
- the obtaining module is configured to obtain text data
- the extraction module is configured to input the text data into a pre-trained information extraction model to obtain triple information contained in the text data.
- the triple information includes subject, predicate, and Object; wherein, the information extraction model includes a two-class classification sub-model and a multi-label classification sub-model, the two-class classification sub-model is used to extract the subject in the text data, the multi-label classification sub-model is used according to the The subject and the text data, and the predicate and object corresponding to the subject in the text data are extracted.
- an electronic device which includes:
- a memory for storing executable instructions of the processor
- the processor is configured to execute the instructions to implement the information extraction method described in any embodiment.
- the present disclosure also discloses a storage medium.
- the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the information extraction method described in any embodiment.
- Figure 1 shows a flow chart of the steps of an information extraction method provided by an embodiment of the present disclosure
- Figure 2 shows a flow chart of steps for obtaining an information extraction model provided by an embodiment of the present disclosure
- FIG. 3 shows a format of triplet tagging information provided by an embodiment of the present disclosure
- FIG. 4 shows a training framework of an information extraction model provided by an embodiment of the present disclosure
- FIG. 5 shows a flow chart of the steps of a method for automatically labeling data provided by an embodiment of the present disclosure
- FIG. 6 shows a schematic flowchart of an automated labeling provided by an embodiment of the present disclosure
- FIG. 7 shows a schematic flowchart of an information extraction method provided by an embodiment of the present disclosure
- FIG. 8 shows a structural block diagram of an information extraction device provided by an embodiment of the present disclosure
- FIG. 9 is a structural block diagram of a model acquisition module of an information extraction device provided by an embodiment of the present disclosure and the units contained therein;
- FIG. 10 schematically shows a block diagram of an electronic device for executing the method according to the present disclosure.
- FIG. 11 schematically shows a storage unit for holding or carrying program codes for implementing the method according to the present disclosure.
- Domain knowledge graph extracts entities and relationships between entities from specific resources in a specific field to build a knowledge base.
- the knowledge system it contains usually has a strong domain specificity and professionalism.
- the domain knowledge graph is constructed from top to bottom, including schema design, entity recognition, relationship extraction, entity linking, knowledge fusion, and knowledge calculation.
- the key is how to automatically extract information to obtain candidate knowledge units.
- the techniques involved include entity extraction, relationship extraction and attribute extraction, collectively referred to as information extraction.
- Information extraction is also called triple (S, P, O) extraction, where S and O are the subject and object of the sentence, corresponding to the entity or attribute value in the knowledge graph, and P is the predicate, corresponding to the relationship between the entities.
- an embodiment of the present disclosure provides an information extraction method.
- the method may include:
- Step 101 Obtain text data.
- the execution subject can obtain the data to be processed.
- the data to be processed may include, for example, data obtained by the execution subject (for example, a server) instantly from a database, or data pre-stored in the storage unit of the execution subject. , Or data imported from a third party, etc.
- the text data may include unstructured text, etc. In some embodiments, the text data is unstructured text.
- text data can also be derived from text information extracted from pictures or files in other formats.
- the file to be processed is a picture or a PDF file.
- the text data can be extracted from the picture or PDF file by means of OCR recognition, etc., and processed.
- Step 102 Input the text data into the pre-trained information extraction model to obtain the triple information contained in the text data.
- the triple information includes the subject, predicate, and object in the text data; wherein, the information extraction model includes a binary classifier
- the model and the multi-label classification sub-model, the two-class classification sub-model is used to extract the subject in the text data
- the multi-label classification sub-model is used to extract the predicate and object corresponding to the subject in the text data according to the subject and the text data.
- domain knowledge graphs are usually constructed using a top-down approach, that is, the top-level design is first performed: determine the types of entities, attributes, and relationships that the knowledge graph needs to include. There is no fixed standard for this part, and it is usually designed according to business needs. For example, in the field of art, it may be necessary to obtain entities such as paintings, painters, and art institutions. There are many attribute values and relationships between these entities and entities. Paints have attributes such as creation time and creation medium, and there are creation relationships between painters and paintings, etc. Based on this, the following information extraction schema can be constructed:
- subject represents the subject s in the triple
- predicate represents the predicate p in the triple, which is also called relationship
- object represents the object o in the triple
- subject_type is the entity type of the subject
- object_type is the entity type of the object.
- the subject s is predicted, then the subject s is passed in to predict the object o corresponding to the subject s, and then the subject s and object o are passed in to predict the relational predicate p.
- the prediction of the object o and the object p can be combined into one step, that is, the subject s is predicted first, and then the subject s is passed in to predict the object o and the predicate p corresponding to the subject s, as shown in the following formula:
- the binary classification sub-model and the multi-label classification sub-model in the information extraction model can be obtained by jointly training the pre-trained language model and the neural network model by using unstructured text labeled with triple information.
- the training process of the information extraction model and the process of labeling unstructured text will be described in detail.
- the text data is first input into the two-classification sub-model, and all subjects in the text data are extracted by the two-classification sub-model, and then each subject and text data are sent to the multi-label classification sub-model in pairs, and the multi-label classification is performed
- the sub-model extracts the predicate and object corresponding to the subject in the text data.
- the relationship joint extraction model replaces the traditional entity recognition and relationship extraction pipeline extraction methods, and improves the efficiency and accuracy of information extraction.
- a step of obtaining the information extraction model may also be included.
- the steps of obtaining the information extraction model may specifically include:
- Step 201 Obtain a sample set.
- the sample set includes multiple texts to be trained and triplet labeling information of each text to be trained.
- the triplet labeling information includes subject labeling information, predicate labeling information, and object labeling information.
- the text to be trained may be, for example: "The Mona Lisa is an oil painting created by the Italian Renaissance painter Leonardo da Vinci, which is now in the Louvre Museum in France.”
- the triple information of the text to be trained includes (Mona Lisa, author, Leonardo da Vinci), (Mona Lisa, collection site, Louvre Museum, France), (Da Vinci, nationality, Italy) and (Mona Lisa, creation category, oil painting) .
- the triple information can be labeled in a specific format.
- the starting and ending positions of the subject S in the sentence can be marked. For example, when labeling (Mona Lisa, author, Leonardo da Vinci), (Mona Lisa, creation category, oil painting), (Da Vinci, nationality, Italy), the subjects Mona Lisa and Leonardo da Vinci The starting and ending positions of Odd in the sentence are marked with two sequences, namely, 1 is marked at the corresponding starting and ending positions, and 0 is marked at the other positions. Refer to Figure 3 to show the subject marking information of the text to be trained.
- Step 202 Input the text to be trained into the first pre-training language model, and input the output information of the first pre-training language model into the first neural network model.
- Step 203 Input the output information of the first neural network model and the text to be trained into the second pre-training language model, and send the output information of the second pre-training language model into the second neural network model.
- Step 204 According to the output information of the first neural network model, the output information of the second neural network model, and the triplet label information, perform a comparison of the first pre-training language model, the first neural network model, the second pre-training language model, and the first
- the second neural network model is trained to obtain an information extraction model, where the first pre-trained language model and the first neural network model after training constitute a two-class sub-model, and the second pre-trained language model and the second neural network model after training Form a multi-label classification sub-model.
- the first loss function can be determined according to the output information of the first neural network model and subject labeling information
- the second loss function can be determined according to the output information of the second neural network model, predicate labeling information, and object labeling information
- the first pre-training language model and the second pre-training language model may be a BERT model, an ERNIE model, a Span BERT model, and so on.
- the first pre-training language model and the second pre-training language model are both BERT models as an example.
- the first neural network model is the Dense layer + sigmod
- the second neural network model is the Dense layer + softmax
- the first loss function Both and the second loss function are cross-entropy loss functions. It should be noted that the minimum sum of the first loss function and the second loss function is not limited to one value, but a range of values.
- the training framework of the information extraction model is shown.
- the specific steps of model training are as follows: First, send the text X to be trained, that is [CLS] "Mona Lisa” is an oil painting created by Italian Renaissance painter Leonardo...[SEP], and send it to the BERT model with single input ,
- the encoding of the output information of the BERT model is sent to the Dense layer + sigmod, and the first loss function loss_s (cross entropy loss function) is used as a two-class training to predict the start and end position of the subject labeling model.
- the first neural network model constitute a two-class sub-model subject_model.
- the output information of the BERT model namely the vector corresponding to [CLS], is sent to the Dense layer + softmax, and the second loss function loss_o (cross entropy loss function) is used for multi-class training of prediction predicates and objects, and the second pre-training language after training
- the model (BERT) and the second neural network model (Dense layer + softmax) constitute a multi-label classification sub-model object_model.
- the two-class classification sub-model subject_model and the multi-label classification sub-model object_model can be jointly trained.
- the first pre-training language model the first neural
- the parameters in the network model, the second pre-training language model, and the second neural network model are iteratively optimized to obtain an information extraction model.
- the output information of the input sample X after being encoded by BERT can be expressed as:
- L represents Transformer layers.
- W start is the trainable weight vector
- b start is the bias term
- ⁇ is the sigmoid activation function
- W s is the sentence embedding matrix
- two sequences can also be used to determine the start and end positions of the object.
- the multi-label classification method can be used to determine the start and end position and relationship of the object at the same time, that is The probability of determining the relationship label at the start and end positions of
- the parameter to be optimized in the model training process is the above-mentioned trainable weight vector, and the loss function loss is minimized by iteratively updating and optimizing the parameters.
- the current mainstream relationship extraction methods are supervised learning methods, semi-supervised learning methods and unsupervised learning methods. Compared with semi-supervised learning methods and unsupervised learning methods, supervised learning methods have higher accuracy and recall rates, so they have received more and more attention.
- the supervised learning method requires a large amount of data annotation. If the efficiency of data annotation is improved, it is also an urgent problem to be solved.
- step 201 may include:
- Step 501 Process the unstructured text sample to obtain the text to be labeled.
- Step 502 Obtain the labeled text to be trained and the triplet label information of the text to be trained.
- Step 503 In response to the subject label information and object label information in the triple label information contained in the text to be labelled, label the text to be labelled according to the triple label information.
- the labeling information marks the text to be labelled. In this way, by using the existing knowledge base to automatically label data, the cost of corpus labeling can be reduced.
- ⁇ 'text':'"Mona Lisa” is an oil painting created by Italian Renaissance painter Leonardo da Vinci, now in the collection of the Louvre Museum in France','spo_list':[(Mona Lisa, author, Leonardo Qi), (Mona Lisa, collection place, Louvre Museum in France), (Da Vinci, nationality, Italy), (Mona Lisa, creation category, oil painting)] ⁇ .
- the method of knowledge distillation can be used to reduce the noise of the automatically labeled data.
- the foregoing implementation manner may also include:
- Step 504 Prediction of the to-be-labeled text is performed using K prediction models obtained by pre-training to obtain K triplet prediction information.
- the K prediction models can be trained by using K-fold cross-validation based on the finished labeling text to be trained and the triplet label information of the text to be trained.
- the training samples are equally divided into K parts, K-1 parts of the training model are taken in turn, and the other 1 part is used as the sample to be predicted. If it can be divided into [D1,D2,D3,...,DK], take [D1,D2,...,Dk-1,Dk+1,...,DK] as the training sample, Dk as the sample to be predicted, and k ⁇ [ 1,K].
- Step 505 When the ratio of the quantity of the first triple information to K is greater than the first preset threshold, the first triple information is added to the sample set as the triple annotation information of the text to be annotated.
- a triplet information is the triplet information that appears in the triplet prediction information but does not appear in the triplet tagging information of the text to be tagged.
- Step 506 When the ratio of the quantity of the second triplet information to K is greater than the second preset threshold, delete the second triplet information from the triplet tagging information of the text to be tagged, where the second triplet
- the group information is the triplet information that appears in the triplet tagging information of the text to be tagged but does not appear in the triplet prediction information.
- the K value can be greater than or equal to 5 and less than or equal to 10, or it can be set by itself according to the data scale.
- the first preset threshold and the second preset threshold may be the same or different, and the specific value may be determined according to actual needs.
- K-fold cross-validation can be used to train K models with labeled data, and then use the trained K models to predict the text to be labeled.
- first triplet information Ti in K triples prediction information that is not in R_s, this first triplet information Ti appears M times in K triples prediction information, and K triples There may be N results in the prediction information that do not contain the second triplet information Tj, and the second triplet information Tj exists in R_s.
- both the first preset threshold and the second preset threshold can be set to Score.
- M/K>Score the first triplet information Ti is considered to be the missing label data of the text to be labeled, so the first The triplet information Ti is added to the triplet label information R_s.
- N/K>Score the second triplet information Tj is considered to be mislabeled data. Therefore, the second triplet information Tj needs to be changed from the triplet Delete the group label information R_s. In this way, by repeating training and prediction many times, the training sample set can be continuously revised.
- the existing knowledge base is used to automatically label data, which can reduce the cost of expected labeling.
- manual review is performed, and the method of knowledge distillation is used to denoise the labeled data in the later stage.
- the information extraction method provided by this embodiment mainly involves several main parts such as data annotation method, schema construction, information extraction algorithm model, data noise reduction, etc.
- the solution uses an end-to-end entity relationship joint extraction method from non-structure Knowledge is extracted from the text, while ensuring the accuracy of information extraction, it reduces the cost of constructing a knowledge graph, improves the efficiency of information extraction, and saves labor costs.
- the device may include:
- the obtaining module 801 is configured to obtain text data
- the extraction module 802 is configured to input the text data into a pre-trained information extraction model to obtain triple information contained in the text data, and the triple information includes subject and predicate in the text data And objects; wherein the information extraction model includes a binary classification sub-model and a multi-label classification sub-model, the binary classification sub-model is used to extract the subject in the text data, the multi-label classification sub-model is used to The subject and the text data are extracted, and the predicate and object corresponding to the subject in the text data are extracted.
- the information extraction model includes a binary classification sub-model and a multi-label classification sub-model, the binary classification sub-model is used to extract the subject in the text data, the multi-label classification sub-model is used to The subject and the text data are extracted, and the predicate and object corresponding to the subject in the text data are extracted.
- the apparatus may further include: a model acquisition module 800 configured to acquire the information extraction model, and the model acquisition module includes:
- the first unit 8001 is configured to obtain a sample set, the sample set includes a plurality of texts to be trained and the triplet tagging information of each text to be trained, the triplet tagging information includes subject tagging information and predicate Labeling information and object labeling information;
- the second unit 8002 is configured to input the text to be trained into a first pre-training language model, and send output information of the first pre-training language model into the first neural network model;
- the third unit 8003 is configured to input the output information of the first neural network model and the text to be trained into a second pre-training language model, and send the output information of the second pre-training language model to the second neural network.
- Network model
- the fourth unit 8004 is configured to, according to the output information of the first neural network model, the output information of the second neural network model, and the triplet tagging information, perform the evaluation on the first pre-training language model, the The first neural network model, the second pre-training language model, and the second neural network model are trained to obtain the information extraction model, wherein the trained first pre-training language model and the first neural network model The two-classification sub-model is formed, and the second pre-training language model and the second neural network model after training constitute the multi-label classification sub-model.
- the fourth unit is specifically configured as:
- the parameters in the first pre-training language model, the first neural network model, the second pre-training language model, and the second neural network model are optimized to obtain the information extraction model so that the The sum of the first loss function and the second loss function is the smallest.
- the first loss function and the second loss function are both cross-entropy loss functions.
- the first unit is specifically configured as:
- the first unit is further configured to:
- the first triplet information is triplet information that appears in the triplet prediction information but does not appear in the triplet tagging information of the text to be tagged;
- the second triplet information is deleted from the triplet tagging information of the text to be tagged, wherein, the first The two-triple information is the triplet information that appears in the triplet tagging information of the text to be tagged but does not appear in the triplet prediction information.
- the K value can be greater than or equal to 5 and less than or equal to 10, or it can be set by itself according to the data scale.
- the first unit is further configured to:
- the K-fold cross-validation method is adopted to obtain K prediction models.
- Another embodiment of the present disclosure also provides an electronic device, which includes:
- a memory for storing executable instructions of the processor
- the processor is configured to execute the instructions to implement the information extraction method described in any embodiment.
- Another embodiment of the present disclosure further provides a storage medium.
- the electronic device can execute the information extraction method described in any embodiment.
- the device embodiments described above are merely illustrative.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units.
- Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
- the various component embodiments of the present disclosure may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them.
- a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the electronic device according to the embodiments of the present disclosure.
- DSP digital signal processor
- the present disclosure can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein.
- Such a program for realizing the present disclosure may be stored on a computer-readable medium, or may have the form of one or more signals.
- Such a signal can be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.
- FIG. 10 shows an electronic device that can implement the method according to the present disclosure.
- the electronic device traditionally includes a processor 1010 and a computer program product in the form of a memory 1020 or a computer-readable medium.
- the memory 1020 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
- the memory 1020 has a storage space 1030 for executing program codes 1031 of any method steps in the above methods.
- the storage space 1030 for program codes may include various program codes 1031 respectively used to implement various steps in the above method. These program codes can be read from or written into one or more computer program products.
- These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards, or floppy disks.
- Such a computer program product is usually a portable or fixed storage unit as described with reference to FIG. 11.
- the storage unit may have storage segments, storage spaces, etc. arranged similarly to the memory 1020 in the electronic device of FIG. 10.
- the program code can be compressed in an appropriate form, for example.
- the storage unit includes computer readable codes 1031', that is, codes that can be read by, for example, a processor such as 1010. When run by an electronic device, these codes cause the electronic device to execute each of the methods described above. step.
- the information extraction method, device, electronic equipment, and storage medium proposed in the embodiments of the present invention at least include the following advantages:
- the technical solution of this application provides an information extraction method, device, electronic equipment, and storage medium.
- source data is obtained, and then the source data is input into a pre-trained information extraction model to obtain the triplet information contained in the source data.
- Tuple information includes the subject, predicate, and object in the source data; among them, the information extraction model includes a two-class sub-model and a multi-label classification sub-model.
- the two-class sub-model is used to extract the subject in the source data
- the multi-label classification sub-model is used Based on the subject and the source data, extract the predicate and object corresponding to the subject in the source data.
- the technical solution of the present application adopts an end-to-end information extraction model to jointly extract triple information in the source data, instead of the traditional pipeline extraction method of entity recognition and relationship extraction, and can improve the efficiency and accuracy of information extraction.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Animal Behavior & Ethology (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (20)
- 一种信息抽取方法,其中,所述方法包括:获取文本数据;将所述文本数据输入预先训练得到的信息抽取模型,得到所述文本数据所包含的三元组信息,所述三元组信息包括所述文本数据中的主语、谓语和宾语;其中,所述信息抽取模型包括二分类子模型和多标签分类子模型,所述二分类子模型用于抽取所述文本数据中的主语,所述多标签分类子模型用于根据所述主语和所述文本数据,抽取所述文本数据中与所述主语对应的谓语和宾语。
- 根据权利要求1所述的信息抽取方法,其中,在所述将所述文本数据输入预先训练得到的信息抽取模型,得到所述文本数据所包含的三元组信息的步骤之前,还包括:获得所述信息抽取模型。
- 根据权利要求2所述的信息抽取方法,其中,所述获得所述信息抽取模型的步骤,包括:获得样本集合,所述样本集合中包括多个待训练文本以及各所述待训练文本的三元组标注信息,所述三元组标注信息包括主语标注信息、谓语标注信息和宾语标注信息;将所述待训练文本输入第一预训练语言模型,将所述第一预训练语言模型的输出信息送入第一神经网络模型;将所述第一神经网络模型的输出信息以及所述待训练文本输入第二预训练语言模型,将所述第二预训练语言模型的输出信息送入第二神经网络模型;根据所述第一神经网络模型的输出信息、所述第二神经网络模型的输出信息以及所述三元组标注信息,对所述第一预训练语言模型、所述第一神经网络模型、所述第二预训练语言模型以及所述第二神经网络模型进行训练,得到所述信息抽取模型,其中,训练后的第一预训练语言模型和第一神经网络模型构成所述二分类子模型,训练后的第二预训练语言模型和第二神经网络模型构成所述多标签分类子模型。
- 根据权利要求3所述的信息抽取方法,其中,所述根据所述第一神经网络模型的输出信息、所述第二神经网络模型的输出信息以及所述三 元组标注信息,对所述第一预训练语言模型、所述第一神经网络模型、所述第二预训练语言模型以及所述第二神经网络模型进行训练,得到所述信息抽取模型的步骤,包括:根据所述第一神经网络模型的输出信息以及所述主语标注信息,确定第一损失函数;根据所述第二神经网络模型的输出信息、所述谓语标注信息以及所述宾语标注信息,确定第二损失函数;对所述第一预训练语言模型、所述第一神经网络模型、所述第二预训练语言模型以及所述第二神经网络模型中的参数进行优化,得到所述信息抽取模型,使得所述第一损失函数与所述第二损失函数之和最小。
- 根据权利要求4所述的信息抽取方法,其中,所述第一损失函数和所述第二损失函数均为交叉熵损失函数。
- 根据权利要求3-5中任一项所述的信息抽取方法,其中,所述获得样本集合的步骤,包括:获取非结构化文本样本;对所述非结构化文本样本进行处理,得到待标注文本;获取已完成标注的待训练文本以及所述待训练文本的三元组标注信息;响应于所述待标注文本中包含所述三元组标注信息中的主语标注信息和宾语标注信息,按照所述三元组标注信息对所述待标注文本进行标注。
- 根据权利要求6所述的信息抽取方法,其中,所述获得样本集合的步骤,还包括:采用预先训练得到的K个预测模型对所述待标注文本进行预测,得到K个三元组预测信息;当第一三元组信息的数量与K的比值大于第一预设阈值时,将所述第一三元组信息作为所述待标注文本的三元组标注信息添加至所述样本集合中,其中,所述第一三元组信息为出现在所述三元组预测信息中但未出现在所述待标注文本的三元组标注信息中的三元组信息;当第二三元组信息的数量与K的比值大于第二预设阈值时,将所述第二三元组信息从所述待标注文本的三元组标注信息中删除,其中,所述第二三元组信息为出现在所述待标注文本的三元组标注信息中但未出现 在所述三元组预测信息中的三元组信息。
- 根据权利要求7所述的信息抽取方法,其中,K大于或等于5且小于或等于10。
- 根据权利要求7或8所述的信息抽取方法,其中,在所述采用预先训练得到的K个预测模型对所述待标注文本进行预测,得到K个三元组预测信息的步骤之前,包括:根据已完成标注的待训练文本以及所述待训练文本的三元组标注信息,采用K折交叉验证的方式获得K个预测模型。
- 一种电子设备,包括:处理器;用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现如下信息抽取操作,包括:获取文本数据;将所述文本数据输入预先训练得到的信息抽取模型,得到所述文本数据所包含的三元组信息,所述三元组信息包括所述文本数据中的主语、谓语和宾语;其中,所述信息抽取模型包括二分类子模型和多标签分类子模型,所述二分类子模型用于抽取所述文本数据中的主语,所述多标签分类子模型用于根据所述主语和所述文本数据,抽取所述文本数据中与所述主语对应的谓语和宾语。
- 根据权利要求10所述的电子设备,其中,在所述将所述文本数据输入预先训练得到的信息抽取模型,得到所述文本数据所包含的三元组信息的操作之前,还包括:获得所述信息抽取模型。
- 根据权利要求11所述的电子设备,其中,所述获得所述信息抽取模型的操作,包括:获得样本集合,所述样本集合中包括多个待训练文本以及各所述待训练文本的三元组标注信息,所述三元组标注信息包括主语标注信息、谓语标注信息和宾语标注信息;将所述待训练文本输入第一预训练语言模型,将所述第一预训练语言模型的输出信息送入第一神经网络模型;将所述第一神经网络模型的输出信息以及所述待训练文本输入第二预训练语言模型,将所述第二预训练语言模型的输出信息送入第二神经网络模型;根据所述第一神经网络模型的输出信息、所述第二神经网络模型的输出信息以及所述三元组标注信息,对所述第一预训练语言模型、所述第一神经网络模型、所述第二预训练语言模型以及所述第二神经网络模型进行训练,得到所述信息抽取模型,其中,训练后的第一预训练语言模型和第一神经网络模型构成所述二分类子模型,训练后的第二预训练语言模型和第二神经网络模型构成所述多标签分类子模型。
- 根据权利要求12所述的电子设备,其中,所述根据所述第一神经网络模型的输出信息、所述第二神经网络模型的输出信息以及所述三元组标注信息,对所述第一预训练语言模型、所述第一神经网络模型、所述第二预训练语言模型以及所述第二神经网络模型进行训练,得到所述信息抽取模型的操作,包括:根据所述第一神经网络模型的输出信息以及所述主语标注信息,确定第一损失函数;根据所述第二神经网络模型的输出信息、所述谓语标注信息以及所述宾语标注信息,确定第二损失函数;对所述第一预训练语言模型、所述第一神经网络模型、所述第二预训练语言模型以及所述第二神经网络模型中的参数进行优化,得到所述信息抽取模型,使得所述第一损失函数与所述第二损失函数之和最小。
- 根据权利要求13所述的电子设备,其中,所述第一损失函数和所述第二损失函数均为交叉熵损失函数。
- 根据权利要求12-14中任一项所述的电子设备,其中,所述获得样本集合的操作,包括:获取非结构化文本样本;对所述非结构化文本样本进行处理,得到待标注文本;获取已完成标注的待训练文本以及所述待训练文本的三元组标注信息;响应于所述待标注文本中包含所述三元组标注信息中的主语标注信息和宾语标注信息,按照所述三元组标注信息对所述待标注文本进行标注。
- 根据权利要求15所述的电子设备,其中,所述获得样本集合的操作,还包括:采用预先训练得到的K个预测模型对所述待标注文本进行预测,得到K个三元组预测信息;当第一三元组信息的数量与K的比值大于第一预设阈值时,将所述第一三元组信息作为所述待标注文本的三元组标注信息添加至所述样本集合中,其中,所述第一三元组信息为出现在所述三元组预测信息中但未出现在所述待标注文本的三元组标注信息中的三元组信息;当第二三元组信息的数量与K的比值大于第二预设阈值时,将所述第二三元组信息从所述待标注文本的三元组标注信息中删除,其中,所述第二三元组信息为出现在所述待标注文本的三元组标注信息中但未出现在所述三元组预测信息中的三元组信息。
- 根据权利要求16所述的电子设备,其中,K大于或等于5且小于或等于10。
- 根据权利要求16或17所述的电子设备,其中,在所述采用预先训练得到的K个预测模型对所述待标注文本进行预测,得到K个三元组预测信息的操作之前,包括:根据已完成标注的待训练文本以及所述待训练文本的三元组标注信息,采用K折交叉验证的方式获得K个预测模型。。
- 一种非易失性计算机可读存储介质,其中,当所述存储介质中的计算机程序代码由电子设备的处理器执行时,使得所述电子设备能够执行如权利要求1至10中任一项所述的信息抽取方法。
- 一种计算机程序产品,其中,包含计算机程序代码,当所述计算机程序代码由电子设备的处理器执行时,使得所述电子设备能够执行如权利要求1至10中任一项所述的信息抽取方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/425,556 US11922121B2 (en) | 2020-01-21 | 2021-01-13 | Method and apparatus for information extraction, electronic device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010071824.3A CN111291185B (zh) | 2020-01-21 | 2020-01-21 | 信息抽取方法、装置、电子设备及存储介质 |
CN202010071824.3 | 2020-01-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021147726A1 true WO2021147726A1 (zh) | 2021-07-29 |
Family
ID=71025634
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/071485 WO2021147726A1 (zh) | 2020-01-21 | 2021-01-13 | 信息抽取方法、装置、电子设备及存储介质 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11922121B2 (zh) |
CN (1) | CN111291185B (zh) |
WO (1) | WO2021147726A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113821602A (zh) * | 2021-09-29 | 2021-12-21 | 平安银行股份有限公司 | 基于图文聊天记录的自动答疑方法、装置、设备及介质 |
CN114266258A (zh) * | 2021-12-30 | 2022-04-01 | 北京百度网讯科技有限公司 | 一种语义关系提取方法、装置、电子设备及存储介质 |
US20220129633A1 (en) * | 2020-10-23 | 2022-04-28 | Target Brands, Inc. | Multi-task learning of query intent and named entities |
CN115759098A (zh) * | 2022-11-14 | 2023-03-07 | 中国科学院空间应用工程与技术中心 | 一种航天文本数据的中文实体和关系联合抽取方法、系统 |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291185B (zh) * | 2020-01-21 | 2023-09-22 | 京东方科技集团股份有限公司 | 信息抽取方法、装置、电子设备及存储介质 |
CN112052681A (zh) * | 2020-08-20 | 2020-12-08 | 中国建设银行股份有限公司 | 信息抽取模型训练方法、信息抽取方法、装置及电子设备 |
CN112000808B (zh) * | 2020-09-29 | 2024-04-16 | 迪爱斯信息技术股份有限公司 | 一种数据处理方法及装置、可读存储介质 |
CN112380356A (zh) * | 2020-11-30 | 2021-02-19 | 百度国际科技(深圳)有限公司 | 用于构建配餐知识图谱的方法、装置、电子设备及介质 |
CN112507125A (zh) * | 2020-12-03 | 2021-03-16 | 平安科技(深圳)有限公司 | 三元组信息提取方法、装置、设备及计算机可读存储介质 |
CN112528641A (zh) * | 2020-12-10 | 2021-03-19 | 北京百度网讯科技有限公司 | 建立信息抽取模型的方法、装置、电子设备和可读存储介质 |
CN112528600B (zh) * | 2020-12-15 | 2024-05-07 | 北京百度网讯科技有限公司 | 文本数据处理方法、相关装置及计算机程序产品 |
CN112613315B (zh) * | 2020-12-29 | 2024-06-07 | 重庆农村商业银行股份有限公司 | 一种文本知识自动抽取方法、装置、设备及存储介质 |
CN113158671B (zh) * | 2021-03-25 | 2023-08-11 | 胡明昊 | 一种结合命名实体识别的开放域信息抽取方法 |
CN112818138B (zh) * | 2021-04-19 | 2021-10-15 | 中译语通科技股份有限公司 | 知识图谱本体构建方法、装置、终端设备及可读存储介质 |
CN113051356B (zh) * | 2021-04-21 | 2023-05-30 | 深圳壹账通智能科技有限公司 | 开放关系抽取方法、装置、电子设备及存储介质 |
CN113254429B (zh) * | 2021-05-13 | 2023-07-21 | 东北大学 | 一种用于远程监督关系抽取的基于bert和mlm的降噪方法 |
CN113160917B (zh) * | 2021-05-18 | 2022-11-01 | 山东浪潮智慧医疗科技有限公司 | 一种电子病历实体关系抽取方法 |
CN113486189A (zh) * | 2021-06-08 | 2021-10-08 | 广州数说故事信息科技有限公司 | 一种开放性知识图谱挖掘方法及系统 |
CN113420120B (zh) * | 2021-06-24 | 2024-05-31 | 平安科技(深圳)有限公司 | 关键信息提取模型的训练方法、提取方法、设备及介质 |
CN113590810B (zh) * | 2021-08-03 | 2023-07-14 | 北京奇艺世纪科技有限公司 | 摘要生成模型训练方法、摘要生成方法、装置及电子设备 |
CN113779260B (zh) * | 2021-08-12 | 2023-07-18 | 华东师范大学 | 一种基于预训练模型的领域图谱实体和关系联合抽取方法及系统 |
CN113468344B (zh) * | 2021-09-01 | 2021-11-30 | 北京德风新征程科技有限公司 | 实体关系抽取方法、装置、电子设备和计算机可读介质 |
CN115544626B (zh) * | 2022-10-21 | 2023-10-20 | 清华大学 | 子模型抽取方法、装置、计算机设备及介质 |
CN116340552B (zh) * | 2023-01-06 | 2024-07-02 | 北京达佳互联信息技术有限公司 | 一种标签排序方法、装置、设备及存储介质 |
CN116415005B (zh) * | 2023-06-12 | 2023-08-18 | 中南大学 | 一种面向学者学术网络构建的关系抽取方法 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100185700A1 (en) * | 2007-09-17 | 2010-07-22 | Yan Bodain | Method and system for aligning ontologies using annotation exchange |
CN106709006A (zh) * | 2016-12-23 | 2017-05-24 | 武汉科技大学 | 一种对查询友好的关联数据压缩方法 |
CN106844368A (zh) * | 2015-12-03 | 2017-06-13 | 华为技术有限公司 | 用于人机对话的方法、神经网络系统和用户设备 |
KR20180108257A (ko) * | 2017-03-24 | 2018-10-04 | (주)아크릴 | 온톨로지에 의해 표현되는 자원들을 이용하여 상기 온톨로지를 확장하는 방법 |
CN108694208A (zh) * | 2017-04-11 | 2018-10-23 | 富士通株式会社 | 用于构造数据库的方法和装置 |
CN108874778A (zh) * | 2018-06-15 | 2018-11-23 | 广东蔚海数问大数据科技有限公司 | 语义实体关系抽取方法、装置及电子设备 |
CN111291185A (zh) * | 2020-01-21 | 2020-06-16 | 京东方科技集团股份有限公司 | 信息抽取方法、装置、电子设备及存储介质 |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070083357A1 (en) * | 2005-10-03 | 2007-04-12 | Moore Robert C | Weighted linear model |
US11222052B2 (en) * | 2011-02-22 | 2022-01-11 | Refinitiv Us Organization Llc | Machine learning-based relationship association and related discovery and |
US10303999B2 (en) * | 2011-02-22 | 2019-05-28 | Refinitiv Us Organization Llc | Machine learning-based relationship association and related discovery and search engines |
US10325106B1 (en) * | 2013-04-04 | 2019-06-18 | Marklogic Corporation | Apparatus and method for operating a triple store database with document based triple access security |
US10977573B1 (en) * | 2015-05-07 | 2021-04-13 | Google Llc | Distantly supervised wrapper induction for semi-structured documents |
CN106055536B (zh) * | 2016-05-19 | 2018-08-21 | 苏州大学 | 一种中文事件联合推理方法 |
US20180232443A1 (en) * | 2017-02-16 | 2018-08-16 | Globality, Inc. | Intelligent matching system with ontology-aided relation extraction |
KR101983455B1 (ko) * | 2017-09-21 | 2019-05-28 | 숭실대학교산학협력단 | 지식베이스 구축 방법 및 그 서버 |
US10824962B2 (en) * | 2017-09-29 | 2020-11-03 | Oracle International Corporation | Utterance quality estimation |
CN108073711B (zh) | 2017-12-21 | 2022-01-11 | 北京大学深圳研究生院 | 一种基于知识图谱的关系抽取方法和系统 |
US11288294B2 (en) * | 2018-04-26 | 2022-03-29 | Accenture Global Solutions Limited | Natural language processing and artificial intelligence based search system |
CN109597855A (zh) | 2018-11-29 | 2019-04-09 | 北京邮电大学 | 基于大数据驱动的领域知识图谱构建方法及系统 |
US10825449B1 (en) * | 2019-09-27 | 2020-11-03 | CrowdAround Inc. | Systems and methods for analyzing a characteristic of a communication using disjoint classification models for parsing and evaluation of the communication |
CN113204649A (zh) * | 2021-05-11 | 2021-08-03 | 西安交通大学 | 基于实体关系联合抽取的法律知识图谱构建方法及设备 |
-
2020
- 2020-01-21 CN CN202010071824.3A patent/CN111291185B/zh active Active
-
2021
- 2021-01-13 WO PCT/CN2021/071485 patent/WO2021147726A1/zh active Application Filing
- 2021-01-13 US US17/425,556 patent/US11922121B2/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100185700A1 (en) * | 2007-09-17 | 2010-07-22 | Yan Bodain | Method and system for aligning ontologies using annotation exchange |
CN106844368A (zh) * | 2015-12-03 | 2017-06-13 | 华为技术有限公司 | 用于人机对话的方法、神经网络系统和用户设备 |
CN106709006A (zh) * | 2016-12-23 | 2017-05-24 | 武汉科技大学 | 一种对查询友好的关联数据压缩方法 |
KR20180108257A (ko) * | 2017-03-24 | 2018-10-04 | (주)아크릴 | 온톨로지에 의해 표현되는 자원들을 이용하여 상기 온톨로지를 확장하는 방법 |
CN108694208A (zh) * | 2017-04-11 | 2018-10-23 | 富士通株式会社 | 用于构造数据库的方法和装置 |
CN108874778A (zh) * | 2018-06-15 | 2018-11-23 | 广东蔚海数问大数据科技有限公司 | 语义实体关系抽取方法、装置及电子设备 |
CN111291185A (zh) * | 2020-01-21 | 2020-06-16 | 京东方科技集团股份有限公司 | 信息抽取方法、装置、电子设备及存储介质 |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220129633A1 (en) * | 2020-10-23 | 2022-04-28 | Target Brands, Inc. | Multi-task learning of query intent and named entities |
US11934785B2 (en) * | 2020-10-23 | 2024-03-19 | Target Brands, Inc. | Multi-task learning of query intent and named entities |
CN113821602A (zh) * | 2021-09-29 | 2021-12-21 | 平安银行股份有限公司 | 基于图文聊天记录的自动答疑方法、装置、设备及介质 |
CN113821602B (zh) * | 2021-09-29 | 2024-05-24 | 平安银行股份有限公司 | 基于图文聊天记录的自动答疑方法、装置、设备及介质 |
CN114266258A (zh) * | 2021-12-30 | 2022-04-01 | 北京百度网讯科技有限公司 | 一种语义关系提取方法、装置、电子设备及存储介质 |
CN114266258B (zh) * | 2021-12-30 | 2023-06-23 | 北京百度网讯科技有限公司 | 一种语义关系提取方法、装置、电子设备及存储介质 |
CN115759098A (zh) * | 2022-11-14 | 2023-03-07 | 中国科学院空间应用工程与技术中心 | 一种航天文本数据的中文实体和关系联合抽取方法、系统 |
Also Published As
Publication number | Publication date |
---|---|
CN111291185A (zh) | 2020-06-16 |
CN111291185B (zh) | 2023-09-22 |
US20230153526A1 (en) | 2023-05-18 |
US11922121B2 (en) | 2024-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021147726A1 (zh) | 信息抽取方法、装置、电子设备及存储介质 | |
CN108875051B (zh) | 面向海量非结构化文本的知识图谱自动构建方法及系统 | |
CN107783960B (zh) | 用于抽取信息的方法、装置和设备 | |
US20220050967A1 (en) | Extracting definitions from documents utilizing definition-labeling-dependent machine learning background | |
US20220171936A1 (en) | Analysis of natural language text in document | |
CN113591483A (zh) | 一种基于序列标注的文档级事件论元抽取方法 | |
WO2021042516A1 (zh) | 命名实体识别方法、装置及计算机可读存储介质 | |
US20240004677A1 (en) | Machine-Learned Models for User Interface Prediction, Generation, and Interaction Understanding | |
CN113392209B (zh) | 一种基于人工智能的文本聚类方法、相关设备及存储介质 | |
CN112989841A (zh) | 一种用于突发事件新闻识别与分类的半监督学习方法 | |
CN111026880B (zh) | 基于联合学习的司法知识图谱构建方法 | |
CN115238690A (zh) | 一种基于bert的军事领域复合命名实体识别方法 | |
CN116127090B (zh) | 基于融合和半监督信息抽取的航空系统知识图谱构建方法 | |
CN112101031A (zh) | 一种实体识别方法、终端设备及存储介质 | |
CN114416995A (zh) | 信息推荐方法、装置及设备 | |
CN115203507A (zh) | 一种面向文书领域的基于预训练模型的事件抽取方法 | |
CN116383399A (zh) | 一种事件舆情风险预测方法及系统 | |
CN115688920A (zh) | 知识抽取方法、模型的训练方法、装置、设备和介质 | |
CN116150361A (zh) | 一种财务报表附注的事件抽取方法、系统及存储介质 | |
CN115292568B (zh) | 一种基于联合模型的民生新闻事件抽取方法 | |
Li et al. | Multi-task deep learning model based on hierarchical relations of address elements for semantic address matching | |
CN112632223B (zh) | 案事件知识图谱构建方法及相关设备 | |
CN116384403A (zh) | 一种基于场景图的多模态社交媒体命名实体识别方法 | |
CN109582958A (zh) | 一种灾难故事线构建方法及装置 | |
CN114417016A (zh) | 一种基于知识图谱的文本信息匹配方法、装置及相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21745039 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21745039 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21745039 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 27/03/2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21745039 Country of ref document: EP Kind code of ref document: A1 |