CN111639498A - Knowledge extraction method and device, electronic equipment and storage medium - Google Patents
Knowledge extraction method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN111639498A CN111639498A CN202010318382.8A CN202010318382A CN111639498A CN 111639498 A CN111639498 A CN 111639498A CN 202010318382 A CN202010318382 A CN 202010318382A CN 111639498 A CN111639498 A CN 111639498A
- Authority
- CN
- China
- Prior art keywords
- entity
- knowledge
- text
- initial
- entity list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a knowledge extraction method, a knowledge extraction device, electronic equipment and a storage medium. The method can preprocess source data to obtain text data, identify entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realize accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, expand the initial entity list based on a knowledge graph to obtain a candidate entity list, realize comprehensive coverage of similar expressions, adopt a semantic matching model trained based on an Attention-DSSM algorithm to disambiguate the candidate entity list to obtain a target entity, strengthen the association between each word and other words due to an Attention mechanism, improve the weight of key words, enable the target entity obtained after data analysis to be more accurate, link the target entity to nodes of the knowledge graph, and automatically extract knowledge based on information on the nodes, the efficiency and the accuracy rate of knowledge extraction are improved.
Description
Technical Field
The invention relates to the technical field of data analysis, in particular to a knowledge extraction method and device, electronic equipment and a storage medium.
Background
The current knowledge extraction usually depends on templates, trigger words or a supervised learning mode, and the rules are summarized and data are labeled manually to form a rule base, and matching is performed on the basis of the rule base.
The method is difficult to maintain and poor in transportability, a large number of rule templates need to be constructed by depending on experts in various fields, the labor required by data labeling is large, the quality of labeled data is uncontrollable, the comprehensive cost is too high, and new relations and classes are not convenient to expand.
Disclosure of Invention
In view of the above, it is desirable to provide a knowledge extraction method, apparatus, electronic device and storage medium, which can enhance the association between each vocabulary and other vocabularies based on the Attention mechanism, achieve automatic extraction of knowledge according to the weight of key vocabularies, and improve the efficiency and accuracy of knowledge extraction.
A knowledge extraction method, the knowledge extraction method comprising:
when a knowledge extraction instruction is received, acquiring source data;
preprocessing the source data to obtain text data;
identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;
linking the target entity to a node of the knowledge-graph;
and extracting knowledge based on the information on the nodes.
According to a preferred embodiment of the present invention, the preprocessing the source data to obtain text data includes:
when the source data are of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data; or
And when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.
According to a preferred embodiment of the present invention, the knowledge extraction method further comprises:
configuring a sequence marking mode according to predefined demand data;
and adding the sequence labeling mode into a Bi-LSTM + CRF model to obtain the sequence labeling model.
According to the preferred embodiment of the present invention, the identifying the entities in the text data through the sequence labeling model based on Bi-LSTM + CRF to obtain the initial entity list includes:
inputting the text data into the sequence labeling model based on Bi-LSTM + CRF, and acquiring the output probability and the transition probability of each corresponding label at each sequence position in a Softmax layer;
calculating the sum of the output probability and the transition probability of each label as the score of each label for each sequence position;
determining the label with the highest score as the output label of each sequence position;
and combining the output labels of each sequence position to obtain the initial entity list.
According to a preferred embodiment of the present invention, the expanding the initial entity list based on the preconfigured knowledge-graph to obtain the candidate entity list includes:
calculating the cosine similarity between each entity in the initial entity list and the entity on each node in the knowledge graph;
acquiring at least one entity with cosine similarity greater than or equal to preset similarity from each node as a candidate entity;
and constructing the candidate entity list according to the initial entity list and the candidate entities.
According to the preferred embodiment of the present invention, the disambiguating the candidate entity list using the semantic matching model trained based on the Attention-DSSM algorithm to obtain the target entity includes:
coding each entity in the candidate entity list based on One-Hot coding algorithm to obtain a word ID of each entity;
inputting the word ID of each entity into a pre-configured dictionary, and outputting a word vector of each entity;
processing the word vector of each entity based on an Attention mechanism to obtain semantic representation of each entity;
interacting the semantic representation of each entity in an Interaction layer, and outputting a semantic vector after Interaction of each entity;
and matching the semantic vector after interaction of each entity with the entities on the nodes of the knowledge graph on a matching layer, and outputting the entity with the highest matching degree as the target entity.
According to a preferred embodiment of the present invention, said extracting knowledge based on information on said nodes comprises:
acquiring at least one path between nodes and associated information on each path from the information on the nodes;
and extracting at least one relation network based on the associated information on each path and the corresponding path.
A knowledge extraction device, the knowledge extraction device comprising:
the acquisition unit is used for acquiring source data when a knowledge extraction instruction is received;
the preprocessing unit is used for preprocessing the source data to obtain text data;
the identification unit is used for identifying the entity in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
the expansion unit is used for expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
the disambiguation unit is used for carrying out disambiguation processing on the candidate entity list by adopting a semantic matching model trained based on the Attention-DSSM algorithm to obtain a target entity;
a linking unit, configured to link the target entity to a node of the knowledge-graph;
and the extraction unit is used for extracting knowledge based on the information on the nodes.
According to a preferred embodiment of the present invention, the preprocessing unit is specifically configured to:
when the source data are of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data; or
And when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.
According to a preferred embodiment of the present invention, the knowledge extraction device further comprises:
the configuration unit is used for configuring the sequence marking mode according to predefined demand data;
and the adding unit is used for adding the sequence labeling mode into the Bi-LSTM + CRF model to obtain the sequence labeling model.
According to a preferred embodiment of the present invention, the identification unit is specifically configured to:
inputting the text data into the sequence labeling model based on Bi-LSTM + CRF, and acquiring the output probability and the transition probability of each corresponding label at each sequence position in a Softmax layer;
calculating the sum of the output probability and the transition probability of each label as the score of each label for each sequence position;
determining the label with the highest score as the output label of each sequence position;
and combining the output labels of each sequence position to obtain the initial entity list.
According to a preferred embodiment of the present invention, the extension unit is specifically configured to:
calculating the cosine similarity between each entity in the initial entity list and the entity on each node in the knowledge graph;
acquiring at least one entity with cosine similarity greater than or equal to preset similarity from each node as a candidate entity;
and constructing the candidate entity list according to the initial entity list and the candidate entities.
According to a preferred embodiment of the invention, the disambiguation unit is specifically configured to:
coding each entity in the candidate entity list based on One-Hot coding algorithm to obtain a word ID of each entity;
inputting the word ID of each entity into a pre-configured dictionary, and outputting a word vector of each entity;
processing the word vector of each entity based on an Attention mechanism to obtain semantic representation of each entity;
interacting the semantic representation of each entity in an Interaction layer, and outputting a semantic vector after Interaction of each entity;
and matching the semantic vector after interaction of each entity with the entities on the nodes of the knowledge graph on a matching layer, and outputting the entity with the highest matching degree as the target entity.
According to a preferred embodiment of the present invention, the extracting unit is specifically configured to:
acquiring at least one path between nodes and associated information on each path from the information on the nodes;
and extracting at least one relation network based on the associated information on each path and the corresponding path.
An electronic device, the electronic device comprising:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to implement the knowledge extraction method.
A computer-readable storage medium having at least one instruction stored therein, the at least one instruction being executable by a processor in an electronic device to implement the knowledge extraction method.
The technical scheme shows that the method can acquire source data when receiving a knowledge extraction instruction, preprocesses the source data to obtain text data, realizes the unification of formats, identifies entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realizes the accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, further expands the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list, realizes the comprehensive coverage of similar representation, and performs disambiguation on the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity, because the Attention mechanism strengthens the association between each word and other words and improves the weight of key words, the newly added Interaction layer also enhances the association between texts to be matched, so that the obtained target entity is more accurate, the target entity is further linked to the nodes of the knowledge graph, and automatic knowledge extraction is performed based on the information on the nodes, so that the efficiency and the accuracy of knowledge extraction are improved.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of the knowledge extraction method of the present invention.
FIG. 2 is a schematic diagram of an exemplary-based data source extraction relationship network according to the present invention.
FIG. 3 is a functional block diagram of a preferred embodiment of the knowledge extraction apparatus of the present invention.
FIG. 4 is a schematic structural diagram of an electronic device implementing a knowledge extraction method according to a preferred embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a flow chart of a preferred embodiment of the knowledge extraction method of the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
The knowledge extraction method is applied to one or more electronic devices, which are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the electronic devices includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive Internet Protocol Television (IPTV), an intelligent wearable device, and the like.
The electronic device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud computing (cloud computing) based cloud consisting of a large number of hosts or network servers.
The Network where the electronic device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.
And S10, acquiring the source data when the knowledge extraction instruction is received.
In at least one embodiment of the invention, the knowledge extraction instructions may be triggered by specified users including, but not limited to: project managers, and the like.
Further, the source data may be obtained from a configuration database.
For example: when a knowledge extraction based on legal knowledge base is performed, the source data may be obtained from a database accessible by the court, such as a database inside the court, a source database on the web, or the like.
The source data may be a picture type or a text type, and the present invention is not limited thereto.
And S11, preprocessing the source data to obtain text data.
In order for the machine to be able to identify the source data, the electronic device first needs to pre-process the source data.
Specifically, the preprocessing the source data to obtain text data includes:
when the source data is of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on an UTF-8(8-bit Unicode Transformation Format, 8-bit) coding algorithm to obtain the text data.
Or when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.
Wherein the electronic device may employ an OCR (Optical Character Recognition) algorithm to convert the source data into the initial text.
Meanwhile, the filtered text is coded through the UTF-8 coding algorithm, full-angle and half-angle symbol conversion, messy code removal and other operations can be performed on the filtered text, and coding unification is finally achieved.
Further, the text data may be in a TXT text format, or may be in other text formats, which is not limited in the present invention.
Through the implementation mode, the source data can be filtered and cleaned to eliminate interference information, and further the source data is converted into a uniform text format, so that the uniformity of the data format is realized, and the preprocessed text data can be recognized and processed by a machine.
And S12, identifying the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list.
The text data obtained by preprocessing the source data is unstructured text data, so that key entity information in the text data needs to be identified, which is equivalent to performing sequence labeling on the text data. Therefore, the electronic device needs to first construct a sequence annotation model associated with the current knowledge extraction instruction.
Specifically, the knowledge extraction method further includes:
the electronic equipment configures a sequence labeling mode according to predefined demand data, and adds the sequence labeling mode to a Bi-LSTM + CRF model to obtain the sequence labeling model.
The sequence labeling mode can be configured according to specific task requirements.
For example: the same label may not be output consecutively, etc.
It should be noted that a Bi-LSTM (Bidirectional Long Short Term Memory) layer provides Long-distance dependency modeling, enhances the relation between each character and a context character, and a CRF (conditional random field) can accommodate arbitrary context information, so that the feature design is flexible, the CRF layer can construct feature transfer and correspondence between characters, and simultaneously considers the sequentiality between output labels, thereby achieving a more accurate recognition effect.
In at least one embodiment of the present invention, the identifying the entities in the text data through the Bi-LSTM + CRF-based sequence labeling model, and obtaining the initial entity list includes:
the electronic equipment inputs the text data into the sequence labeling model based on the Bi-LSTM + CRF, obtains output probability and transition probability of each corresponding label at each sequence position in a Softmax layer, calculates sum of the output probability and the transition probability of each label as score of each label for each sequence position, determines the label with the highest score as the output label of each sequence position, and combines the output labels of each sequence position to obtain the initial entity list.
Specifically, X ═ X (X) for each input1,x2,…,xn) A predicted tag sequence Y ═ Y (Y) can be obtained1,y2,…,yn) The score formula is defined as follows:
wherein the content of the first and second substances,for the output of Softmax layer at the ith sequence position as yiThe probability of (a) of (b) being,is from yiTo yi+1The transition probability of (2).
According to the formula, when a predicted sequence is high in score, not all positions are labels corresponding to the maximum probability value output by the Softmax layer, and transition probabilities are comprehensively considered, namely the sequence labeling mode is met (for example, B cannot be followed by B).
For example: after Bi-LSTM processing, the most likely output sequence is bbibioo, and since in the transition probability matrix, according to the sequence labeling mode, the probability of B — > B is very small, even negative, then the sequence will not get the highest score, i.e. it is not the initial entity list.
Taking the above example into account, if B-PER represents the first character tag of a person's name, E-PER represents the last character tag of a person's name, O represents the independent character tag, B-ORG represents the first character tag of an organization's name, and I-ORG represents the middle character tag of an organization's name, the initial entity list obtained by merging the tag items of the same category in the sequence may include the following sequence: sequences (B, E) representing the name of a person; sequences (B, I, E) representing organization names; sequence (O), representing an independent character.
S13, expanding the initial entity list based on the pre-configured knowledge graph to obtain a candidate entity list.
The knowledge graph can be configured in advance according to each technical field, such as: legal knowledge maps in the legal domain, etc.
It should be noted that each entity in the initial entity list may be a partial representation or an alternative representation of the entity, and therefore, a surface name extension needs to be performed on each entity to obtain the candidate entity list.
For example: the "XX middle court" and the "XX City middle people court" are different representations of the same entity.
Specifically, the expanding the initial entity list based on the preconfigured knowledge graph to obtain the candidate entity list includes:
the electronic equipment calculates the cosine similarity between each entity in the initial entity list and the entities on each node in the knowledge graph, and obtains at least one entity with the cosine similarity larger than or equal to the preset similarity from each node as a candidate entity, and the electronic equipment constructs the candidate entity list according to the initial entity list and the candidate entity.
The preset similarity may be configured by user, for example: 99.7 percent.
The cosine similarity is a way of measuring the similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, and compared with distance measurement, the cosine similarity focuses more on the difference of the two vectors in the direction. Generally, after obtaining a vector representation of two texts by using Embedding (Embedding vector), a cosine similarity can be used to calculate a similarity between the two texts.
By the embodiment, the similarity between each entity and the entity on the node of the knowledge graph can be calculated by adopting cosine similarity so as to judge whether the coreference relationship exists, and further the expansion of the initial entity list is realized to obtain the candidate entity list with more comprehensive coverage.
S14, disambiguating the candidate entity list by adopting a Semantic matching Model trained based on an Attention-Deep Structured Semantic Model (ATTENTION-DSSM) algorithm to obtain a target entity.
The candidate entity list obtained by the expansion processing of the initial entity list may have a plurality of candidates, and therefore, the candidate entity list needs to be further disambiguated, and further, the only one target entity which is most matched with the entity on the knowledge graph node can be more accurately determined based on a plurality of similar representations.
Specifically, the disambiguating the candidate entity list by using the semantic matching model trained based on the Attention-DSSM algorithm to obtain the target entity includes:
the electronic equipment encodes each entity in the candidate entity list based on an One-Hot encoding algorithm to obtain a word ID of each entity, inputs the word ID of each entity into a pre-configured dictionary, outputs a word vector of each entity, processes the word vector of each entity based on an Attention mechanism to obtain a semantic representation of each entity, further interacts the semantic representation of each entity on an Interaction layer (Interaction layer), outputs the semantic vector after Interaction of each entity, matches the semantic vector after Interaction of each entity with the entity on the knowledge graph node on a matching layer, and outputs the entity with the highest matching degree as the target entity.
The traditional DSSM (Deep Structured Semantic matching Model) expresses the extracted entity context information and the context information of candidate entities as low-dimensional Semantic vectors, and calculates the distance between the two Semantic vectors by cosine distance. However, because the DSSM adopts the bag-of-word model, the word order information and the context information are lost. In addition, the DSSM model is a weakly supervised end-to-end model, the prediction result is uncontrollable, long-distance information cannot be acquired, and the problems of gradient disappearance and the like exist.
In consideration of the particularity of the semantic matching task, the embodiment adopts a semantic matching model trained based on the Attention-DSSM algorithm, and the semantic matching model may include, from bottom to top: the system comprises an input layer, a semantic representation layer, an Interaction layer and a matching layer.
Through the implementation mode, the word vector of each entity is processed based on the Attention mechanism, the semantic representation capability is enhanced, the association between the vocabulary in each text and other vocabularies is enhanced, the weight of key vocabularies in the text is improved, the accuracy is high, in addition, the newly added Interaction layer enables Interaction between two texts to be matched, the association between the two texts to be matched is enhanced through mutual expression, and the model normalization capability is improved.
S15, linking the target entity to the nodes of the knowledge-graph.
Specifically, the electronic device obtains a node on the knowledge-graph corresponding to the target entity, and further links the target entity to the node of the knowledge-graph.
For example: and linking the XX middle court in the legal document with the XX city middle people court node in the legal knowledge graph, wherein the attributes and the relationship of the XX middle court are the same as those of the XX city middle people court.
It should be noted that the physical link belongs to a relatively mature technology, and the present invention is not described herein.
And S16, extracting knowledge based on the information on the node.
Specifically, the extracting knowledge based on the information on the node includes:
the electronic equipment acquires at least one path between the nodes and the associated information on each path from the information on the nodes, and extracts at least one relation network based on the associated information on each path and the corresponding path.
As shown in fig. 2, for the acquired source data, a corresponding relationship network may be extracted.
It should be noted that the knowledge graph is composed of a large amount of knowledge and relations between the knowledge, nodes in the network represent entities existing in the real world, edges between the nodes represent relations between two entities, and the knowledge in the real world is abstracted into a knowledge network applicable to machine processing through the combination of points and edges.
Through the above embodiment, knowledge extraction is performed based on the information on the nodes, implicit information of the linked nodes can be acquired after the target entity is linked to the nodes of the knowledge graph, and extraction of relationship and event information can be performed according to paths among the nodes.
The technical scheme shows that the method can acquire source data when receiving a knowledge extraction instruction, preprocesses the source data to obtain text data, realizes the unification of formats, identifies entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realizes the accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, further expands the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list, realizes the comprehensive coverage of similar representation, and performs disambiguation on the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity, because the Attention mechanism strengthens the association between each word and other words and improves the weight of key words, the newly added Interaction layer also enhances the association between texts to be matched, so that the obtained target entity is more accurate, the target entity is further linked to the nodes of the knowledge graph, and automatic knowledge extraction is performed based on the information on the nodes, so that the efficiency and the accuracy of knowledge extraction are improved.
Fig. 3 is a functional block diagram of the knowledge extracting apparatus according to the preferred embodiment of the present invention. The knowledge extraction device 11 comprises an acquisition unit 110, a preprocessing unit 111, an identification unit 112, an expansion unit 113, a disambiguation unit 114, a linking unit 115, an extraction unit 116, a configuration unit 117, and an addition unit 118. The module/unit referred to in the present invention refers to a series of computer program segments that can be executed by the processor 13 and that can perform a fixed function, and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.
When receiving a knowledge extraction instruction, the acquisition unit 110 acquires source data.
In at least one embodiment of the invention, the knowledge extraction instructions may be triggered by specified users including, but not limited to: project managers, and the like.
Further, the source data may be obtained from a configuration database.
For example: when a knowledge extraction based on legal knowledge base is performed, the source data may be obtained from a database accessible by the court, such as a database inside the court, a source database on the web, or the like.
The source data may be a picture type or a text type, and the present invention is not limited thereto.
The preprocessing unit 111 preprocesses the source data to obtain text data.
In order to enable the machine to recognize the source data, the preprocessing unit 111 first needs to preprocess the source data.
Specifically, the preprocessing unit 111 preprocesses the source data to obtain text data, and includes:
when the source data is of a picture type, the preprocessing unit 111 converts the source data into an initial text, filters and cleans the initial text to obtain a filtered text, and codes the filtered text based on an UTF-8(8-bit unicode transformation Format, 8-bit) coding algorithm to obtain the text data.
Or when the source data is a text type, the preprocessing unit 111 filters and cleans the source data to obtain a filtered text, and codes the filtered text based on a UTF-8 coding algorithm to obtain the text data.
Wherein the preprocessing unit 111 may employ an OCR (Optical Character Recognition) algorithm to convert the source data into the initial text.
Meanwhile, the filtered text is coded through the UTF-8 coding algorithm, full-angle and half-angle symbol conversion, messy code removal and other operations can be performed on the filtered text, and coding unification is finally achieved.
Further, the text data may be in a TXT text format, or may be in other text formats, which is not limited in the present invention.
Through the implementation mode, the source data can be filtered and cleaned to eliminate interference information, and further the source data is converted into a uniform text format, so that the uniformity of the data format is realized, and the preprocessed text data can be recognized and processed by a machine.
The identification unit 112 identifies the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF, resulting in an initial entity list.
The text data obtained by preprocessing the source data is unstructured text data, so that key entity information in the text data needs to be identified, which is equivalent to performing sequence labeling on the text data. Therefore, it is necessary to first construct a sequence annotation model associated with the current knowledge extraction instruction.
Specifically, the configuration unit 117 configures a sequence labeling mode according to predefined demand data, and the adding unit 118 adds the sequence labeling mode to the Bi-LSTM + CRF model to obtain the sequence labeling model.
The sequence labeling mode can be configured according to specific task requirements.
For example: the same label may not be output consecutively, etc.
It should be noted that a Bi-LSTM (Bidirectional Long Short Term Memory) layer provides Long-distance dependency modeling, enhances the relation between each character and a context character, and a CRF (conditional random field) can accommodate arbitrary context information, so that the feature design is flexible, the CRF layer can construct feature transfer and correspondence between characters, and simultaneously considers the sequentiality between output labels, thereby achieving a more accurate recognition effect.
In at least one embodiment of the present invention, the identifying unit 112 identifies the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF, and the obtaining of the initial entity list includes:
the identification unit 112 inputs the text data into the Bi-LSTM + CRF-based sequence labeling model, and obtains an output probability and a transition probability of each corresponding label at each sequence position in the Softmax layer, for each sequence position, the identification unit 112 calculates a sum of the output probability and the transition probability of each label as a score of each label, and determines a label with a highest score as an output label at each sequence position, and the identification unit 112 combines the output labels at each sequence position to obtain the initial entity list.
Specifically, X ═ X (X) for each input1,x2,…,xn) A predicted tag sequence Y ═ Y (Y) can be obtained1,y2,…,yn) The score formula is defined as follows:
wherein the content of the first and second substances,for the output of Softmax layer at the ith sequence position as yiThe probability of (a) of (b) being,is from yiTo yi+1The transition probability of (2).
According to the formula, when a predicted sequence is high in score, not all positions are labels corresponding to the maximum probability value output by the Softmax layer, and transition probabilities are comprehensively considered, namely the sequence labeling mode is met (for example, B cannot be followed by B).
For example: after Bi-LSTM processing, the most likely output sequence is bbibioo, and since in the transition probability matrix, according to the sequence labeling mode, the probability of B — > B is very small, even negative, then the sequence will not get the highest score, i.e. it is not the initial entity list.
Taking the above example into account, if B-PER represents the first character tag of a person's name, E-PER represents the last character tag of a person's name, O represents the independent character tag, B-ORG represents the first character tag of an organization's name, and I-ORG represents the middle character tag of an organization's name, the initial entity list obtained by merging the tag items of the same category in the sequence may include the following sequence: sequences (B, E) representing the name of a person; sequences (B, I, E) representing organization names; sequence (O), representing an independent character.
The expansion unit 113 expands the initial entity list based on a preconfigured knowledge graph to obtain a candidate entity list.
The knowledge graph can be configured in advance according to each technical field, such as: legal knowledge maps in the legal domain, etc.
It should be noted that each entity in the initial entity list may be a partial representation or an alternative representation of the entity, and therefore, a surface name extension needs to be performed on each entity to obtain the candidate entity list.
For example: the "XX middle court" and the "XX City middle people court" are different representations of the same entity.
Specifically, the expanding unit 113 expands the initial entity list based on a preconfigured knowledge graph, and obtaining a candidate entity list includes:
the expansion unit 113 calculates cosine similarity between each entity in the initial entity list and entities on each node in the knowledge graph, and obtains at least one entity with the cosine similarity greater than or equal to a preset similarity from each node as a candidate entity, and the expansion unit 113 constructs the candidate entity list according to the initial entity list and the candidate entity.
The preset similarity may be configured by user, for example: 99.7 percent.
The cosine similarity is a way of measuring the similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, and compared with distance measurement, the cosine similarity focuses more on the difference of the two vectors in the direction. Generally, after obtaining a vector representation of two texts by using Embedding (Embedding vector), a cosine similarity can be used to calculate a similarity between the two texts.
By the embodiment, the similarity between each entity and the entity on the node of the knowledge graph can be calculated by adopting cosine similarity so as to judge whether the coreference relationship exists, and further the expansion of the initial entity list is realized to obtain the candidate entity list with more comprehensive coverage.
The disambiguation unit 114 performs disambiguation on the candidate entity list by using a semantic matching model trained based on an Attention-Deep Structured semantic model (atsm-DSSM) algorithm to obtain a target entity.
The candidate entity list obtained by the expansion processing of the initial entity list may have a plurality of candidates, and therefore, the candidate entity list needs to be further disambiguated, and further, the only one target entity which is most matched with the entity on the knowledge graph node can be more accurately determined based on a plurality of similar representations.
Specifically, the disambiguation unit 114 performs disambiguation on the candidate entity list by using a semantic matching model trained based on the Attention-DSSM algorithm, and obtaining a target entity includes:
the disambiguation unit 114 encodes each entity in the candidate entity list based on One-Hot encoding algorithm to obtain a word ID of each entity, inputs the word ID of each entity into a pre-configured dictionary, outputs a word vector of each entity, processes the word vector of each entity based on an Attention mechanism to obtain a semantic representation of each entity, further interacts the semantic representation of each entity on an Interaction layer (Interaction layer), outputs the semantic vector after Interaction of each entity, matches the semantic vector after Interaction of each entity with the entity on the knowledge graph node on a matching layer, and outputs the entity with the highest matching degree as the target entity.
The traditional DSSM (Deep Structured Semantic matching Model) expresses the extracted entity context information and the context information of candidate entities as low-dimensional Semantic vectors, and calculates the distance between the two Semantic vectors by cosine distance. However, because the DSSM adopts the bag-of-word model, the word order information and the context information are lost. In addition, the DSSM model is a weakly supervised end-to-end model, the prediction result is uncontrollable, long-distance information cannot be acquired, and the problems of gradient disappearance and the like exist.
In consideration of the particularity of the semantic matching task, the embodiment adopts a semantic matching model trained based on the Attention-DSSM algorithm, and the semantic matching model may include, from bottom to top: the system comprises an input layer, a semantic representation layer, an Interaction layer and a matching layer.
Through the implementation mode, the word vector of each entity is processed based on the Attention mechanism, the semantic representation capability is enhanced, the association between the vocabulary in each text and other vocabularies is enhanced, the weight of key vocabularies in the text is improved, the accuracy is high, in addition, the newly added Interaction layer enables Interaction between two texts to be matched, the association between the two texts to be matched is enhanced through mutual expression, and the model normalization capability is improved.
The linking unit 115 links the target entity to a node of the knowledge-graph.
Specifically, the linking unit 115 acquires a node on the knowledge-graph corresponding to the target entity, and further links the target entity to the node of the knowledge-graph.
For example: and linking the XX middle court in the legal document with the XX city middle people court node in the legal knowledge graph, wherein the attributes and the relationship of the XX middle court are the same as those of the XX city middle people court.
It should be noted that the physical link belongs to a relatively mature technology, and the present invention is not described herein.
The extraction unit 116 performs knowledge extraction based on the information on the node.
Specifically, the extracting unit 116 performs knowledge extraction based on the information on the node, including:
the extraction unit 116 obtains at least one path between nodes and associated information on each path from the information on the nodes, and extracts at least one relationship network based on the associated information on each path and the corresponding path.
As shown in fig. 2, for the acquired source data, a corresponding relationship network may be extracted.
It should be noted that the knowledge graph is composed of a large amount of knowledge and relations between the knowledge, nodes in the network represent entities existing in the real world, edges between the nodes represent relations between two entities, and the knowledge in the real world is abstracted into a knowledge network applicable to machine processing through the combination of points and edges.
Through the above embodiment, knowledge extraction is performed based on the information on the nodes, implicit information of the linked nodes can be acquired after the target entity is linked to the nodes of the knowledge graph, and extraction of relationship and event information can be performed according to paths among the nodes.
The technical scheme shows that the method can acquire source data when receiving a knowledge extraction instruction, preprocesses the source data to obtain text data, realizes the unification of formats, identifies entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realizes the accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, further expands the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list, realizes the comprehensive coverage of similar representation, and performs disambiguation on the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity, because the Attention mechanism strengthens the association between each word and other words and improves the weight of key words, the newly added Interaction layer also enhances the association between texts to be matched, so that the obtained target entity is more accurate, the target entity is further linked to the nodes of the knowledge graph, and automatic knowledge extraction is performed based on the information on the nodes, so that the efficiency and the accuracy of knowledge extraction are improved.
Fig. 4 is a schematic structural diagram of an electronic device according to a preferred embodiment of the knowledge extraction method of the present invention.
The electronic device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program, such as a knowledge extraction program, stored in the memory 12 and executable on the processor 13.
It will be understood by those skilled in the art that the schematic diagram is merely an example of the electronic device 1, and does not constitute a limitation to the electronic device 1, the electronic device 1 may have a bus-type structure or a star-type structure, the electronic device 1 may further include more or less hardware or software than those shown in the figures, or different component arrangements, for example, the electronic device 1 may further include an input and output device, a network access device, and the like.
It should be noted that the electronic device 1 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
The memory 12 includes at least one type of readable storage medium, which includes flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, for example a removable hard disk of the electronic device 1. The memory 12 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 12 may be used not only to store application software installed in the electronic apparatus 1 and various types of data such as codes of a knowledge extraction program, but also to temporarily store data that has been output or is to be output.
The processor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the electronic device 1 by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (for example, executing a knowledge extraction program and the like) stored in the memory 12 and calling data stored in the memory 12.
The processor 13 executes an operating system of the electronic device 1 and various installed application programs. The processor 13 executes the application program to implement the steps in the above-described respective knowledge extraction method embodiments, such as steps S10, S11, S12, S13, S14, S15, S16 shown in fig. 1.
Alternatively, the processor 13, when executing the computer program, implements the functions of the modules/units in the above device embodiments, for example:
when a knowledge extraction instruction is received, acquiring source data;
preprocessing the source data to obtain text data;
identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;
linking the target entity to a node of the knowledge-graph;
and extracting knowledge based on the information on the nodes.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the electronic device 1. For example, the computer program may be divided into an acquisition unit 110, a pre-processing unit 111, an identification unit 112, an extension unit 113, a disambiguation unit 114, a linking unit 115, an extraction unit 116, a configuration unit 117, an addition unit 118.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the knowledge extraction method according to the embodiments of the present invention.
The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented.
Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 4, but this does not indicate only one bus or one type of bus. The bus is arranged to enable connection communication between the memory 12 and at least one processor 13 or the like.
Although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 13 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
Fig. 4 only shows the electronic device 1 with components 12-13, and it will be understood by a person skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
In conjunction with fig. 1, the memory 12 in the electronic device 1 stores a plurality of instructions to implement a knowledge extraction method, and the processor 13 executes the plurality of instructions to implement:
when a knowledge extraction instruction is received, acquiring source data;
preprocessing the source data to obtain text data;
identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;
linking the target entity to a node of the knowledge-graph;
and extracting knowledge based on the information on the nodes.
Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. A knowledge extraction method, characterized by comprising:
when a knowledge extraction instruction is received, acquiring source data;
preprocessing the source data to obtain text data;
identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;
linking the target entity to a node of the knowledge-graph;
and extracting knowledge based on the information on the nodes.
2. The knowledge extraction method of claim 1, wherein the preprocessing the source data to obtain text data comprises:
when the source data are of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data; or
And when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.
3. The knowledge extraction method of claim 1, further comprising:
configuring a sequence marking mode according to predefined demand data;
and adding the sequence labeling mode into a Bi-LSTM + CRF model to obtain the sequence labeling model.
4. The knowledge extraction method of claim 1, wherein the identifying entities in the text data through a Bi-LSTM + CRF based sequence labeling model, and obtaining an initial entity list comprises:
inputting the text data into the sequence labeling model based on Bi-LSTM + CRF, and acquiring the output probability and the transition probability of each corresponding label at each sequence position in a Softmax layer;
calculating the sum of the output probability and the transition probability of each label as the score of each label for each sequence position;
determining the label with the highest score as the output label of each sequence position;
and combining the output labels of each sequence position to obtain the initial entity list.
5. The method of knowledge extraction as claimed in claim 1, wherein the expanding the initial entity list based on a preconfigured knowledge-graph to obtain a candidate entity list comprises:
calculating the cosine similarity between each entity in the initial entity list and the entity on each node in the knowledge graph;
acquiring at least one entity with cosine similarity greater than or equal to preset similarity from each node as a candidate entity;
and constructing the candidate entity list according to the initial entity list and the candidate entities.
6. The knowledge extraction method of claim 1, wherein the disambiguating the candidate entity list using a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity comprises:
coding each entity in the candidate entity list based on One-Hot coding algorithm to obtain a word ID of each entity;
inputting the word ID of each entity into a pre-configured dictionary, and outputting a word vector of each entity;
processing the word vector of each entity based on an Attention mechanism to obtain semantic representation of each entity;
interacting the semantic representation of each entity in an Interaction layer, and outputting a semantic vector after Interaction of each entity;
and matching the semantic vector after interaction of each entity with the entities on the nodes of the knowledge graph on a matching layer, and outputting the entity with the highest matching degree as the target entity.
7. The knowledge extraction method of claim 1, wherein the extracting knowledge based on information on the node comprises:
acquiring at least one path between nodes and associated information on each path from the information on the nodes;
and extracting at least one relation network based on the associated information on each path and the corresponding path.
8. A knowledge extraction apparatus, characterized by comprising:
the acquisition unit is used for acquiring source data when a knowledge extraction instruction is received;
the preprocessing unit is used for preprocessing the source data to obtain text data;
the identification unit is used for identifying the entity in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
the expansion unit is used for expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
the disambiguation unit is used for carrying out disambiguation processing on the candidate entity list by adopting a semantic matching model trained based on the Attention-DSSM algorithm to obtain a target entity;
a linking unit, configured to link the target entity to a node of the knowledge-graph;
and the extraction unit is used for extracting knowledge based on the information on the nodes.
9. An electronic device, characterized in that the electronic device comprises:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to implement the knowledge extraction method of any one of claims 1 to 7.
10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein at least one instruction that is executed by a processor in an electronic device to implement the knowledge extraction method of any one of claims 1-7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010318382.8A CN111639498A (en) | 2020-04-21 | 2020-04-21 | Knowledge extraction method and device, electronic equipment and storage medium |
PCT/CN2020/104964 WO2021212682A1 (en) | 2020-04-21 | 2020-07-27 | Knowledge extraction method, apparatus, electronic device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010318382.8A CN111639498A (en) | 2020-04-21 | 2020-04-21 | Knowledge extraction method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111639498A true CN111639498A (en) | 2020-09-08 |
Family
ID=72328869
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010318382.8A Pending CN111639498A (en) | 2020-04-21 | 2020-04-21 | Knowledge extraction method and device, electronic equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111639498A (en) |
WO (1) | WO2021212682A1 (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112328653A (en) * | 2020-10-30 | 2021-02-05 | 北京百度网讯科技有限公司 | Data identification method and device, electronic equipment and storage medium |
CN112380359A (en) * | 2021-01-18 | 2021-02-19 | 平安科技(深圳)有限公司 | Knowledge graph-based training resource allocation method, device, equipment and medium |
CN112395429A (en) * | 2020-12-02 | 2021-02-23 | 上海三稻智能科技有限公司 | Method, system and storage medium for determining, pushing and applying HS (high speed coding) codes based on graph neural network |
CN112426726A (en) * | 2020-12-09 | 2021-03-02 | 网易(杭州)网络有限公司 | Game event extraction method, device, storage medium and server |
CN112464669A (en) * | 2020-12-07 | 2021-03-09 | 宁波深擎信息科技有限公司 | Stock entity word disambiguation method, computer device and storage medium |
CN112507126A (en) * | 2020-12-07 | 2021-03-16 | 厦门渊亭信息科技有限公司 | Entity linking device and method based on recurrent neural network |
CN112508615A (en) * | 2020-12-10 | 2021-03-16 | 深圳市欢太科技有限公司 | Feature extraction method, feature extraction device, storage medium, and electronic apparatus |
CN112528660A (en) * | 2020-12-04 | 2021-03-19 | 北京百度网讯科技有限公司 | Method, apparatus, device, storage medium and program product for processing text |
CN113111660A (en) * | 2021-04-22 | 2021-07-13 | 脉景(杭州)健康管理有限公司 | Data processing method, device, equipment and storage medium |
CN113220835A (en) * | 2021-05-08 | 2021-08-06 | 北京百度网讯科技有限公司 | Text information processing method and device, electronic equipment and storage medium |
CN113268452A (en) * | 2021-05-25 | 2021-08-17 | 联仁健康医疗大数据科技股份有限公司 | Entity extraction method, device, equipment and storage medium |
CN113297419A (en) * | 2021-06-23 | 2021-08-24 | 南京谦萃智能科技服务有限公司 | Video knowledge point determining method and device, electronic equipment and storage medium |
WO2021190653A1 (en) * | 2020-10-31 | 2021-09-30 | 平安科技(深圳)有限公司 | Semantic parsing device and method, terminal, and storage medium |
CN113505889A (en) * | 2021-07-23 | 2021-10-15 | 中国平安人寿保险股份有限公司 | Processing method and device of atlas knowledge base, computer equipment and storage medium |
CN113705194A (en) * | 2021-04-12 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Extraction method and electronic equipment for short |
CN114186690A (en) * | 2022-02-16 | 2022-03-15 | 中国空气动力研究与发展中心计算空气动力研究所 | Aircraft knowledge graph construction method, device, equipment and storage medium |
CN114237829A (en) * | 2021-12-27 | 2022-03-25 | 南方电网物资有限公司 | Data acquisition and processing method for power equipment |
CN114780749A (en) * | 2022-05-05 | 2022-07-22 | 国网江苏省电力有限公司营销服务中心 | Electric power entity chain finger method based on graph attention machine mechanism |
CN115062619A (en) * | 2022-08-11 | 2022-09-16 | 中国人民解放军国防科技大学 | Chinese entity linking method, device, equipment and storage medium |
CN116826933A (en) * | 2023-08-30 | 2023-09-29 | 深圳科力远数智能源技术有限公司 | Knowledge-graph-based hybrid energy storage battery power supply backstepping control method and system |
CN117668259A (en) * | 2024-02-01 | 2024-03-08 | 华安证券股份有限公司 | Knowledge-graph-based inside and outside data linkage analysis method and device |
Families Citing this family (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114218931B (en) * | 2021-11-04 | 2024-01-23 | 北京百度网讯科技有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
CN114239583B (en) * | 2021-12-15 | 2023-04-07 | 北京百度网讯科技有限公司 | Method, device, equipment and medium for training entity chain finger model and entity chain finger |
CN114218403B (en) * | 2021-12-20 | 2024-04-09 | 平安付科技服务有限公司 | Fault root cause positioning method, device, equipment and medium based on knowledge graph |
CN114416976A (en) * | 2021-12-23 | 2022-04-29 | 北京百度网讯科技有限公司 | Text labeling method and device and electronic equipment |
CN114330345B (en) * | 2021-12-24 | 2023-01-17 | 北京百度网讯科技有限公司 | Named entity recognition method, training method, device, electronic equipment and medium |
CN114491232B (en) * | 2021-12-24 | 2023-03-24 | 北京百度网讯科技有限公司 | Information query method and device, electronic equipment and storage medium |
CN114330353B (en) * | 2022-01-06 | 2023-06-13 | 腾讯科技(深圳)有限公司 | Entity identification method, device, equipment, medium and program product of virtual scene |
CN114186759A (en) * | 2022-02-16 | 2022-03-15 | 杭州杰牌传动科技有限公司 | Material scheduling control method and system based on reducer knowledge graph |
CN114925158A (en) * | 2022-03-15 | 2022-08-19 | 青岛海尔科技有限公司 | Sentence text intention recognition method and device, storage medium and electronic device |
CN114385833B (en) * | 2022-03-23 | 2023-05-12 | 支付宝(杭州)信息技术有限公司 | Method and device for updating knowledge graph |
CN114896408B (en) * | 2022-03-24 | 2024-04-19 | 北京大学深圳研究生院 | Construction method of material knowledge graph, material knowledge graph and application |
CN114942998B (en) * | 2022-04-25 | 2024-02-13 | 西北工业大学 | Knowledge graph neighborhood structure sparse entity alignment method integrating multi-source data |
CN114912637B (en) * | 2022-05-21 | 2023-08-29 | 重庆大学 | Human-computer object knowledge graph manufacturing production line operation and maintenance decision method and system and storage medium |
CN114861677B (en) * | 2022-05-30 | 2023-04-18 | 北京百度网讯科技有限公司 | Information extraction method and device, electronic equipment and storage medium |
CN114707005B (en) * | 2022-06-02 | 2022-10-25 | 浙江建木智能系统有限公司 | Knowledge graph construction method and system for ship equipment |
CN115017255B (en) * | 2022-08-08 | 2022-11-01 | 杭州实在智能科技有限公司 | Knowledge base construction and search method based on tree structure |
CN115050085B (en) * | 2022-08-15 | 2022-11-01 | 珠海翔翼航空技术有限公司 | Method, system and equipment for recognizing objects of analog machine management system based on map |
CN115510245B (en) * | 2022-10-14 | 2024-05-14 | 北京理工大学 | Unstructured data-oriented domain knowledge extraction method |
CN115544626B (en) * | 2022-10-21 | 2023-10-20 | 清华大学 | Sub-model extraction method, device, computer equipment and medium |
CN115795051B (en) * | 2022-12-02 | 2023-05-23 | 中科雨辰科技有限公司 | Data processing system for acquiring link entity based on entity relationship |
CN115796189B (en) * | 2023-01-31 | 2023-05-12 | 北京面壁智能科技有限责任公司 | Semantic determining method, semantic determining device, electronic equipment and medium |
CN116070001B (en) * | 2023-02-03 | 2023-12-19 | 深圳市艾莉诗科技有限公司 | Information directional grabbing method and device based on Internet |
CN115827935B (en) * | 2023-02-09 | 2023-05-23 | 支付宝(杭州)信息技术有限公司 | Data processing method, device and equipment |
CN116127053B (en) * | 2023-02-14 | 2024-01-02 | 北京百度网讯科技有限公司 | Entity word disambiguation, knowledge graph generation and knowledge recommendation methods and devices |
CN115878849B (en) * | 2023-02-27 | 2023-05-26 | 北京奇树有鱼文化传媒有限公司 | Video tag association method and device and electronic equipment |
CN116503865A (en) * | 2023-05-29 | 2023-07-28 | 北京石油化工学院 | Hydrogen road transportation risk identification method and device, electronic equipment and storage medium |
CN116362166A (en) * | 2023-05-29 | 2023-06-30 | 青岛泰睿思微电子有限公司 | Pattern merging system and method for chip packaging |
CN116663537B (en) * | 2023-07-26 | 2023-11-03 | 中信联合云科技有限责任公司 | Big data analysis-based method and system for processing selected question planning information |
CN116719955B (en) * | 2023-08-09 | 2023-10-27 | 北京国电通网络技术有限公司 | Label labeling information generation method and device, electronic equipment and readable medium |
CN116756151B (en) * | 2023-08-17 | 2023-11-24 | 公安部信息通信中心 | Knowledge searching and data processing system |
CN116821712B (en) * | 2023-08-25 | 2023-12-19 | 中电科大数据研究院有限公司 | Semantic matching method and device for unstructured text and knowledge graph |
CN117272170B (en) * | 2023-09-20 | 2024-03-08 | 东旺智能科技(上海)有限公司 | Knowledge graph-based IT operation and maintenance fault root cause analysis method |
CN117012373B (en) * | 2023-10-07 | 2024-02-23 | 广州市妇女儿童医疗中心 | Training method, application method and system of grape embryo auxiliary inspection model |
CN117349386B (en) * | 2023-10-12 | 2024-04-12 | 吉玖(天津)技术有限责任公司 | Digital humane application method based on data strength association model |
CN117172323B (en) * | 2023-11-02 | 2024-01-23 | 知呱呱(天津)大数据技术有限公司 | Patent multi-domain knowledge extraction method and system based on feature alignment |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7014288B2 (en) * | 2018-03-07 | 2022-02-01 | 日本電気株式会社 | Knowledge expansion systems, methods and programs |
CN108733792B (en) * | 2018-05-14 | 2020-12-01 | 北京大学深圳研究生院 | Entity relation extraction method |
CN110609902B (en) * | 2018-05-28 | 2021-10-22 | 华为技术有限公司 | Text processing method and device based on fusion knowledge graph |
CN110362660B (en) * | 2019-07-23 | 2023-06-09 | 重庆邮电大学 | Electronic product quality automatic detection method based on knowledge graph |
-
2020
- 2020-04-21 CN CN202010318382.8A patent/CN111639498A/en active Pending
- 2020-07-27 WO PCT/CN2020/104964 patent/WO2021212682A1/en active Application Filing
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112328653A (en) * | 2020-10-30 | 2021-02-05 | 北京百度网讯科技有限公司 | Data identification method and device, electronic equipment and storage medium |
CN112328653B (en) * | 2020-10-30 | 2023-07-28 | 北京百度网讯科技有限公司 | Data identification method, device, electronic equipment and storage medium |
WO2021190653A1 (en) * | 2020-10-31 | 2021-09-30 | 平安科技(深圳)有限公司 | Semantic parsing device and method, terminal, and storage medium |
CN112395429A (en) * | 2020-12-02 | 2021-02-23 | 上海三稻智能科技有限公司 | Method, system and storage medium for determining, pushing and applying HS (high speed coding) codes based on graph neural network |
CN112528660B (en) * | 2020-12-04 | 2023-10-24 | 北京百度网讯科技有限公司 | Method, apparatus, device, storage medium and program product for processing text |
CN112528660A (en) * | 2020-12-04 | 2021-03-19 | 北京百度网讯科技有限公司 | Method, apparatus, device, storage medium and program product for processing text |
CN112464669A (en) * | 2020-12-07 | 2021-03-09 | 宁波深擎信息科技有限公司 | Stock entity word disambiguation method, computer device and storage medium |
CN112507126A (en) * | 2020-12-07 | 2021-03-16 | 厦门渊亭信息科技有限公司 | Entity linking device and method based on recurrent neural network |
CN112464669B (en) * | 2020-12-07 | 2024-02-09 | 宁波深擎信息科技有限公司 | Stock entity word disambiguation method, computer device, and storage medium |
CN112507126B (en) * | 2020-12-07 | 2022-11-15 | 厦门渊亭信息科技有限公司 | Entity linking device and method based on recurrent neural network |
CN112426726A (en) * | 2020-12-09 | 2021-03-02 | 网易(杭州)网络有限公司 | Game event extraction method, device, storage medium and server |
CN112508615A (en) * | 2020-12-10 | 2021-03-16 | 深圳市欢太科技有限公司 | Feature extraction method, feature extraction device, storage medium, and electronic apparatus |
CN112380359B (en) * | 2021-01-18 | 2021-04-20 | 平安科技(深圳)有限公司 | Knowledge graph-based training resource allocation method, device, equipment and medium |
CN112380359A (en) * | 2021-01-18 | 2021-02-19 | 平安科技(深圳)有限公司 | Knowledge graph-based training resource allocation method, device, equipment and medium |
CN113705194A (en) * | 2021-04-12 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Extraction method and electronic equipment for short |
CN113111660A (en) * | 2021-04-22 | 2021-07-13 | 脉景(杭州)健康管理有限公司 | Data processing method, device, equipment and storage medium |
CN113220835A (en) * | 2021-05-08 | 2021-08-06 | 北京百度网讯科技有限公司 | Text information processing method and device, electronic equipment and storage medium |
CN113220835B (en) * | 2021-05-08 | 2023-09-29 | 北京百度网讯科技有限公司 | Text information processing method, device, electronic equipment and storage medium |
CN113268452A (en) * | 2021-05-25 | 2021-08-17 | 联仁健康医疗大数据科技股份有限公司 | Entity extraction method, device, equipment and storage medium |
CN113268452B (en) * | 2021-05-25 | 2024-02-02 | 联仁健康医疗大数据科技股份有限公司 | Entity extraction method, device, equipment and storage medium |
CN113297419A (en) * | 2021-06-23 | 2021-08-24 | 南京谦萃智能科技服务有限公司 | Video knowledge point determining method and device, electronic equipment and storage medium |
CN113505889A (en) * | 2021-07-23 | 2021-10-15 | 中国平安人寿保险股份有限公司 | Processing method and device of atlas knowledge base, computer equipment and storage medium |
CN114237829A (en) * | 2021-12-27 | 2022-03-25 | 南方电网物资有限公司 | Data acquisition and processing method for power equipment |
CN114186690A (en) * | 2022-02-16 | 2022-03-15 | 中国空气动力研究与发展中心计算空气动力研究所 | Aircraft knowledge graph construction method, device, equipment and storage medium |
CN114780749A (en) * | 2022-05-05 | 2022-07-22 | 国网江苏省电力有限公司营销服务中心 | Electric power entity chain finger method based on graph attention machine mechanism |
CN115062619B (en) * | 2022-08-11 | 2022-11-22 | 中国人民解放军国防科技大学 | Chinese entity linking method, device, equipment and storage medium |
CN115062619A (en) * | 2022-08-11 | 2022-09-16 | 中国人民解放军国防科技大学 | Chinese entity linking method, device, equipment and storage medium |
CN116826933A (en) * | 2023-08-30 | 2023-09-29 | 深圳科力远数智能源技术有限公司 | Knowledge-graph-based hybrid energy storage battery power supply backstepping control method and system |
CN116826933B (en) * | 2023-08-30 | 2023-12-01 | 深圳科力远数智能源技术有限公司 | Knowledge-graph-based hybrid energy storage battery power supply backstepping control method and system |
CN117668259A (en) * | 2024-02-01 | 2024-03-08 | 华安证券股份有限公司 | Knowledge-graph-based inside and outside data linkage analysis method and device |
CN117668259B (en) * | 2024-02-01 | 2024-04-26 | 华安证券股份有限公司 | Knowledge-graph-based inside and outside data linkage analysis method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2021212682A1 (en) | 2021-10-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639498A (en) | Knowledge extraction method and device, electronic equipment and storage medium | |
CN111428488A (en) | Resume data information analyzing and matching method and device, electronic equipment and medium | |
CN111680168A (en) | Text feature semantic extraction method and device, electronic equipment and storage medium | |
CN110688854A (en) | Named entity recognition method, device and computer readable storage medium | |
CN113051356B (en) | Open relation extraction method and device, electronic equipment and storage medium | |
CN113157927B (en) | Text classification method, apparatus, electronic device and readable storage medium | |
CN110110213B (en) | Method and device for mining user occupation, computer readable storage medium and terminal equipment | |
CN112100384B (en) | Data viewpoint extraction method, device, equipment and storage medium | |
CN112883730B (en) | Similar text matching method and device, electronic equipment and storage medium | |
CN113378970A (en) | Sentence similarity detection method and device, electronic equipment and storage medium | |
CN111753089A (en) | Topic clustering method and device, electronic equipment and storage medium | |
CN115238670B (en) | Information text extraction method, device, equipment and storage medium | |
CN113268615A (en) | Resource label generation method and device, electronic equipment and storage medium | |
CN116245097A (en) | Method for training entity recognition model, entity recognition method and corresponding device | |
CN115238115A (en) | Image retrieval method, device and equipment based on Chinese data and storage medium | |
CN113204698B (en) | News subject term generation method, device, equipment and medium | |
CN112364068A (en) | Course label generation method, device, equipment and medium | |
CN113254814A (en) | Network course video labeling method and device, electronic equipment and medium | |
CN113205814A (en) | Voice data labeling method and device, electronic equipment and storage medium | |
CN117290515A (en) | Training method of text annotation model, method and device for generating text graph | |
CN116450829A (en) | Medical text classification method, device, equipment and medium | |
CN111339760A (en) | Method and device for training lexical analysis model, electronic equipment and storage medium | |
CN115510188A (en) | Text keyword association method, device, equipment and storage medium | |
CN114943306A (en) | Intention classification method, device, equipment and storage medium | |
CN115146064A (en) | Intention recognition model optimization method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |