CN111639498A - Knowledge extraction method and device, electronic equipment and storage medium - Google Patents

Knowledge extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111639498A
CN111639498A CN202010318382.8A CN202010318382A CN111639498A CN 111639498 A CN111639498 A CN 111639498A CN 202010318382 A CN202010318382 A CN 202010318382A CN 111639498 A CN111639498 A CN 111639498A
Authority
CN
China
Prior art keywords
entity
knowledge
text
initial
entity list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010318382.8A
Other languages
Chinese (zh)
Inventor
张聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202010318382.8A priority Critical patent/CN111639498A/en
Priority to PCT/CN2020/104964 priority patent/WO2021212682A1/en
Publication of CN111639498A publication Critical patent/CN111639498A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a knowledge extraction method, a knowledge extraction device, electronic equipment and a storage medium. The method can preprocess source data to obtain text data, identify entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realize accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, expand the initial entity list based on a knowledge graph to obtain a candidate entity list, realize comprehensive coverage of similar expressions, adopt a semantic matching model trained based on an Attention-DSSM algorithm to disambiguate the candidate entity list to obtain a target entity, strengthen the association between each word and other words due to an Attention mechanism, improve the weight of key words, enable the target entity obtained after data analysis to be more accurate, link the target entity to nodes of the knowledge graph, and automatically extract knowledge based on information on the nodes, the efficiency and the accuracy rate of knowledge extraction are improved.

Description

Knowledge extraction method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of data analysis, in particular to a knowledge extraction method and device, electronic equipment and a storage medium.
Background
The current knowledge extraction usually depends on templates, trigger words or a supervised learning mode, and the rules are summarized and data are labeled manually to form a rule base, and matching is performed on the basis of the rule base.
The method is difficult to maintain and poor in transportability, a large number of rule templates need to be constructed by depending on experts in various fields, the labor required by data labeling is large, the quality of labeled data is uncontrollable, the comprehensive cost is too high, and new relations and classes are not convenient to expand.
Disclosure of Invention
In view of the above, it is desirable to provide a knowledge extraction method, apparatus, electronic device and storage medium, which can enhance the association between each vocabulary and other vocabularies based on the Attention mechanism, achieve automatic extraction of knowledge according to the weight of key vocabularies, and improve the efficiency and accuracy of knowledge extraction.
A knowledge extraction method, the knowledge extraction method comprising:
when a knowledge extraction instruction is received, acquiring source data;
preprocessing the source data to obtain text data;
identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;
linking the target entity to a node of the knowledge-graph;
and extracting knowledge based on the information on the nodes.
According to a preferred embodiment of the present invention, the preprocessing the source data to obtain text data includes:
when the source data are of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data; or
And when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.
According to a preferred embodiment of the present invention, the knowledge extraction method further comprises:
configuring a sequence marking mode according to predefined demand data;
and adding the sequence labeling mode into a Bi-LSTM + CRF model to obtain the sequence labeling model.
According to the preferred embodiment of the present invention, the identifying the entities in the text data through the sequence labeling model based on Bi-LSTM + CRF to obtain the initial entity list includes:
inputting the text data into the sequence labeling model based on Bi-LSTM + CRF, and acquiring the output probability and the transition probability of each corresponding label at each sequence position in a Softmax layer;
calculating the sum of the output probability and the transition probability of each label as the score of each label for each sequence position;
determining the label with the highest score as the output label of each sequence position;
and combining the output labels of each sequence position to obtain the initial entity list.
According to a preferred embodiment of the present invention, the expanding the initial entity list based on the preconfigured knowledge-graph to obtain the candidate entity list includes:
calculating the cosine similarity between each entity in the initial entity list and the entity on each node in the knowledge graph;
acquiring at least one entity with cosine similarity greater than or equal to preset similarity from each node as a candidate entity;
and constructing the candidate entity list according to the initial entity list and the candidate entities.
According to the preferred embodiment of the present invention, the disambiguating the candidate entity list using the semantic matching model trained based on the Attention-DSSM algorithm to obtain the target entity includes:
coding each entity in the candidate entity list based on One-Hot coding algorithm to obtain a word ID of each entity;
inputting the word ID of each entity into a pre-configured dictionary, and outputting a word vector of each entity;
processing the word vector of each entity based on an Attention mechanism to obtain semantic representation of each entity;
interacting the semantic representation of each entity in an Interaction layer, and outputting a semantic vector after Interaction of each entity;
and matching the semantic vector after interaction of each entity with the entities on the nodes of the knowledge graph on a matching layer, and outputting the entity with the highest matching degree as the target entity.
According to a preferred embodiment of the present invention, said extracting knowledge based on information on said nodes comprises:
acquiring at least one path between nodes and associated information on each path from the information on the nodes;
and extracting at least one relation network based on the associated information on each path and the corresponding path.
A knowledge extraction device, the knowledge extraction device comprising:
the acquisition unit is used for acquiring source data when a knowledge extraction instruction is received;
the preprocessing unit is used for preprocessing the source data to obtain text data;
the identification unit is used for identifying the entity in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
the expansion unit is used for expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
the disambiguation unit is used for carrying out disambiguation processing on the candidate entity list by adopting a semantic matching model trained based on the Attention-DSSM algorithm to obtain a target entity;
a linking unit, configured to link the target entity to a node of the knowledge-graph;
and the extraction unit is used for extracting knowledge based on the information on the nodes.
According to a preferred embodiment of the present invention, the preprocessing unit is specifically configured to:
when the source data are of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data; or
And when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.
According to a preferred embodiment of the present invention, the knowledge extraction device further comprises:
the configuration unit is used for configuring the sequence marking mode according to predefined demand data;
and the adding unit is used for adding the sequence labeling mode into the Bi-LSTM + CRF model to obtain the sequence labeling model.
According to a preferred embodiment of the present invention, the identification unit is specifically configured to:
inputting the text data into the sequence labeling model based on Bi-LSTM + CRF, and acquiring the output probability and the transition probability of each corresponding label at each sequence position in a Softmax layer;
calculating the sum of the output probability and the transition probability of each label as the score of each label for each sequence position;
determining the label with the highest score as the output label of each sequence position;
and combining the output labels of each sequence position to obtain the initial entity list.
According to a preferred embodiment of the present invention, the extension unit is specifically configured to:
calculating the cosine similarity between each entity in the initial entity list and the entity on each node in the knowledge graph;
acquiring at least one entity with cosine similarity greater than or equal to preset similarity from each node as a candidate entity;
and constructing the candidate entity list according to the initial entity list and the candidate entities.
According to a preferred embodiment of the invention, the disambiguation unit is specifically configured to:
coding each entity in the candidate entity list based on One-Hot coding algorithm to obtain a word ID of each entity;
inputting the word ID of each entity into a pre-configured dictionary, and outputting a word vector of each entity;
processing the word vector of each entity based on an Attention mechanism to obtain semantic representation of each entity;
interacting the semantic representation of each entity in an Interaction layer, and outputting a semantic vector after Interaction of each entity;
and matching the semantic vector after interaction of each entity with the entities on the nodes of the knowledge graph on a matching layer, and outputting the entity with the highest matching degree as the target entity.
According to a preferred embodiment of the present invention, the extracting unit is specifically configured to:
acquiring at least one path between nodes and associated information on each path from the information on the nodes;
and extracting at least one relation network based on the associated information on each path and the corresponding path.
An electronic device, the electronic device comprising:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to implement the knowledge extraction method.
A computer-readable storage medium having at least one instruction stored therein, the at least one instruction being executable by a processor in an electronic device to implement the knowledge extraction method.
The technical scheme shows that the method can acquire source data when receiving a knowledge extraction instruction, preprocesses the source data to obtain text data, realizes the unification of formats, identifies entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realizes the accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, further expands the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list, realizes the comprehensive coverage of similar representation, and performs disambiguation on the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity, because the Attention mechanism strengthens the association between each word and other words and improves the weight of key words, the newly added Interaction layer also enhances the association between texts to be matched, so that the obtained target entity is more accurate, the target entity is further linked to the nodes of the knowledge graph, and automatic knowledge extraction is performed based on the information on the nodes, so that the efficiency and the accuracy of knowledge extraction are improved.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of the knowledge extraction method of the present invention.
FIG. 2 is a schematic diagram of an exemplary-based data source extraction relationship network according to the present invention.
FIG. 3 is a functional block diagram of a preferred embodiment of the knowledge extraction apparatus of the present invention.
FIG. 4 is a schematic structural diagram of an electronic device implementing a knowledge extraction method according to a preferred embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a flow chart of a preferred embodiment of the knowledge extraction method of the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
The knowledge extraction method is applied to one or more electronic devices, which are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the electronic devices includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive Internet Protocol Television (IPTV), an intelligent wearable device, and the like.
The electronic device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud computing (cloud computing) based cloud consisting of a large number of hosts or network servers.
The Network where the electronic device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.
And S10, acquiring the source data when the knowledge extraction instruction is received.
In at least one embodiment of the invention, the knowledge extraction instructions may be triggered by specified users including, but not limited to: project managers, and the like.
Further, the source data may be obtained from a configuration database.
For example: when a knowledge extraction based on legal knowledge base is performed, the source data may be obtained from a database accessible by the court, such as a database inside the court, a source database on the web, or the like.
The source data may be a picture type or a text type, and the present invention is not limited thereto.
And S11, preprocessing the source data to obtain text data.
In order for the machine to be able to identify the source data, the electronic device first needs to pre-process the source data.
Specifically, the preprocessing the source data to obtain text data includes:
when the source data is of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on an UTF-8(8-bit Unicode Transformation Format, 8-bit) coding algorithm to obtain the text data.
Or when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.
Wherein the electronic device may employ an OCR (Optical Character Recognition) algorithm to convert the source data into the initial text.
Meanwhile, the filtered text is coded through the UTF-8 coding algorithm, full-angle and half-angle symbol conversion, messy code removal and other operations can be performed on the filtered text, and coding unification is finally achieved.
Further, the text data may be in a TXT text format, or may be in other text formats, which is not limited in the present invention.
Through the implementation mode, the source data can be filtered and cleaned to eliminate interference information, and further the source data is converted into a uniform text format, so that the uniformity of the data format is realized, and the preprocessed text data can be recognized and processed by a machine.
And S12, identifying the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list.
The text data obtained by preprocessing the source data is unstructured text data, so that key entity information in the text data needs to be identified, which is equivalent to performing sequence labeling on the text data. Therefore, the electronic device needs to first construct a sequence annotation model associated with the current knowledge extraction instruction.
Specifically, the knowledge extraction method further includes:
the electronic equipment configures a sequence labeling mode according to predefined demand data, and adds the sequence labeling mode to a Bi-LSTM + CRF model to obtain the sequence labeling model.
The sequence labeling mode can be configured according to specific task requirements.
For example: the same label may not be output consecutively, etc.
It should be noted that a Bi-LSTM (Bidirectional Long Short Term Memory) layer provides Long-distance dependency modeling, enhances the relation between each character and a context character, and a CRF (conditional random field) can accommodate arbitrary context information, so that the feature design is flexible, the CRF layer can construct feature transfer and correspondence between characters, and simultaneously considers the sequentiality between output labels, thereby achieving a more accurate recognition effect.
In at least one embodiment of the present invention, the identifying the entities in the text data through the Bi-LSTM + CRF-based sequence labeling model, and obtaining the initial entity list includes:
the electronic equipment inputs the text data into the sequence labeling model based on the Bi-LSTM + CRF, obtains output probability and transition probability of each corresponding label at each sequence position in a Softmax layer, calculates sum of the output probability and the transition probability of each label as score of each label for each sequence position, determines the label with the highest score as the output label of each sequence position, and combines the output labels of each sequence position to obtain the initial entity list.
Specifically, X ═ X (X) for each input1,x2,…,xn) A predicted tag sequence Y ═ Y (Y) can be obtained1,y2,…,yn) The score formula is defined as follows:
Figure BDA0002460352860000091
wherein the content of the first and second substances,
Figure BDA0002460352860000092
for the output of Softmax layer at the ith sequence position as yiThe probability of (a) of (b) being,is from yiTo yi+1The transition probability of (2).
According to the formula, when a predicted sequence is high in score, not all positions are labels corresponding to the maximum probability value output by the Softmax layer, and transition probabilities are comprehensively considered, namely the sequence labeling mode is met (for example, B cannot be followed by B).
For example: after Bi-LSTM processing, the most likely output sequence is bbibioo, and since in the transition probability matrix, according to the sequence labeling mode, the probability of B — > B is very small, even negative, then the sequence will not get the highest score, i.e. it is not the initial entity list.
Taking the above example into account, if B-PER represents the first character tag of a person's name, E-PER represents the last character tag of a person's name, O represents the independent character tag, B-ORG represents the first character tag of an organization's name, and I-ORG represents the middle character tag of an organization's name, the initial entity list obtained by merging the tag items of the same category in the sequence may include the following sequence: sequences (B, E) representing the name of a person; sequences (B, I, E) representing organization names; sequence (O), representing an independent character.
S13, expanding the initial entity list based on the pre-configured knowledge graph to obtain a candidate entity list.
The knowledge graph can be configured in advance according to each technical field, such as: legal knowledge maps in the legal domain, etc.
It should be noted that each entity in the initial entity list may be a partial representation or an alternative representation of the entity, and therefore, a surface name extension needs to be performed on each entity to obtain the candidate entity list.
For example: the "XX middle court" and the "XX City middle people court" are different representations of the same entity.
Specifically, the expanding the initial entity list based on the preconfigured knowledge graph to obtain the candidate entity list includes:
the electronic equipment calculates the cosine similarity between each entity in the initial entity list and the entities on each node in the knowledge graph, and obtains at least one entity with the cosine similarity larger than or equal to the preset similarity from each node as a candidate entity, and the electronic equipment constructs the candidate entity list according to the initial entity list and the candidate entity.
The preset similarity may be configured by user, for example: 99.7 percent.
The cosine similarity is a way of measuring the similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, and compared with distance measurement, the cosine similarity focuses more on the difference of the two vectors in the direction. Generally, after obtaining a vector representation of two texts by using Embedding (Embedding vector), a cosine similarity can be used to calculate a similarity between the two texts.
By the embodiment, the similarity between each entity and the entity on the node of the knowledge graph can be calculated by adopting cosine similarity so as to judge whether the coreference relationship exists, and further the expansion of the initial entity list is realized to obtain the candidate entity list with more comprehensive coverage.
S14, disambiguating the candidate entity list by adopting a Semantic matching Model trained based on an Attention-Deep Structured Semantic Model (ATTENTION-DSSM) algorithm to obtain a target entity.
The candidate entity list obtained by the expansion processing of the initial entity list may have a plurality of candidates, and therefore, the candidate entity list needs to be further disambiguated, and further, the only one target entity which is most matched with the entity on the knowledge graph node can be more accurately determined based on a plurality of similar representations.
Specifically, the disambiguating the candidate entity list by using the semantic matching model trained based on the Attention-DSSM algorithm to obtain the target entity includes:
the electronic equipment encodes each entity in the candidate entity list based on an One-Hot encoding algorithm to obtain a word ID of each entity, inputs the word ID of each entity into a pre-configured dictionary, outputs a word vector of each entity, processes the word vector of each entity based on an Attention mechanism to obtain a semantic representation of each entity, further interacts the semantic representation of each entity on an Interaction layer (Interaction layer), outputs the semantic vector after Interaction of each entity, matches the semantic vector after Interaction of each entity with the entity on the knowledge graph node on a matching layer, and outputs the entity with the highest matching degree as the target entity.
The traditional DSSM (Deep Structured Semantic matching Model) expresses the extracted entity context information and the context information of candidate entities as low-dimensional Semantic vectors, and calculates the distance between the two Semantic vectors by cosine distance. However, because the DSSM adopts the bag-of-word model, the word order information and the context information are lost. In addition, the DSSM model is a weakly supervised end-to-end model, the prediction result is uncontrollable, long-distance information cannot be acquired, and the problems of gradient disappearance and the like exist.
In consideration of the particularity of the semantic matching task, the embodiment adopts a semantic matching model trained based on the Attention-DSSM algorithm, and the semantic matching model may include, from bottom to top: the system comprises an input layer, a semantic representation layer, an Interaction layer and a matching layer.
Through the implementation mode, the word vector of each entity is processed based on the Attention mechanism, the semantic representation capability is enhanced, the association between the vocabulary in each text and other vocabularies is enhanced, the weight of key vocabularies in the text is improved, the accuracy is high, in addition, the newly added Interaction layer enables Interaction between two texts to be matched, the association between the two texts to be matched is enhanced through mutual expression, and the model normalization capability is improved.
S15, linking the target entity to the nodes of the knowledge-graph.
Specifically, the electronic device obtains a node on the knowledge-graph corresponding to the target entity, and further links the target entity to the node of the knowledge-graph.
For example: and linking the XX middle court in the legal document with the XX city middle people court node in the legal knowledge graph, wherein the attributes and the relationship of the XX middle court are the same as those of the XX city middle people court.
It should be noted that the physical link belongs to a relatively mature technology, and the present invention is not described herein.
And S16, extracting knowledge based on the information on the node.
Specifically, the extracting knowledge based on the information on the node includes:
the electronic equipment acquires at least one path between the nodes and the associated information on each path from the information on the nodes, and extracts at least one relation network based on the associated information on each path and the corresponding path.
As shown in fig. 2, for the acquired source data, a corresponding relationship network may be extracted.
It should be noted that the knowledge graph is composed of a large amount of knowledge and relations between the knowledge, nodes in the network represent entities existing in the real world, edges between the nodes represent relations between two entities, and the knowledge in the real world is abstracted into a knowledge network applicable to machine processing through the combination of points and edges.
Through the above embodiment, knowledge extraction is performed based on the information on the nodes, implicit information of the linked nodes can be acquired after the target entity is linked to the nodes of the knowledge graph, and extraction of relationship and event information can be performed according to paths among the nodes.
The technical scheme shows that the method can acquire source data when receiving a knowledge extraction instruction, preprocesses the source data to obtain text data, realizes the unification of formats, identifies entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realizes the accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, further expands the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list, realizes the comprehensive coverage of similar representation, and performs disambiguation on the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity, because the Attention mechanism strengthens the association between each word and other words and improves the weight of key words, the newly added Interaction layer also enhances the association between texts to be matched, so that the obtained target entity is more accurate, the target entity is further linked to the nodes of the knowledge graph, and automatic knowledge extraction is performed based on the information on the nodes, so that the efficiency and the accuracy of knowledge extraction are improved.
Fig. 3 is a functional block diagram of the knowledge extracting apparatus according to the preferred embodiment of the present invention. The knowledge extraction device 11 comprises an acquisition unit 110, a preprocessing unit 111, an identification unit 112, an expansion unit 113, a disambiguation unit 114, a linking unit 115, an extraction unit 116, a configuration unit 117, and an addition unit 118. The module/unit referred to in the present invention refers to a series of computer program segments that can be executed by the processor 13 and that can perform a fixed function, and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.
When receiving a knowledge extraction instruction, the acquisition unit 110 acquires source data.
In at least one embodiment of the invention, the knowledge extraction instructions may be triggered by specified users including, but not limited to: project managers, and the like.
Further, the source data may be obtained from a configuration database.
For example: when a knowledge extraction based on legal knowledge base is performed, the source data may be obtained from a database accessible by the court, such as a database inside the court, a source database on the web, or the like.
The source data may be a picture type or a text type, and the present invention is not limited thereto.
The preprocessing unit 111 preprocesses the source data to obtain text data.
In order to enable the machine to recognize the source data, the preprocessing unit 111 first needs to preprocess the source data.
Specifically, the preprocessing unit 111 preprocesses the source data to obtain text data, and includes:
when the source data is of a picture type, the preprocessing unit 111 converts the source data into an initial text, filters and cleans the initial text to obtain a filtered text, and codes the filtered text based on an UTF-8(8-bit unicode transformation Format, 8-bit) coding algorithm to obtain the text data.
Or when the source data is a text type, the preprocessing unit 111 filters and cleans the source data to obtain a filtered text, and codes the filtered text based on a UTF-8 coding algorithm to obtain the text data.
Wherein the preprocessing unit 111 may employ an OCR (Optical Character Recognition) algorithm to convert the source data into the initial text.
Meanwhile, the filtered text is coded through the UTF-8 coding algorithm, full-angle and half-angle symbol conversion, messy code removal and other operations can be performed on the filtered text, and coding unification is finally achieved.
Further, the text data may be in a TXT text format, or may be in other text formats, which is not limited in the present invention.
Through the implementation mode, the source data can be filtered and cleaned to eliminate interference information, and further the source data is converted into a uniform text format, so that the uniformity of the data format is realized, and the preprocessed text data can be recognized and processed by a machine.
The identification unit 112 identifies the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF, resulting in an initial entity list.
The text data obtained by preprocessing the source data is unstructured text data, so that key entity information in the text data needs to be identified, which is equivalent to performing sequence labeling on the text data. Therefore, it is necessary to first construct a sequence annotation model associated with the current knowledge extraction instruction.
Specifically, the configuration unit 117 configures a sequence labeling mode according to predefined demand data, and the adding unit 118 adds the sequence labeling mode to the Bi-LSTM + CRF model to obtain the sequence labeling model.
The sequence labeling mode can be configured according to specific task requirements.
For example: the same label may not be output consecutively, etc.
It should be noted that a Bi-LSTM (Bidirectional Long Short Term Memory) layer provides Long-distance dependency modeling, enhances the relation between each character and a context character, and a CRF (conditional random field) can accommodate arbitrary context information, so that the feature design is flexible, the CRF layer can construct feature transfer and correspondence between characters, and simultaneously considers the sequentiality between output labels, thereby achieving a more accurate recognition effect.
In at least one embodiment of the present invention, the identifying unit 112 identifies the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF, and the obtaining of the initial entity list includes:
the identification unit 112 inputs the text data into the Bi-LSTM + CRF-based sequence labeling model, and obtains an output probability and a transition probability of each corresponding label at each sequence position in the Softmax layer, for each sequence position, the identification unit 112 calculates a sum of the output probability and the transition probability of each label as a score of each label, and determines a label with a highest score as an output label at each sequence position, and the identification unit 112 combines the output labels at each sequence position to obtain the initial entity list.
Specifically, X ═ X (X) for each input1,x2,…,xn) A predicted tag sequence Y ═ Y (Y) can be obtained1,y2,…,yn) The score formula is defined as follows:
Figure BDA0002460352860000151
wherein the content of the first and second substances,
Figure BDA0002460352860000152
for the output of Softmax layer at the ith sequence position as yiThe probability of (a) of (b) being,
Figure BDA0002460352860000153
is from yiTo yi+1The transition probability of (2).
According to the formula, when a predicted sequence is high in score, not all positions are labels corresponding to the maximum probability value output by the Softmax layer, and transition probabilities are comprehensively considered, namely the sequence labeling mode is met (for example, B cannot be followed by B).
For example: after Bi-LSTM processing, the most likely output sequence is bbibioo, and since in the transition probability matrix, according to the sequence labeling mode, the probability of B — > B is very small, even negative, then the sequence will not get the highest score, i.e. it is not the initial entity list.
Taking the above example into account, if B-PER represents the first character tag of a person's name, E-PER represents the last character tag of a person's name, O represents the independent character tag, B-ORG represents the first character tag of an organization's name, and I-ORG represents the middle character tag of an organization's name, the initial entity list obtained by merging the tag items of the same category in the sequence may include the following sequence: sequences (B, E) representing the name of a person; sequences (B, I, E) representing organization names; sequence (O), representing an independent character.
The expansion unit 113 expands the initial entity list based on a preconfigured knowledge graph to obtain a candidate entity list.
The knowledge graph can be configured in advance according to each technical field, such as: legal knowledge maps in the legal domain, etc.
It should be noted that each entity in the initial entity list may be a partial representation or an alternative representation of the entity, and therefore, a surface name extension needs to be performed on each entity to obtain the candidate entity list.
For example: the "XX middle court" and the "XX City middle people court" are different representations of the same entity.
Specifically, the expanding unit 113 expands the initial entity list based on a preconfigured knowledge graph, and obtaining a candidate entity list includes:
the expansion unit 113 calculates cosine similarity between each entity in the initial entity list and entities on each node in the knowledge graph, and obtains at least one entity with the cosine similarity greater than or equal to a preset similarity from each node as a candidate entity, and the expansion unit 113 constructs the candidate entity list according to the initial entity list and the candidate entity.
The preset similarity may be configured by user, for example: 99.7 percent.
The cosine similarity is a way of measuring the similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, and compared with distance measurement, the cosine similarity focuses more on the difference of the two vectors in the direction. Generally, after obtaining a vector representation of two texts by using Embedding (Embedding vector), a cosine similarity can be used to calculate a similarity between the two texts.
By the embodiment, the similarity between each entity and the entity on the node of the knowledge graph can be calculated by adopting cosine similarity so as to judge whether the coreference relationship exists, and further the expansion of the initial entity list is realized to obtain the candidate entity list with more comprehensive coverage.
The disambiguation unit 114 performs disambiguation on the candidate entity list by using a semantic matching model trained based on an Attention-Deep Structured semantic model (atsm-DSSM) algorithm to obtain a target entity.
The candidate entity list obtained by the expansion processing of the initial entity list may have a plurality of candidates, and therefore, the candidate entity list needs to be further disambiguated, and further, the only one target entity which is most matched with the entity on the knowledge graph node can be more accurately determined based on a plurality of similar representations.
Specifically, the disambiguation unit 114 performs disambiguation on the candidate entity list by using a semantic matching model trained based on the Attention-DSSM algorithm, and obtaining a target entity includes:
the disambiguation unit 114 encodes each entity in the candidate entity list based on One-Hot encoding algorithm to obtain a word ID of each entity, inputs the word ID of each entity into a pre-configured dictionary, outputs a word vector of each entity, processes the word vector of each entity based on an Attention mechanism to obtain a semantic representation of each entity, further interacts the semantic representation of each entity on an Interaction layer (Interaction layer), outputs the semantic vector after Interaction of each entity, matches the semantic vector after Interaction of each entity with the entity on the knowledge graph node on a matching layer, and outputs the entity with the highest matching degree as the target entity.
The traditional DSSM (Deep Structured Semantic matching Model) expresses the extracted entity context information and the context information of candidate entities as low-dimensional Semantic vectors, and calculates the distance between the two Semantic vectors by cosine distance. However, because the DSSM adopts the bag-of-word model, the word order information and the context information are lost. In addition, the DSSM model is a weakly supervised end-to-end model, the prediction result is uncontrollable, long-distance information cannot be acquired, and the problems of gradient disappearance and the like exist.
In consideration of the particularity of the semantic matching task, the embodiment adopts a semantic matching model trained based on the Attention-DSSM algorithm, and the semantic matching model may include, from bottom to top: the system comprises an input layer, a semantic representation layer, an Interaction layer and a matching layer.
Through the implementation mode, the word vector of each entity is processed based on the Attention mechanism, the semantic representation capability is enhanced, the association between the vocabulary in each text and other vocabularies is enhanced, the weight of key vocabularies in the text is improved, the accuracy is high, in addition, the newly added Interaction layer enables Interaction between two texts to be matched, the association between the two texts to be matched is enhanced through mutual expression, and the model normalization capability is improved.
The linking unit 115 links the target entity to a node of the knowledge-graph.
Specifically, the linking unit 115 acquires a node on the knowledge-graph corresponding to the target entity, and further links the target entity to the node of the knowledge-graph.
For example: and linking the XX middle court in the legal document with the XX city middle people court node in the legal knowledge graph, wherein the attributes and the relationship of the XX middle court are the same as those of the XX city middle people court.
It should be noted that the physical link belongs to a relatively mature technology, and the present invention is not described herein.
The extraction unit 116 performs knowledge extraction based on the information on the node.
Specifically, the extracting unit 116 performs knowledge extraction based on the information on the node, including:
the extraction unit 116 obtains at least one path between nodes and associated information on each path from the information on the nodes, and extracts at least one relationship network based on the associated information on each path and the corresponding path.
As shown in fig. 2, for the acquired source data, a corresponding relationship network may be extracted.
It should be noted that the knowledge graph is composed of a large amount of knowledge and relations between the knowledge, nodes in the network represent entities existing in the real world, edges between the nodes represent relations between two entities, and the knowledge in the real world is abstracted into a knowledge network applicable to machine processing through the combination of points and edges.
Through the above embodiment, knowledge extraction is performed based on the information on the nodes, implicit information of the linked nodes can be acquired after the target entity is linked to the nodes of the knowledge graph, and extraction of relationship and event information can be performed according to paths among the nodes.
The technical scheme shows that the method can acquire source data when receiving a knowledge extraction instruction, preprocesses the source data to obtain text data, realizes the unification of formats, identifies entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realizes the accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, further expands the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list, realizes the comprehensive coverage of similar representation, and performs disambiguation on the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity, because the Attention mechanism strengthens the association between each word and other words and improves the weight of key words, the newly added Interaction layer also enhances the association between texts to be matched, so that the obtained target entity is more accurate, the target entity is further linked to the nodes of the knowledge graph, and automatic knowledge extraction is performed based on the information on the nodes, so that the efficiency and the accuracy of knowledge extraction are improved.
Fig. 4 is a schematic structural diagram of an electronic device according to a preferred embodiment of the knowledge extraction method of the present invention.
The electronic device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program, such as a knowledge extraction program, stored in the memory 12 and executable on the processor 13.
It will be understood by those skilled in the art that the schematic diagram is merely an example of the electronic device 1, and does not constitute a limitation to the electronic device 1, the electronic device 1 may have a bus-type structure or a star-type structure, the electronic device 1 may further include more or less hardware or software than those shown in the figures, or different component arrangements, for example, the electronic device 1 may further include an input and output device, a network access device, and the like.
It should be noted that the electronic device 1 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
The memory 12 includes at least one type of readable storage medium, which includes flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, for example a removable hard disk of the electronic device 1. The memory 12 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 12 may be used not only to store application software installed in the electronic apparatus 1 and various types of data such as codes of a knowledge extraction program, but also to temporarily store data that has been output or is to be output.
The processor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the electronic device 1 by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (for example, executing a knowledge extraction program and the like) stored in the memory 12 and calling data stored in the memory 12.
The processor 13 executes an operating system of the electronic device 1 and various installed application programs. The processor 13 executes the application program to implement the steps in the above-described respective knowledge extraction method embodiments, such as steps S10, S11, S12, S13, S14, S15, S16 shown in fig. 1.
Alternatively, the processor 13, when executing the computer program, implements the functions of the modules/units in the above device embodiments, for example:
when a knowledge extraction instruction is received, acquiring source data;
preprocessing the source data to obtain text data;
identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;
linking the target entity to a node of the knowledge-graph;
and extracting knowledge based on the information on the nodes.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the electronic device 1. For example, the computer program may be divided into an acquisition unit 110, a pre-processing unit 111, an identification unit 112, an extension unit 113, a disambiguation unit 114, a linking unit 115, an extraction unit 116, a configuration unit 117, an addition unit 118.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the knowledge extraction method according to the embodiments of the present invention.
The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented.
Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 4, but this does not indicate only one bus or one type of bus. The bus is arranged to enable connection communication between the memory 12 and at least one processor 13 or the like.
Although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 13 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
Fig. 4 only shows the electronic device 1 with components 12-13, and it will be understood by a person skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
In conjunction with fig. 1, the memory 12 in the electronic device 1 stores a plurality of instructions to implement a knowledge extraction method, and the processor 13 executes the plurality of instructions to implement:
when a knowledge extraction instruction is received, acquiring source data;
preprocessing the source data to obtain text data;
identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;
linking the target entity to a node of the knowledge-graph;
and extracting knowledge based on the information on the nodes.
Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A knowledge extraction method, characterized by comprising:
when a knowledge extraction instruction is received, acquiring source data;
preprocessing the source data to obtain text data;
identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;
linking the target entity to a node of the knowledge-graph;
and extracting knowledge based on the information on the nodes.
2. The knowledge extraction method of claim 1, wherein the preprocessing the source data to obtain text data comprises:
when the source data are of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data; or
And when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.
3. The knowledge extraction method of claim 1, further comprising:
configuring a sequence marking mode according to predefined demand data;
and adding the sequence labeling mode into a Bi-LSTM + CRF model to obtain the sequence labeling model.
4. The knowledge extraction method of claim 1, wherein the identifying entities in the text data through a Bi-LSTM + CRF based sequence labeling model, and obtaining an initial entity list comprises:
inputting the text data into the sequence labeling model based on Bi-LSTM + CRF, and acquiring the output probability and the transition probability of each corresponding label at each sequence position in a Softmax layer;
calculating the sum of the output probability and the transition probability of each label as the score of each label for each sequence position;
determining the label with the highest score as the output label of each sequence position;
and combining the output labels of each sequence position to obtain the initial entity list.
5. The method of knowledge extraction as claimed in claim 1, wherein the expanding the initial entity list based on a preconfigured knowledge-graph to obtain a candidate entity list comprises:
calculating the cosine similarity between each entity in the initial entity list and the entity on each node in the knowledge graph;
acquiring at least one entity with cosine similarity greater than or equal to preset similarity from each node as a candidate entity;
and constructing the candidate entity list according to the initial entity list and the candidate entities.
6. The knowledge extraction method of claim 1, wherein the disambiguating the candidate entity list using a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity comprises:
coding each entity in the candidate entity list based on One-Hot coding algorithm to obtain a word ID of each entity;
inputting the word ID of each entity into a pre-configured dictionary, and outputting a word vector of each entity;
processing the word vector of each entity based on an Attention mechanism to obtain semantic representation of each entity;
interacting the semantic representation of each entity in an Interaction layer, and outputting a semantic vector after Interaction of each entity;
and matching the semantic vector after interaction of each entity with the entities on the nodes of the knowledge graph on a matching layer, and outputting the entity with the highest matching degree as the target entity.
7. The knowledge extraction method of claim 1, wherein the extracting knowledge based on information on the node comprises:
acquiring at least one path between nodes and associated information on each path from the information on the nodes;
and extracting at least one relation network based on the associated information on each path and the corresponding path.
8. A knowledge extraction apparatus, characterized by comprising:
the acquisition unit is used for acquiring source data when a knowledge extraction instruction is received;
the preprocessing unit is used for preprocessing the source data to obtain text data;
the identification unit is used for identifying the entity in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;
the expansion unit is used for expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;
the disambiguation unit is used for carrying out disambiguation processing on the candidate entity list by adopting a semantic matching model trained based on the Attention-DSSM algorithm to obtain a target entity;
a linking unit, configured to link the target entity to a node of the knowledge-graph;
and the extraction unit is used for extracting knowledge based on the information on the nodes.
9. An electronic device, characterized in that the electronic device comprises:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to implement the knowledge extraction method of any one of claims 1 to 7.
10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein at least one instruction that is executed by a processor in an electronic device to implement the knowledge extraction method of any one of claims 1-7.
CN202010318382.8A 2020-04-21 2020-04-21 Knowledge extraction method and device, electronic equipment and storage medium Pending CN111639498A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010318382.8A CN111639498A (en) 2020-04-21 2020-04-21 Knowledge extraction method and device, electronic equipment and storage medium
PCT/CN2020/104964 WO2021212682A1 (en) 2020-04-21 2020-07-27 Knowledge extraction method, apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010318382.8A CN111639498A (en) 2020-04-21 2020-04-21 Knowledge extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111639498A true CN111639498A (en) 2020-09-08

Family

ID=72328869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010318382.8A Pending CN111639498A (en) 2020-04-21 2020-04-21 Knowledge extraction method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111639498A (en)
WO (1) WO2021212682A1 (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328653A (en) * 2020-10-30 2021-02-05 北京百度网讯科技有限公司 Data identification method and device, electronic equipment and storage medium
CN112380359A (en) * 2021-01-18 2021-02-19 平安科技(深圳)有限公司 Knowledge graph-based training resource allocation method, device, equipment and medium
CN112395429A (en) * 2020-12-02 2021-02-23 上海三稻智能科技有限公司 Method, system and storage medium for determining, pushing and applying HS (high speed coding) codes based on graph neural network
CN112426726A (en) * 2020-12-09 2021-03-02 网易(杭州)网络有限公司 Game event extraction method, device, storage medium and server
CN112464669A (en) * 2020-12-07 2021-03-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device and storage medium
CN112507126A (en) * 2020-12-07 2021-03-16 厦门渊亭信息科技有限公司 Entity linking device and method based on recurrent neural network
CN112508615A (en) * 2020-12-10 2021-03-16 深圳市欢太科技有限公司 Feature extraction method, feature extraction device, storage medium, and electronic apparatus
CN112528660A (en) * 2020-12-04 2021-03-19 北京百度网讯科技有限公司 Method, apparatus, device, storage medium and program product for processing text
CN113111660A (en) * 2021-04-22 2021-07-13 脉景(杭州)健康管理有限公司 Data processing method, device, equipment and storage medium
CN113220835A (en) * 2021-05-08 2021-08-06 北京百度网讯科技有限公司 Text information processing method and device, electronic equipment and storage medium
CN113268452A (en) * 2021-05-25 2021-08-17 联仁健康医疗大数据科技股份有限公司 Entity extraction method, device, equipment and storage medium
CN113297419A (en) * 2021-06-23 2021-08-24 南京谦萃智能科技服务有限公司 Video knowledge point determining method and device, electronic equipment and storage medium
WO2021190653A1 (en) * 2020-10-31 2021-09-30 平安科技(深圳)有限公司 Semantic parsing device and method, terminal, and storage medium
CN113505889A (en) * 2021-07-23 2021-10-15 中国平安人寿保险股份有限公司 Processing method and device of atlas knowledge base, computer equipment and storage medium
CN113705194A (en) * 2021-04-12 2021-11-26 腾讯科技(深圳)有限公司 Extraction method and electronic equipment for short
CN114186690A (en) * 2022-02-16 2022-03-15 中国空气动力研究与发展中心计算空气动力研究所 Aircraft knowledge graph construction method, device, equipment and storage medium
CN114237829A (en) * 2021-12-27 2022-03-25 南方电网物资有限公司 Data acquisition and processing method for power equipment
CN114780749A (en) * 2022-05-05 2022-07-22 国网江苏省电力有限公司营销服务中心 Electric power entity chain finger method based on graph attention machine mechanism
CN115062619A (en) * 2022-08-11 2022-09-16 中国人民解放军国防科技大学 Chinese entity linking method, device, equipment and storage medium
CN116826933A (en) * 2023-08-30 2023-09-29 深圳科力远数智能源技术有限公司 Knowledge-graph-based hybrid energy storage battery power supply backstepping control method and system
CN117668259A (en) * 2024-02-01 2024-03-08 华安证券股份有限公司 Knowledge-graph-based inside and outside data linkage analysis method and device

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218931B (en) * 2021-11-04 2024-01-23 北京百度网讯科技有限公司 Information extraction method, information extraction device, electronic equipment and readable storage medium
CN114239583B (en) * 2021-12-15 2023-04-07 北京百度网讯科技有限公司 Method, device, equipment and medium for training entity chain finger model and entity chain finger
CN114218403B (en) * 2021-12-20 2024-04-09 平安付科技服务有限公司 Fault root cause positioning method, device, equipment and medium based on knowledge graph
CN114416976A (en) * 2021-12-23 2022-04-29 北京百度网讯科技有限公司 Text labeling method and device and electronic equipment
CN114330345B (en) * 2021-12-24 2023-01-17 北京百度网讯科技有限公司 Named entity recognition method, training method, device, electronic equipment and medium
CN114491232B (en) * 2021-12-24 2023-03-24 北京百度网讯科技有限公司 Information query method and device, electronic equipment and storage medium
CN114330353B (en) * 2022-01-06 2023-06-13 腾讯科技(深圳)有限公司 Entity identification method, device, equipment, medium and program product of virtual scene
CN114186759A (en) * 2022-02-16 2022-03-15 杭州杰牌传动科技有限公司 Material scheduling control method and system based on reducer knowledge graph
CN114925158A (en) * 2022-03-15 2022-08-19 青岛海尔科技有限公司 Sentence text intention recognition method and device, storage medium and electronic device
CN114385833B (en) * 2022-03-23 2023-05-12 支付宝(杭州)信息技术有限公司 Method and device for updating knowledge graph
CN114896408B (en) * 2022-03-24 2024-04-19 北京大学深圳研究生院 Construction method of material knowledge graph, material knowledge graph and application
CN114942998B (en) * 2022-04-25 2024-02-13 西北工业大学 Knowledge graph neighborhood structure sparse entity alignment method integrating multi-source data
CN114912637B (en) * 2022-05-21 2023-08-29 重庆大学 Human-computer object knowledge graph manufacturing production line operation and maintenance decision method and system and storage medium
CN114861677B (en) * 2022-05-30 2023-04-18 北京百度网讯科技有限公司 Information extraction method and device, electronic equipment and storage medium
CN114707005B (en) * 2022-06-02 2022-10-25 浙江建木智能系统有限公司 Knowledge graph construction method and system for ship equipment
CN115017255B (en) * 2022-08-08 2022-11-01 杭州实在智能科技有限公司 Knowledge base construction and search method based on tree structure
CN115050085B (en) * 2022-08-15 2022-11-01 珠海翔翼航空技术有限公司 Method, system and equipment for recognizing objects of analog machine management system based on map
CN115510245B (en) * 2022-10-14 2024-05-14 北京理工大学 Unstructured data-oriented domain knowledge extraction method
CN115544626B (en) * 2022-10-21 2023-10-20 清华大学 Sub-model extraction method, device, computer equipment and medium
CN115795051B (en) * 2022-12-02 2023-05-23 中科雨辰科技有限公司 Data processing system for acquiring link entity based on entity relationship
CN115796189B (en) * 2023-01-31 2023-05-12 北京面壁智能科技有限责任公司 Semantic determining method, semantic determining device, electronic equipment and medium
CN116070001B (en) * 2023-02-03 2023-12-19 深圳市艾莉诗科技有限公司 Information directional grabbing method and device based on Internet
CN115827935B (en) * 2023-02-09 2023-05-23 支付宝(杭州)信息技术有限公司 Data processing method, device and equipment
CN116127053B (en) * 2023-02-14 2024-01-02 北京百度网讯科技有限公司 Entity word disambiguation, knowledge graph generation and knowledge recommendation methods and devices
CN115878849B (en) * 2023-02-27 2023-05-26 北京奇树有鱼文化传媒有限公司 Video tag association method and device and electronic equipment
CN116503865A (en) * 2023-05-29 2023-07-28 北京石油化工学院 Hydrogen road transportation risk identification method and device, electronic equipment and storage medium
CN116362166A (en) * 2023-05-29 2023-06-30 青岛泰睿思微电子有限公司 Pattern merging system and method for chip packaging
CN116663537B (en) * 2023-07-26 2023-11-03 中信联合云科技有限责任公司 Big data analysis-based method and system for processing selected question planning information
CN116719955B (en) * 2023-08-09 2023-10-27 北京国电通网络技术有限公司 Label labeling information generation method and device, electronic equipment and readable medium
CN116756151B (en) * 2023-08-17 2023-11-24 公安部信息通信中心 Knowledge searching and data processing system
CN116821712B (en) * 2023-08-25 2023-12-19 中电科大数据研究院有限公司 Semantic matching method and device for unstructured text and knowledge graph
CN117272170B (en) * 2023-09-20 2024-03-08 东旺智能科技(上海)有限公司 Knowledge graph-based IT operation and maintenance fault root cause analysis method
CN117012373B (en) * 2023-10-07 2024-02-23 广州市妇女儿童医疗中心 Training method, application method and system of grape embryo auxiliary inspection model
CN117349386B (en) * 2023-10-12 2024-04-12 吉玖(天津)技术有限责任公司 Digital humane application method based on data strength association model
CN117172323B (en) * 2023-11-02 2024-01-23 知呱呱(天津)大数据技术有限公司 Patent multi-domain knowledge extraction method and system based on feature alignment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7014288B2 (en) * 2018-03-07 2022-02-01 日本電気株式会社 Knowledge expansion systems, methods and programs
CN108733792B (en) * 2018-05-14 2020-12-01 北京大学深圳研究生院 Entity relation extraction method
CN110609902B (en) * 2018-05-28 2021-10-22 华为技术有限公司 Text processing method and device based on fusion knowledge graph
CN110362660B (en) * 2019-07-23 2023-06-09 重庆邮电大学 Electronic product quality automatic detection method based on knowledge graph

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328653A (en) * 2020-10-30 2021-02-05 北京百度网讯科技有限公司 Data identification method and device, electronic equipment and storage medium
CN112328653B (en) * 2020-10-30 2023-07-28 北京百度网讯科技有限公司 Data identification method, device, electronic equipment and storage medium
WO2021190653A1 (en) * 2020-10-31 2021-09-30 平安科技(深圳)有限公司 Semantic parsing device and method, terminal, and storage medium
CN112395429A (en) * 2020-12-02 2021-02-23 上海三稻智能科技有限公司 Method, system and storage medium for determining, pushing and applying HS (high speed coding) codes based on graph neural network
CN112528660B (en) * 2020-12-04 2023-10-24 北京百度网讯科技有限公司 Method, apparatus, device, storage medium and program product for processing text
CN112528660A (en) * 2020-12-04 2021-03-19 北京百度网讯科技有限公司 Method, apparatus, device, storage medium and program product for processing text
CN112464669A (en) * 2020-12-07 2021-03-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device and storage medium
CN112507126A (en) * 2020-12-07 2021-03-16 厦门渊亭信息科技有限公司 Entity linking device and method based on recurrent neural network
CN112464669B (en) * 2020-12-07 2024-02-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device, and storage medium
CN112507126B (en) * 2020-12-07 2022-11-15 厦门渊亭信息科技有限公司 Entity linking device and method based on recurrent neural network
CN112426726A (en) * 2020-12-09 2021-03-02 网易(杭州)网络有限公司 Game event extraction method, device, storage medium and server
CN112508615A (en) * 2020-12-10 2021-03-16 深圳市欢太科技有限公司 Feature extraction method, feature extraction device, storage medium, and electronic apparatus
CN112380359B (en) * 2021-01-18 2021-04-20 平安科技(深圳)有限公司 Knowledge graph-based training resource allocation method, device, equipment and medium
CN112380359A (en) * 2021-01-18 2021-02-19 平安科技(深圳)有限公司 Knowledge graph-based training resource allocation method, device, equipment and medium
CN113705194A (en) * 2021-04-12 2021-11-26 腾讯科技(深圳)有限公司 Extraction method and electronic equipment for short
CN113111660A (en) * 2021-04-22 2021-07-13 脉景(杭州)健康管理有限公司 Data processing method, device, equipment and storage medium
CN113220835A (en) * 2021-05-08 2021-08-06 北京百度网讯科技有限公司 Text information processing method and device, electronic equipment and storage medium
CN113220835B (en) * 2021-05-08 2023-09-29 北京百度网讯科技有限公司 Text information processing method, device, electronic equipment and storage medium
CN113268452A (en) * 2021-05-25 2021-08-17 联仁健康医疗大数据科技股份有限公司 Entity extraction method, device, equipment and storage medium
CN113268452B (en) * 2021-05-25 2024-02-02 联仁健康医疗大数据科技股份有限公司 Entity extraction method, device, equipment and storage medium
CN113297419A (en) * 2021-06-23 2021-08-24 南京谦萃智能科技服务有限公司 Video knowledge point determining method and device, electronic equipment and storage medium
CN113505889A (en) * 2021-07-23 2021-10-15 中国平安人寿保险股份有限公司 Processing method and device of atlas knowledge base, computer equipment and storage medium
CN114237829A (en) * 2021-12-27 2022-03-25 南方电网物资有限公司 Data acquisition and processing method for power equipment
CN114186690A (en) * 2022-02-16 2022-03-15 中国空气动力研究与发展中心计算空气动力研究所 Aircraft knowledge graph construction method, device, equipment and storage medium
CN114780749A (en) * 2022-05-05 2022-07-22 国网江苏省电力有限公司营销服务中心 Electric power entity chain finger method based on graph attention machine mechanism
CN115062619B (en) * 2022-08-11 2022-11-22 中国人民解放军国防科技大学 Chinese entity linking method, device, equipment and storage medium
CN115062619A (en) * 2022-08-11 2022-09-16 中国人民解放军国防科技大学 Chinese entity linking method, device, equipment and storage medium
CN116826933A (en) * 2023-08-30 2023-09-29 深圳科力远数智能源技术有限公司 Knowledge-graph-based hybrid energy storage battery power supply backstepping control method and system
CN116826933B (en) * 2023-08-30 2023-12-01 深圳科力远数智能源技术有限公司 Knowledge-graph-based hybrid energy storage battery power supply backstepping control method and system
CN117668259A (en) * 2024-02-01 2024-03-08 华安证券股份有限公司 Knowledge-graph-based inside and outside data linkage analysis method and device
CN117668259B (en) * 2024-02-01 2024-04-26 华安证券股份有限公司 Knowledge-graph-based inside and outside data linkage analysis method and device

Also Published As

Publication number Publication date
WO2021212682A1 (en) 2021-10-28

Similar Documents

Publication Publication Date Title
CN111639498A (en) Knowledge extraction method and device, electronic equipment and storage medium
CN111428488A (en) Resume data information analyzing and matching method and device, electronic equipment and medium
CN111680168A (en) Text feature semantic extraction method and device, electronic equipment and storage medium
CN110688854A (en) Named entity recognition method, device and computer readable storage medium
CN113051356B (en) Open relation extraction method and device, electronic equipment and storage medium
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN110110213B (en) Method and device for mining user occupation, computer readable storage medium and terminal equipment
CN112100384B (en) Data viewpoint extraction method, device, equipment and storage medium
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN113378970A (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN111753089A (en) Topic clustering method and device, electronic equipment and storage medium
CN115238670B (en) Information text extraction method, device, equipment and storage medium
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN115238115A (en) Image retrieval method, device and equipment based on Chinese data and storage medium
CN113204698B (en) News subject term generation method, device, equipment and medium
CN112364068A (en) Course label generation method, device, equipment and medium
CN113254814A (en) Network course video labeling method and device, electronic equipment and medium
CN113205814A (en) Voice data labeling method and device, electronic equipment and storage medium
CN117290515A (en) Training method of text annotation model, method and device for generating text graph
CN116450829A (en) Medical text classification method, device, equipment and medium
CN111339760A (en) Method and device for training lexical analysis model, electronic equipment and storage medium
CN115510188A (en) Text keyword association method, device, equipment and storage medium
CN114943306A (en) Intention classification method, device, equipment and storage medium
CN115146064A (en) Intention recognition model optimization method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination