CN111639498A

CN111639498A - Knowledge extraction method and device, electronic equipment and storage medium

Info

Publication number: CN111639498A
Application number: CN202010318382.8A
Authority: CN
Inventors: 张聪
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-09-08
Also published as: WO2021212682A1

Abstract

The invention provides a knowledge extraction method, a knowledge extraction device, electronic equipment and a storage medium. The method can preprocess source data to obtain text data, identify entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realize accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, expand the initial entity list based on a knowledge graph to obtain a candidate entity list, realize comprehensive coverage of similar expressions, adopt a semantic matching model trained based on an Attention-DSSM algorithm to disambiguate the candidate entity list to obtain a target entity, strengthen the association between each word and other words due to an Attention mechanism, improve the weight of key words, enable the target entity obtained after data analysis to be more accurate, link the target entity to nodes of the knowledge graph, and automatically extract knowledge based on information on the nodes, the efficiency and the accuracy rate of knowledge extraction are improved.

Description

Knowledge extraction method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of data analysis, in particular to a knowledge extraction method and device, electronic equipment and a storage medium.

Background

The current knowledge extraction usually depends on templates, trigger words or a supervised learning mode, and the rules are summarized and data are labeled manually to form a rule base, and matching is performed on the basis of the rule base.

The method is difficult to maintain and poor in transportability, a large number of rule templates need to be constructed by depending on experts in various fields, the labor required by data labeling is large, the quality of labeled data is uncontrollable, the comprehensive cost is too high, and new relations and classes are not convenient to expand.

Disclosure of Invention

In view of the above, it is desirable to provide a knowledge extraction method, apparatus, electronic device and storage medium, which can enhance the association between each vocabulary and other vocabularies based on the Attention mechanism, achieve automatic extraction of knowledge according to the weight of key vocabularies, and improve the efficiency and accuracy of knowledge extraction.

A knowledge extraction method, the knowledge extraction method comprising:

when a knowledge extraction instruction is received, acquiring source data;

preprocessing the source data to obtain text data;

identifying entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;

expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;

disambiguating the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity;

linking the target entity to a node of the knowledge-graph;

and extracting knowledge based on the information on the nodes.

According to a preferred embodiment of the present invention, the preprocessing the source data to obtain text data includes:

when the source data are of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data; or

And when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.

According to a preferred embodiment of the present invention, the knowledge extraction method further comprises:

configuring a sequence marking mode according to predefined demand data;

and adding the sequence labeling mode into a Bi-LSTM + CRF model to obtain the sequence labeling model.

According to the preferred embodiment of the present invention, the identifying the entities in the text data through the sequence labeling model based on Bi-LSTM + CRF to obtain the initial entity list includes:

inputting the text data into the sequence labeling model based on Bi-LSTM + CRF, and acquiring the output probability and the transition probability of each corresponding label at each sequence position in a Softmax layer;

calculating the sum of the output probability and the transition probability of each label as the score of each label for each sequence position;

determining the label with the highest score as the output label of each sequence position;

and combining the output labels of each sequence position to obtain the initial entity list.

According to a preferred embodiment of the present invention, the expanding the initial entity list based on the preconfigured knowledge-graph to obtain the candidate entity list includes:

calculating the cosine similarity between each entity in the initial entity list and the entity on each node in the knowledge graph;

acquiring at least one entity with cosine similarity greater than or equal to preset similarity from each node as a candidate entity;

and constructing the candidate entity list according to the initial entity list and the candidate entities.

According to the preferred embodiment of the present invention, the disambiguating the candidate entity list using the semantic matching model trained based on the Attention-DSSM algorithm to obtain the target entity includes:

coding each entity in the candidate entity list based on One-Hot coding algorithm to obtain a word ID of each entity;

inputting the word ID of each entity into a pre-configured dictionary, and outputting a word vector of each entity;

processing the word vector of each entity based on an Attention mechanism to obtain semantic representation of each entity;

interacting the semantic representation of each entity in an Interaction layer, and outputting a semantic vector after Interaction of each entity;

and matching the semantic vector after interaction of each entity with the entities on the nodes of the knowledge graph on a matching layer, and outputting the entity with the highest matching degree as the target entity.

According to a preferred embodiment of the present invention, said extracting knowledge based on information on said nodes comprises:

acquiring at least one path between nodes and associated information on each path from the information on the nodes;

and extracting at least one relation network based on the associated information on each path and the corresponding path.

A knowledge extraction device, the knowledge extraction device comprising:

the acquisition unit is used for acquiring source data when a knowledge extraction instruction is received;

the preprocessing unit is used for preprocessing the source data to obtain text data;

the identification unit is used for identifying the entity in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list;

the expansion unit is used for expanding the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list;

the disambiguation unit is used for carrying out disambiguation processing on the candidate entity list by adopting a semantic matching model trained based on the Attention-DSSM algorithm to obtain a target entity;

a linking unit, configured to link the target entity to a node of the knowledge-graph;

and the extraction unit is used for extracting knowledge based on the information on the nodes.

According to a preferred embodiment of the present invention, the preprocessing unit is specifically configured to:

According to a preferred embodiment of the present invention, the knowledge extraction device further comprises:

the configuration unit is used for configuring the sequence marking mode according to predefined demand data;

and the adding unit is used for adding the sequence labeling mode into the Bi-LSTM + CRF model to obtain the sequence labeling model.

According to a preferred embodiment of the present invention, the identification unit is specifically configured to:

According to a preferred embodiment of the present invention, the extension unit is specifically configured to:

According to a preferred embodiment of the invention, the disambiguation unit is specifically configured to:

According to a preferred embodiment of the present invention, the extracting unit is specifically configured to:

An electronic device, the electronic device comprising:

a memory storing at least one instruction; and

a processor executing instructions stored in the memory to implement the knowledge extraction method.

A computer-readable storage medium having at least one instruction stored therein, the at least one instruction being executable by a processor in an electronic device to implement the knowledge extraction method.

The technical scheme shows that the method can acquire source data when receiving a knowledge extraction instruction, preprocesses the source data to obtain text data, realizes the unification of formats, identifies entities in the text data through a Bi-LSTM + CRF sequence labeling model to obtain an initial entity list, realizes the accurate conversion to unstructured data based on the Bi-LSTM + CRF sequence labeling model, further expands the initial entity list based on a pre-configured knowledge graph to obtain a candidate entity list, realizes the comprehensive coverage of similar representation, and performs disambiguation on the candidate entity list by adopting a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity, because the Attention mechanism strengthens the association between each word and other words and improves the weight of key words, the newly added Interaction layer also enhances the association between texts to be matched, so that the obtained target entity is more accurate, the target entity is further linked to the nodes of the knowledge graph, and automatic knowledge extraction is performed based on the information on the nodes, so that the efficiency and the accuracy of knowledge extraction are improved.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the knowledge extraction method of the present invention.

FIG. 2 is a schematic diagram of an exemplary-based data source extraction relationship network according to the present invention.

FIG. 3 is a functional block diagram of a preferred embodiment of the knowledge extraction apparatus of the present invention.

FIG. 4 is a schematic structural diagram of an electronic device implementing a knowledge extraction method according to a preferred embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a flow chart of a preferred embodiment of the knowledge extraction method of the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.

The knowledge extraction method is applied to one or more electronic devices, which are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the electronic devices includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive Internet Protocol Television (IPTV), an intelligent wearable device, and the like.

The electronic device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud computing (cloud computing) based cloud consisting of a large number of hosts or network servers.

The Network where the electronic device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.

And S10, acquiring the source data when the knowledge extraction instruction is received.

In at least one embodiment of the invention, the knowledge extraction instructions may be triggered by specified users including, but not limited to: project managers, and the like.

Further, the source data may be obtained from a configuration database.

For example: when a knowledge extraction based on legal knowledge base is performed, the source data may be obtained from a database accessible by the court, such as a database inside the court, a source database on the web, or the like.

The source data may be a picture type or a text type, and the present invention is not limited thereto.

And S11, preprocessing the source data to obtain text data.

In order for the machine to be able to identify the source data, the electronic device first needs to pre-process the source data.

Specifically, the preprocessing the source data to obtain text data includes:

when the source data is of a picture type, converting the source data into an initial text, filtering and cleaning the initial text to obtain a filtered text, and coding the filtered text based on an UTF-8(8-bit Unicode Transformation Format, 8-bit) coding algorithm to obtain the text data.

Or when the source data is of a text type, filtering and cleaning the source data to obtain a filtered text, and coding the filtered text based on a UTF-8 coding algorithm to obtain the text data.

Wherein the electronic device may employ an OCR (Optical Character Recognition) algorithm to convert the source data into the initial text.

Meanwhile, the filtered text is coded through the UTF-8 coding algorithm, full-angle and half-angle symbol conversion, messy code removal and other operations can be performed on the filtered text, and coding unification is finally achieved.

Further, the text data may be in a TXT text format, or may be in other text formats, which is not limited in the present invention.

Through the implementation mode, the source data can be filtered and cleaned to eliminate interference information, and further the source data is converted into a uniform text format, so that the uniformity of the data format is realized, and the preprocessed text data can be recognized and processed by a machine.

And S12, identifying the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF to obtain an initial entity list.

The text data obtained by preprocessing the source data is unstructured text data, so that key entity information in the text data needs to be identified, which is equivalent to performing sequence labeling on the text data. Therefore, the electronic device needs to first construct a sequence annotation model associated with the current knowledge extraction instruction.

Specifically, the knowledge extraction method further includes:

the electronic equipment configures a sequence labeling mode according to predefined demand data, and adds the sequence labeling mode to a Bi-LSTM + CRF model to obtain the sequence labeling model.

The sequence labeling mode can be configured according to specific task requirements.

For example: the same label may not be output consecutively, etc.

It should be noted that a Bi-LSTM (Bidirectional Long Short Term Memory) layer provides Long-distance dependency modeling, enhances the relation between each character and a context character, and a CRF (conditional random field) can accommodate arbitrary context information, so that the feature design is flexible, the CRF layer can construct feature transfer and correspondence between characters, and simultaneously considers the sequentiality between output labels, thereby achieving a more accurate recognition effect.

In at least one embodiment of the present invention, the identifying the entities in the text data through the Bi-LSTM + CRF-based sequence labeling model, and obtaining the initial entity list includes:

the electronic equipment inputs the text data into the sequence labeling model based on the Bi-LSTM + CRF, obtains output probability and transition probability of each corresponding label at each sequence position in a Softmax layer, calculates sum of the output probability and the transition probability of each label as score of each label for each sequence position, determines the label with the highest score as the output label of each sequence position, and combines the output labels of each sequence position to obtain the initial entity list.

Specifically, X ═ X (X) for each input₁,x₂,…,x_n) A predicted tag sequence Y ═ Y (Y) can be obtained₁,y₂,…,y_n) The score formula is defined as follows:

wherein the content of the first and second substances,

for the output of Softmax layer at the ith sequence position as y_iThe probability of (a) of (b) being,is from y_iTo y_i+1The transition probability of (2).

According to the formula, when a predicted sequence is high in score, not all positions are labels corresponding to the maximum probability value output by the Softmax layer, and transition probabilities are comprehensively considered, namely the sequence labeling mode is met (for example, B cannot be followed by B).

For example: after Bi-LSTM processing, the most likely output sequence is bbibioo, and since in the transition probability matrix, according to the sequence labeling mode, the probability of B — > B is very small, even negative, then the sequence will not get the highest score, i.e. it is not the initial entity list.

Taking the above example into account, if B-PER represents the first character tag of a person's name, E-PER represents the last character tag of a person's name, O represents the independent character tag, B-ORG represents the first character tag of an organization's name, and I-ORG represents the middle character tag of an organization's name, the initial entity list obtained by merging the tag items of the same category in the sequence may include the following sequence: sequences (B, E) representing the name of a person; sequences (B, I, E) representing organization names; sequence (O), representing an independent character.

S13, expanding the initial entity list based on the pre-configured knowledge graph to obtain a candidate entity list.

The knowledge graph can be configured in advance according to each technical field, such as: legal knowledge maps in the legal domain, etc.

It should be noted that each entity in the initial entity list may be a partial representation or an alternative representation of the entity, and therefore, a surface name extension needs to be performed on each entity to obtain the candidate entity list.

For example: the "XX middle court" and the "XX City middle people court" are different representations of the same entity.

Specifically, the expanding the initial entity list based on the preconfigured knowledge graph to obtain the candidate entity list includes:

the electronic equipment calculates the cosine similarity between each entity in the initial entity list and the entities on each node in the knowledge graph, and obtains at least one entity with the cosine similarity larger than or equal to the preset similarity from each node as a candidate entity, and the electronic equipment constructs the candidate entity list according to the initial entity list and the candidate entity.

The preset similarity may be configured by user, for example: 99.7 percent.

The cosine similarity is a way of measuring the similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, and compared with distance measurement, the cosine similarity focuses more on the difference of the two vectors in the direction. Generally, after obtaining a vector representation of two texts by using Embedding (Embedding vector), a cosine similarity can be used to calculate a similarity between the two texts.

By the embodiment, the similarity between each entity and the entity on the node of the knowledge graph can be calculated by adopting cosine similarity so as to judge whether the coreference relationship exists, and further the expansion of the initial entity list is realized to obtain the candidate entity list with more comprehensive coverage.

S14, disambiguating the candidate entity list by adopting a Semantic matching Model trained based on an Attention-Deep Structured Semantic Model (ATTENTION-DSSM) algorithm to obtain a target entity.

The candidate entity list obtained by the expansion processing of the initial entity list may have a plurality of candidates, and therefore, the candidate entity list needs to be further disambiguated, and further, the only one target entity which is most matched with the entity on the knowledge graph node can be more accurately determined based on a plurality of similar representations.

Specifically, the disambiguating the candidate entity list by using the semantic matching model trained based on the Attention-DSSM algorithm to obtain the target entity includes:

the electronic equipment encodes each entity in the candidate entity list based on an One-Hot encoding algorithm to obtain a word ID of each entity, inputs the word ID of each entity into a pre-configured dictionary, outputs a word vector of each entity, processes the word vector of each entity based on an Attention mechanism to obtain a semantic representation of each entity, further interacts the semantic representation of each entity on an Interaction layer (Interaction layer), outputs the semantic vector after Interaction of each entity, matches the semantic vector after Interaction of each entity with the entity on the knowledge graph node on a matching layer, and outputs the entity with the highest matching degree as the target entity.

The traditional DSSM (Deep Structured Semantic matching Model) expresses the extracted entity context information and the context information of candidate entities as low-dimensional Semantic vectors, and calculates the distance between the two Semantic vectors by cosine distance. However, because the DSSM adopts the bag-of-word model, the word order information and the context information are lost. In addition, the DSSM model is a weakly supervised end-to-end model, the prediction result is uncontrollable, long-distance information cannot be acquired, and the problems of gradient disappearance and the like exist.

In consideration of the particularity of the semantic matching task, the embodiment adopts a semantic matching model trained based on the Attention-DSSM algorithm, and the semantic matching model may include, from bottom to top: the system comprises an input layer, a semantic representation layer, an Interaction layer and a matching layer.

Through the implementation mode, the word vector of each entity is processed based on the Attention mechanism, the semantic representation capability is enhanced, the association between the vocabulary in each text and other vocabularies is enhanced, the weight of key vocabularies in the text is improved, the accuracy is high, in addition, the newly added Interaction layer enables Interaction between two texts to be matched, the association between the two texts to be matched is enhanced through mutual expression, and the model normalization capability is improved.

S15, linking the target entity to the nodes of the knowledge-graph.

Specifically, the electronic device obtains a node on the knowledge-graph corresponding to the target entity, and further links the target entity to the node of the knowledge-graph.

For example: and linking the XX middle court in the legal document with the XX city middle people court node in the legal knowledge graph, wherein the attributes and the relationship of the XX middle court are the same as those of the XX city middle people court.

It should be noted that the physical link belongs to a relatively mature technology, and the present invention is not described herein.

And S16, extracting knowledge based on the information on the node.

Specifically, the extracting knowledge based on the information on the node includes:

the electronic equipment acquires at least one path between the nodes and the associated information on each path from the information on the nodes, and extracts at least one relation network based on the associated information on each path and the corresponding path.

As shown in fig. 2, for the acquired source data, a corresponding relationship network may be extracted.

It should be noted that the knowledge graph is composed of a large amount of knowledge and relations between the knowledge, nodes in the network represent entities existing in the real world, edges between the nodes represent relations between two entities, and the knowledge in the real world is abstracted into a knowledge network applicable to machine processing through the combination of points and edges.

Through the above embodiment, knowledge extraction is performed based on the information on the nodes, implicit information of the linked nodes can be acquired after the target entity is linked to the nodes of the knowledge graph, and extraction of relationship and event information can be performed according to paths among the nodes.

Fig. 3 is a functional block diagram of the knowledge extracting apparatus according to the preferred embodiment of the present invention. The knowledge extraction device 11 comprises an acquisition unit 110, a preprocessing unit 111, an identification unit 112, an expansion unit 113, a disambiguation unit 114, a linking unit 115, an extraction unit 116, a configuration unit 117, and an addition unit 118. The module/unit referred to in the present invention refers to a series of computer program segments that can be executed by the processor 13 and that can perform a fixed function, and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.

When receiving a knowledge extraction instruction, the acquisition unit 110 acquires source data.

Further, the source data may be obtained from a configuration database.

The preprocessing unit 111 preprocesses the source data to obtain text data.

In order to enable the machine to recognize the source data, the preprocessing unit 111 first needs to preprocess the source data.

Specifically, the preprocessing unit 111 preprocesses the source data to obtain text data, and includes:

when the source data is of a picture type, the preprocessing unit 111 converts the source data into an initial text, filters and cleans the initial text to obtain a filtered text, and codes the filtered text based on an UTF-8(8-bit unicode transformation Format, 8-bit) coding algorithm to obtain the text data.

Or when the source data is a text type, the preprocessing unit 111 filters and cleans the source data to obtain a filtered text, and codes the filtered text based on a UTF-8 coding algorithm to obtain the text data.

Wherein the preprocessing unit 111 may employ an OCR (Optical Character Recognition) algorithm to convert the source data into the initial text.

The identification unit 112 identifies the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF, resulting in an initial entity list.

The text data obtained by preprocessing the source data is unstructured text data, so that key entity information in the text data needs to be identified, which is equivalent to performing sequence labeling on the text data. Therefore, it is necessary to first construct a sequence annotation model associated with the current knowledge extraction instruction.

Specifically, the configuration unit 117 configures a sequence labeling mode according to predefined demand data, and the adding unit 118 adds the sequence labeling mode to the Bi-LSTM + CRF model to obtain the sequence labeling model.

For example: the same label may not be output consecutively, etc.

In at least one embodiment of the present invention, the identifying unit 112 identifies the entities in the text data through a sequence labeling model based on Bi-LSTM + CRF, and the obtaining of the initial entity list includes:

the identification unit 112 inputs the text data into the Bi-LSTM + CRF-based sequence labeling model, and obtains an output probability and a transition probability of each corresponding label at each sequence position in the Softmax layer, for each sequence position, the identification unit 112 calculates a sum of the output probability and the transition probability of each label as a score of each label, and determines a label with a highest score as an output label at each sequence position, and the identification unit 112 combines the output labels at each sequence position to obtain the initial entity list.

Specifically, X ═ X (X) for each input₁，x₂,…，x_n) A predicted tag sequence Y ═ Y (Y) can be obtained₁，y₂，…，y_n) The score formula is defined as follows:

wherein the content of the first and second substances,

for the output of Softmax layer at the ith sequence position as y_iThe probability of (a) of (b) being,

is from y_iTo y_i+1The transition probability of (2).

The expansion unit 113 expands the initial entity list based on a preconfigured knowledge graph to obtain a candidate entity list.

Specifically, the expanding unit 113 expands the initial entity list based on a preconfigured knowledge graph, and obtaining a candidate entity list includes:

the expansion unit 113 calculates cosine similarity between each entity in the initial entity list and entities on each node in the knowledge graph, and obtains at least one entity with the cosine similarity greater than or equal to a preset similarity from each node as a candidate entity, and the expansion unit 113 constructs the candidate entity list according to the initial entity list and the candidate entity.

The preset similarity may be configured by user, for example: 99.7 percent.

The disambiguation unit 114 performs disambiguation on the candidate entity list by using a semantic matching model trained based on an Attention-Deep Structured semantic model (atsm-DSSM) algorithm to obtain a target entity.

Specifically, the disambiguation unit 114 performs disambiguation on the candidate entity list by using a semantic matching model trained based on the Attention-DSSM algorithm, and obtaining a target entity includes:

the disambiguation unit 114 encodes each entity in the candidate entity list based on One-Hot encoding algorithm to obtain a word ID of each entity, inputs the word ID of each entity into a pre-configured dictionary, outputs a word vector of each entity, processes the word vector of each entity based on an Attention mechanism to obtain a semantic representation of each entity, further interacts the semantic representation of each entity on an Interaction layer (Interaction layer), outputs the semantic vector after Interaction of each entity, matches the semantic vector after Interaction of each entity with the entity on the knowledge graph node on a matching layer, and outputs the entity with the highest matching degree as the target entity.

The linking unit 115 links the target entity to a node of the knowledge-graph.

Specifically, the linking unit 115 acquires a node on the knowledge-graph corresponding to the target entity, and further links the target entity to the node of the knowledge-graph.

The extraction unit 116 performs knowledge extraction based on the information on the node.

Specifically, the extracting unit 116 performs knowledge extraction based on the information on the node, including:

the extraction unit 116 obtains at least one path between nodes and associated information on each path from the information on the nodes, and extracts at least one relationship network based on the associated information on each path and the corresponding path.

Fig. 4 is a schematic structural diagram of an electronic device according to a preferred embodiment of the knowledge extraction method of the present invention.

The electronic device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program, such as a knowledge extraction program, stored in the memory 12 and executable on the processor 13.

It will be understood by those skilled in the art that the schematic diagram is merely an example of the electronic device 1, and does not constitute a limitation to the electronic device 1, the electronic device 1 may have a bus-type structure or a star-type structure, the electronic device 1 may further include more or less hardware or software than those shown in the figures, or different component arrangements, for example, the electronic device 1 may further include an input and output device, a network access device, and the like.

It should be noted that the electronic device 1 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.

The memory 12 includes at least one type of readable storage medium, which includes flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, for example a removable hard disk of the electronic device 1. The memory 12 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 12 may be used not only to store application software installed in the electronic apparatus 1 and various types of data such as codes of a knowledge extraction program, but also to temporarily store data that has been output or is to be output.

The processor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the electronic device 1 by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (for example, executing a knowledge extraction program and the like) stored in the memory 12 and calling data stored in the memory 12.

The processor 13 executes an operating system of the electronic device 1 and various installed application programs. The processor 13 executes the application program to implement the steps in the above-described respective knowledge extraction method embodiments, such as steps S10, S11, S12, S13, S14, S15, S16 shown in fig. 1.

Alternatively, the processor 13, when executing the computer program, implements the functions of the modules/units in the above device embodiments, for example:

when a knowledge extraction instruction is received, acquiring source data;

preprocessing the source data to obtain text data;

linking the target entity to a node of the knowledge-graph;

and extracting knowledge based on the information on the nodes.

Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the electronic device 1. For example, the computer program may be divided into an acquisition unit 110, a pre-processing unit 111, an identification unit 112, an extension unit 113, a disambiguation unit 114, a linking unit 115, an extraction unit 116, a configuration unit 117, an addition unit 118.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the knowledge extraction method according to the embodiments of the present invention.

The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented.

Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 4, but this does not indicate only one bus or one type of bus. The bus is arranged to enable connection communication between the memory 12 and at least one processor 13 or the like.

Although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 13 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

Fig. 4 only shows the electronic device 1 with components 12-13, and it will be understood by a person skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

In conjunction with fig. 1, the memory 12 in the electronic device 1 stores a plurality of instructions to implement a knowledge extraction method, and the processor 13 executes the plurality of instructions to implement:

when a knowledge extraction instruction is received, acquiring source data;

preprocessing the source data to obtain text data;

linking the target entity to a node of the knowledge-graph;

and extracting knowledge based on the information on the nodes.

Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A knowledge extraction method, characterized by comprising:

when a knowledge extraction instruction is received, acquiring source data;

preprocessing the source data to obtain text data;

linking the target entity to a node of the knowledge-graph;

and extracting knowledge based on the information on the nodes.

2. The knowledge extraction method of claim 1, wherein the preprocessing the source data to obtain text data comprises:

3. The knowledge extraction method of claim 1, further comprising:

configuring a sequence marking mode according to predefined demand data;

4. The knowledge extraction method of claim 1, wherein the identifying entities in the text data through a Bi-LSTM + CRF based sequence labeling model, and obtaining an initial entity list comprises:

5. The method of knowledge extraction as claimed in claim 1, wherein the expanding the initial entity list based on a preconfigured knowledge-graph to obtain a candidate entity list comprises:

6. The knowledge extraction method of claim 1, wherein the disambiguating the candidate entity list using a semantic matching model trained based on an Attention-DSSM algorithm to obtain a target entity comprises:

7. The knowledge extraction method of claim 1, wherein the extracting knowledge based on information on the node comprises:

8. A knowledge extraction apparatus, characterized by comprising:

9. An electronic device, characterized in that the electronic device comprises:

a memory storing at least one instruction; and

a processor executing instructions stored in the memory to implement the knowledge extraction method of any one of claims 1 to 7.

10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein at least one instruction that is executed by a processor in an electronic device to implement the knowledge extraction method of any one of claims 1-7.