CN115982389A - Knowledge graph generation method, device and equipment - Google Patents

Knowledge graph generation method, device and equipment Download PDF

Info

Publication number
CN115982389A
CN115982389A CN202310246624.0A CN202310246624A CN115982389A CN 115982389 A CN115982389 A CN 115982389A CN 202310246624 A CN202310246624 A CN 202310246624A CN 115982389 A CN115982389 A CN 115982389A
Authority
CN
China
Prior art keywords
entity
entities
attribute information
candidate attribute
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310246624.0A
Other languages
Chinese (zh)
Other versions
CN115982389B (en
Inventor
王甫宁
代旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Guohua Zhonglian Technology Co ltd
Original Assignee
Beijing Guohua Zhonglian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Guohua Zhonglian Technology Co ltd filed Critical Beijing Guohua Zhonglian Technology Co ltd
Priority to CN202310246624.0A priority Critical patent/CN115982389B/en
Publication of CN115982389A publication Critical patent/CN115982389A/en
Application granted granted Critical
Publication of CN115982389B publication Critical patent/CN115982389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a knowledge graph generation method, a knowledge graph generation device and knowledge graph generation equipment, wherein the method comprises the steps of obtaining a network text of the industry where the knowledge graph is located, extracting sentences from the network text, and carrying out word segmentation and labeling processing on the sentences to obtain a plurality of segmented words and labeling information of each segmented word; identifying entities in the sentences from the participles based on the labeling information of the participles; identifying candidate attribute information of the entity from each word segmentation, and identifying the attribute information of the entity from the candidate attribute information; and determining entity relationships among the entities, and generating the knowledge graph based on the entities, the attribute information of the entities and the entity relationships among the entities. In the method, the candidate attribute information of the entity is firstly identified in a coarse screening mode, the real attribute information of the entity is screened from the candidate attribute information in a fine screening mode, and the entity attribute information is extracted in a mode of combining the coarse screening and the fine screening, so that the accuracy of extracting the entity attribute information and the accuracy of the knowledge graph are improved.

Description

Knowledge graph generation method, device and equipment
Technical Field
The present disclosure relates to the field of knowledge graph construction technologies, and in particular, to a method, an apparatus, and a device for generating a knowledge graph.
Background
With the development of the internet, the content of the network data presents an explosive growth situation. Due to the characteristics of large scale, heterogeneous multiple and loose organization structure of internet content, the method provides challenges for people to effectively acquire information and knowledge. The Knowledge Graph (Knowledge Graph) lays a foundation for the intellectual organization and intelligent application in the internet era by virtue of the strong semantic processing capability and open organization capability of the Knowledge Graph, so that the Knowledge Graph is widely popularized and applied.
At present, in the process of constructing a knowledge graph, attributes and attribute values of an entity are usually extracted directly based on preset grammar rules, but at present when network data content shows explosive growth, the preset grammar rules are difficult to adapt to new requirements of data change, so that the extraction accuracy of the attributes and attribute values of the entity is low, and further the accuracy of the knowledge graph is reduced.
Disclosure of Invention
In view of this, the present disclosure provides a method, an apparatus, and a device for generating a knowledge graph, which can improve accuracy of extracting entity attribute information, and further improve accuracy of the knowledge graph.
According to a first aspect of the present disclosure, there is provided a knowledge-graph generating method, comprising:
acquiring a network text of the industry where the knowledge graph is located, extracting sentences from the network text, and performing word segmentation and labeling processing on the sentences to obtain a plurality of segmented words and labeling information of each segmented word;
identifying entities in the sentences from the participles based on the labeling information of the participles;
identifying candidate attribute information of the entity from each word segmentation, and identifying attribute information of the entity from the candidate attribute information;
determining entity relationships among the entities, and generating the knowledge graph based on the entities, the attribute information of the entities, and the entity relationships among the entities.
In one possible implementation manner, when obtaining the web text of the industry where the knowledge graph is located, the method further includes: performing data cleaning on the web text;
the data cleaning comprises at least one of screening a network text with high matching degree with the industry, cleaning stop words in the network text and standardizing characters in the network text.
In a possible implementation manner, when the entity in the sentence is identified from each participle based on the labeling information of each participle, the method is implemented based on a pre-constructed entity extraction model.
In one possible implementation, the candidate attribute information includes at least one of a candidate attribute and a candidate attribute value.
In one possible implementation manner, when the candidate attribute information of the entity is identified from each segmented word, the method is implemented based on a pre-configured candidate attribute information extraction rule.
In one possible implementation manner, when the attribute information of the entity is identified from the candidate attribute information, the attribute information is implemented based on a pre-constructed attribute information extraction model.
In one possible implementation, the entity relationship between the entities includes: at least one of a type of relationship between entities, a strength of relationship between entities, and a preference of relationship between entities.
In one possible implementation, the method further includes:
acquiring entry files of the industry, and screening out industry hot words from the entry files;
and carrying out expansion description on the entities in the knowledge graph based on the industry hot words.
According to a second aspect of the present disclosure, there is provided a knowledge-graph generating apparatus comprising:
the system comprises a data acquisition module, a word segmentation module and a word segmentation and labeling module, wherein the data acquisition module is used for acquiring a network text of the industry where the knowledge graph is located, extracting a sentence from the network text, and performing word segmentation and labeling processing on the sentence to obtain a plurality of segmented words and labeling information of each segmented word;
the entity identification module is used for identifying the entity in the sentence from each participle based on the labeling information of each participle;
the attribute information identification module is used for identifying candidate attribute information of the entity from each participle and identifying the attribute information of the entity from the candidate attribute information;
and the map building module is used for determining entity relations among the entities and generating the knowledge map based on the entities and the attributes, attribute values and entity relations of the entities.
According to a third aspect of the present disclosure, there is provided a knowledge-graph generating apparatus comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the method of the first aspect of the present disclosure.
The invention provides a knowledge graph generation method, which comprises the steps of obtaining a network text of the industry of the knowledge graph, extracting sentences from the network text, and carrying out word segmentation and labeling processing on the sentences to obtain a plurality of segmented words and labeling information of each segmented word; identifying entities in the sentences from the participles based on the labeling information of the participles; identifying candidate attribute information of the entity from each word segmentation, and identifying the attribute information of the entity from the candidate attribute information; and determining entity relationships among the entities, and generating the knowledge graph based on the entities, the attribute information of the entities and the entity relationships among the entities. In the method, the candidate attribute information of the entity is firstly identified in a coarse screening mode, the real attribute information of the entity is screened from the candidate attribute information in a fine screening mode, and the entity attribute information is extracted in a mode of combining the coarse screening and the fine screening, so that the accuracy of extracting the entity attribute information and the accuracy of the knowledge graph are improved.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 illustrates a flow diagram of a knowledge-graph generation method according to an embodiment of the present disclosure;
FIG. 2 illustrates a schematic structural diagram of a third neural network model, according to an embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of a first syntactic dependency tree structure sample, according to an embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of a second syntactic dependency tree structure sample, in accordance with an embodiment of the present disclosure;
FIG. 5 illustrates a schematic diagram of a third syntactic dependency tree structure sample, in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates a syntactic dependency tree diagram, according to an embodiment of the present disclosure;
FIG. 7 illustrates a syntactic dependency tree diagram according to another embodiment of the present disclosure;
FIG. 8 illustrates a syntactic dependency tree diagram according to yet another embodiment of the present disclosure;
FIG. 9 shows a schematic block diagram of a knowledge-graph generating apparatus according to an embodiment of the present disclosure;
FIG. 10 shows a schematic block diagram of a knowledge-graph generating apparatus according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.
< method example >
FIG. 1 shows a flow diagram of a method of knowledge-graph generation according to an embodiment of the present disclosure. As shown in fig. 1, the method may include steps S1100-S1400.
S1100, acquiring a network text of the industry where the knowledge graph is located, extracting sentences from the network text, and performing word segmentation and labeling processing on the sentences to obtain a plurality of segmented words and labeling information of each segmented word.
When an industry knowledge graph is constructed, firstly, a web crawler technology is used for directionally crawling web texts related to the industry. For example, when a knowledge graph of the disabled-assisting industry is constructed, policy files related to the disabled can be crawled from the network orientation as network texts for constructing the knowledge graph of the disabled-assisting industry.
In order to improve the quality of acquiring the web text, in a possible implementation manner, after the web text for constructing the knowledge graph is acquired, the method further includes: and performing data cleaning operation on the web text. The data cleaning can comprise at least one of screening the network texts with high matching degree with the industry, cleaning stop words in the network texts and standardizing characters in the network texts.
In a possible implementation manner, when the web texts with high matching degree with the industry are screened, the web texts are screened based on a pre-constructed industry dictionary and a text screening rule. Specifically, the professional terms of the industry are recorded in the industry dictionary, and the word frequency threshold and the filtering rule of each professional term are set in the text filtering rule, wherein the word frequency thresholds of the professional terms may be the same or different, and are not specifically limited herein. After the web text is obtained, determining that the web text comprises the professional terms based on an industry dictionary, and calculating the word frequency of the professional terms; and then, screening out the network texts with high matching degree with the industry according to the word frequency and the word frequency threshold value of each professional term and the corresponding screening rule.
It should be noted here that the word frequency threshold of each term may be determined according to the word frequency average of each term in each web text. For example, for a professional term A, N network texts related to industries are obtained, word frequencies of the professional term A in the N network texts are calculated respectively, an average value of the N word frequencies is calculated, and the average value is used as a word frequency threshold corresponding to the professional term A.
In one possible implementation, the word frequency of each term in the web text can be calculated based on the TF-IDF model. For example, inputting the term A and the web text B into the TF-IDF model, the word frequency of the term A in the web text B is calculated.
For example, the professional terms included in the dictionary of the disability assistance industry include handicapped people, handicapped insurance, warm family and the like, the word frequency threshold corresponding to the handicapped people, the handicapped insurance and the warm family in the text screening rule is 5%, the screening rule is that three professional terms of the handicapped people, the handicapped insurance and the warm family appear at the same time, and the word frequency of the professional subject is greater than the corresponding word frequency threshold, then the network text includes the three professional terms of the handicapped people, the handicapped insurance and the warm family, and when the word frequency of the three professional terms in the network text is greater than 5% of the corresponding word frequency threshold, the network text is screened out as the network text with high matching degree with the industry.
In one possible implementation, the method is implemented based on a pre-constructed deactivation dictionary when cleaning the deactivation words in the web text. Specifically, the stop words are included in the stop dictionary, after the web text is obtained, whether the stop words exist in the web text is determined based on the stop dictionary, and the corresponding stop words are deleted under the condition that the stop words are included in the web text.
In one possible implementation, the normalization of the words in the web text is performed based on a pre-constructed industry dictionary. Specifically, the network text is segmented based on an industry dictionary, first similarity between each segmented word and a professional term included in the industry dictionary is calculated, and when the first similarity is larger than a set first similarity threshold, the segmented word is replaced by the corresponding professional term, so that standardization processing of the network text is achieved.
In one possible implementation, when calculating the first similarity between the segmented word and the professional term included in the industry dictionary, the following steps may be included: firstly, respectively calculating word vectors of participles and professional terms; then, the cosine distance between the word vector of the participle and the word vector of the professional term is calculated, and the calculated cosine distance is used as the first similarity of the participle and the professional term.
In one possible implementation, when determining the first similarity threshold, the following steps may be included: firstly, selecting a set number of professional terms from an industry dictionary, setting corresponding similar terms for each screened professional term, respectively calculating first similarity of each professional term and the corresponding similar terms, then calculating an average value of the first similarity, and taking the calculated average value of the first similarity as a first similarity threshold.
In a possible implementation manner, after the network text data is obtained, network texts with high matching degree with industries can be screened from the network texts, and then stop word cleaning and character standardization processing are sequentially performed on the screened network texts, so that the network texts for finally constructing the knowledge graph are obtained. Through the data cleaning operation, the quality of the acquired web text can be improved.
After the data cleaning is completed, for each screened network text, sentences (which may be a sentence in the network text) in the network text are extracted one by one, and for each extracted sentence, the entity, the attribute information of the entity, and the entity relationship between the entities are extracted. Since the processing manner of each sentence in each web text is the same, a description will be given below of a process of extracting an entity, attribute information of the entity, and an entity relationship between entities, taking a processing process of one sentence as an example.
In a possible implementation manner, for the extracted sentence, a natural language processing tool may be used to perform word segmentation and labeling processing on the extracted sentence, so as to obtain a word segmentation result including a plurality of words and labeling information corresponding to each word segmentation. Wherein, the labeling information may include: at least one of word segmentation length, prefix and suffix, location, part of speech and syntactic dependency, punctuation.
S1200, identifying entities in the sentence from each participle based on the labeling information of each participle in the sentence.
In one possible implementation, the entity extraction in the sentence can be performed by a pre-constructed entity extraction model. In this implementation, an entity extraction training data set needs to be constructed first; then, training the selected first neural network model based on the entity extraction training data to obtain an entity extraction model; and finally, inputting the word segmentation result and the labeling information of each word segmentation into the entity extraction model, and identifying the entity in the sentence from each word segmentation of the word segmentation result by the entity extraction model according to the input labeling information of each word segmentation.
In one possible implementation, when the construction entity extracts the training data, the following steps may be included: firstly, selecting a web text with set data from the cleaned web texts as a training text. Secondly, extracting sentences in the training texts one by one aiming at each training text. Thirdly, performing word segmentation and labeling processing on each extracted sentence by adopting a natural language processing tool to obtain word segmentation results of the sentence and labeling information of each word in the word segmentation results; marking out entities in the word segmentation result in a manual marking mode; and taking the word segmentation result corresponding to the sentence, the labeling information of each word segmentation and the marked entity as an entity extraction training data, wherein the word segmentation result of the sentence and the labeling information of each word segmentation are taken as input data of the entity extraction training data, and the marked entity is taken as output data of the entity extraction training data. And finally, taking all the obtained entity extraction training data as an entity extraction training data set.
Under the condition that an entity extraction training data set is constructed, input data (namely, word segmentation results of sentences and label information of each word) of each entity extraction training data is used as input, output data (namely, labeled entities) of each entity extraction training data is used as output, and the selected first neural network model is trained, so that an entity extraction model capable of performing entity extraction according to the word segmentation results and the label information of each word is obtained.
In one possible implementation, the selected first neural network model may be constructed based on a long-and-short memory network (LSTM) and a random condition field (CRF).
And S1300, identifying candidate attribute information of the entity from each participle, and identifying the attribute information of the entity from the candidate attribute information. Wherein the candidate attribute information may include at least one of a candidate attribute and an attribute value of the candidate attribute.
In one possible implementation manner, when candidate attribute information of an entity is identified from each segmented word, the candidate attribute information can be implemented based on a pre-constructed attribute information extraction rule.
In an implementation manner in which the candidate attribute information includes a candidate attribute of the entity, the attribute information extraction rule may include: at least one of a first candidate attribute extraction rule, a second candidate attribute extraction rule, and a third candidate attribute extraction rule.
In one possible implementation manner, the first candidate attribute extraction rule may be set based on parts of speech of the participles adjacent to the entity. Specifically, the first candidate attribute extraction rule set based on the part of speech of the participles adjacent to the entity may be: and (3) if the word segmentation result comprises a first word segmentation structure of the word segmentation/name word segmentation (namely the corresponding noun word segmentation is carried out after the word segmentation), and the first word segmentation structure is adjacent to the entity, screening the noun word segmentation after the word segmentation as the candidate attribute of the entity. For example, in the phrase "z has 4000 frames of various types of devices", the "z" is an entity, and a first word segmentation structure of a word segmentation word "4000 frames" and a word segmentation word "device" adjacent to the "z" is provided, so that the word segmentation word "device" after the word segmentation word "4000 frames" is screened out as a candidate attribute of the "z" entity.
In one possible implementation, the second candidate attribute extraction rule may be set based on parts of speech of the participles adjacent to the entity. Specifically, the second candidate attribute extraction rule set based on the part of speech of the participle adjacent to the entity may be: and (3) a second word segmentation structure (namely, the adjective word segmentation is followed by the noun word segmentation) of the adjective word segmentation/noun word segmentation is included in the word segmentation result, and the second word segmentation structure is adjacent to the entity, and then the noun after the adjective word segmentation is screened out to be used as the candidate attribute of the entity. For example, in the sentence "z has a strong learning ability", the "z" is an entity, and the second word segmentation structure of the adjective word segmentation "strong" and the noun word segmentation "learning ability" is provided later, so that the noun word segmentation "learning ability" after the adjective word segmentation "strong" is screened out as a candidate attribute of the "z" entity.
In one possible implementation, the third candidate attribute extraction rule may be: and calculating a second similarity between each noun word segmentation in the word segmentation result and a preset attribute word of the entity, and screening the noun word segmentation as a candidate attribute of the entity under the condition that the second similarity is greater than a preset second similarity threshold. The word vector distance between each noun participle and the preset attribute word can be calculated, and the calculated word vector distance is used as a second similarity of each noun participle and the preset attribute word of the entity. The setting method of the second similarity threshold may refer to the first similarity, and is not described herein again.
It should be noted that, corresponding preset attribute words are set for entities that may appear in the industry, so that after a word segmentation result of a sentence is obtained, the preset attribute words corresponding to the entity may be searched first, a second similarity between each noun word segmentation in the word segmentation result of the entity and the preset attribute words thereof is calculated, and the noun word segmentation is screened out as a candidate attribute of the entity under the condition that the second similarity is greater than a preset second similarity threshold.
In an implementation manner that the candidate attribute extraction rules include a first candidate attribute extraction rule, a second candidate attribute extraction rule and a third candidate attribute extraction rule, when a word in the word segmentation result satisfies any one candidate attribute extraction rule, the word is screened out as a candidate attribute of the entity. Specifically, when candidate attributes of an entity are identified from each participle: screening out noun candidate participles after quantificational words adjacent to the entity and noun candidate participles after adjective words adjacent to the entity as candidate attributes of the entity; and then calculating a second similarity of each participle and a preset attribute word of the entity, and screening the participle as a candidate attribute of the entity under the condition that the second similarity is greater than a preset second similarity threshold value. That is to say, as long as the participle in the participle result satisfies any one candidate attribute extraction rule, the participle is extracted as the candidate attribute of the entity, so that the screening range of the candidate attribute is expanded through the rough screening process of the three candidate attribute extraction rules, and the loss of the candidate attribute is avoided.
In order to improve the extraction efficiency of the candidate attribute, in a possible implementation manner, when the candidate attribute of the entity is identified from each participle, a step of filtering verbalization in the participle is further included. Specifically, filtering verbalization in the segmentation result to obtain candidate segmentation; and then, extracting the candidate attribute of the entity from the candidate participles according to a preset candidate attribute extraction rule.
In an implementation manner that the candidate attribute information includes a candidate attribute value of the entity candidate attribute, in a case where the candidate attribute of the entity is extracted, the method further includes: and extracting candidate attribute values of the candidate attributes from the word segmentation result based on a preset candidate attribute value extraction rule.
In this implementation, the candidate attribute information extraction rule further includes a candidate attribute value extraction rule. The candidate attribute value extraction rule may be set based on the parts of speech of the participles adjacent to the candidate attribute. Wherein the candidate attribute value extraction rule may include a first candidate attribute value extraction rule and a second candidate attribute value extraction rule.
The first candidate attribute value extraction rule may be: and extracting quantifier participles adjacent to the candidate attribute in the participle result to serve as candidate attribute values of the candidate attribute. Specifically, after the candidate attribute of the entity is extracted, the position of the candidate attribute in the word segmentation result is determined, and then the quantifier word segmentation adjacent to the candidate attribute is extracted as the candidate attribute value of the candidate attribute. For example, in the sentence "z has 4000 pieces of equipment," z "is an entity, and" equipment "is a candidate attribute of the entity, a quantifier participle" 4000 pieces of equipment "adjacent to the candidate attribute" equipment "is screened out as a candidate attribute value of" equipment ".
The second candidate attribute value extraction rule may be: and extracting adjective participles adjacent to the candidate attribute in the participle result to serve as candidate attribute values of the candidate attribute. Specifically, after the candidate attribute of the entity is extracted, the position of the candidate attribute in the word segmentation result is determined, and then the adjective word segmentation adjacent to the candidate attribute is extracted as the candidate attribute value of the candidate attribute. For example, in the sentence "z has strong learning ability", the "z" is an entity, and the "learning ability" is a candidate attribute of the entity, and the adjective word "strong" adjacent to the candidate attribute "learning ability" is extracted as a candidate attribute value of the candidate attribute "learning ability".
After candidate attribute information of the entity is identified from the segmented words, a step of identifying real attribute information of the entity from the candidate attribute information is performed.
In a possible implementation manner, when the real attribute information of the entity is identified from the candidate attribute information, the identification may be implemented based on a pre-constructed attribute information extraction model.
In the implementation mode, attribute information needs to be constructed first to extract a training data set; then, extracting a training data set based on the attribute information to train the selected second neural network model to obtain an attribute information extraction model; and finally, inputting the entity and the candidate attribute information of the entity into an attribute information extraction model, wherein the attribute information extraction model identifies the real entity attribute information from the candidate attribute information of the entity.
In one possible implementation, in constructing the attribute information extraction training data set, the following steps may be included: first, an entity extraction training data set is obtained. Secondly, aiming at each piece of training data in the entity extraction training data set, adopting the candidate attribute information extraction rule to extract candidate attribute information of the entity from each participle of the participle result, marking the attribute information of the entity from each candidate attribute information in a manual marking mode, and taking the entity, the candidate attribute information of the entity and the real candidate attribute information of the entity as one piece of attribute information to extract the training data, wherein the candidate attribute information of the entity and the entity is taken as the attribute information to extract the input data of the training data, and the real candidate attribute information is taken as the output data of the training data to extract the training data. And finally, extracting training data from all the obtained attribute information to be used as an attribute information extraction training data set.
Under the condition that an attribute information extraction training data set is constructed, input data (namely entity and entity candidate attribute information) of each training data in the training set is used as input, output data (entity real attribute information) of each training data is used as output, and the selected second neural network model is trained, so that an attribute information extraction model capable of extracting the entity real attribute information according to the entity and entity candidate attribute information is obtained. Wherein, the training of the second neural network model can be realized based on a multilayer perceptron algorithm.
It should be noted here that, when extracting attribute information of an entity, all possible candidate attribute information of the entity is first screened out according to a preset candidate attribute information extraction rule, so as to expand an extraction range of the attribute information and avoid loss of all possible candidate attribute information. And then, accurately screening the candidate attribute information to obtain the real attribute information. That is to say, in the implementation mode, the accuracy of entity attribute information extraction is realized by combining two steps of coarse screening and fine screening.
S1400, determining entity relationships among the entities, and generating a knowledge graph based on the entities, the attribute information of the entities and the entity relationships among the entities.
In one possible implementation, the entity relationship between entities may include: at least one of a type of relationship between entities, a strength of relationship between entities, and a preference of relationship between entities. Wherein the relationship type between the entities may include at least one of containment, peer-to-peer, and belonging. The value range of the relationship strength between the entities can be 1-100, and the larger the numerical value of the relationship strength of the entities is, the closer the relationship between the two entities is proved, and the shorter the connecting line length of the two corresponding entities in the knowledge graph is. The preference of the relationship between the entities ranges from-1 to 1, including positive (+ 1 label) and negative (-1 label). When the relation preference value between two entities is closer to 1, the forward direction of the relation between the entities is shown, and the connecting line color of the corresponding two entities in the knowledge graph is the first color (for example, green). The closer the preference value of the relationship between two entities is to-1, the more negative the relationship between two entities is, the color of the connecting line of the corresponding two entities in the knowledge graph is the second color (for example, red).
In one possible implementation, the determination of the relationship type between entities may be based on a syntactic dependency tree implementation. The method specifically comprises the following steps:
first, a syntactic dependency tree of a statement is generated.
In one possible implementation, in generating a syntactic dependency tree of a statement, it may be implemented based on a pre-constructed entity relationship extraction model. In the implementation mode, an entity relationship is required to be constructed first to extract training data; then, training the selected third neural network model based on the entity relationship extraction training data to obtain an entity relationship extraction model; secondly, inputting the statement into a pre-constructed entity relationship extraction model, wherein the entity relationship extraction model outputs a hidden vector of the statement; finally, a syntactic dependency tree of the sentence is generated from the hidden vector of the sentence. Wherein the hidden vector of the sentence is used for characterizing the syntactic dependency of the sentence.
In one possible implementation, when constructing the entity relationship and extracting the training data, the following steps may be included: first, a training text is obtained, specifically referring to step S1200. Secondly, extracting sentences in the training texts one by one aiming at each training text, marking the hidden vectors of the sentences in a manual marking mode, and taking the sentences and the hidden vectors of the sentences as an entity relation to extract training data, wherein the sentences are used as input data, and the hidden vectors of the sentences are used as output data. And finally, taking training data constructed based on sentences in all the training texts as an entity relation to extract a training data set.
Under the condition that entity relationship extraction training data are constructed, input data (namely sentences) of each training data are used as input, output data (namely hidden vectors of the sentences) of each training data are used as output, and the selected third neural network model is trained, so that an entity relationship extraction model capable of calculating the hidden vectors of the sentences is obtained.
In one possible implementation, the structure of the third neural network model may be as shown in fig. 2. Specifically, the third neural network model includes an input layer, an encoding network Net1, a bias layer, a decoding network Net2, and an output layer, which are connected in sequence. The input layer is used for calculating word vectors (W1-W12) of all characters in the input sentence and inputting the word vectors of all the characters into the coding network Net1. The coding network Net1 is composed of a long-short-term memory network 1 (namely, LSTM 1), a long-short-term memory network 2 (namely, LSTM 2), and a long-short-term memory network 3 (namely, LSTM 3) which are connected in sequence, and is configured to vectorize a word vector of each character and input the obtained vector to the bias layer. The bias layer is used for providing a bias vector, so that the bias operation is carried out on the vector output by the coding network through the bias vector, the influence of the part of speech on the final output layer data is enhanced, and the robustness of the output result on the low-quality text statement is ensured. The offset vector is set according to the part of speech type, for example, the offset vector of a noun may be set to 1, the offset vector of an adjective may be set to 1.5, the offset vector of a verb may be set to 1.8, the bias vector of a moral word may be set to 0.2, and the like. The decoding network Net2 is composed of eight layers of backward propagation networks which are connected in sequence and used for decoding data output by a bias layer, s1-s12 are output layer network nodes and are fully connected with the output of the decoding network Net2, and finally, the hidden vector Y of the sentence is obtained through network training. In the hidden vectors Y, R11 is a hidden vector of a first word of the entity R1, R12 is a hidden vector of a second word of the entity R1, N11 is a hidden vector of a first word of the noun N1, N12 is a hidden vector of a second word of the noun 1, D11 is a hidden vector of a first word of the verb D1, D12 is a hidden vector of a second word of the verb D1, R21 is a hidden vector of a first word of the entity R2, R22 is a hidden vector of a second word of the entity R2, N21 is a hidden vector of a first word of the noun 2, N22 is a hidden vector of a second word of the noun 2, L1 is a hidden vector of punctuation, and < EOS > is a hidden vector of a sentence end symbol.
After the hidden vector of the sentence is generated, the syntactic dependency tree of the sentence can be generated according to the probabilistic graph model.
Second, an entity relationship type between entities is determined based on the syntactic dependency tree. Specifically, a tree node where an entity is located is determined in the syntactic dependency tree; marking entity existence dependency relationship in the syntactic dependency tree by a CRF algorithm; and extracting a syntactic dependency tree structure where one or more entities are located from the syntactic dependency tree, and determining the relationship type between the two entities by adopting a pre-configured entity relationship type matching template. Wherein the relationship type between the entities may include at least one of containment, peer-to-peer, and belonging.
It should be noted here that, the entity relationship type matching template records syntactic dependency tree structure samples corresponding to various entity relationship types, so that after the syntactic dependency tree structures where two entities are located are cut from the syntactic sequential tree, whether a syntactic dependency tree structure sample having the same syntactic dependency tree structure as the syntactic dependency tree structures where the two entities are located exists can be searched in the entity relationship type matching template, and if so, the entity relationship type to which the consistent syntactic dependency tree structure sample belongs is directly used as the relationship type between the two entities.
In one possible implementation, the entity relationship type matching module includes a first syntactic dependency tree structure sample reflecting the existence of containment relationships between entities, a second syntactic dependency tree structure sample reflecting the existence of belonging relationships between entities, and a third syntactic dependency tree structure sample reflecting the existence of peer relationships between entities. Specifically, in fig. 3, under a ROOT node ROOT, R1 is a first entity, D1 is a first proprietary verb reflecting an inclusion relationship, the first proprietary verb may be a verb or an inclusion, m is a number, g is a quantifier, and R2 is a second entity. The second syntactic dependency tree structure sample may be as shown in fig. 4, specifically, in fig. 4, under a ROOT node ROOT, R1 is a first entity, D2 is a second proprietary verb reflecting a relationship, the second proprietary verb may be belonging or belonging, etc., m is a number, g is a quantifier, and R2 is a second entity. A third syntactic dependency tree structure sample may be as shown in fig. 5, specifically, in fig. 5, under the ROOT node ROOT, R1 is a first entity, N1 is a first noun, and the first entity R1 is used to modify the first noun 1, R2 is a second entity, N2 is a second noun, and the first entity R2 is used to modify the second noun N2, D3 is a third proprietary verb reflecting a peer-to-peer relationship, which may be a countermeasure.
For example, for sentence a, the implicit vector of the sentence calculated by the entity relationship extraction model is "R1/N1/D3/R2/N2/L1". Wherein, R1 is a first entity, N1 is a first noun, D3 is a third verb, R2 is a second entity, N2 is a second name, and L1 is a first punctuation. A syntactic dependency tree generated based on this implicit vector of "R1/N1/D3/R2/N2/L1" is shown in FIG. 6. By using the CRF algorithm to mark the dependency relationship between the first entity R1 and the first noun N1 as a modification and the dependency relationship between the second entity R2 and the second name N2 as a modification in the syntactic dependency tree, the marking result can be shown in fig. 7. The syntactic dependency tree structure where the entity R1 and the entity R2 are located is extracted from the syntactic dependency tree, as shown in fig. 8. And matching the syntactic dependency tree structures of the first entity R1 and the second entity R2 with the syntactic dependency tree structure samples recorded in the entity relationship type matching template in a graph matching mode to obtain third syntactic dependency tree structure samples matched with the syntactic dependency tree structures, wherein the peer relationship reflected by the third syntactic dependency tree structure samples can be used as the entity relationship type between the first entity R1 and the second entity R2.
In one possible implementation, the determination of the strength of the relationship between entities may be based on the word frequency implementation of the entities. For example, for an entity a and an entity B, the word frequency of the entity a in the network text, the word frequency of the entity B in the network text, and the joint word frequency of the entity a and the entity B are respectively calculated, then, the product of the word frequency of the entity a in the network text and the word frequency of the entity B in the network text is calculated, and finally, the product of the word frequency of the entity a in the network text and the word frequency of the entity B in the network text is divided by the joint word frequency of the entity a and the entity B to serve as the relationship strength between the entity a and the entity B. The joint word frequency can be determined according to joint probability of simultaneous occurrence of the entity A and the entity B in the same sentence.
Through the calculation of the entity relationship, the knowledge graph not only can reflect the relationship type between two entities, but also can reflect the relationship strength between the entities, thereby enriching the information content of the knowledge graph.
In one possible implementation, in determining the relationship preferences between entities, the implementation may be based on adjective segmentation between entities. In particular, where preferences for relationships between entities are determined based on adjective part-of-speech between entities, determinations may be made based on the desirability of adjectives. For example, where an adjective between entities is a recognition, then the relationship between the two entities is preferably positive, which may be represented by + 1. When the adjective between the entities is a derogative word, the relationship preference between the two entities is negative and can be represented by-1.
Through the calculation of the relation preference, the knowledge graph not only can reflect the relation type and the relation strength between two entities, but also can reflect the relation preference between the entities, thereby enriching the information content of the knowledge graph.
In an implementation manner in which the entity relationship includes an entity relationship type, an entity relationship strength, and an entity relationship preference, when generating the knowledge graph based on the entity, attribute information of the entity, and an entity relationship between the entities, the method may include the following steps: the method comprises the steps of taking entities as nodes in a knowledge graph, taking attribute information of the entities as attribute information of the nodes, determining relation description between the entities based on relation types between the two entities, determining connecting line length between the entities based on relation strength between the two entities, and determining connecting line color between the entities based on relation preference between the two entities.
In a possible implementation mode, when the network text is obtained, obtaining an entry file of an industry, and screening out an industry hotword from the entry file; and carrying out operation of expanding description on the entity based on the industry hot word. Specifically, the entry files related to the industry can be crawled from Wikipedia or encyclopedia, and the entry names in the industry entry files are used as industry hot words. After the construction of the knowledge graph is completed, the hot words of each industry can be calculated and compared with the entities in the knowledge graph, whether the hot words of the industry are consistent with the entities or not is judged, and the hot words of the industry are combined with the entities under the condition that the hot words of the industry are consistent with the entities. And under the condition that the industry hot words are not consistent with the entities, judging the third similarity between the industry hot words and the entities, and under the condition that the third similarity is larger than a set third similarity threshold, displaying the industry hot words near the entities to expand and describe the entities through the industry hot words. And deleting the industry hot words under the condition that the industry hot words are not consistent with the entities and the third similarity between the industry hot words and the entities is smaller than a set third similarity threshold value. In the implementation mode, the current popular expression mode of the entity can be described in an expansion mode, so that the information quantity of the knowledge graph is further enriched.
It should be noted that, the method for calculating the third similarity may refer to the first similarity, and is not described herein again. The third similarity threshold may be calculated by referring to the calculation method of the first similarity threshold, and then the initial third similarity threshold is multiplied by 1.5 to be used as a final third similarity threshold.
The knowledge-graph generation method in the present disclosure may include the steps of: acquiring a network text of the industry where the knowledge graph is located, extracting sentences from the network text, and performing word segmentation and labeling processing on the sentences to obtain a plurality of segmented words and labeling information of each segmented word; identifying entities in the sentences from the participles based on the labeling information of the participles; identifying candidate attributes and candidate attribute values of the entities from each participle, and screening out the attributes and the attribute values of the entities from the candidate attributes and the candidate attribute values; generating a syntactic dependency tree of the sentence, and determining entity relations between the entities based on the syntactic dependency tree; and generating the knowledge graph based on the entity and the attribute, the attribute value and the entity relation of the entity. In the method, the candidate attributes and the candidate attribute values of the entities are identified from the word segments, the attributes and the attribute values of the entities are screened from the candidate attributes and the candidate attribute values, the attributes and the attribute values of the entities are extracted in a mode of combining the rough screening step and the fine screening step, the accuracy of extracting the attributes and the attribute values of the entities is improved, and therefore the accuracy of constructing the knowledge graph is improved.
< apparatus embodiment >
FIG. 9 shows a schematic block diagram of a knowledge-graph generating apparatus according to an embodiment of the present disclosure. As shown in fig. 9, the knowledge-graph generating apparatus 100 includes:
the data acquisition module 110 is configured to acquire a web text of an industry where the knowledge graph is located, extract a sentence from the web text, and perform word segmentation and labeling processing on the sentence to obtain a plurality of segmented words and labeling information of each segmented word;
an entity identification module 120, configured to identify an entity in a sentence from each participle based on the labeling information of each participle;
an attribute information identification module 130, configured to identify candidate attribute information of the entity from each segmented word, and identify attribute information of the entity from the candidate attribute information;
the map building module 140 is configured to determine entity relationships between entities, and generate a knowledge map based on the entities and attributes, attribute values, and entity relationships of the entities.
< apparatus embodiment >
FIG. 10 shows a schematic block diagram of a knowledge-graph generating apparatus according to an embodiment of the present disclosure. As shown in fig. 10, the knowledge-map generating apparatus 200 includes: a processor 210 and a memory 220 for storing instructions executable by the processor 210. Wherein the processor 210 is configured to execute the executable instructions to implement any of the aforementioned methods of knowledge-graph generation.
Here, it should be noted that the number of the processors 210 may be one or more. Meanwhile, in the knowledge-graph generating apparatus 200 of the embodiment of the present disclosure, an input device 230 and an output device 240 may be further included. The processor 210, the memory 220, the input device 230, and the output device 240 may be connected via a bus, or may be connected via other methods, which is not limited in detail herein.
The memory 220, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and various modules, such as: the knowledge graph generation method of the embodiment of the disclosure corresponds to a program or a module. The processor 210 executes various functional applications and data processing of the knowledge-graph generating apparatus 200 by executing software programs or modules stored in the memory 220.
The input device 230 may be used to receive an input number or signal. Wherein the signal may be a key signal generated in connection with user settings and function control of the device/terminal/server. The output device 240 may include a display device such as a display screen.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A method for generating a knowledge graph, comprising:
acquiring a network text of the industry where the knowledge graph is located, extracting sentences from the network text, and performing word segmentation and labeling processing on the sentences to obtain a plurality of segmented words and labeling information of each segmented word;
identifying entities in the sentences from the participles based on the labeling information of the participles;
identifying candidate attribute information of the entity from each word segmentation, and identifying attribute information of the entity from the candidate attribute information;
determining entity relationships among the entities, and generating the knowledge graph based on the entities, the attribute information of the entities, and the entity relationships among the entities.
2. The method of claim 1, when obtaining web text of an industry where the knowledge graph is located, further comprising: performing data cleaning on the web text;
the data cleaning comprises at least one of screening a network text with high matching degree with the industry, cleaning stop words in the network text and standardizing characters in the network text.
3. The method of claim 1, wherein the identifying of the entity in the sentence from each of the segmented words is performed based on a pre-constructed entity extraction model based on labeling information of each segmented word.
4. The method of claim 1, wherein the candidate attribute information comprises at least one of a candidate attribute and a candidate attribute value.
5. The method of claim 1, wherein upon identifying candidate attribute information for the entity from each of the participles, the extracting is performed based on a pre-configured candidate attribute information extraction rule.
6. The method of claim 1, wherein the attribute information of the entity is identified from the candidate attribute information based on a pre-constructed attribute information extraction model.
7. The method of claim 1, wherein the entity relationships between the entities comprise: at least one of a type of relationship between entities, a strength of relationship between entities, and a preference of relationship between entities.
8. The method of claim 1, further comprising:
acquiring entry files of the industry, and screening out industry hot words from the entry files;
and carrying out expansion description on the entities in the knowledge graph based on the industry hot words.
9. A knowledge-graph generating apparatus, comprising:
the system comprises a data acquisition module, a word segmentation module and a word segmentation and labeling module, wherein the data acquisition module is used for acquiring a network text of the industry where the knowledge graph is located, extracting a sentence from the network text, and performing word segmentation and labeling processing on the sentence to obtain a plurality of segmented words and labeling information of each segmented word;
the entity identification module is used for identifying the entity in the sentence from each participle based on the labeling information of each participle;
the attribute information identification module is used for identifying candidate attribute information of the entity from each participle and identifying the attribute information of the entity from the candidate attribute information;
and the map building module is used for determining entity relations among the entities and generating the knowledge map based on the entities and the attributes, attribute values and entity relations of the entities.
10. A knowledge-graph generating apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to carry out the executable instructions when implementing the method of any one of claims 1 to 8.
CN202310246624.0A 2023-03-10 2023-03-10 Knowledge graph generation method, device and equipment Active CN115982389B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310246624.0A CN115982389B (en) 2023-03-10 2023-03-10 Knowledge graph generation method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310246624.0A CN115982389B (en) 2023-03-10 2023-03-10 Knowledge graph generation method, device and equipment

Publications (2)

Publication Number Publication Date
CN115982389A true CN115982389A (en) 2023-04-18
CN115982389B CN115982389B (en) 2023-05-30

Family

ID=85964679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310246624.0A Active CN115982389B (en) 2023-03-10 2023-03-10 Knowledge graph generation method, device and equipment

Country Status (1)

Country Link
CN (1) CN115982389B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
US20200210466A1 (en) * 2018-12-26 2020-07-02 Microsoft Technology Licensing, Llc Hybrid entity matching to drive program execution
CN111368094A (en) * 2020-02-27 2020-07-03 沈阳东软熙康医疗系统有限公司 Entity knowledge map establishing method, attribute information acquiring method, outpatient triage method and device
CN111488738A (en) * 2019-01-25 2020-08-04 阿里巴巴集团控股有限公司 Illegal information identification method and device
CN115186109A (en) * 2022-08-08 2022-10-14 军工保密资格审查认证中心 Data processing method, equipment and medium of threat intelligence knowledge graph

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
US20200210466A1 (en) * 2018-12-26 2020-07-02 Microsoft Technology Licensing, Llc Hybrid entity matching to drive program execution
CN111488738A (en) * 2019-01-25 2020-08-04 阿里巴巴集团控股有限公司 Illegal information identification method and device
CN111368094A (en) * 2020-02-27 2020-07-03 沈阳东软熙康医疗系统有限公司 Entity knowledge map establishing method, attribute information acquiring method, outpatient triage method and device
CN115186109A (en) * 2022-08-08 2022-10-14 军工保密资格审查认证中心 Data processing method, equipment and medium of threat intelligence knowledge graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
东天岳: "面向HPC领域的知识图谱构建研究与应用", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, pages 137 - 257 *

Also Published As

Publication number Publication date
CN115982389B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN106570180B (en) Voice search method and device based on artificial intelligence
RU2679988C1 (en) Extracting information objects with the help of a classifier combination
CN110413760B (en) Man-machine conversation method, device, storage medium and computer program product
CN111832282B (en) External knowledge fused BERT model fine adjustment method and device and computer equipment
CN112906392B (en) Text enhancement method, text classification method and related device
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
CN114417865B (en) Description text processing method, device and equipment for disaster event and storage medium
CN111090771B (en) Song searching method, device and computer storage medium
CN110096599B (en) Knowledge graph generation method and device
CN112381038A (en) Image-based text recognition method, system and medium
Liu et al. Open intent discovery through unsupervised semantic clustering and dependency parsing
CN111091009B (en) Document association auditing method based on semantic analysis
CN115544303A (en) Method, apparatus, device and medium for determining label of video
CN111859950A (en) Method for automatically generating lecture notes
CN115017335A (en) Knowledge graph construction method and system
CN111737523B (en) Video tag, generation method of search content and server
CN112906391A (en) Meta-event extraction method and device, electronic equipment and storage medium
CN111858894A (en) Semantic missing recognition method and device, electronic equipment and storage medium
CN110309258B (en) Input checking method, server and computer readable storage medium
CN111062199A (en) Bad information identification method and device
CN115982389B (en) Knowledge graph generation method, device and equipment
CN115858733A (en) Cross-language entity word retrieval method, device, equipment and storage medium
CN110986972A (en) Information processing method and device for vehicle navigation
CN115374258A (en) Knowledge base query method and system combining semantic understanding with question template

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant