CN113553439A - Method and system for knowledge graph mining - Google Patents

Method and system for knowledge graph mining Download PDF

Info

Publication number
CN113553439A
CN113553439A CN202110678441.7A CN202110678441A CN113553439A CN 113553439 A CN113553439 A CN 113553439A CN 202110678441 A CN202110678441 A CN 202110678441A CN 113553439 A CN113553439 A CN 113553439A
Authority
CN
China
Prior art keywords
entity
words
speech
text
entities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110678441.7A
Other languages
Chinese (zh)
Inventor
高鹏
郝少春
袁兰
吴飞
周伟华
高峰
潘晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Mjoys Big Data Technology Co ltd
Original Assignee
Hangzhou Mjoys Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Mjoys Big Data Technology Co ltd filed Critical Hangzhou Mjoys Big Data Technology Co ltd
Priority to CN202110678441.7A priority Critical patent/CN113553439A/en
Publication of CN113553439A publication Critical patent/CN113553439A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The application relates to a method and a system for mining a knowledge graph, wherein the method for mining the knowledge graph comprises the following steps: acquiring a text and carrying out error correction processing on the text; performing word segmentation and part-of-speech tagging on the text subjected to error correction processing according to a preset word list to obtain words and parts-of-speech of the words in the text; identifying entities in the text according to the words and the parts of speech, and extracting attributes and relations of the entities in the text according to the words, the parts of speech and the entities; according to the method and the device, entity linking is carried out according to the entities, and knowledge fusion is carried out according to the entity linking results and the attributes and the relations of the entities to obtain the knowledge map.

Description

Method and system for knowledge graph mining
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method and a system for knowledge graph mining.
Background
Knowledge map (Knowledge Graph) is called Knowledge domain visualization or Knowledge domain mapping map in book intelligence world, is a series of different graphs for displaying Knowledge development process and structure relationship, describes Knowledge resources and carriers thereof by using visualization technology, excavates, analyzes, constructs, draws and displays Knowledge and mutual relation among Knowledge resources and carriers thereof, and is widely applied to a plurality of fields of question answering, searching, recommendation and the like.
In the related technology, the process of knowledge graph mining needs manual participation, the knowledge graph mining is to mine information offline to obtain new knowledge, and then the new knowledge is updated to the storage content of the knowledge graph in a timing mode, so that the updating of the knowledge graph has large hysteresis.
Aiming at the problem that the knowledge updating of the knowledge graph has large hysteresis in the related technology, an effective solution is not provided.
Disclosure of Invention
The embodiment of the application provides a method and a system for knowledge graph mining, which at least solve the problem that the knowledge updating of the knowledge graph in the related technology has larger hysteresis.
In a first aspect, an embodiment of the present application provides a method for knowledge graph mining, where the method includes:
acquiring a text and carrying out error correction processing on the text;
performing word segmentation and part-of-speech tagging on the text subjected to error correction processing according to a preset word list to obtain words in the text and parts-of-speech of the words;
identifying entities in the text according to the words and the parts of speech, and extracting the attributes and the relations of the entities in the text according to the words, the parts of speech and the entities;
and carrying out entity linking according to the entities, and carrying out knowledge fusion according to the entity linking result, the attributes of the entities and the relationship to obtain a knowledge graph.
In some embodiments, the vocabulary building process includes:
adopting a plurality of part-of-speech tagging tools, and configuring the parts of speech in the plurality of part-of-speech tagging tools into parts of speech in a target part-of-speech tagging set;
acquiring basic data for constructing a word list, dividing sentences of the basic data, and inputting the divided basic data into a plurality of part-of-speech tagging tools to obtain tagging results, wherein the tagging results comprise words of the basic data and parts of speech of the words;
and under the condition that the labeling results obtained by at least two labeling tools are the same, recording the labeling results, counting the occurrence frequency of the labeling results, and generating the word list according to the labeling results and the frequency.
In some of these embodiments, the identification process of the entity includes:
respectively carrying out entity recognition through a dictionary and a recognition model;
under the condition that the recognition result of the dictionary is the same as the recognition result of the recognition model, adopting the entity words in the recognition result;
and when the recognition result of the dictionary is empty and the confidence of the recognition result of the recognition model reaches a confidence threshold, storing the entity words in the recognition result of the recognition model and the associated information of the entity words, wherein the associated information comprises the dialogue sentences in which the entity words are located.
In some embodiments, the extracting of the attributes and relationships of the entities includes:
inputting words in the text, the part of speech of the words and the entities into a syntactic analysis model to obtain an analysis result, wherein the analysis result comprises a main-predicate-object relationship, and at least one of a subject and an object is an entity in the main-predicate-object relationship;
and constructing the attribute of the entity by using a Label Property Graph as a basic data structure under the condition that one of the subject and the object of the subject-predicate relationship is the entity, and constructing the relationship between the entities under the condition that both the subject and the object of the subject-predicate relationship are the entities.
In some of these embodiments, the entity link comprises a candidate entity recall, the process of candidate entity recall comprising:
inputting the entity to a word vector model, determining the similar meaning words of the entity to obtain a first set, inputting the entity to a BERT model, determining the similar meaning words of the entity to obtain a second set, wherein the word vector model is obtained by training according to the word list;
merging the first set and the second set to obtain a near meaning word set of the entity;
determining words in the knowledge graph existing in the similar meaning word set to obtain a recalled entity list.
In some embodiments, the text error correction process includes error checking and error correction, and the error checking process includes:
inputting the text into a classification model, and determining that an error sentence exists;
and recalling similar characters for each Chinese character in the sentence with errors, wherein the similar characters comprise similar characters or similar sound.
In a second aspect, an embodiment of the present application provides a system for knowledge graph mining, where the system includes:
the acquisition module is used for acquiring a text and correcting the text;
the word segmentation module is used for performing word segmentation and part-of-speech tagging on the text after error correction processing according to a preset word list to obtain words and parts-of-speech of the words in the text;
the extraction module is used for identifying entities in the text according to the words and the parts of speech and extracting the attributes and the relations of the entities in the text according to the words, the parts of speech and the entities;
and the fusion module is used for carrying out entity link according to the entity and carrying out knowledge fusion according to the result of the entity link, the attribute of the entity and the relationship to obtain a knowledge graph.
In some embodiments, the word segmentation module is further configured to:
acquiring basic data for constructing a word list, segmenting the basic data, inputting the segmented basic data into a plurality of part-of-speech tagging tools, and obtaining tagging results, wherein the parts of speech in the plurality of part-of-speech tagging tools are all configured as the parts of speech in a target part-of-speech tagging set, and the tagging results comprise words of the basic data and the parts of speech of the words;
and under the condition that the labeling results obtained by at least two labeling tools are the same, recording the labeling results, counting the occurrence frequency of the labeling results, and generating the word list according to the labeling results and the frequency.
In a third aspect, the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of knowledge-graph mining when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for knowledge-graph mining.
Compared with the related technology, the method for mining the knowledge graph provided by the embodiment of the application acquires the text and performs error correction processing on the text; performing word segmentation and part-of-speech tagging on the text subjected to error correction processing according to a preset word list to obtain words and parts-of-speech of the words in the text; identifying entities in the text according to the words and the parts of speech, and extracting attributes and relations of the entities in the text according to the words, the parts of speech and the entities; and carrying out entity linking according to the entities, and carrying out knowledge fusion according to the entity linking results and the attributes and the relations of the entities to obtain the knowledge map, so that the problem of large hysteresis in knowledge updating of the knowledge map in the related technology is solved, and the effect of updating the knowledge of the knowledge map in time is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic diagram of an application environment of a method of knowledge-graph mining according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of knowledge-graph mining according to a first embodiment of the present application;
FIG. 3 is a flow chart of a process of text correction according to a second embodiment of the present application;
FIG. 4 is a flow chart of a vocabulary construction process according to a third embodiment of the present application;
FIG. 5 is a flow chart of an entity identification process according to a fourth embodiment of the present application;
FIG. 6 is a flow chart of an extraction process of attributes and relationships of entities according to a fifth embodiment of the present application;
FIG. 7 is a flowchart of an entity linking process according to a sixth embodiment of the present application;
FIG. 8 is a flow chart of a method of knowledge-graph mining according to a seventh embodiment of the present application;
FIG. 9 is a block diagram of a system for knowledge-graph mining according to an eighth embodiment of the present application;
fig. 10 is an internal structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The method for mining the knowledge graph provided by the application can be applied to an application environment shown in fig. 1, fig. 1 is an application environment schematic diagram of the method for mining the knowledge graph according to the embodiment of the application, as shown in fig. 1, a server 102 acquires text data of a terminal 101 through a network and operates the method for mining the knowledge graph, so that knowledge can be mined from the text data, the server 102 adds or updates the knowledge to the knowledge graph, the server 102 can be implemented by an independent server or a server cluster consisting of a plurality of servers, and the terminal 101 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.
The present embodiment provides a method for knowledge graph mining, and fig. 2 is a flowchart of a method for knowledge graph mining according to a first embodiment of the present application, as shown in fig. 2, the flowchart includes the following steps:
step S201, acquiring a text, and performing error correction processing on the text, for example, acquiring a newly added text in the interval time period in a short time interval, and performing error correction processing on the text, for example, the time interval may be every 30 seconds, where the text may be a dialog content in a human-computer dialog scene;
step S202, performing word segmentation and part-of-speech tagging on the text subjected to the error correction processing according to a preset word list to obtain words and parts-of-speech of the words in the text, wherein optionally, a method of combining Ngram and HMM can be used for word segmentation and part-of-speech tagging;
step S203, identifying entities in the text according to the words and the parts of speech, and extracting attributes and relations of the entities in the text according to the words, the parts of speech and the entities, wherein the types of the entities mainly comprise names of people, names of institutions, names of places, dates, time, quantity and the like, and for example, the entities can be 'morals';
and step S204, carrying out entity linking according to the entities, and carrying out knowledge fusion according to the result of the entity linking and the attributes and the relations of the entities to obtain a knowledge map.
Through steps S201 to S204, compared to the problem of large hysteresis in knowledge update of a knowledge graph in the related art, in the embodiment, by acquiring a text, performing error correction, word segmentation and part-of-speech tagging on the text content, identifying an entity in the text, extracting attributes and relationships of the entity in the text, performing entity linking and knowledge fusion, obtaining the knowledge graph, analyzing the text of a newly generated human-computer conversation to obtain knowledge in the text, and updating new knowledge generated in a conversation process into the knowledge graph in time, the problem of large hysteresis in knowledge update of the knowledge graph in the related art is solved, and an effect of updating knowledge of the knowledge graph in time is achieved.
In addition, the embodiment does not need to manually perform operations such as mining and auditing, reduces the labor input and reduces the updating cost of the knowledge graph.
In some embodiments, fig. 3 is a flowchart of a text error correction process according to a second embodiment of the present application, and as shown in fig. 3, the process includes the following steps:
step S301, inputting a text into a classification model, and determining that an error sentence exists, wherein the classification model can be a BI-LSTM model;
step S302, similar character recall is carried out on each Chinese character in the sentence with errors, wherein the similar characters comprise similar characters or similar sound;
step S303, inputting a text and similar characters to a BERT model, determining the character with the maximum probability to obtain a candidate character, for example, inputting the text and the similar characters to a Tiny BERT model, and determining the most smooth character in the text to obtain the candidate character;
step S304, determining whether the Chinese character is the same as the candidate character of the Chinese character, if not, replacing the Chinese character with the candidate character, if so, not replacing.
Through steps S301 to S304, compared with the problem that text error correction is performed word by word in the related art, and the error correction efficiency is low, in the embodiment, text error correction is divided into two steps, error check is performed first, and after a sentence with an error is checked, the sentence is corrected again, so that the efficiency of text error correction is greatly improved, thereby saving time for the knowledge mining process and accelerating the updating speed of the knowledge map.
Considering that the process of updating the knowledge graph in the embodiment of the present application does not involve human review, and in order to avoid adding wrong content to the knowledge graph, it is necessary to improve the accuracy of the mined knowledge as much as possible, in some embodiments, fig. 4 is a flowchart of a process of constructing a vocabulary according to a third embodiment of the present application, and as shown in fig. 4, the process includes the following steps:
step S401, a plurality of parts of speech tagging tools are adopted, and the parts of speech in the plurality of parts of speech tagging tools are all configured to be parts of speech in a target part of speech tagging set, optionally, the plurality of parts of speech tagging tools can be segmentation and part of speech tagging tools such as Chinese word segmentation, Baidu LAC, big North PKU and big Harbin LTP, and the target part of speech tagging set can be 863 parts of speech tagging set;
step S402, obtaining basic data for constructing a word list, carrying out clause on the basic data, inputting the claused basic data into a plurality of part-of-speech tagging tools to obtain tagging results, wherein the tagging results comprise words and parts of speech of the basic data, and optionally, the basic data for constructing the word list can be a data set disclosed by a network and a self-owned dialogue corpus;
and S403, recording the labeling results under the condition that the labeling results obtained by at least two labeling tools are the same, counting the occurrence frequency of the labeling results, and generating a word list according to the labeling results and the frequency, wherein the frequency can be used as a reference basis for subsequent word segmentation and part-of-speech labeling.
Through steps S401 to S403, for the problem that the accuracy of the word segmentation and part-of-speech tagging results is not high because only one word segmentation and part-of-speech tagging tool is used for word segmentation and part-of-speech tagging in the related art, in this embodiment, parts-of-speech in a plurality of part-of-speech tagging tools are all configured as parts-of-speech in the target part-of-speech tagging set, so that the unification of part-of-speech standards of different part-of-speech tagging tools is realized, and the tagging results are recorded only when the tagging results obtained by at least two tagging tools are the same, so that the accuracy of the word segmentation and part-of-speech tagging results is improved, and the accuracy of the mined knowledge is improved.
In some embodiments, fig. 5 is a flowchart of an entity identification process according to a fourth embodiment of the present application, and as shown in fig. 5, the flowchart includes the following steps:
step S501, entity recognition is carried out through a dictionary and a recognition model respectively, wherein the recognition model can be a model built by using an architecture combining BI-LSTM and CRF, the BI-LSTM can effectively utilize context information of text features, and the CRF can learn the context of a label;
step S502, under the condition that the recognition result of the dictionary is the same as the recognition result of the recognition model, the entity words in the recognition result are adopted;
in step S503, when the recognition result of the dictionary is empty and the confidence of the recognition result of the recognition model reaches the confidence threshold, storing the entity words in the recognition result of the recognition model and the associated information of the entity words, where the associated information includes information such as the dialog sentences in which the entity words are located, the frequency of occurrence, and optionally, 0.8 may be set as the confidence threshold.
Through the steps S501 to S503, compared with the problem that the accuracy of the recognition result is not high when the dictionary is used alone for entity recognition or the model is used alone for recognition in the related art, in the embodiment of the present application, the two recognition methods are combined, and the entity words in the recognition result are used only when the recognition results of the two recognition methods are the same, so that the accuracy of the recognition result is improved, and the accuracy of the mined knowledge is improved.
Meanwhile, the embodiment of the application also stores the entity words of which the recognition results of the dictionary are empty and the confidence coefficient of the recognition results of the recognition model reaches the confidence coefficient threshold, so that the normal operation of the current knowledge mining process is not influenced, the stored data can be uniformly processed manually and periodically in the follow-up process, and specifically, the information such as the frequency and the confidence coefficient of the entity words can be comprehensively considered to determine whether to add the entity words into the dictionary, so that the vocabulary in the dictionary is continuously enriched, and the follow-up entity recognition of the dictionary is facilitated.
In some embodiments, fig. 6 is a flowchart of an extraction process of attributes and relationships of entities according to a fifth embodiment of the present application, and as shown in fig. 6, the process includes the following steps:
step S601, inputting words, parts of speech, and entities of the words in a text into a syntactic analysis model to obtain an analysis result, because knowledge in the embodiment of the present application is automatically updated into a knowledge graph, so that syntactic relations in related technologies are labeled in advance, and a method of obtaining standard data is not applicable to the present application, in the embodiment of the present application, a unsupervised method is adopted, starting with data, a predicate-object relation in the analysis result is extracted, and in the predicate-object relation, at least one of a subject and an object is an entity, specifically, the analysis result may include a predicate relation SBV (for example, "baby birth"), a middle relation ATT (for example, "university teaching"), a predicate relation VOB (for example, "three days and three nights"), and a core relation HED (for example, "this is the cheapest taxi for engendering);
step S602, a Label Property Graph is used as a basic data structure, the Property of the entity is constructed under the condition that one of the subject and the object of the subject-predicate relationship is the entity, the relationship between the entities is constructed under the condition that both the subject and the object of the subject-predicate relationship are the entities, and the Label Property Graph is used as the basic data structure, so that the existing entity Property and relationship Property can be continuously improved while new entities and relationships are continuously added.
Through the steps S601 to S602, the analysis result is limited to the subject-predicate relationship, and the subject-predicate relationship can clearly reflect the entity and the attribute and relationship of the entity, so that the reliability of the knowledge determined according to the subject-predicate relationship is high, and the accuracy of the mined knowledge is improved; in addition, because the accuracy of the knowledge mined by the embodiment is high, the embodiment of the application is also suitable for constructing the knowledge graph of the field for a certain blank field of the knowledge graph, and the knowledge graph of the field can be conveniently and quickly mined by acquiring the related data or the historical dialogue records of the field and operating the knowledge graph mining method.
In some of these embodiments, the entity link includes candidate entity recalls and indications matching, fig. 7 is a flow chart of an entity link process according to a sixth embodiment of the present application, as shown in fig. 7, the flow includes the following steps:
step S701, inputting an entity to a word vector model, determining a near meaning word of the entity to obtain a first set, inputting the entity to a BERT model, determining the near meaning word of the entity to obtain a second set, wherein the word vector model is obtained by training according to a word list;
step S702, merging the first set and the second set to obtain a near meaning word set of the entity;
step S703, determining words existing in the synonym set in the knowledge graph to obtain a recalled entity list, where synonyms of entities in the entity list may be referred to as candidate entities;
step S704, replacing the entity in the original sentence with each candidate entity to obtain a new sentence containing the candidate entity, feeding the original sentence and the new sentence into a twin network simultaneously, determining whether the original sentence and the new sentence are similar, if so, indicating that the entity in the original sentence can be linked to the entity in the knowledge graph, wherein the Loss function of the network adopts a contrast Loss function:
L(W,Y,X1,X2)=(1-Y)*(1/2)*(DW)2+Y*(1/2)*{max(0,m-Dw)}2
y ═ 0 indicates that the original sentence and the new sentence are similar or matched, Y ═ 1 indicates that the original sentence and the new sentence are not similar or matched, and optionally, the confidence threshold m may be set higher, for example, to 0.9;
step S705, according to the result of entity link, performing knowledge fusion to obtain a knowledge graph, specifically, under the condition that an entity cannot be linked to the existing entity of the knowledge graph, adding the entity to the knowledge graph, meanwhile, if an entity is extracted, adding the attribute of the entity, and if a pair of entities is extracted, adding the relationship between the entities; under the condition that an entity can be linked to an existing entity of the knowledge graph, if an entity is extracted, whether the entity has attributes or not is determined, if yes, the attributes of the entity are updated, the modification time is recorded, if not, the attributes of the entity are added, if a pair of entities are extracted, whether the relationship between the entities exists or not is determined, if yes, the relationship between the entities is updated, the modification time is recorded, and if not, the relationship between the entities is added.
Through steps S701 to S705, the present application avoids repeated content of the knowledge graph through similar query, candidate entity recall, Mention matching, and knowledge fusion, so that the content of the knowledge graph is neat and systematic enough, and in addition, compared with the case of recalling the candidate entity in the related art, only the BERT model is adopted alone to obtain the synonyms of the candidate entity, which may cause the situation of missing part of the synonyms.
In some of these embodiments, fig. 8 is a flow chart of a method of knowledge-graph mining according to a seventh embodiment of the present application, as shown in fig. 8, the method comprising: obtaining a dialog text of a user, and preprocessing the text, wherein the preprocessing process comprises text error correction, word segmentation and part of speech tagging; performing knowledge extraction on the preprocessed text, wherein the knowledge extraction process comprises named entity identification, attribute extraction and relationship extraction; according to the extracted entities, entity linking is carried out, and the process of entity linking comprises similar word query, candidate entity recall and indication matching; performing knowledge fusion according to the result of entity link, wherein the process of knowledge fusion comprises entity fusion, attribute fusion and relationship fusion, so as to obtain an updated knowledge graph; and continuously acquiring the newly generated dialog text of the user, executing the knowledge graph mining process, and continuously obtaining the updated knowledge graph, so that the robot continuously learns by self in the actual dialog process to update the knowledge graph, and the content of the knowledge graph can be continuously improved along with the increase of the dialog.
The present embodiment further provides a system for knowledge graph mining, and fig. 9 is a block diagram of a structure of the system for knowledge graph mining according to an eighth embodiment of the present application, as shown in fig. 9, the system includes:
the acquiring module 91 is configured to acquire a text and perform error correction processing on the text;
the word segmentation module 92 is configured to perform word segmentation and part-of-speech tagging on the text after the error correction processing according to a preset word list, so as to obtain words and parts-of-speech of the words in the text;
the extraction module 93 is configured to identify entities in the text according to the words and the parts of speech, and extract attributes and relationships of the entities in the text according to the words, the parts of speech, and the entities;
and the fusion module 94 is configured to perform entity linking according to the entities, and perform knowledge fusion according to the result of the entity linking and the attributes and relationships of the entities to obtain a knowledge graph.
In some embodiments, the segmentation module 92 is further configured to:
acquiring basic data for constructing a word list, segmenting the basic data, inputting the segmented basic data into a plurality of part-of-speech tagging tools, and obtaining tagging results, wherein the parts of speech in the plurality of part-of-speech tagging tools are all configured into parts of speech in a target part-of-speech tagging set, and the tagging results comprise words of the basic data and the parts of speech of the words;
and under the condition that the labeling results obtained by at least two labeling tools are the same, recording the labeling results, counting the occurrence frequency of the labeling results, and generating a word list according to the labeling results and the frequency.
In one embodiment, fig. 10 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 10, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 10. The electronic device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the electronic device is used for storing data. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of knowledge-graph mining.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the present solution and does not constitute a limitation on the electronic devices to which the present solution applies, and that a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of knowledge-graph mining, the method comprising:
acquiring a text and carrying out error correction processing on the text;
performing word segmentation and part-of-speech tagging on the text subjected to error correction processing according to a preset word list to obtain words in the text and parts-of-speech of the words;
identifying entities in the text according to the words and the parts of speech, and extracting the attributes and the relations of the entities in the text according to the words, the parts of speech and the entities;
and carrying out entity linking according to the entities, and carrying out knowledge fusion according to the entity linking result, the attributes of the entities and the relationship to obtain a knowledge graph.
2. The method of claim 1, wherein the vocabulary building process comprises:
adopting a plurality of part-of-speech tagging tools, and configuring the parts of speech in the plurality of part-of-speech tagging tools into parts of speech in a target part-of-speech tagging set;
acquiring basic data for constructing a word list, dividing sentences of the basic data, and inputting the divided basic data into a plurality of part-of-speech tagging tools to obtain tagging results, wherein the tagging results comprise words of the basic data and parts of speech of the words;
and under the condition that the labeling results obtained by at least two labeling tools are the same, recording the labeling results, counting the occurrence frequency of the labeling results, and generating the word list according to the labeling results and the frequency.
3. The method of claim 1, wherein the identification process of the entity comprises:
respectively carrying out entity recognition through a dictionary and a recognition model;
under the condition that the recognition result of the dictionary is the same as the recognition result of the recognition model, adopting the entity words in the recognition result;
and when the recognition result of the dictionary is empty and the confidence of the recognition result of the recognition model reaches a confidence threshold, storing the entity words in the recognition result of the recognition model and the associated information of the entity words, wherein the associated information comprises the dialogue sentences in which the entity words are located.
4. The method of claim 1, wherein the extracting of the attributes and relationships of the entities comprises:
inputting words in the text, the part of speech of the words and the entities into a syntactic analysis model to obtain an analysis result, wherein the analysis result comprises a main-predicate-object relationship, and at least one of a subject and an object is an entity in the main-predicate-object relationship;
and constructing the attribute of the entity by using a Label Property Graph as a basic data structure under the condition that one of the subject and the object of the subject-predicate relationship is the entity, and constructing the relationship between the entities under the condition that both the subject and the object of the subject-predicate relationship are the entities.
5. The method of claim 2, wherein the entity link comprises a candidate entity recall, and wherein the candidate entity recall comprises:
inputting the entity to a word vector model, determining the similar meaning words of the entity to obtain a first set, inputting the entity to a BERT model, determining the similar meaning words of the entity to obtain a second set, wherein the word vector model is obtained by training according to the word list;
merging the first set and the second set to obtain a near meaning word set of the entity;
determining words in the knowledge graph existing in the similar meaning word set to obtain a recalled entity list.
6. The method of claim 1, wherein the text error correction process comprises error checking and error correction, and wherein the error checking process comprises:
inputting the text into a classification model, and determining that an error sentence exists;
and recalling similar characters for each Chinese character in the sentence with errors, wherein the similar characters comprise similar characters or similar sound.
7. A system for knowledge-graph mining, the system comprising:
the acquisition module is used for acquiring a text and correcting the text;
the word segmentation module is used for performing word segmentation and part-of-speech tagging on the text after error correction processing according to a preset word list to obtain words and parts-of-speech of the words in the text;
the extraction module is used for identifying entities in the text according to the words and the parts of speech and extracting the attributes and the relations of the entities in the text according to the words, the parts of speech and the entities;
and the fusion module is used for carrying out entity link according to the entity and carrying out knowledge fusion according to the result of the entity link, the attribute of the entity and the relationship to obtain a knowledge graph.
8. The system of claim 7, wherein the word segmentation module is further configured to:
acquiring basic data for constructing a word list, segmenting the basic data, inputting the segmented basic data into a plurality of part-of-speech tagging tools, and obtaining tagging results, wherein the parts of speech in the plurality of part-of-speech tagging tools are all configured as the parts of speech in a target part-of-speech tagging set, and the tagging results comprise words of the basic data and the parts of speech of the words;
and under the condition that the labeling results obtained by at least two labeling tools are the same, recording the labeling results, counting the occurrence frequency of the labeling results, and generating the word list according to the labeling results and the frequency.
9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements a method of knowledge-graph mining as claimed in any one of claims 1 to 6.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of knowledge-graph mining according to any one of claims 1 to 6.
CN202110678441.7A 2021-06-18 2021-06-18 Method and system for knowledge graph mining Pending CN113553439A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110678441.7A CN113553439A (en) 2021-06-18 2021-06-18 Method and system for knowledge graph mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110678441.7A CN113553439A (en) 2021-06-18 2021-06-18 Method and system for knowledge graph mining

Publications (1)

Publication Number Publication Date
CN113553439A true CN113553439A (en) 2021-10-26

Family

ID=78130709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110678441.7A Pending CN113553439A (en) 2021-06-18 2021-06-18 Method and system for knowledge graph mining

Country Status (1)

Country Link
CN (1) CN113553439A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780756A (en) * 2022-06-07 2022-07-22 国网浙江省电力有限公司信息通信分公司 Entity alignment method and device based on noise detection and noise perception
WO2024065952A1 (en) * 2022-09-30 2024-04-04 中国四维测绘技术有限公司 Remote sensing satellite information recommendation method, system and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9348815B1 (en) * 2013-06-28 2016-05-24 Digital Reasoning Systems, Inc. Systems and methods for construction, maintenance, and improvement of knowledge representations
CN109509556A (en) * 2018-11-09 2019-03-22 天津开心生活科技有限公司 Knowledge mapping generation method, device, electronic equipment and computer-readable medium
CN111191051A (en) * 2020-04-09 2020-05-22 速度时空信息科技股份有限公司 Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN111209412A (en) * 2020-02-10 2020-05-29 同方知网(北京)技术有限公司 Method for building knowledge graph of periodical literature by cyclic updating iteration
CN111914550A (en) * 2020-07-16 2020-11-10 华中师范大学 Knowledge graph updating method and system for limited field
CN112612884A (en) * 2020-11-27 2021-04-06 中山大学 Entity label automatic labeling method based on public text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9348815B1 (en) * 2013-06-28 2016-05-24 Digital Reasoning Systems, Inc. Systems and methods for construction, maintenance, and improvement of knowledge representations
CN109509556A (en) * 2018-11-09 2019-03-22 天津开心生活科技有限公司 Knowledge mapping generation method, device, electronic equipment and computer-readable medium
CN111209412A (en) * 2020-02-10 2020-05-29 同方知网(北京)技术有限公司 Method for building knowledge graph of periodical literature by cyclic updating iteration
CN111191051A (en) * 2020-04-09 2020-05-22 速度时空信息科技股份有限公司 Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN111914550A (en) * 2020-07-16 2020-11-10 华中师范大学 Knowledge graph updating method and system for limited field
CN112612884A (en) * 2020-11-27 2021-04-06 中山大学 Entity label automatic labeling method based on public text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《图书情报工作》杂志社: "《馆藏资源聚合研究与实践进展》", 哈尔滨:哈尔滨工业大学出版社, pages: 292 - 293 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780756A (en) * 2022-06-07 2022-07-22 国网浙江省电力有限公司信息通信分公司 Entity alignment method and device based on noise detection and noise perception
CN114780756B (en) * 2022-06-07 2022-09-16 国网浙江省电力有限公司信息通信分公司 Entity alignment method and device based on noise detection and noise perception
WO2024065952A1 (en) * 2022-09-30 2024-04-04 中国四维测绘技术有限公司 Remote sensing satellite information recommendation method, system and device

Similar Documents

Publication Publication Date Title
CN109815333B (en) Information acquisition method and device, computer equipment and storage medium
CN112015900B (en) Medical attribute knowledge graph construction method, device, equipment and medium
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN110909122B (en) Information processing method and related equipment
CN113553439A (en) Method and system for knowledge graph mining
CN113536735B (en) Text marking method, system and storage medium based on keywords
CN114595686B (en) Knowledge extraction method, and training method and device of knowledge extraction model
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
US20140214406A1 (en) Method and system of adding punctuation and establishing language model
US9575957B2 (en) Recognizing chemical names in a chinese document
CN109446299B (en) Method and system for searching e-mail content based on event recognition
TWI752406B (en) Speech recognition method, speech recognition device, electronic equipment, computer-readable storage medium and computer program product
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN110705211A (en) Text key content marking method and device, computer equipment and storage medium
US20220139386A1 (en) System and method for chinese punctuation restoration using sub-character information
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN111400340A (en) Natural language processing method and device, computer equipment and storage medium
CN110956043A (en) Domain professional vocabulary word embedding vector training method, system and medium based on alias standardization
CN109344385B (en) Natural language processing method, device, computer equipment and storage medium
CN114492437B (en) Keyword recognition method and device, electronic equipment and storage medium
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
CN114218431A (en) Video searching method and device, electronic equipment and storage medium
CN109727591B (en) Voice search method and device
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification
CN114661852A (en) Text searching method, terminal and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination