CN113569051A - Knowledge graph construction method and device - Google Patents

Knowledge graph construction method and device Download PDF

Info

Publication number
CN113569051A
CN113569051A CN202010358878.8A CN202010358878A CN113569051A CN 113569051 A CN113569051 A CN 113569051A CN 202010358878 A CN202010358878 A CN 202010358878A CN 113569051 A CN113569051 A CN 113569051A
Authority
CN
China
Prior art keywords
text
corpus
data set
text corpus
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010358878.8A
Other languages
Chinese (zh)
Inventor
李长亮
刘晓楠
汪美玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Software Co Ltd
Kingsoft Corp Ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Software Co Ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Software Co Ltd, Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Software Co Ltd
Priority to CN202010358878.8A priority Critical patent/CN113569051A/en
Publication of CN113569051A publication Critical patent/CN113569051A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a knowledge graph construction method and a knowledge graph construction device, wherein the method comprises the following steps: acquiring a text data set, wherein the text data set comprises a plurality of text corpora and at least one attribute information corresponding to each text corpora; determining at least one characteristic label corresponding to each text corpus according to the text data set; and constructing a knowledge graph corresponding to the text data set according to a plurality of text corpora, at least one attribute information corresponding to each text corpus and at least one feature tag corresponding to each text corpus. The knowledge graph corresponding to the text data set is constructed based on at least one feature tag corresponding to each text corpus, so that semantic description including celebrity languages is expanded, the text corpuses are organized in a more optimal graph data structure mode, and a user is supported to acquire information from multiple sides.

Description

Knowledge graph construction method and device
Technical Field
The present application relates to the field of information data processing technologies, and in particular, to a method and an apparatus for constructing a knowledge graph, a computing device, and a computer-readable storage medium.
Background
The path of the famous speaker which can be inquired on the existing network comprises a Chinese database, a classical language record, a partial forum, a literature website and the like, the path mainly provides information such as the content and the author of the famous speaker, and usually adopts a relational database to organize and store, on the basis, two using modes which can be provided for users are mainly provided, one is to search related information through the known famous speaker, and the other is to search related famous speaker through the author name or specific keywords of the famous speaker. However, there is no knowledge graph constructed by using celebrity titles as data on the existing network or on the market, and although the information provided by the related website only contains the content, author and type of the celebrity titles, the use requirement of the user on the celebrity titles cannot be well met. For example, "truthful is the native of a person" and "fainter is the most happy and trusting" are both "luck" and "truthful namess about" truthfulness ", but at present, it is difficult for a user to simultaneously obtain information of the two namess by using a knowledge graph through two keywords of" luck "and" trusting ". Therefore, constructing a knowledge graph containing celebrity data is an urgent problem to be solved.
Disclosure of Invention
In view of this, embodiments of the present application provide a method, an apparatus, a computing device, and a computer-readable storage medium for constructing a knowledge graph, so as to solve technical defects in the prior art.
According to a first aspect of embodiments of the present specification, there is provided a knowledge-graph construction method, including:
acquiring a text data set, wherein the text data set comprises a plurality of text corpora and at least one attribute information corresponding to each text corpora;
determining at least one characteristic label corresponding to each text corpus according to the text data set;
and constructing a knowledge graph corresponding to the text data set according to a plurality of text corpora, at least one attribute information corresponding to each text corpus and at least one feature tag corresponding to each text corpus.
According to a second aspect of embodiments herein, there is provided a knowledge-graph constructing apparatus including:
the data acquisition module is configured to acquire a text data set, wherein the text data set comprises a plurality of text corpora and at least one piece of attribute information corresponding to each text corpora;
the label acquisition module is configured to determine at least one characteristic label corresponding to each text corpus according to the text data set;
the map building module is configured to build a knowledge map corresponding to the text data set according to a plurality of text corpuses, at least one attribute information corresponding to each text corpus, and at least one feature tag corresponding to each text corpus.
According to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the tag-based celebrity-language knowledge-graph construction method when executing the instructions.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the tag-based celebrity-name knowledge-graph construction method.
The application provides a knowledge graph construction method, which is characterized in that the most representative words in a text corpus are obtained through a natural language processing technology and are used as feature labels, the knowledge graph corresponding to a text data set is constructed on the basis of the text corpus, each attribute information corresponding to the text corpus and each feature label corresponding to the text corpus, so that semantic description including celebrity namelanguages is expanded, the incidence relation among the text corpuses is increased, the text corpuses are organized in a more optimal graph data structure mode, a user is supported to obtain information from multiple sides, and the intelligent retrieval requirement and the use requirement of the user on the text corpuses are well met.
Drawings
FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;
FIG. 2 is a flow chart of a tag-based celebrity-name knowledge-graph construction method provided by an embodiment of the application;
FIG. 3 is a flow chart of tag computation provided by an embodiment of the present application;
FIG. 4 is a flow chart of LDA algorithm processing provided by an embodiment of the present application;
FIG. 5 is a flow chart of tag replenishment provided by an embodiment of the present application;
FIG. 6 is a flow diagram of a celebrity-name knowledgegraph construction provided by an embodiment of the present application;
FIG. 7 is a flow diagram of celebrity-name knowledge-graph storage and interfacing provided by embodiments of the present application;
FIG. 8 is a flow diagram of a particular celebrity name process provided by an embodiment of the present application;
fig. 9 is a schematic structural diagram of a tag-based celebrity-name-language knowledge-graph building apparatus provided in an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Celebrity speech: the idea that a famous person who is rich in knowledge can understand the reason is widely significant, and the idea is revealed to people, such as 'red and black juveniles between jujubes', black and black between juveniles, and 'good and not good but not good' and the like.
Knowledge graph: the knowledge graph aims to describe various entities or concepts existing in the real world and relations thereof, and forms a huge semantic network graph, wherein nodes represent the entities or concepts, and edges are formed by attributes or relations.
Entity: an entity refers to something that is distinguishable and exists independently, such as a person's name, a city name, a plant name, a commodity name, and the like, and is the most basic element in a knowledge graph, and different relationships exist among different entities.
The attributes are as follows: an attribute value pointing to it from an entity, different attribute types corresponding to edges of different types of attributes, an attribute mainly referring to characteristic information of an object, such as "area", "population", "capital" are several different attributes, and an attribute value mainly referring to a value of an attribute, such as 960 ten thousand square kilometers, etc.
The association relationship is as follows: on a knowledge graph, a relationship is a function that maps several graph nodes (entities, semantic classes, attribute values) to boolean values.
Triplet: triples are a general representation of knowledge graph, and the basic form of triples mainly includes (head entity-relationship-tail entity) and (entity-attribute value).
And (3) knowledge fusion: knowledge fusion is required because of the problems of wide knowledge sources, good and uneven knowledge quality, repeated knowledge from different data sources, missing hierarchical structures and the like. Knowledge fusion is a high-level knowledge organization, so that knowledge from different knowledge sources is subjected to steps of heterogeneous data integration, disambiguation, processing, reasoning verification, updating and the like under the same frame specification, fusion of data, information, methods, experiences and human ideas is achieved, and a high-quality knowledge base is formed.
The LDA algorithm: the hidden Dirichlet distribution is called LDA (late Dirichlet allocation) for short, which is a typical document theme generation model, that is, it considers a document as a set composed of a group of words, and there is no sequential or precedence relationship between words. A document may contain multiple topics (topic) from which each word in the document is generated. The method is a topic model, which can give the topic of each document in a document set in a probability distribution form, and LDA can be regarded as a clustering algorithm.
Text data set: the data set is a data set containing a large number of text corpora of celebrities, author information, type information, provenance information, country information, era information, English original sentence information and/or translation information and the like.
Text corpus: chinese corpora recording contents of celebrities.
Attribute information: the general name of information such as author information, genre information, provenance information, country information, era information, English original sentence information, and/or translation information of celebrity titles is described.
Text word segmentation: and (4) stopping words and symbols of the text corpus, and performing word segmentation to obtain words forming the text corpus.
Subject term: and calculating the text segmentation words by using an LDA algorithm to obtain words representing the subjects of the text corpus.
Characteristic label: the subject words which can represent the subjects or the subjects of the celebrity speech after being screened.
Stop words: stop Words refer to that in information retrieval, in order to save storage space and improve search efficiency, some characters or Words are automatically filtered before or after natural language data or text is processed, the characters or Words are called Stop Words (Stop Words), the Stop Words are manually input and are not automatically generated, and the generated Stop Words form a Stop word list.
Symbol: full and half corner symbols, including punctuation symbols.
In the present application, a knowledge graph construction method, an apparatus, a computing device and a computer readable storage medium are provided, which are described in detail in the following embodiments based on celebrity or middle and outer poetry sentences.
FIG. 1 shows a block diagram of a computing device 100, according to an embodiment of the present description. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 100 and other components not shown in FIG. 1 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein the processor 120 may perform the steps of the method shown in fig. 2. Fig. 2 is a schematic flow chart diagram illustrating a tag-based celebrity-name knowledgegraph construction method according to an embodiment of the present application, including steps 202-206.
Step 202: the method comprises the steps of obtaining a text data set, wherein the text data set comprises a plurality of text corpora and at least one piece of attribute information corresponding to each text corpora.
In an embodiment of the application, a system or a terminal of the application collects text data of a large number of celebrities from a network or a preset database through a network tool or a manual import mode and the like, the text data comprises a large number of text corpora of celebrities and at least one attribute information corresponding to each text corpus, and the attribute information comprises information of authors, types, source works, countries, times, English original sentences or translations and the like of the celebrities, such as 'honest' of the celebrities and the day and day; as good as the others, it is also true for people. "attribute information of which includes author" mengzi ", type" literary culture ", source work" middle-plains ", country" china ", era" warring country "and translation" honesty is a rule of sky and way; it is a humane principle to achieve honesty. The natural honest people do not need to be reluctant to take care of the situation reasonably and do not need to think about proper speech. "also, for example, celebrity-name" first needs to be honest if others want to be honest ", and attribute information thereof includes the author" shakebya ", the type" novel ", the country" uk ", and the english original sentence" ifyouwangtthe good face of others, first of all to the one own integration ", and the like.
According to the method and the system, a large amount of celebrity language data are collected, basic original data are provided for constructing the celebrity language knowledge graph, and comprehensiveness and accuracy of the celebrity language knowledge graph are guaranteed.
Step 204: and determining at least one characteristic label corresponding to each text corpus according to the text data set.
In an embodiment of the present application, as shown in fig. 3, determining at least one feature tag corresponding to each text corpus according to the text data set includes steps 302 to 308.
Step 302: and filtering stop words and symbols in each text corpus, and performing word segmentation to obtain at least one text word segmentation corresponding to each text corpus and form a word list.
In an embodiment of the present application, the system or the terminal of the present application obtains at least one text participle corresponding to a text corpus of each celebrity name by performing natural language processing on the text corpus of each celebrity name, removing stop words and symbols, and performing word segmentation through a chinese word segmentation component, and forms a word list, for example, for a celebrity name "a pile of sand is loose, but is tougher than granite after being mixed with cement, pebble, and water, and a corresponding word list {" sand "," loose "," cement "," pebble "," water "," mixed "," granite ", and" tough "can be obtained after natural language processing.
According to the method and the device, natural language processing is carried out on the text corpus of each celebrity name, and the text participles of each celebrity name are obtained, so that unstructured data are sorted and structured, and subsequent downstream specific tasks are convenient to process and calculate.
Step 304: and storing the word list corresponding to each text corpus into a corpus to obtain a corpus containing at least one text participle corresponding to each text corpus.
In an embodiment of the present application, after obtaining a word list corresponding to a corpus of text of each celebrity, a system or a terminal of the present application stores the word list corresponding to each corpus of text into a corpus, so as to construct a corpus including at least one text participle corresponding to the corpus of text of each celebrity, for example, when 100 celebrity vocabularies are collected, 100 word lists representing each celebrity vocabularies are obtained after natural language processing is performed on 100 sentence celebrity vocabularies, and when the 100 word lists include 500 text participles in total including repeated text participles, the corpus including the 500 text participles is constructed.
According to the method and the device, the corpus containing a large number of text participles representing celebrity languages is constructed, and basic data information is provided for subsequent label calculation, so that subsequent label calculation is facilitated.
Step 306: and determining at least one subject term corresponding to each text corpus from the corpus through a text clustering algorithm to form a feature tag list corresponding to each text corpus.
In the embodiment of the application, the system or the terminal of the application iterates through a topic (topic) and a threshold value set by manual experience by using a text clustering algorithm, so as to calculate the probability distribution of each text participle corresponding to a celebrity name or each text participle in a corpus finally appearing in the celebrity name, and retain the text participles larger than the probability threshold value as subject words representing the celebrity name, so as to screen out words not containing values, for example, words such as "honest", "ideal" and "reading" appear in most celebrity names, the words can be used as subject words to participate in calculation, and meanwhile, if words such as "honest", "ideal" or "reading" appear in a celebrity name for a plurality of times, the probability distribution of the words appearing in the celebrity name is higher, the words can be used as subject words representing the celebrity, and the occasional words such as 'sunlight' can be screened out, and the subjects are collected to form a feature tag list corresponding to the text corpus of each celebrity. For example, the words "honesty", "hearts" and "life" in the celebrity's name "faithfulness" can be the fastest as the celebrity's name "faithfulness" person, and the honesty "is the most natural" corresponding subject word, so that a feature tag list containing "honesty", "hearts" and "life" is formed.
Optionally, as shown in fig. 4, the text clustering algorithm may be an LDA algorithm, and by setting parameters such as a subject word and the number of the subject words, the LDA algorithm first obtains a word list corresponding to a text corpus of a celebrity name, then extracts a subject from the subject distribution, extracts a text participle from the word list corresponding to the extracted subject, and finally repeats the above process until each text participle in the word list is traversed.
Specifically, a word list corresponding to each celebrity name D in the corpus D is < w1, w2, … …, wn >, wi represents the ith text participle, the celebrity name D is provided with n text participles w, meanwhile, all non-repetitive text participles in the corpus D form a set VOC, the set VOC includes m text participles v, and n and m are positive integers greater than 1.
Text participles w (repeatable) in a word list of a celebrity D in the corpus D correspond to one multinomial distribution of each topic t in k topics, and the multinomial distribution is marked as theta, namely, one theta D of the celebrity D corresponding to the topic t is < pt 1.. ptk >, wherein pti represents the probability that the celebrity D corresponds to the ith topic t in the k topics, and the calculation method is pti-nti/n, wherein nti indicates the number of text participles w of the celebrity D corresponding to the ith topic t, and n is the total number of all text participles w in the celebrity D, wherein k is a positive integer greater than 1;
each topic t corresponds to a label set i, each topic t corresponds to a multinomial distribution of m nonrepeating text participles v in the corpus, and the multinomial distribution is recorded as
Figure BDA0002474391500000101
I.e. topic t corresponds to one of the text participles w
Figure BDA0002474391500000102
Pwi represents the probability of generating the ith text participle w in the m text participles v in the corpus based on the topic t, and the calculation method is pwi ═ Nwi/N, wherein Nwi represents the number of the ith text participles v in the corpus corresponding to the topic t, and N represents the total number of all the text participles v corresponding to the topic t.
The core formula of the LDA algorithm is as follows:
p(w|d)=p(w|t)*p(t|d)
the formula shows that the subject t is taken as an intermediate layer, and the current sum of theta d can be used
Figure BDA0002474391500000103
Giving the probability of the occurrence of the text participle w in the celebrity name d, wherein p (t | d) is calculated by using theta d to obtain the probability of representing the celebrity name d corresponding to each subject t, and p (w | t) is calculated by using theta d
Figure BDA0002474391500000111
The probability of generating the text participle w representing the topic t is calculated, and it should be noted that the text participle w here may not be limited to the text participle w in the current celebrity language D, but may also be the text participle v in the corpus D.
According to the method and the device, the text clustering algorithm is used for assisting manual work to quickly calculate the 'subject term' which can represent each celebrity name as the label corresponding to the celebrity name, so that the identification of latent subject information in a large-scale document set or corpus is realized, and the core subject term of each celebrity name is provided.
Step 308: and screening at least one subject term in the feature tag list corresponding to each text corpus, and taking the screened at least one subject term as at least one feature tag corresponding to the text corpus.
In the embodiment of the application, after the feature tag list corresponding to the text corpus of each celebrity name calculated by the text clustering algorithm is obtained, at least one subject term in the feature tag list corresponding to each text corpus is further screened in view of errors and inaccuracy existing in the algorithm, and the screening process can be manual screening, so that the subject term which finally represents each celebrity name is determined.
In the above embodiment, as shown in fig. 5, after the at least one filtered subject term is used as the at least one feature tag corresponding to the text corpus, steps 502 to 504 are further included.
Step 502: and acquiring at least one subject term of each text corpus from at least one attribute message corresponding to each text corpus.
In the embodiment of the present application, there may also be key information capable of representing the celebrity in at least one attribute information corresponding to the text corpus of each celebrity, for example, if the author of the celebrity is a writer with a high reputation or a famous person, then a name word corresponding to the author, such as "rushing" or "bang", may be used as a subject word of the text corpus of the celebrity, and the manner may be manually introduced.
Step 504: and adding a subject word obtained from at least one attribute information corresponding to the text corpus as a feature tag corresponding to the text corpus into a feature tag list corresponding to the text corpus.
In the embodiment of the application, partial key information such as author names, types, source works, countries or times and the like is introduced and is added into a feature tag list corresponding to a text corpus of a celebrity, so that the feature tags of the celebrity are enriched, and the construction and retrieval of a knowledge graph of the celebrity are facilitated.
Step 206: and constructing a knowledge graph corresponding to the text data set according to a plurality of text corpora, at least one attribute information corresponding to each text corpus and at least one feature tag corresponding to each text corpus.
In an embodiment of the present application, as shown in fig. 6, a knowledge graph corresponding to the text data set is constructed according to a plurality of text corpuses, at least one attribute information corresponding to each text corpus, and at least one feature tag corresponding to each text corpus, including steps 602 to 606.
Step 602: and taking each text corpus and at least one characteristic label corresponding to each text corpus as entities, and constructing an association relation between the entities.
In an embodiment of the present application, a system or a terminal of the present application takes a text corpus of each celebrity and at least one feature tag corresponding to each text corpus as entities, and establishes an association relationship between the entities and tags thereof, for example, a celebrity "creditor is the most happy and honest is the most natural. "and its characteristic labels" honesty "," honesty "and" luck "are respectively used as entities and establish the association relationship.
Step 604: and taking at least one attribute information corresponding to each text corpus as the attribute of the corresponding entity.
In the embodiment of the present application, the system or the terminal of the present application uses at least one attribute information corresponding to the text corpus of each celebrity name as an attribute of a corresponding entity, such as author name, genre, source work, country or era, and it should be noted that, if part of the key information in the attribute information, such as "luck" appears in the feature tag as an entity, the content of "luck" as an attribute "author" is not affected, and the feature tag is only convenient for question-answer matching during retrieval, that is, the celebrity name knowledge graph of the present application will take the celebrity name as an entity and show all the attributes of the entity when recommending the celebrity name.
Step 606: and constructing a knowledge graph corresponding to the text data set according to the plurality of entities, the incidence relation among the entities and the attribute of each entity.
In the embodiment of the application, the system or the terminal of the application constructs a data layer and a mode layer according to a plurality of entities, the incidence relation among the entities and the attribute of each entity, so as to form a knowledge graph of celebrity names, wherein the data layer refers to stored real data, such as better than gehrix-wife-mlinda-gehrix and better than gehrix-president-microsoft, and the mode layer is used for storing refined knowledge and is managed by adopting an ontology base, including entity-relation-entity and entity-attribute values.
The application provides a knowledge graph construction method, which is characterized in that the most representative words in a text corpus are obtained through a natural language processing technology and are used as feature labels, the knowledge graph corresponding to a text data set is constructed on the basis of the text corpus, each attribute information corresponding to the text corpus and each feature label corresponding to the text corpus, so that semantic description including celebrity namelanguages is expanded, the incidence relation among the text corpuses is increased, the text corpuses are organized in a more optimal graph data structure mode, a user is supported to obtain information from multiple sides, and the intelligent retrieval requirement and the use requirement of the user on the text corpuses are well met.
In an embodiment of the present application, as shown in fig. 7, after a knowledge graph corresponding to the text data set is constructed according to a plurality of text corpuses, at least one attribute information corresponding to each text corpus, and at least one feature tag corresponding to each text corpus, steps 702 to 704 are further included.
Step 702: and storing the constructed knowledge graph corresponding to the text data set into a graph database.
Optionally, the graphic database may be a Neo4j database. The Neo4j graph database is modeled by using the concept of "graph" in a data structure, two most basic concepts in the Neo4j graph database are nodes and edges, the nodes represent entities, the edges represent relationships between the entities, the nodes and the edges can have own attributes, different entities are associated through various relationships to form a complex object graph, and the Neo4j graph database simultaneously provides functions of searching and traversing on the object graph.
Step 704: and when the attributes of a target entity in the knowledge graph corresponding to the text data set are the same as the attributes of a target knowledge graph and the target entity has an inclusion relationship with any entity in the target knowledge graph, the target entity is used for realizing the butt joint of the knowledge graph corresponding to the text data set and the target knowledge graph.
In the embodiment of the present application, in the celebrity-name-language knowledge graph of the present application, the "type" of a certain entity a is the same as the "type" of the target knowledge graph, for example, the type of the certain entity a is "poetry", and the target knowledge graph is a knowledge graph constructed by "poetry", and it is satisfied that the certain entity a is included in any entity b of the target knowledge graph, for example, a celebrity-name-language "creditor is the most happy and honest and the most natural" is included in a novel or poetry sentence stored as an entity in the target knowledge graph, the docking of the celebrity-name-language knowledge graph and the target knowledge graph can be realized.
According to the method and the device, the butt joint between the related maps is realized by utilizing the commonality of the attribute information of the entities in the knowledge map and the inclusion relationship between the entities, and the larger-scale knowledge map is conveniently constructed in a knowledge fusion mode.
Fig. 8 illustrates a tag-based celebrity-name-language knowledge-graph construction method, which is described by taking a celebrity as an example and includes steps 802 to 818, according to an embodiment of the present specification.
Step 802: the method is characterized by acquiring the attribute information of the celebrity 'the trustful person is the fastest and the truthfulness is the most natural', the author 'luck', the type 'dialect', the country 'China', the times 'near modern' and the like.
Step 804: the celebrity name ' keepers are the most happy, honest are the most natural ' stop word is removed ' and punctuation marks are used for word segmentation by utilizing a Chinese word segmentation component, so that the celebrity name ' keepers are the most happy, honest are the most natural ' corresponding text segmentation ' integrity ', ' happiness ', ' honest ' and ' naturalness ', and a word list is formed. The Chinese word segmentation component can be a Jieba word segmentation (Jieba), the Jieba word segmentation algorithm realizes efficient word graph scanning based on a prefix dictionary, generates a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of Chinese characters in a sentence, adopts dynamic programming to search a maximum probability path and find out a maximum segmentation combination based on word frequency, adopts an HMM model based on the word forming capability of the Chinese characters for unknown words, and uses a Viterbi algorithm.
Step 806: storing the corresponding word list of the celebrity who is the most happy and truthful but the most natural in the name of the celebrity in the corpus to obtain the corpus containing the text participles of truthfulness, happiness, truthfulness and naturalness.
Step 808: based on the corpus, determining words such as ' honesty ', ' heart state ' and ' life ' in the corpus as corresponding subject terms of the celebrity name ' on the basis of an LDA algorithm, wherein the words are the fastest and the honesty is the most natural, and forming a corresponding feature label list of the celebrity name ' on the basis that the celebrity name ' on the basis of the corpus is the fastest and the honesty is the most natural.
In the above embodiment, the LDA algorithm selects the words "honest", "psychological" and "life" greater than the probability threshold as the celebrity name "keepers as the fastest and honest" corresponding subject words by calculating the probability distribution of each word in the corpus as the celebrity name "keepers as the fastest and honest" corresponding subject words.
Step 810: the method is characterized in that manual examination is carried out on the subject words of integrity, honesty, heart state and life, and two characteristic labels that the text participles of integrity and honesty are used as celebrity names and are the fastest and the honesty is the most natural are determined.
Step 812: two other special feature labels of the author 'ruxun' and the type 'dialect' in the attribute information, namely 'ruxun' and 'dialect', are manually added to the corresponding feature label list of the celebrity 'fierce who is the most happy and truthful as the celebrity' fiercer who is the most truthful.
Step 814: the characteristic labels ' honesty ', ' luck ' and ' plains ' in the corresponding characteristic label list of the celebrity ' with the celebrity ' fierce name ' fierce, honesty and ' truest ' are taken as entities, and the incidence relation among the entities is constructed.
Step 816: the attribute is the author ' Luxun ', the type ' dialect ', the country ' China ', the times ' near modern ' and other attribute information of the celebrity ' the name ' the trusting person is the most happy and the honesty is the most natural '.
Step 818: and adding the entities, the relations and the attributes corresponding to the celebrity name 'trustful and truthful' into the knowledge graph corresponding to the text data set in the representation form of the triples.
The celebrity knowledge graph constructed by the method can meet the retrieval requirements of users, for example, the users ask questions by 'express about honest celebrities', the celebrity knowledge graph can be used for retrieving and matching the feature tags according to the 'express' and 'honest' in the question as key words, and the person who is on duty and returns the feature tags containing the 'express' and 'honest' celebrity celebrities is the fastest and honest, so that the effect of obtaining the celebrity celebrities from multiple sides is achieved.
Corresponding to the above method embodiment, the present specification further provides an embodiment of a tag-based celebrity-name-language knowledge-graph constructing apparatus, and fig. 9 shows a schematic structural diagram of the tag-based celebrity-name-language knowledge-graph constructing apparatus according to an embodiment of the present specification. As shown in fig. 9, the apparatus includes:
a data obtaining module 901, configured to obtain a text data set, where the text data set includes a plurality of text corpora and at least one attribute information corresponding to each text corpus;
a tag obtaining module 902, configured to determine, according to the text data set, at least one feature tag corresponding to each text corpus;
the map building module 903 is configured to build a knowledge map corresponding to the text data set according to a plurality of text corpora, at least one attribute information corresponding to each text corpus, and at least one feature tag corresponding to each text corpus.
Optionally, the tag obtaining module 902 includes:
the word list unit is configured to filter stop words and symbols in each text corpus and perform word segmentation operation to obtain at least one text word corresponding to each text corpus and form a word list;
the tag list unit is configured to obtain a feature tag list corresponding to each text corpus according to a word list corresponding to each text corpus;
and the label screening unit is configured to screen at least one subject term in the feature label list corresponding to each text corpus, and take the screened at least one subject term as at least one feature label corresponding to the text corpus.
Optionally, the tag list unit includes:
a corpus construction unit configured to store a word list corresponding to each text corpus into a corpus to obtain a corpus including at least one text participle corresponding to each text corpus;
the label calculation unit is configured to determine at least one subject term corresponding to each text corpus from the corpus through a text clustering algorithm to form a feature label list corresponding to each text corpus;
optionally, the apparatus further comprises:
the attribute filtering module is configured to acquire at least one subject term corresponding to each text corpus from at least one attribute message corresponding to each text corpus;
and the label supplement module is configured to add a subject word acquired from at least one attribute information corresponding to the text corpus as a feature label corresponding to the text corpus into a feature label list corresponding to the text corpus.
Optionally, the map building module 903 includes:
the relation construction unit is configured to take each text corpus and at least one feature tag corresponding to each text corpus as entities and construct an association relation between the entities;
the attribute configuration unit is configured to take at least one attribute information corresponding to each text corpus as the attribute of the corresponding entity;
and the map weaving unit is configured to construct a knowledge map corresponding to the text data set according to the plurality of entities, the incidence relation among the entities and the attribute of each entity.
Optionally, the apparatus further comprises:
and the map storage module is configured to store the constructed knowledge map corresponding to the text data set into a graphic database.
Optionally, the apparatus further comprises:
the graph docking module is configured to, when the attributes of a target entity in a knowledge graph corresponding to the text data set are the same as the attributes of a target knowledge graph and the target entity has a containment relationship with any entity in the target knowledge graph, enable the target entity to dock the knowledge graph corresponding to the text data set with the target knowledge graph.
The application provides a knowledge graph founds device obtains the most representative word in the text corpus as the feature label through natural language processing technology to based on text corpus, every at least one attribute information and every that the text corpus corresponds at least one feature label, the structure that the text data set corresponds the knowledge graph that text data set corresponds to the semantic description that contains the celebrity name speech in has been expanded, the incidence relation between the text corpus is increased, organizes the text corpus with more excellent atlas data structure form, and then supports the user and from the multiple flank acquisition information, has satisfied the intelligent retrieval demand and the user demand of user to the text corpus betterly.
An embodiment of the present application further provides a computing device, including a memory, a processor, and computer instructions stored on the memory and executable on the processor, where the processor executes the instructions to implement the following steps:
acquiring a text data set, wherein the text data set comprises a plurality of text corpora and at least one attribute information corresponding to each text corpora;
determining at least one characteristic label corresponding to each text corpus according to the text data set;
and constructing a knowledge graph corresponding to the text data set according to a plurality of text corpora, at least one attribute information corresponding to each text corpus and at least one feature tag corresponding to each text corpus.
An embodiment of the present application also provides a computer readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the tag-based celebrity-language knowledge-graph construction method as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the computer-readable storage medium and the technical solution of the tag-based celebrity-name knowledge-graph construction method belong to the same concept, and details of the technical solution of the computer-readable storage medium, which are not described in detail, can be referred to the description of the technical solution of the tag-based celebrity-name knowledge-graph construction method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (16)

1. A knowledge graph construction method is characterized by comprising the following steps:
acquiring a text data set, wherein the text data set comprises a plurality of text corpora and at least one attribute information corresponding to each text corpora;
determining at least one characteristic label corresponding to each text corpus according to the text data set;
and constructing a knowledge graph corresponding to the text data set according to a plurality of text corpora, at least one attribute information corresponding to each text corpus and at least one feature tag corresponding to each text corpus.
2. The method according to claim 1, wherein determining at least one feature tag corresponding to each text corpus according to the text data set comprises:
filtering stop words and symbols in each text corpus, and performing word segmentation operation to obtain at least one text word corresponding to each text corpus and form a word list;
obtaining a feature tag list corresponding to each text corpus according to the word list corresponding to each text corpus;
and screening at least one subject term in the feature tag list corresponding to each text corpus, and taking the screened at least one subject term as at least one feature tag corresponding to the text corpus.
3. The method according to claim 2, wherein obtaining a feature tag list corresponding to each of the text corpuses according to a word list corresponding to each of the text corpuses comprises:
storing a word list corresponding to each text corpus into a corpus to obtain a corpus comprising at least one text participle corresponding to each text corpus;
and determining at least one subject term corresponding to each text corpus from the corpus through a text clustering algorithm to form a feature tag list corresponding to each text corpus.
4. The method according to claim 2, wherein after the step of using the filtered at least one subject term as the at least one feature tag corresponding to the text corpus, the method further comprises:
acquiring at least one subject term corresponding to each text corpus from at least one attribute information corresponding to each text corpus;
and adding a subject word obtained from at least one attribute information corresponding to the text corpus as a feature tag corresponding to the text corpus into a feature tag list corresponding to the text corpus.
5. The method according to claim 1 or 4, wherein constructing the knowledge graph corresponding to the text data set according to a plurality of text corpora, at least one attribute information corresponding to each text corpus, and at least one feature tag corresponding to each text corpus comprises:
taking each text corpus and at least one feature tag corresponding to each text corpus as entities, and constructing an association relation between the entities;
taking at least one attribute information corresponding to each text corpus as an attribute of a corresponding entity;
and constructing a knowledge graph corresponding to the text data set according to the plurality of entities, the incidence relation among the entities and the attribute of each entity.
6. The method according to claim 1, further comprising, after constructing a knowledge graph corresponding to the text data set according to a plurality of text corpora, at least one attribute information corresponding to each of the text corpora, and at least one feature tag corresponding to each of the text corpora, the method further comprising:
and storing the constructed knowledge graph corresponding to the text data set into a graph database.
7. The method of claim 6, further comprising, after storing the constructed knowledge graph corresponding to the text data set in a graph database:
and when the attributes of a target entity in the knowledge graph corresponding to the text data set are the same as the attributes of a target knowledge graph and the target entity has an inclusion relationship with any entity in the target knowledge graph, the target entity is used for realizing the butt joint of the knowledge graph corresponding to the text data set and the target knowledge graph.
8. A knowledge-graph building apparatus, comprising:
the data acquisition module is configured to acquire a text data set, wherein the text data set comprises a plurality of text corpora and at least one piece of attribute information corresponding to each text corpora;
the label acquisition module is configured to determine at least one characteristic label corresponding to each text corpus according to the text data set;
the map building module is configured to build a knowledge map corresponding to the text data set according to a plurality of text corpuses, at least one attribute information corresponding to each text corpus, and at least one feature tag corresponding to each text corpus.
9. The apparatus of claim 8, wherein the tag acquisition module comprises:
the word list unit is configured to filter stop words and symbols in each text corpus and perform word segmentation operation to obtain at least one text word corresponding to each text corpus and form a word list;
the tag list unit is configured to obtain a feature tag list corresponding to each text corpus according to a word list corresponding to each text corpus;
and the label screening unit is configured to screen at least one subject term in the feature label list corresponding to each text corpus, and take the screened at least one subject term as at least one feature label corresponding to the text corpus.
10. The apparatus of claim 9, wherein the tag list unit comprises:
a corpus construction subunit, configured to store the word list corresponding to each text corpus into a corpus to obtain a corpus including at least one text participle corresponding to each text corpus;
and the label calculation subunit is configured to determine at least one subject term corresponding to each text corpus from the corpus through a text clustering algorithm to form a feature label list corresponding to each text corpus.
11. The apparatus of claim 9, further comprising:
the attribute filtering module is configured to acquire at least one subject term corresponding to each text corpus from at least one attribute message corresponding to each text corpus;
and the label supplement module is configured to add a subject word acquired from at least one attribute information corresponding to the text corpus as a feature label corresponding to the text corpus into a feature label list corresponding to the text corpus.
12. The apparatus of claim 8 or 11, wherein the atlas-building module comprises:
the relation construction unit is configured to take each text corpus and at least one feature tag corresponding to each text corpus as entities and construct an association relation between the entities;
the attribute configuration unit is configured to take at least one attribute information corresponding to each text corpus as the attribute of the corresponding entity;
and the map weaving unit is configured to construct a knowledge map corresponding to the text data set according to the plurality of entities, the incidence relation among the entities and the attribute of each entity.
13. The apparatus of claim 8, further comprising:
and the map storage module is configured to store the constructed knowledge map corresponding to the text data set into a graphic database.
14. The apparatus of claim 13, further comprising:
the graph docking module is configured to, when the attributes of a target entity in a knowledge graph corresponding to the text data set are the same as the attributes of a target knowledge graph and the target entity has a containment relationship with any entity in the target knowledge graph, enable the target entity to dock the knowledge graph corresponding to the text data set with the target knowledge graph.
15. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-7 when executing the instructions.
16. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
CN202010358878.8A 2020-04-29 2020-04-29 Knowledge graph construction method and device Pending CN113569051A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010358878.8A CN113569051A (en) 2020-04-29 2020-04-29 Knowledge graph construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010358878.8A CN113569051A (en) 2020-04-29 2020-04-29 Knowledge graph construction method and device

Publications (1)

Publication Number Publication Date
CN113569051A true CN113569051A (en) 2021-10-29

Family

ID=78158897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010358878.8A Pending CN113569051A (en) 2020-04-29 2020-04-29 Knowledge graph construction method and device

Country Status (1)

Country Link
CN (1) CN113569051A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880455A (en) * 2022-07-12 2022-08-09 科大讯飞股份有限公司 Triple extraction method, device, equipment and storage medium
CN116821377A (en) * 2023-08-31 2023-09-29 南京云创大数据科技股份有限公司 Primary school Chinese automatic evaluation system based on knowledge graph and large model

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107633044A (en) * 2017-09-14 2018-01-26 国家计算机网络与信息安全管理中心 A kind of public sentiment knowledge mapping construction method based on focus incident
WO2018072563A1 (en) * 2016-10-18 2018-04-26 中兴通讯股份有限公司 Knowledge graph creation method, device, and system
CN108694177A (en) * 2017-04-06 2018-10-23 北大方正集团有限公司 Knowledge mapping construction method and system
CN109189942A (en) * 2018-09-12 2019-01-11 山东大学 A kind of construction method and device of patent data knowledge mapping
CN109446343A (en) * 2018-11-05 2019-03-08 上海德拓信息技术股份有限公司 A kind of method of public safety knowledge mapping building
CN109684483A (en) * 2018-12-11 2019-04-26 平安科技(深圳)有限公司 Construction method, device, computer equipment and the storage medium of knowledge mapping
CN109977233A (en) * 2019-03-15 2019-07-05 北京金山数字娱乐科技有限公司 A kind of idiom knowledge map construction method and device
CN110119473A (en) * 2019-05-23 2019-08-13 北京金山数字娱乐科技有限公司 A kind of construction method and device of file destination knowledge mapping
CN110442733A (en) * 2019-08-08 2019-11-12 恒生电子股份有限公司 A kind of subject generating method, device and equipment and medium
CN110457487A (en) * 2019-07-10 2019-11-15 北京邮电大学 The construction method and device of patent knowledge map
CN110543574A (en) * 2019-08-30 2019-12-06 北京百度网讯科技有限公司 knowledge graph construction method, device, equipment and medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018072563A1 (en) * 2016-10-18 2018-04-26 中兴通讯股份有限公司 Knowledge graph creation method, device, and system
CN108694177A (en) * 2017-04-06 2018-10-23 北大方正集团有限公司 Knowledge mapping construction method and system
CN107633044A (en) * 2017-09-14 2018-01-26 国家计算机网络与信息安全管理中心 A kind of public sentiment knowledge mapping construction method based on focus incident
CN109189942A (en) * 2018-09-12 2019-01-11 山东大学 A kind of construction method and device of patent data knowledge mapping
CN109446343A (en) * 2018-11-05 2019-03-08 上海德拓信息技术股份有限公司 A kind of method of public safety knowledge mapping building
CN109684483A (en) * 2018-12-11 2019-04-26 平安科技(深圳)有限公司 Construction method, device, computer equipment and the storage medium of knowledge mapping
CN109977233A (en) * 2019-03-15 2019-07-05 北京金山数字娱乐科技有限公司 A kind of idiom knowledge map construction method and device
CN110119473A (en) * 2019-05-23 2019-08-13 北京金山数字娱乐科技有限公司 A kind of construction method and device of file destination knowledge mapping
CN110457487A (en) * 2019-07-10 2019-11-15 北京邮电大学 The construction method and device of patent knowledge map
CN110442733A (en) * 2019-08-08 2019-11-12 恒生电子股份有限公司 A kind of subject generating method, device and equipment and medium
CN110543574A (en) * 2019-08-30 2019-12-06 北京百度网讯科技有限公司 knowledge graph construction method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880455A (en) * 2022-07-12 2022-08-09 科大讯飞股份有限公司 Triple extraction method, device, equipment and storage medium
CN116821377A (en) * 2023-08-31 2023-09-29 南京云创大数据科技股份有限公司 Primary school Chinese automatic evaluation system based on knowledge graph and large model

Similar Documents

Publication Publication Date Title
CN107229610B (en) A kind of analysis method and device of affection data
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN110119473B (en) Method and device for constructing target file knowledge graph
KR20190015797A (en) The System and the method of offering the Optimized answers to legal experts utilizing a Deep learning training module and a Prioritization framework module based on Artificial intelligence and providing an Online legal dictionary utilizing a character Strings Dictionary Module that converts legal information into significant vector
CN113127624B (en) Question-answer model training method and device
CN106126619A (en) A kind of video retrieval method based on video content and system
CN106886580A (en) A kind of picture feeling polarities analysis method based on deep learning
CN110633577A (en) Text desensitization method and device
CN113961685A (en) Information extraction method and device
CN110555440B (en) Event extraction method and device
CN101820475A (en) Cell phone multimedia message generating method based on intelligent semantic understanding
CN109947921A (en) A kind of intelligent Answer System based on natural language processing
Evert A Lightweight and Efficient Tool for Cleaning Web Pages.
CN109446337A (en) A kind of knowledge mapping construction method and device
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN113569051A (en) Knowledge graph construction method and device
CN105956158A (en) Automatic extraction method of network neologism on the basis of mass microblog texts and use information
CN114003706A (en) Keyword combination generation model training method and device
CN118035405A (en) Knowledge base question-answering construction method and device based on large model
CN112182159B (en) Personalized search type dialogue method and system based on semantic representation
CN114647719A (en) Question-answering method and device based on knowledge graph
CN111881685A (en) Small-granularity strategy mixed model-based Chinese named entity identification method and system
CN110705310A (en) Article generation method and device
CN115203429B (en) Automatic knowledge graph expansion method for constructing ontology framework in auditing field
CN115757723A (en) Text processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination