CN110543574B

CN110543574B - Knowledge graph construction method, device, equipment and medium

Info

Publication number: CN110543574B
Application number: CN201910817819.XA
Authority: CN
Inventors: 方舟; 冯知凡; 汪琦; 秦华鹏; 张扬; 陆超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-05-17
Anticipated expiration: 2039-08-30
Also published as: CN110543574A

Abstract

The embodiment of the application discloses a method, a device, equipment and a medium for constructing a knowledge graph. The method comprises the following steps: acquiring a text corpus of at least one information object; extracting phrases from the text corpus; identifying content phrases of the concerned subject, the object entity, the object side and the action event from the extracted phrases, and identifying the incidence relation among the phrases; and updating the content phrases into point elements of the knowledge graph, and updating the association relation into edge elements of the knowledge graph. According to the technical scheme of the embodiment of the application, the knowledge graph of the information object is constructed by extracting the text corpus of the information object, a plurality of point elements such as attention subjects, action events, object sides and the like and corresponding side elements are added, information expansion of existing object entities in the knowledge graph is achieved, and new object entities are continuously mined from the information object to continuously expand and supplement the composition of the knowledge graph.

Description

Knowledge graph construction method, device, equipment and medium

Technical Field

The embodiment of the application relates to a data processing technology, in particular to a natural language processing technology, and particularly relates to a method, a device, equipment and a medium for constructing a knowledge graph.

Background

In the existing Natural Language Processing (NLP) technology, a knowledge-graph database is gradually constructed in order to facilitate the recognition of semantic knowledge. The knowledge graph includes point elements and edge elements, entities are recorded by the point elements, and relationships between the entities are recorded by the edge elements, and the entities are generally specific things in the real world.

However, when the knowledge graph is used for recognizing information beyond text such as pictures, audio, and video, it is difficult to satisfy the information recognition requirement because the information covers the complexity of the entity.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a medium for constructing a knowledge graph, so as to construct a more complete graph capable of reflecting complex knowledge data cognition.

In a first aspect, an embodiment of the present application provides a method for constructing a knowledge graph, including:

acquiring a text corpus of at least one information object;

extracting phrases from the text corpus, the phrases comprising at least one vocabulary;

identifying content phrases of concerned subjects, object entities, object sides and action events from the extracted phrases, and identifying the association relation among the phrases according to the vocabulary structure in the phrases and the sentence structure in the text corpus where the phrases are located;

and updating the content phrases into point elements of the knowledge graph, and updating the association relation into edge elements of the knowledge graph.

One embodiment in the above application has the following advantages or benefits: by extracting the text corpus of the information object, the knowledge graph of the information object is constructed, a plurality of point elements such as an attention theme, an action event, an object side and the like and corresponding side elements are added while an object entity is determined, the composition elements of the knowledge graph are enriched, the information expansion of the existing object entity in the knowledge graph is realized, and the continuous mining of new object entities from the information object is realized for continuously expanding and supplementing the composition of the knowledge graph.

Extracting phrases from the text corpus comprises: and extracting a minimum unit phrase and a compound phrase from the text corpus, wherein the minimum unit phrase comprises one word, and the compound phrase comprises at least two words.

One embodiment in the above application has the following advantages or benefits: the phrases extracted from the text corpus not only comprise the minimum unit phrase of a single vocabulary, but also comprise the compound phrases of a plurality of vocabularies, and the actual content of the text corpus is reflected more accurately.

After extracting the minimum unit phrase and the compound phrase from the text corpus, the method further comprises the following steps: and adjusting the vocabulary in the compound phrase according to the text corpus so as to expand and generate other compound phrases.

One embodiment in the above application has the following advantages or benefits: and expanding the acquired compound phrases to generate other compound phrases, thereby enlarging the sources of the constituent elements in the knowledge graph.

Extracting the minimum unit phrase from the text corpus comprises: performing word segmentation processing on the text corpus to form a word sequence; and for each word, if the word collocation structure of the word in each word sequence is stable, determining the word as the extracted minimum unit phrase.

One embodiment in the above application has the following advantages or benefits: through word segmentation processing and word sequence screening, accurate minimum unit phrases are obtained from the text corpus, and accurate alternative phrases are provided for point elements in the knowledge graph.

For each word, if the word collocation structure of the word in each word sequence is stable, before determining the word as the extracted minimum unit phrase, the method further comprises at least one of the following preprocessing: removing stop words from the sequence of words; filtering existing phrases from the vocabulary sequence according to phrases recorded by existing point elements in the knowledge graph; and sequencing the vocabularies in the vocabulary sequence according to the occurrence frequency in the text corpus so as to be used for identifying the stability of the vocabulary collocation structure in sequence.

One embodiment in the above application has the following advantages or benefits: useful words in the word sequence are screened out by removing stop words, filtering existing phrases and sorting frequency, the obtaining range of the minimum unit phrase is greatly reduced, and the accuracy of obtaining the minimum unit phrase is improved.

Extracting compound phrases from the text corpus comprises: performing part-of-speech tagging on the vocabulary in the text corpus; extracting at least two vocabularies which accord with a part-of-speech matching structure in the text corpus according to a preset part-of-speech matching template to serve as composite phrases; wherein the preset part-of-speech collocation template at least comprises one of the following: solid and side templates; name, verb and modifier templates; verb, noun, and modifier templates.

One embodiment in the above application has the following advantages or benefits: according to the part-of-speech tagging and the preset template, the compound phrases in the text corpus are accurately acquired, and the acquisition efficiency and accuracy of the compound phrases are improved.

According to the text corpus, adjusting the vocabulary in the compound phrase to expand and generate other compound phrases, wherein the step of adjusting the vocabulary in the compound phrase comprises the following steps: combining at least two vocabularies according to the syntactic dependency tree relationship of the vocabularies in the text corpus to generate a new compound phrase; and/or replacing words in the compound phrase by approximate semantic words for the extracted compound phrase to generate a new compound phrase.

One embodiment in the above application has the following advantages or benefits: multiple vocabularies which are not directly connected in the text corpus are obtained through a syntactic dependency tree, and the meanings expressed by sentences in the text corpus are described in an abstract way; and synonym replacement enlarges the coverage range of the compound phrases and provides a wider range of optional elements for the knowledge graph.

Identifying, from the extracted phrases, content phrases of a topic of interest including: subject cleaning is carried out on the extracted phrases; wherein, the theme cleaning at least comprises removing stop words, setting stop word addition and synonym normalization; and screening the phrases according to the appearance frequency of the phrases after the topic cleaning in each text corpus so as to determine the content phrases serving as the concerned topics.

One embodiment in the above application has the following advantages or benefits: by the aid of theme cleaning, useless phrases are filtered, more accurate concerned themes are obtained from the extracted phrases, and obtaining efficiency is improved.

Identifying, from the extracted phrases, content phrases of a topic of interest including: and aiming at the extracted compound phrases, screening the compound phrases according to the occurrence frequency of words included in the compound phrases in the existing knowledge graph so as to determine the content phrases which are the concerned subjects.

One embodiment in the above application has the following advantages or benefits: and screening according to the occurrence frequency of the compound phrases in the knowledge graph, so that the obtained compound phrases are guaranteed to have certain heat, namely attention, and the attention theme can be guaranteed to be inquired or accessed by more users after the attention theme is determined.

From the extracted phrases, identifying content phrases of the business entity includes: and aiming at a text corpus comprising the subject phrase of interest, identifying a phrase which is in a context position relationship with the subject phrase of interest in the sentence of the text corpus based on a preset sentence template, and taking the phrase as a content phrase of the object entity.

One embodiment in the above application has the following advantages or benefits: and acquiring object entities according to a preset statement template, accurately acquiring the object entities in the text corpus, and further providing accurate point elements for the construction of the knowledge graph.

From the extracted phrases, identifying content phrases flanking the object includes: determining a composite phrase comprising a business entity phrase from the extracted composite phrases; determining phrases matched with the object entity phrases as candidate side phrases according to a preset phrase structure template aiming at the compound phrases comprising the object entity phrases; for the candidate side phrases, identifying the co-occurrence frequency of the object entity phrases in each text corpus, and screening and determining the content phrases of the object sides of the object entities according to the co-occurrence frequency; wherein the preset phrase structure template at least comprises one of the following items: entity and verb templates, entity and noun templates, and entity and modifier templates.

One embodiment in the above application has the following advantages or benefits: according to the preset phrase structure template and the co-occurrence frequency with the object entity, the object side in the text corpus is accurately acquired, and therefore accurate point elements are provided for the construction of the knowledge graph. .

Identifying, from the extracted phrases, a content phrase of an action event includes: extracting verbs from the extracted phrases as content phrases of the candidate action events; and screening the quality of the action events by adopting a preset machine learning model for the content phrases of the candidate action events, wherein the preset machine learning model is formed by training a text corpus of the action events with independent semantics through manual labeling.

One embodiment in the above application has the following advantages or benefits: and (4) performing quality screening on the verb phrases according to a preset machine learning model, accurately acquiring action events in the text corpus, and further providing accurate point elements for construction of a knowledge graph.

According to the vocabulary structure in the phrases and the sentence structure in the text corpus where the phrases are located, identifying the association relationship between the phrases as the concerned subjects comprises: performing clustering identification according to text corpora of the information object, matching with a preset superior theme in a knowledge graph according to a clustering result, and determining an association relation between the information object and the superior theme; establishing an association relation between the concerned subject extracted from the text corpus of the information object and the superior subject; taking the concerned subject included in the text corpus as a basic concerned subject, and establishing an association relation between an open concerned subject formed by expanding the basic concerned subject and a superior subject according to the association relation between the basic concerned subject and the superior subject; the preset upper-level theme comprises one or more levels.

One embodiment in the above application has the following advantages or benefits: the incidence relation between the open attention theme and the superior theme is determined through the clustering identification and the extracted incidence relation between the attention theme and the superior theme, and then accurate edge elements are provided for the construction of the knowledge graph.

According to the sentence structure of the text corpus where the phrases are located, identifying the incidence relation among the phrases as object entities comprises the following steps: in the process of identifying the content phrases of the object entities, establishing an association relationship between the object entity phrases and the focus topic phrases conforming to the contextual position relationship.

One embodiment in the above application has the following advantages or benefits: and establishing an incidence relation between the event entity and the concerned subject, and further providing accurate edge elements for the construction of the knowledge graph.

According to the word structure in the phrases, identifying the incidence relation between the phrases which are the sides of things comprises the following steps: in the process of identifying the content phrases of the object sides, the object entity phrases co-occurring in the same phrase and the object side phrases are established into an association relation.

One embodiment in the above application has the following advantages or benefits: the incidence relation between the entity of the object and the side of the object is established, and further, accurate edge elements are provided for the construction of the knowledge graph.

According to the sentence structure of the text corpus where the phrases are located, identifying the association relationship among the phrases as action events comprises the following steps: and determining co-occurrence attention subjects and object entities in the text corpus comprising the action events, and establishing an association relation.

One embodiment in the above application has the following advantages or benefits: and the incidence relation between the action event and the object entity and between the action event and the concerned subject is established, so that accurate edge elements are provided for the construction of the knowledge graph.

The information object includes at least one of: pictures, audio and video; the text corpus of the video comprises: video titles, video tags, video captions, video descriptions, user posting information for videos, searched logs of videos, and user reviews of videos.

One embodiment in the above application has the following advantages or benefits: the content contained in the information object is specified, and particularly for videos, the text corpora of the information object comprise a plurality of corpora related to the videos.

In a second aspect, an embodiment of the present application provides an apparatus for constructing a knowledge graph, including:

the text corpus acquiring module is used for acquiring text corpuses of at least one information object;

the phrase extraction module is used for extracting phrases from the text corpus, wherein the phrases comprise at least one vocabulary;

the incidence relation acquisition module is used for identifying content phrases of concerned subjects, object entities, object sides and action events from the extracted phrases, and identifying incidence relations among the phrases according to the vocabulary structures in the phrases and the sentence structures in the text corpus where the phrases are located;

and the knowledge graph updating module is used for updating the content phrases into the point elements of the knowledge graph and updating the association relation into the edge elements of the knowledge graph.

In a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of constructing a knowledge-graph as described in any of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute a method for constructing a knowledge graph according to any of the embodiments of the present application.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1A is a flow chart of a method for constructing a knowledge graph according to a first embodiment of the present application;

fig. 1B is a block diagram of a KG database according to an embodiment of the present application;

FIG. 1C is a block diagram of the structure of a knowledge-graph constructed in one embodiment of the present application;

FIG. 2 is a flow chart of a method for constructing a knowledge graph according to the second embodiment of the present application;

FIG. 3A is a flow chart of a method for constructing a knowledge graph in the third embodiment of the present application;

FIG. 3B is a flowchart of a method for constructing a knowledge-graph according to a third embodiment of the present application;

FIG. 3C is a block diagram of the structure of a knowledge-graph constructed in the third embodiment of the present application;

FIG. 3D is a block diagram of the structure of a knowledge-graph constructed in the third embodiment of the present application;

FIG. 3E is a schematic flow chart diagram of a method for constructing a knowledge graph provided by an embodiment of the present application;

FIG. 4 is a block diagram of an apparatus for constructing a knowledge graph according to a fourth embodiment of the present invention;

FIG. 5 is a block diagram of an electronic device for implementing the method of construction of a knowledge-graph of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Example one

Fig. 1A is a flowchart of a method for constructing a knowledge graph according to an embodiment of the present application, where this embodiment is applicable to a case of constructing a knowledge graph of knowledge data, and the method may be executed by a knowledge graph constructing apparatus in an embodiment of the present application, where the apparatus may be implemented in software and/or hardware, and may be generally integrated on a knowledge graph constructing server, and the method specifically includes the following operations:

s110, obtaining text corpora of at least one information object.

Corpora, i.e., linguistic material; the text corpus of the information object, namely the textual language material related to the information object; the text corpus of the information object can be obtained from open data in the internet, for example, a KG (Knowledge Graph) database; as shown in fig. 1B, the KG database contains tens of millions of specific object entities (e.g., palace chicken boutique), and the relationship between the specific object entities and their concept entities (e.g., food), and information of objects, people, or places, such as landmarks, celebrities, cities, teams, buildings, geographic features, and works of art, can be obtained through the KG database.

Optionally, in this embodiment of the present application, the information object includes at least one of: pictures, audio and video; wherein, the text corpus of the video comprises: video titles, video tags, video captions, video descriptions, user posting information for videos, searched logs of videos, and user reviews of videos. The corresponding text corpus can also be obtained by performing image recognition and audio recognition on the video.

And S120, extracting phrases from the text corpus, wherein the phrases comprise at least one word.

The vocabulary is the smallest unit reflecting text content and having independent natural semantics, and one or more vocabularies can be included in the phrase; optionally, extracting phrases from the text corpus includes: and extracting a minimum unit phrase and a compound phrase from the text corpus, wherein the minimum unit phrase comprises one word, and the compound phrase comprises at least two words. The minimum unit phrase can be extracted from the text corpus in a mutual information and information entropy mode; specifically, word segmentation processing is carried out on the text corpus to form a word sequence; and for each word, if the word collocation structure of the word in each word sequence is stable, determining the word as the extracted minimum unit phrase. The word segmentation, i.e. word segmentation, is to segment the text corpus into a plurality of separate words so that the computer or other recognition devices can effectively recognize the meaning of the sentence. Taking the example that a word sequence formed after word segmentation processing is performed on a text corpus corresponding to an information object contains a word "beauty", and if the word sequence formed by other text corpora of the information object appears for many times in a word list formed by other text corpora in a matching manner of beauty and appearance, the word "beauty" is determined to be a minimum unit phrase.

Specifically, before determining the vocabulary as the extracted minimum unit phrase if the vocabulary collocation structure of the vocabulary in each vocabulary sequence is stable, the method further comprises at least one of the following preprocessing:

removing stop words from the sequence of words;

filtering existing phrases from the vocabulary sequence according to phrases recorded by existing point elements in the knowledge graph;

and sequencing the vocabularies in the vocabulary sequence according to the occurrence frequency in the text corpus so as to be used for identifying the stability of the vocabulary collocation structure in sequence.

The stop words are words that frequently appear in each text corpus and have no practical meaning, such as: "of", "has", "etc." and "like", etc., the stop words in the text corpus can be directly removed from the stop words stored in the stop word database. And filtering out phrases in the vocabulary sequence if they are already present in an existing knowledge graph (e.g., the KG database), for example, if the phrases "panda hairpin" are already contained in phrases recorded by existing point elements in the knowledge graph and the phrases are also included in the vocabulary sequence, filtering out phrases "panda hairpin" in the vocabulary sequence to avoid repetitive point elements in the knowledge graph.

The minimum unit phrase is sometimes too generalized in meaning, so that although the occurrence frequency is high and the collocation structure is stable, the cognitive function of the information object is not large, namely the distinguishing function is not large, and the minimum unit phrase needs to be combined into a compound phrase to accurately express the text content of the information object. For example, the text corpus includes "winter popular hair accessories", where "winter" and "hair accessories" are minimum unit words, and may be small words for distinguishing and recognizing the information object, so that a compound phrase "winter popular hair accessories" needs to be combined to accurately represent the text content of the information object. Optionally, extracting a compound phrase from the text corpus includes: performing part-of-speech tagging on the vocabulary in the text corpus; extracting at least two vocabularies which accord with a part-of-speech matching structure in the text corpus according to a preset part-of-speech matching template to serve as composite phrases; wherein the preset part-of-speech collocation template at least comprises one of the following: solid and side templates; name, verb and modifier templates; verb, noun, and modifier templates. Part-of-speech tagging, namely tagging each vocabulary in the corpus of text by using a part-of-speech tagging tool, wherein tagged contents comprise word categories (such as names of people, names of places, numbers, nonsense conjunctions and the like) and word parts-of-speech (such as verbs, nouns, adjectives and the like). The preset part-of-speech matching template can be set based on manual experience, and can also be subjected to common use screening and determination from various word matching templates.

S130, identifying content phrases of the concerned subject, the object entity, the object side and the action event from the extracted phrases, and identifying the association relation among the phrases according to the vocabulary structure in the phrases and the sentence structure in the text corpus where the phrases are located.

Optionally, in an embodiment of the present application, identifying, from the extracted phrases, a content phrase of a topic of interest includes: and aiming at the extracted compound phrases, screening the compound phrases according to the occurrence frequency of words included in the compound phrases in the existing knowledge graph so as to determine the content phrases which are the concerned subjects. For example, for the compound phrase "winter fashion hair accessory," wherein "winter", "fashion" and "hair accessory" are all existing point elements in the existing KG database and occur more frequently, and there is a relationship between the three (representing the category "season-style-fashion category"), the "winter fashion hair accessory" can be regarded as a subject of attention.

Optionally, in an embodiment of the present application, identifying, from the extracted phrases, a content phrase of the business entity includes: and aiming at a text corpus comprising the subject phrase of interest, identifying a phrase which is in a context position relationship with the subject phrase of interest in the sentence of the text corpus based on a preset sentence template, and taking the phrase as a content phrase of the object entity. For example, according to the acquired theme "popular hair accessories in winter", the corresponding text corpus includes "the hair pin is bought in rabbit shape after all, which is the most popular hair accessories in winter", and since the theme is usually described as an entity, and the word "hair pin in rabbit shape" included in the context is a thing entity, the "hair pin in rabbit shape" is determined as the identified thing entity.

For example, in the phrase of the main-predicate structure, the subject is determined as the object entity, and the predicate is the object side, such as making the Gongbao chicken dices. Optionally, from the extracted phrases, identifying content phrases flanking the object includes: determining a composite phrase comprising a business entity phrase from the extracted composite phrases; determining phrases matched with the object entity phrases as candidate side phrases according to a preset phrase structure template aiming at the compound phrases comprising the object entity phrases; for the candidate side phrases, identifying the co-occurrence frequency of the object entity phrases in each text corpus, and screening and determining the content phrases of the object sides of the object entities according to the co-occurrence frequency; wherein the preset phrase structure template at least comprises one of the following items: entity and verb templates, entity and noun templates, and entity and modifier templates. According to the preset phrase structure template, it can be determined that the side of the object is a verb, noun or modifier matched with the object entity, for example, in the compound phrase "efficacy of Chinese wolfberry", the side of the object matched with the object entity "Chinese wolfberry" is "efficacy"; in the compound phrase "cheetah rate", the side of an object collocated with the object entity "cheetah" is "rate"; after the candidate side phrases are determined, in order to determine that the extracted candidate side phrases have significance, screening can be performed through the co-occurrence frequency of the candidate side phrases and the object entity phrases, and screening out the candidate side phrases with the co-occurrence frequency exceeding a preset threshold value as the object sides of the object entities. Optionally, the word space, the statistical characteristics and the context can be input into the black box model to determine whether the object entity and the object side can be combined; wherein, the black box model can comprise a combined model of Bilstm (Bi-directional Long Short-Term Memory) and Softmax; the association between the object entity and the object side can be expanded through the concept entity corresponding to the object entity, for example, the object side efficacy corresponding to the object entity ' medlar ' is obtained through the technical scheme, and the object side efficacy corresponding to the object entity ' medlar ' can be expanded according to the concept entity if the object entity is expanded according to the description of the medlar ' in the knowledge graph existing in the KG database, and all the object entities with the concept entity ' medicinal material ' can be expanded, for example, the object side efficacy corresponding to the object entity ' ginseng '.

Optionally, identifying a content phrase of the action event from the extracted phrases comprises: extracting verbs from the extracted phrases as content phrases of the candidate action events; and screening the quality of the action events by adopting a preset machine learning model for the content phrases of the candidate action events, wherein the preset machine learning model is formed by training a text corpus of the action events with independent semantics through manual labeling. For example, action events with independent semantics such as "cooking", "eating and broadcasting", and "marriage" are labeled manually, a plurality of acquired sample text corpora are trained, a preset machine learning model is further obtained, and then whether a content phrase extracted from the phrases as a candidate action event can be used as one action event is determined by the preset machine learning model.

On the basis of the above technical solution, as shown in fig. 1C, in the process of identifying the content phrases of the thing entities, an association between the phrases of the thing entities and the phrases of the attention topic conforming to the contextual position relationship has been established, for example, an association between the thing entities "hairpin rabbit hair" and the topic of attention "popular hair accessories in winter" in the above technical solution; in the process of identifying content phrases on the sides of things, establishing an association relationship between the phrases of the entity of things and the phrases of the sides of things, for example, the association relationship between the "medlar" as the entity of things and the "efficacy" as the entity of things in the above technical solution; and determining co-occurrence attention subjects and object entities in the text corpus comprising the action events, and establishing an association relation.

S140, updating the content phrases into the point elements of the knowledge graph, and updating the association relation into the edge elements of the knowledge graph.

And updating the content phrases of the concerned subject, the object entity, the object side and the action event acquired by the technical scheme into point elements of a knowledge graph, and updating the incidence relation between the point elements in the technical scheme into edge elements of the knowledge graph to construct a complete knowledge graph so as to describe the object entity in detail.

According to the technical scheme of the embodiment of the application, the knowledge graph of the information object is constructed by extracting the text corpus of the information object, a plurality of point elements such as an attention subject, an action event and an object side and corresponding side elements are added while an object entity is determined, the composition elements of the knowledge graph are enriched, information expansion of the existing object entity in the knowledge graph is achieved, and new object entities are continuously mined from the information object to continuously expand and supplement the composition of the knowledge graph.

Example two

Fig. 2 is a flowchart of a method for constructing a knowledge graph in an embodiment of the present application, which is embodied based on the above embodiment, and in this embodiment, after extracting a minimum unit phrase and a compound phrase from the text corpus, the method further includes: and adjusting the vocabulary in the compound phrase according to the text corpus so as to expand and generate other compound phrases. Correspondingly, the method of the embodiment specifically includes the following operations:

s210, obtaining text corpora of at least one information object.

S220, extracting phrases from the text corpus; the phrases include a minimum unit phrase including one word and a compound phrase including at least two words.

S230, combining at least two vocabularies according to the syntactic dependency tree relationship of the vocabularies in the text corpus to generate a new compound phrase; and/or replacing words in the compound phrase by approximate semantic words for the extracted compound phrase to generate a new compound phrase.

Syntactic dependency trees, i.e., dependency syntax, are constructed by parsing sentences into tree structures according to syntactic logic, describing dependency relationships among words, i.e., indicating syntactic collocation relationships among words, which are semantically related. Taking the "winter popular hair accessories" as an example, the text corpus does not include the expression of the "winter popular hair accessories", and the compound phrase "winter popular hair accessories" cannot be directly extracted; there is also no expression of "winter-fashion hair accessories", and therefore "winter-fashion hair accessories" cannot be obtained by removing the stop word "; however, the text corpus comprises 'hair accessories which are commonly used as stars and are the most popular style in winter', and according to the syntactic dependency tree relationship, the 'winter', 'popular' and 'hair accessories' can be extracted and recombined into a compound phrase 'the hair accessories popular in winter'.

The words in the compound phrase can be replaced by the similar semantic words to generate a new compound phrase; for example, the generated composite phrase "winter fashion hair accessory" may be subjected to vocabulary replacement to form a new composite phrase, including "spring fashion hair accessory", "winter fashion clothing", and "spring retro makeup", etc.

Specifically, for the recombined composite phrase, the correctness of the composite concept entity may be determined, for example, whether the composite phrase is a reasonable composite phrase may be determined based on a superordinate concept entity of the newly generated composite phrase, a PV (Page View) attribute and/or a combination model of the Attention and CRF (conditional random field) algorithms, where the combination model of the Attention and the CRF may be obtained by training based on a plurality of samples labeled with reasonable composite phrases.

S240, performing theme cleaning on the extracted phrases; wherein the topic cleaning at least comprises removing stop words, setting stop word addition and synonym normalization.

The extracted phrases may have expressions that are not normative, not smooth and repetitive, so that topic washing is required to make the topic of interest obtained from the phrases that express the normative. Setting stop word addition, namely adding special stop words according to needs, for example, when an information object is a video, adding 'old iron' and 'double-click six-six' as the special stop words so that the phrases do not comprise the words; synonyms are normalized, namely, the synonyms are normalized, and the normalization of the synonyms can be realized through data stored in a synonym database or the synonyms added according to needs, so that the synonyms appear as the same point element in the knowledge graph, and the phenomenon that the structure of the knowledge graph is too complicated due to the independent appearance is avoided.

And S250, screening according to the occurrence frequency of the phrases after the theme cleaning in each text corpus to determine the content phrases serving as the concerned themes.

The method comprises the following steps that an attention theme is a phrase with a certain attention degree, if the occurrence frequency of one phrase in each text corpus is too low, the phrase is considered to have no practical meaning and is not necessary to be used as the attention theme; therefore, the cleaned phrases generally cannot all appear as the attention subject, and the appearance frequency of each cleaned phrase in each text corpus needs to be screened, and the higher the appearance frequency, the higher the importance degree, the priority is taken as the attention subject. Optionally, the importance degree of the phrase is judged in a TF-IDF (term frequency inverse document frequency index) weighted statistical manner, and a preset number of phrases with the highest importance degree are selected as the concerned subjects.

Optionally, in the embodiment of the present application, the quality of the extracted phrases as the attention subjects may be further screened through the black box model, so as to obtain phrases meeting a certain quality requirement, that is, phrases meeting a certain occurrence frequency requirement as the attention subjects.

S260, identifying the association relation between the phrases according to the vocabulary structure in the phrases and the sentence structure in the text corpus where the phrases are located.

S270, updating the content phrases into point elements of the knowledge graph, and updating the association relation into edge elements of the knowledge graph.

According to the technical scheme of the embodiment of the application, the extracted vocabulary in the compound phrases is adjusted to expand and generate other compound phrases, so that point elements of the knowledge graph are greatly enriched, the expansion of the concerned subjects of the same type is realized, and a more complete knowledge graph is constructed.

EXAMPLE III

Fig. 3A is a flowchart of a method for constructing a knowledge graph in a third embodiment of the present application, which is embodied on the basis of the foregoing embodiment, and in this embodiment, identifying, according to a vocabulary structure in a phrase and a sentence structure in a text corpus in which the phrase is located, an association relationship between phrases serving as topics of interest includes: and performing clustering identification according to the text corpus of the information object, matching with a preset superior theme in the knowledge graph according to a clustering result, and determining an association relation between the information object and the superior theme. Correspondingly, the method of the embodiment specifically includes the following operations:

s310, obtaining text corpora of at least one information object.

S320, extracting phrases from the text corpus, wherein the phrases comprise a minimum unit phrase and a compound phrase, the minimum unit phrase comprises a word, and the compound phrase comprises at least two words.

S330, identifying the content phrases of the concerned subjects from the extracted phrases.

S340, performing clustering identification according to the text corpora of the information object, matching with a preset superior theme in the knowledge graph according to a clustering result, and determining an association relation between the information object and the superior theme; the preset upper-level theme comprises one or more levels.

The upper-level theme is a closed theme set and can be preset in the knowledge graph according to the requirement; the superior theme may include different levels, for example, the superior theme may include a primary theme and a secondary theme, wherein the primary theme includes "fashion" and the secondary theme includes "apparel".

As shown in fig. 3B, taking an information object including a video as an example, according to an existing clustering algorithm, for example, an lda (latent Dirichlet allocation) document topic generation model and a similarity algorithm of a deep hash Feature (Feature Hashing), clustering and identifying text corpora of the information object, identifying a key vocabulary in the text corpora, comparing the key vocabulary with a preset higher-level topic in similarity, and further establishing a mapping relationship between the text corpora and the preset higher-level topic based on the similarity. For example, the key word "clothing" is identified by clustering in the text corpus, and this is completely consistent with the secondary topic "clothing", so that the mapping relationship between the information object and the secondary topic "clothing" can be established.

And S350, establishing an association relation between the phrases extracted from the text corpus of the information object and the superior theme. And establishing an association relation between the concerned subject extracted from the text corpus of the information object and the superior subject.

The concerned subject extracted from the text corpus of the information object has a natural mapping relationship with the key vocabulary obtained in the above technical scheme, for example, as shown in fig. 3C, the concerned subject "winter fashion clothing" and the keyword "clothing" are concerned, so the mapping relationship can be established between the "winter fashion clothing" and the preset superior subject. Particularly, the attention topic 'fashion in winter' acquired in the technical scheme can establish an association relationship with the secondary topic 'fashion', can also establish an association relationship with the primary topic 'fashion', and can also establish an association relationship with the primary topic 'fashion' and the secondary topic 'fashion'.

S360, taking the concerned subject included in the text corpus as a basic concerned subject, and establishing an association relation between an open concerned subject formed by expanding the basic concerned subject and a superior subject according to the association relation between the basic concerned subject and the superior subject.

For example, "spring popular dress", "summer popular dress", and "spring national dress" formed by expanding the above-mentioned focus topic "winter popular dress" and the like also establish an association relationship with a preset superior topic at the same time.

Particularly, in the existing KG database, the relationship between the specific object entity and the concept entity is already included, and if the concept entity is an existing preset superior theme, the specific object entity corresponding to the concept entity can be associated with the preset superior theme; for example, as shown in fig. 3D, in the KG database, the concept entities corresponding to the specific object entity "no-lane" are "hong kong action" and "movie", and the preset upper-level theme includes a primary theme "movie" and a secondary theme "port", so that the hitching relationship between the specific object entity "no-lane" and the preset upper-level theme can be directly obtained according to the corresponding relationship between the concept entities and the preset upper-level theme.

S370, updating the open concern subjects formed by expanding the basic concern subjects to the point elements of the knowledge graph, and updating the association relationship to the edge elements of the knowledge graph.

According to the technical scheme of the embodiment of the application, the relation between the concerned subject extracted from the text corpus and the preset superior subject is established by presetting the superior subject, so that the open concerned subject and the closed preset superior subject are connected, the structure of the knowledge graph is clearer, and the object entity under the preset subject is convenient to search.

The technical scheme of the embodiment of the application realizes the construction of the novel knowledge graph, and the novel knowledge graph can be suitable for various application scenes, and is typically suitable for video cognition. The video content is rich, the content of multi-frame images is changed, sometimes, the multi-frame images are difficult to distinguish through simple label entities, and videos can be effectively identified and classified through the constructed novel knowledge graph. Fig. 3E is a complete flow chart diagram illustrating a map building method provided in the embodiment of the present application. As shown in fig. 3E, the raw corpora may be obtained from a variety of corpora, such as KG universal knowledge base, search/feed stream user log, web-wide web page/hundred's, and various video resources. The method comprises the steps of carrying out knowledge analysis, semantic structural processing and normalization fusion, relation completion and inspection and other processing on the basis of original linguistic data, adding the processed information into a knowledge understanding map, and carrying out processing on semantic structural and relation completion and inspection processing environments through a technical means of representing learning and knowledge reasoning. Or performing knowledge representation based on the original corpus and adding the knowledge representation to the knowledge understanding graph. The knowledge understanding map is also called a knowledge map, and the knowledge map can be expanded and cleaned by means of repetitive control, feature mining and the like. Through semantic point/edge screening, the knowledge graph can be used for different scenes, such as video semantic understanding, cross-media generation and other technical fields.

Example four

Fig. 4 is a schematic structural diagram of an apparatus for constructing a knowledge graph according to a fourth embodiment of the present application, where the apparatus specifically includes: a text corpus obtaining module 410, a phrase extracting module 420, an association relation obtaining module 430 and a knowledge graph updating module 440.

A text corpus obtaining module 410, configured to obtain a text corpus of at least one information object;

a phrase extraction module 420, configured to extract phrases from the text corpus, where the phrases include at least one vocabulary;

an association relation obtaining module 430, configured to identify content phrases of the concerned subject, the object entity, the object side, and the action event from the extracted phrases, and identify an association relation between the phrases according to a vocabulary structure in the phrases and a sentence structure in a text corpus where the phrases are located;

the knowledge graph updating module 440 updates the content phrases into point elements of the knowledge graph and updates the association relation into edge elements of the knowledge graph.

Optionally, on the basis of the foregoing technical solution, the phrase extraction module 420 is specifically configured to:

and extracting a minimum unit phrase and a compound phrase from the text corpus, wherein the minimum unit phrase comprises one word, and the compound phrase comprises at least two words.

Optionally, on the basis of the above technical solution, the apparatus for constructing a knowledge graph further includes:

and the compound phrase adjusting module is used for adjusting the vocabulary in the compound phrase according to the text corpus so as to generate other compound phrases in an expanding way.

Optionally, on the basis of the foregoing technical solution, the phrase extracting module 420 includes:

the word segmentation processing unit is used for carrying out word segmentation processing on the text corpus to form a word sequence;

and the minimum unit phrase acquiring unit is used for determining the vocabulary as the extracted minimum unit phrase if the vocabulary collocation structure of the vocabulary in each vocabulary sequence is stable.

the stop word removing module is used for removing stop words from the vocabulary sequence;

the existing element filtering module is used for filtering the existing phrases from the vocabulary sequence according to the phrases recorded by the existing point elements in the knowledge graph;

and the vocabulary frequency sequencing module is used for sequencing the vocabularies in the vocabulary sequence according to the occurrence frequency in the text corpus so as to identify the stability of the vocabulary collocation structure in sequence.

Optionally, on the basis of the foregoing technical solution, the phrase extracting module 420 further includes:

the part-of-speech tagging unit is used for carrying out part-of-speech tagging on the vocabulary in the text corpus;

the composite phrase extracting unit is used for extracting at least two vocabularies which accord with a part-of-speech matching structure in the text corpus according to a preset part-of-speech matching template to serve as composite phrases; wherein the preset part-of-speech collocation template at least comprises one of the following: solid and side templates; name, verb and modifier templates; verb, noun, and modifier templates.

Optionally, on the basis of the above technical solution, the compound phrase adjusting module is specifically configured to:

combining at least two vocabularies according to the syntactic dependency tree relationship of the vocabularies in the text corpus to generate a new compound phrase; and/or replacing words in the compound phrase by approximate semantic words for the extracted compound phrase to generate a new compound phrase.

Optionally, on the basis of the foregoing technical solution, the association relationship obtaining module 430 includes:

the theme cleaning unit is used for cleaning the theme of the extracted phrases; wherein, the theme cleaning at least comprises removing stop words, setting stop word addition and synonym normalization;

and the first concerned subject determining unit is used for screening all phrases according to the appearance frequency of the phrases after the subject is cleaned in all text corpora so as to determine the content phrases which are the concerned subjects.

Optionally, on the basis of the foregoing technical solution, the association relationship obtaining module 430 further includes:

and the second attention topic determining unit is used for screening the extracted compound phrases according to the occurrence frequency of the words included in the compound phrases in the existing knowledge graph so as to determine the content phrases serving as the attention topics.

and the object entity determining unit is used for identifying phrases which accord with the context position relation with the concerned subject phrase in the sentences of the text corpus as the content phrases of the object entity based on a preset sentence template aiming at the text corpus comprising the concerned subject phrase.

a business entity phrase extracting unit, configured to determine a composite phrase including a business entity phrase from the extracted composite phrases;

the candidate side phrase determining unit is used for determining phrases matched with the object entity phrases according to a preset phrase structure template aiming at the compound phrases comprising the object entity phrases, and the phrases are used as candidate side phrases;

the object side determining unit is used for identifying the co-occurrence frequency of the object entity phrases in each text corpus aiming at the candidate side phrases, and screening and determining the content phrases of the object sides of the object entities according to the co-occurrence frequency;

wherein the preset phrase structure template at least comprises one of the following items: entity and verb templates, entity and noun templates, and entity and modifier templates.

a candidate action event phrase extracting unit, configured to extract a verb as a content phrase of a candidate action event from the extracted phrases;

and the quality screening unit is used for screening the quality of the action event by adopting a preset machine learning model for the content phrase of the candidate action event, wherein the preset machine learning model is formed by training a text corpus of the action event which is artificially marked with independent semantics.

the system comprises a clustering identification unit, a knowledge graph analysis unit and a data processing unit, wherein the clustering identification unit is used for carrying out clustering identification according to text corpora of information objects, matching with a preset superior theme in the knowledge graph according to a clustering result and determining the incidence relation between the information objects and the superior theme;

the incidence relation establishing unit is used for establishing incidence relation between the concerned subject extracted from the text corpus of the information object and the superior subject;

a first association relation determining unit, configured to use an attention topic included in the text corpus as a basic attention topic, and establish, according to an association relation between the basic attention topic and a higher-level topic, an association relation between an open attention topic formed by expanding the basic attention topic and the higher-level topic;

the preset upper-level theme comprises one or more levels.

and the second incidence relation determining unit is used for establishing incidence relation between the object entity phrase and the concerned subject phrase according with the context position relation in the process of identifying the content phrase of the object entity.

and the third association relation determining unit is used for establishing association relation between object entity phrases and object side phrases which are co-occurring in the same phrase in the process of identifying the content phrases at the sides of the object.

and the fourth incidence relation determining unit is used for determining co-occurrence attention subjects and object entities in the text corpus comprising the action events and establishing incidence relations.

Optionally, on the basis of the above technical solution, the information object includes at least one of: pictures, audio and video; the text corpus of the video comprises: video titles, video tags, video captions, video descriptions, user posting information for videos, searched logs of videos, and user reviews of videos.

The device can execute the construction method of the knowledge graph provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For technical details not described in detail in this embodiment, reference may be made to the method provided in any embodiment of the present application.

EXAMPLE five

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

FIG. 5 is a block diagram of an electronic device for construction of a knowledge-graph according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.

Memory 502 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of constructing a knowledge-graph provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the method of constructing a knowledge-graph provided herein.

The memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for constructing a knowledge graph in the embodiment of the present application (for example, the text corpus obtaining module 410, the phrase extracting module 420, the association relation obtaining module 430, and the knowledge graph updating module 440 shown in fig. 4). The processor 501 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 502, that is, implements the method of constructing the knowledge graph in the above method embodiments.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device based on construction of the knowledge-graph, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 may optionally include memory remotely located from processor 501, which may be connected to the knowledge-graph build electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for constructing a knowledge graph may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device for construction of the knowledge-graph, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, and the like. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the knowledge graph of the information object is constructed by extracting the text corpus of the information object, a plurality of point elements such as the concerned subject, the action event, the object side and the like and corresponding edge elements are added while the object entity is determined, the composition elements of the knowledge graph are enriched, the information expansion of the existing object entity in the knowledge graph is realized, and the continuous mining of new object entities from the information object is realized for continuously expanding and supplementing the composition of the knowledge graph.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for constructing a knowledge graph is characterized by comprising the following steps:

acquiring a text corpus of at least one information object;

identifying content phrases of concerned subjects, object entities, object sides and action events from the extracted phrases, performing clustering identification according to text corpora of the information objects, matching the clustering results with preset superior subjects in a knowledge graph, and determining the association relationship between the information objects and the superior subjects;

establishing an association relation between the concerned subject extracted from the text corpus of the information object and the superior subject;

taking the concerned subject included in the text corpus as a basic concerned subject, and establishing an association relation between an open concerned subject formed by expanding the basic concerned subject and a superior subject according to the association relation between the basic concerned subject and the superior subject;

the preset upper-level theme comprises one or more levels; the object side is used for describing the object entity and is a verb, a noun or a modifier matched with the object entity;

2. The method of claim 1, wherein extracting phrases from the corpus of text comprises:

3. The method of claim 2, further comprising, after extracting the minimum unit phrase and the compound phrase from the corpus of text:

and adjusting the vocabulary in the compound phrase according to the text corpus so as to expand and generate other compound phrases.

4. The method of claim 2, wherein extracting the smallest unit phrase from the corpus of text comprises:

performing word segmentation processing on the text corpus to form a word sequence;

and for each word, if the word collocation structure of the word in each word sequence is stable, determining the word as the extracted minimum unit phrase.

5. The method of claim 4, wherein for each vocabulary, before determining the vocabulary as the minimum unit phrase extracted if the vocabulary collocation structure in the respective vocabulary sequence is stable, further comprising at least one of the following pre-processing:

removing stop words from the vocabulary sequence;

6. The method of claim 2, wherein extracting compound phrases from the text corpus comprises:

performing part-of-speech tagging on the vocabulary in the text corpus;

extracting at least two vocabularies which accord with a part-of-speech matching structure in the text corpus according to a preset part-of-speech matching template to serve as composite phrases;

wherein the preset part-of-speech collocation template at least comprises one of the following: solid and side templates; name, verb and modifier templates; verb, noun, and modifier templates.

7. The method of claim 3, wherein adjusting the vocabulary in the compound phrase to expand to generate other compound phrases according to the text corpus comprises:

combining at least two vocabularies according to the syntactic dependency tree relationship of the vocabularies in the text corpus to generate a new compound phrase; and/or

And for the extracted compound phrase, replacing words in the compound phrase by approximate semantic words to generate a new compound phrase.

8. The method of any of claims 2-7, wherein identifying content phrases of a topic of interest from the extracted phrases comprises:

subject cleaning is carried out on the extracted phrases; wherein, the theme cleaning at least comprises removing stop words, setting stop word addition and synonym normalization;

and screening the phrases according to the appearance frequency of the phrases after the topic cleaning in each text corpus so as to determine the content phrases serving as the concerned topics.

9. The method of any of claims 2-7, wherein identifying content phrases of a topic of interest from the extracted phrases comprises:

and aiming at the extracted compound phrases, screening the compound phrases according to the occurrence frequency of words included in the compound phrases in the existing knowledge graph so as to determine the content phrases serving as the concerned subjects.

10. The method of claim 1, wherein:

the information object includes at least one of: pictures, audio and video;

the text corpus of the video comprises: video titles, video tags, video captions, video descriptions, user posting information for videos, searched logs of videos, and user reviews of videos.

11. A method for constructing a knowledge graph, comprising:

acquiring a text corpus of at least one information object;

identifying content phrases of attention topics, object entities, object sides and action events from the extracted phrases, and establishing an association relationship between object entity phrases and attention topic phrases conforming to context position relationships in the process of identifying the content phrases of the object entities; the object side is used for describing the object entity and is a verb, a noun or a modifier matched with the object entity;

12. The method of claim 11, wherein extracting phrases from the text corpus comprises:

13. The method of claim 12, further comprising, after extracting the smallest unit phrase and the compound phrase from the corpus of text:

14. The method of claim 12, wherein extracting the smallest unit phrase from the corpus of text comprises:

15. The method of claim 14, wherein for each vocabulary, before determining the vocabulary as the minimum unit phrase extracted if the vocabulary collocation structure in the respective vocabulary sequence is stable, further comprising at least one of the following pre-processing:

removing stop words from the sequence of words;

16. The method of claim 12, wherein extracting compound phrases from the text corpus comprises:

performing part-of-speech tagging on the vocabulary in the text corpus;

17. The method of claim 13, wherein adjusting the vocabulary in the compound phrase to expand to generate other compound phrases according to the text corpus comprises:

18. The method of any one of claims 12-17, wherein identifying, from the extracted phrases, content phrases of the business entity comprises:

and aiming at a text corpus comprising the subject phrase of interest, identifying a phrase which is in a context position relationship with the subject phrase of interest in the sentence of the text corpus based on a preset sentence template, and taking the phrase as a content phrase of the object entity.

19. The method of claim 11, wherein:

the information object includes at least one of: pictures, audio and video;

20. A method for constructing a knowledge graph, comprising:

acquiring a text corpus of at least one information object;

identifying content phrases of concerned subjects, object entities, object sides and action events from the extracted phrases, and establishing an association relationship between object entity phrases and object side phrases which are co-appeared in the same phrase in the process of identifying the content phrases of the object sides; the object side is used for describing the object entity and is a verb, a noun or a modifier matched with the object entity;

21. The method of claim 20, wherein extracting phrases from the text corpus comprises:

22. The method of claim 21, further comprising, after extracting the smallest unit phrase and the compound phrase from the corpus of text:

23. The method of claim 21, wherein extracting the smallest unit phrase from the corpus of text comprises:

24. The method of claim 23, wherein for each vocabulary, before determining the vocabulary as the minimum unit phrase extracted if the vocabulary collocation structure in the respective vocabulary sequence is stable, further comprising at least one of the following pre-processing:

removing stop words from the sequence of words;

25. The method of claim 21, wherein extracting compound phrases from the text corpus comprises:

performing part-of-speech tagging on the vocabulary in the text corpus;

26. The method of claim 22, wherein adjusting the vocabulary in the compound phrase to expand to generate other compound phrases according to the text corpus comprises:

27. The method of any one of claims 21-26, wherein identifying content phrases flanking an object from the extracted phrases comprises:

determining a composite phrase comprising a business entity phrase from the extracted composite phrases;

determining phrases matched with the object entity phrases as candidate side phrases according to a preset phrase structure template aiming at the compound phrases comprising the object entity phrases;

for the candidate side phrases, identifying the co-occurrence frequency of the object entity phrases in each text corpus, and screening and determining the content phrases of the object sides of the object entities according to the co-occurrence frequency;

28. The method of claim 20, wherein:

the information object includes at least one of: pictures, audio and video;

29. A method for constructing a knowledge graph, comprising:

acquiring a text corpus of at least one information object;

identifying content phrases of an attention topic, a thing entity, a thing side and an action event from the extracted phrases, determining a co-occurrence attention topic and a thing entity in a text corpus comprising the action event, and establishing an association relation; the object side is used for describing the object entity and is a verb, a noun or a modifier matched with the object entity;

30. The method of claim 29, wherein extracting phrases from the text corpus comprises:

31. The method of claim 30, further comprising, after extracting the smallest unit phrase and the compound phrase from the corpus of text:

32. The method of claim 30, wherein extracting the smallest unit phrase from the corpus of text comprises:

33. The method of claim 32, wherein for each word, before determining the word as the minimum unit phrase extracted if the word collocation structure of the word in each word sequence is stable, further comprising at least one of the following pre-processing:

removing stop words from the sequence of words;

34. The method of claim 30, wherein extracting compound phrases from the text corpus comprises:

performing part-of-speech tagging on the vocabulary in the text corpus;

35. The method of claim 31, wherein adjusting the vocabulary in the compound phrase to expand to generate other compound phrases according to the text corpus comprises:

36. The method of any of claims 30-35, wherein identifying content phrases of action events from the extracted phrases comprises:

extracting verbs from the extracted phrases as content phrases of candidate action events;

and screening the quality of the action events by adopting a preset machine learning model for the content phrases of the candidate action events, wherein the preset machine learning model is formed by training a text corpus of the action events with independent semantics through manual labeling.

37. The method of claim 29, wherein:

the information object includes at least one of: pictures, audio and video;

38. An apparatus for constructing a knowledge graph, comprising:

the incidence relation acquisition module is used for identifying content phrases of concerned topics, object entities, object sides and action events from the extracted phrases, performing clustering identification according to text corpora of the information objects, matching the clustering results with preset upper-level topics in a knowledge graph, and determining the incidence relation between the information objects and the upper-level topics;

39. An apparatus for constructing a knowledge graph, comprising:

the incidence relation acquisition module is used for identifying the content phrases of the concerned subject, the object entity, the object side and the action event from the extracted phrases, and establishing the incidence relation between the object entity phrase and the concerned subject phrase which accords with the context position relation in the process of identifying the content phrase of the object entity; the object side is used for describing the object entity and is a verb, a noun or a modifier matched with the object entity;

and the knowledge graph updating module is used for updating the content phrases into point elements of the knowledge graph and updating the association relation into edge elements of the knowledge graph.

40. An apparatus for constructing a knowledge graph, comprising:

the incidence relation acquisition module is used for identifying content phrases of concerned subjects, object entities, object sides and action events from the extracted phrases, and establishing incidence relation between object entity phrases and object side phrases which are co-appeared in the same phrase in the process of identifying the content phrases of the object sides; the object side is used for describing the object entity and is a verb, a noun or a modifier matched with the object entity;

41. An apparatus for constructing a knowledge graph, comprising:

the incidence relation acquisition module is used for identifying the content phrases of the concerned subject, the object entity, the object side and the action event from the extracted phrases, determining the concerned subject and the object entity which co-occur in the text corpus comprising the action event and establishing the incidence relation; the object side is used for describing the object entity and is a verb, a noun or a modifier matched with the object entity;

42. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10 or 11-19 or 20-28 or 29-37.

43. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-10 or 11-19 or 20-28 or 29-37.