CN109977198B

CN109977198B - Method and device for establishing mapping relation, hardware equipment and computer readable medium

Info

Publication number: CN109977198B
Application number: CN201910257829.2A
Authority: CN
Inventors: 李千; 史亚冰; 梁海金; 张扬; 朱勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2021-08-31
Anticipated expiration: 2039-04-01
Also published as: CN109977198A

Abstract

The present disclosure provides a method for establishing a mapping relationship, the method comprising: determining a target topic of the unstructured data according to an original topic of the unstructured data, wherein the target topic is an entity set; extracting at least one target entity from the unstructured data, and establishing a target entity set according to the target entity; and establishing a mapping relation between the target topic and the target entity set. The disclosure also provides a device, hardware equipment and a computer readable medium for establishing the mapping relation.

Description

Method and device for establishing mapping relation, hardware equipment and computer readable medium

Technical Field

The embodiment of the disclosure relates to the technical field of databases, in particular to a method and a device for establishing a mapping relation, hardware equipment and a computer readable medium.

Background

When a user searches or raises a problem about an entity set, an entity corresponding to the entity set needs to be recommended to the user, and the entity corresponding to the entity set can be obtained by the following method:

1) the entity and the entity set are corresponding by adopting a manual method, but the method needs manual participation, and has longer time consumption and lower accuracy;

2) the corresponding relation between the entity set and the entity is obtained by analyzing the structured data, but the method is only suitable for the structured data with a specific structure, and is not suitable for unstructured data, semi-structured data or structured data with different specific structures.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for establishing a mapping relation, hardware equipment and a computer readable medium.

In a first aspect, an embodiment of the present disclosure provides a method for establishing a mapping relationship, including:

determining a target topic of the unstructured data according to an original topic of the unstructured data, wherein the target topic is an entity set;

extracting at least one target entity from the unstructured data, and establishing a target entity set according to the target entity;

and establishing a mapping relation between the target topic and the target entity set.

In some embodiments, the unstructured data is unstructured data of an information introduction class;

the determining the target topic of the unstructured data according to the original topic of the unstructured data comprises: and extracting a target topic from the original topics of the unstructured data according to a preset regular matching model.

In some embodiments, the original topic of the unstructured data comprises preset keywords.

In some embodiments, the unstructured data is question-answer like unstructured data;

the determining the target topic of the unstructured data according to the original topic of the unstructured data comprises: matching the original topic of the unstructured data with a preset part-of-speech sequence template, wherein the part-of-speech sequence template comprises at least one noun; extracting parts corresponding to the preset nouns in the part-of-speech sequence template from the original topics to obtain at least two candidate topics; and determining the correlation degree of each candidate topic and the original topic, and taking the candidate topic with the maximum corresponding correlation degree as a target topic.

In some embodiments, said extracting at least one target entity from said unstructured data comprises:

identifying an entity in the unstructured data;

screening at least one candidate entity from the identified entities according to the distribution characteristics of the identified entities in the unstructured data;

and screening at least one target entity from the candidate entities according to the correlation degree of the candidate entities and the unstructured data.

In some embodiments, the unstructured data comprises an inventory block comprising consecutive first and second paragraphs, the first paragraph having a length less than or equal to a first threshold, the second paragraph having a length greater than or equal to a second threshold, the second threshold being greater than the first threshold;

the identifying the entity in the unstructured data comprises: identifying an entity in a first paragraph of the inventory block;

the distribution characteristics of the entities in the unstructured data include distribution characteristics of the entities throughout the unstructured data and distribution characteristics of the entities in the inventory blocks.

In some embodiments, there is no inventory block in the unstructured data, the inventory block comprising consecutive first and second paragraphs, the length of the first paragraph being less than or equal to a first threshold, the length of the second paragraph being greater than or equal to a second threshold, the second threshold being greater than the first threshold;

the identifying the entity in the unstructured data comprises: identifying an entity in the unstructured data;

the distribution characteristics of the entity in the unstructured data include distribution characteristics of the entity throughout the unstructured data.

In some embodiments, the screening at least one target entity from the candidate entities according to the degree of correlation of the candidate entities with the unstructured data comprises:

obtaining a first vector according to the relevant information of the candidate entity in a preset first database, and obtaining a second vector according to the unstructured data;

calculating the similarity of the first vector and the second vector;

calculating a score of the candidate entity according to the similarity, the heat of the candidate entity, the authority coefficient of the first database and the position coefficient of the candidate entity in the unstructured data;

and screening at least one target entity from each candidate entity according to the score of each candidate entity.

In some embodiments, after the establishing the mapping relationship between the target topic and the target entity set, the method further includes:

and adding the mapping relation into a second database.

In a second aspect, an embodiment of the present disclosure provides an apparatus for establishing a mapping relationship, including:

the target theme determining unit is used for determining a target theme of the unstructured data according to an original theme of the unstructured data, wherein the target theme is an entity set;

the target entity extraction unit is used for extracting at least one target entity from the unstructured data and establishing a target entity set according to the target entity;

and the mapping relation establishing unit is used for establishing the mapping relation between the target topic and the target entity set.

the target topic determination unit is configured to: and extracting a target topic from the original topics of the unstructured data according to a preset regular matching model.

the target topic determination unit is configured to: matching the original topic of the unstructured data with a preset part-of-speech sequence template, wherein the part-of-speech sequence template comprises at least one noun; extracting parts corresponding to the preset nouns in the part-of-speech sequence template from the original topics to obtain at least two candidate topics; and determining the correlation degree of each candidate topic and the original topic, and taking the candidate topic with the maximum corresponding correlation degree as a target topic.

In some embodiments, the target entity extraction unit comprises:

an entity identification subunit, configured to identify an entity in the unstructured data;

a candidate entity screening subunit, configured to screen at least one candidate entity from the identified entities according to distribution characteristics of the identified entities in the unstructured data;

and the target entity screening subunit is used for screening at least one target entity from the candidate entities according to the correlation degree of the candidate entities and the unstructured data.

the entity identification subunit is used for: identifying an entity in a first paragraph of the inventory block;

the entity identification subunit is used for: identifying an entity in the unstructured data;

In some embodiments, the target entity screening subunit is to:

calculating the similarity of the first vector and the second vector;

In some embodiments, the means for establishing a mapping relationship further comprises: and the adding unit is used for adding the mapping relation into a second database.

In a third aspect, an embodiment of the present disclosure provides a hardware device, including:

one or more processors;

a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement any of the above-described methods of establishing a mapping relationship.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium having a computer program stored thereon, wherein,

the program, when executed by a processor, implements any of the above-described methods for establishing a mapping relationship.

In the method for establishing the mapping relationship according to the embodiment of the disclosure, the original topic including the entity set and the target entity corresponding to the entity set can be extracted from the unstructured data, so as to determine which entities the entity set corresponds to, and further, the method can be used for feeding back user search, recommending information for the user, perfecting a knowledge graph, and the like. Moreover, the method is automatically realized, does not depend on manpower, and has high efficiency and accuracy. In addition, the method can be used for processing a large amount of unstructured data, is not limited to structured data with a specific structure, has a wide application range and can fully utilize the existing data resources.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. The above and other features and advantages will become more apparent to those skilled in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:

fig. 1 is a flowchart of a method for establishing a mapping relationship according to an embodiment of the present disclosure:

fig. 2 is a flowchart of another method for establishing a mapping relationship according to an embodiment of the present disclosure:

FIG. 3 is a block diagram of an apparatus for establishing a mapping relationship according to an embodiment of the disclosure;

fig. 4 is a block diagram of another apparatus for establishing a mapping relationship according to an embodiment of the disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the following describes the method, apparatus, hardware device and computer readable medium for terminal communication provided by the present disclosure in detail with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but which may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure.

The disclosed embodiments are not limited to the embodiments shown in the drawings, but include modifications of configurations formed based on a manufacturing process. Thus, the regions illustrated in the figures have schematic properties, and the shapes of the regions shown in the figures illustrate specific shapes of regions of elements, but are not intended to be limiting.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The following is a brief introduction to the terminology set forth in the disclosure.

An entity (or concept) refers to an actual physical body or abstract concept existing or ever existing in the real world, such as a person, an article, a structure, a product, a building, a place, a country, an organization, an event, an art work, a scientific technology, a scientific theorem, and the like.

An entity set, which is a concept representing a set made up of a plurality of entities. Wherein, the entity corresponding to the entity set may not be completely determined, such as the entity set "important campaign of world war ii" or "famous english scientist", different people may give different corresponding entities; the entity corresponding to the entity set may also be determined, such as the entity set "eight planets in the solar system", which corresponds to the entity of a specific eight planets in a general view.

A knowledge graph, which is a database representing relationships between different entities and attributes of the entities. In the knowledge graph, entities are taken as nodes; the entities are connected with each other through edges, and the entities are connected with the values (attribute-value) of the attributes corresponding to the entities through edges, so that the structured and network-shaped database is formed. Wherein, the connection (edge) between the entities represents the relationship between the entities, such as the third person of the entity is the father of the fourth person of the entity Liqu; and the connection (edge) between the entity and its corresponding attribute value indicates that a certain attribute of the entity is a certain value, for example, the height attribute of Zhang III (person) has a value of 172 cm.

Data, which refers to relatively independent, digitized information that carries a certain amount of information. For example, a piece of data may be an article, a web page (or page), and so on.

Unstructured data, which is data in which contents of each part (such as a title, a directory, etc.) are not divided into data structures, and thus, from the viewpoint of the data structure, each part of the unstructured data is not distinguished; of course, unstructured data may be divided into different sentences, paragraphs, etc. from a textual perspective, but these do not constitute partitions of the data structure, since the separators of these partitions are actually part of the text.

The database is a data set formed by one or more data according to a certain form. All data in one database may be stored together centrally, e.g., the database may be a knowledge graph stored on a particular server; alternatively, all data in a database may be stored in a distributed manner, for example, if a plurality of web pages stored on different servers are classified by the index of a search engine and are thus retrieved by the search engine, they are also a database.

in a first aspect, an embodiment of the present disclosure provides a method for establishing a mapping relationship, with reference to fig. 1, including:

s100, determining a target topic of the unstructured data according to an original topic of the unstructured data, wherein the target topic is an entity set.

The specific form of the unstructured data is various, such as web pages, articles, and the like, which can be divided into different sentences and paragraphs from the perspective of text, but the parts of the unstructured data are not different from each other from the perspective of data structure, and thus are unstructured.

For unstructured data, its definite "heading" cannot be derived from the data structure perspective; thus, its original title can only be distinguished from a textual perspective. For example, the text of the first centered paragraph of unstructured data may be used as its original title, or the first sentence or paragraph of unstructured data may be used as its original title. Of course, the original topic so determined may or may not be the title that the unstructured data originally intended.

Since the original topics of the unstructured data have various forms and no unified standard, the step needs to extract a target topic representing a certain entity set from the original topics.

S200, extracting at least one target entity from the unstructured data, and establishing a target entity set according to the target entity.

There must be many entities in the unstructured data, some of which correspond to the above entity set of the target topic, and this step needs to extract these entities (target entities) and compose the target entity set.

S300, establishing a mapping relation between the target topic and the target entity set.

And establishing a mapping relation between the target topic and the target entity set, namely indicating that the target entity corresponds to the entity set of the target topic.

Fig. 2 is a flowchart of another method for establishing a mapping relationship according to an embodiment of the present disclosure.

Referring to fig. 2, the method for establishing a mapping relationship according to the embodiment of the present disclosure may specifically include the following steps:

Before processing the unstructured data, the original theme can be filtered to remove the unstructured data which does not meet the requirement, and the data processing amount is reduced.

Exemplary, unstructured data that is not processed includes:

(a) the original title includes unstructured data of illegal content such as pornography, politics, etc. For example, a pre-trained SVM (support Vector machine) model can be used to determine the word Vector of the original topic and evaluate the possibility of illegal contents.

(b) Original topics are the unstructured data of the particular negative example accumulated (i.e., the unstructured data that was found to have the same original topic was disqualified).

Before processing unstructured data, the original subject can be preprocessed to remove special symbols, redundant words, quantifier words, words with low information content and the like.

In this step, a target topic representing an entity set needs to be extracted from the original topics, and the specific extraction mode may be different according to different types of unstructured data.

The identification of the unstructured data types can be realized by classifying the source websites of the unstructured data types, and can also be realized by analyzing the original topics of the unstructured data types, which are not described in detail herein.

In some embodiments, for unstructured data of the information introduction class, this step (S100) may include: and extracting a target topic from the original topics of the unstructured data according to a preset regular matching model.

The "unstructured data of the information introduction class" is also referred to as "news unstructured data", which refers to data written by a person and introducing or commenting on the content of a subject, such as a news webpage or a news article.

The original topic comparison rules for unstructured data of the information introduction class will typically include their corresponding set of entities at a particular location. Therefore, a regular matching model can be preset, and a specific part can be extracted from the original topics as target topics (entity sets) in a regular manner. For example, for the original topic "invent the important campaign of world war ii", the target topic "important campaign of world war ii" is extracted from it in a regular manner.

In some embodiments, the original topic of unstructured data comprises preset keywords.

Since all the unstructured data of the information introduction class are not introduced to the entity set, all the unstructured data of the information introduction class are not necessarily processed, but only unstructured data with specific keywords and a high possibility of introducing the entity set are processed. For example, the keywords may include "inventory," "detailed number," "ten-large," etc., since such keywords are generally used when introducing the entity set, so that the data processing amount can be reduced and the processing accuracy can be improved by filtering the keywords.

In some embodiments, unstructured data of question-answer classes; the step (S100) may include: matching an original topic of the unstructured data with a preset part-of-speech sequence template, wherein the part-of-speech sequence template comprises at least one noun; extracting parts corresponding to predetermined nouns in the part-of-speech sequence template from the original topics to obtain at least two candidate topics; and determining the correlation degree of each candidate title and the original title, and taking the candidate title with the maximum corresponding correlation degree as the target title.

The "unstructured data of question-answering type" refers to data formed by a person asking a question on the internet and then answering the question by another person, such as a web page of a question-answering web site.

The original topic of unstructured data of the question-answer class is often the above question. While the original questions of the unstructured data of the question-and-answer class are usually more various and less normative in form than the unstructured data of the information introduction class, such as the question may be "who can tell me what is the important campaign of world war ii? "may" ask for several important campaigns in world war ii? "or" whether to introduce to me an important campaign in world war ii or not "and the like, it is difficult to extract the target topic directly from the original topic by using a uniform rule.

For this purpose, templates composed of words of a specific part of speech in a specific order, such as "noun-co-noun (n-u-n) template", "co-noun (u-n) template", "noun (n) template", etc., may be set in advance as part of speech sequence templates. Then, the part of speech of each part in the original topic is analyzed and matched with the part of speech sequence template, so as to extract the part of the original topic corresponding to a specific noun (because the entity set is necessarily a noun) in the part of speech sequence template, and the part of the original topic is taken as a candidate topic which may be the entity set.

Due to the diversity of the original topics of the unstructured data of the question-and-answer class, different candidate topics can be derived when they are matched with one part-of-speech sequence template in different ways, and different candidate topics can also be derived when the original topic is matched with different part-of-speech sequence templates. For example, the original topic "who can tell me which are important campaigns for world war ii? If the target topic is matched with the template of n | u | n "(noun/auxiliary word/noun), the target topic can be 'important battle in world war II', can also be 'world war II', or 'important battle', etc. For another example, the original topic "who can tell me which important campaigns for world war ii? If "matches" the noun (n) template, then the resulting candidate topics may also be "who", "I", etc.

Obviously, only one of the above candidate topics is the final desired set of entities (important campaign in world war ii). For this purpose, the correlation degree between each candidate topic and the original topic is also required to be compared to find out the most relevant candidate topic (i.e. the candidate topic that can represent the intention of the original topic) as the target topic.

For example, the relevance of each candidate topic to the original topic can be ranked by a pair-wise learning to rank model obtained by pre-training, so as to find the most relevant candidate topic as the target topic. The pair-wise learning to rank model is one of learning to rank models, and the learning to rank model is an existing model for ranking the correlation degrees between different data.

For example, the learning to rank model may rank the relevance of candidate topics by their following features:

(a) grammatical features: training an n-gram model by using the existing data (such as the data belonging to the same field as the original topic), and obtaining the probability, the confusion (namely the continuity of the word property), the probability of using the first word of the candidate topic as the first word of the original topic and the like of generating the candidate topic according to each word (term) in the original topic; where the n-gram model is an existing model for calculating the probability that multiple words are combined (ordered) in different ways.

(b) Syntactic characteristics: training a Dependency syntax (Dependency Parsing) template by using existing data (such as data belonging to the same field as the original topic), and scoring the Dependency sequence of the candidate topic; wherein the dependency syntax template is an existing template for describing dependency relationships between multiple words.

(c) Syntactic characteristics: the original theme is subjected to main-stem analysis, and retention scores (namely scores representing the retention degree of syntactic properties) of the candidate themes in the sequence (rank) and the weight (weight) of the original theme are obtained.

(d) Semantic features: using an existing topic (such as a topic belonging to the same domain as the original topic) as a positive sample; extracting the sequence and the weight of each word of an original question, then increasing the weight of the words without information content, and reducing the weight of main words to be used as negative samples; training a word vector by using existing data (such as data belonging to the same field as the original subject); performing weighted accumulation on word vectors of all words of the positive sample and the negative sample to obtain a vector of the positive sample and a vector of the negative sample; and obtaining the predicted scores of the candidate topics through a Gradient Boosting Decision Tree model, wherein the Gradient Boosting Decision Tree model is an existing model comprising a plurality of Decision trees and used for classification regression.

(e) Semantic features: training a word vector by using the existing data (such as the data belonging to the same field as the original topic), constructing semantic vectors of the original topic and the candidate topics according to the ranking and weighting, and calculating the similarity of the semantic vectors as a semantic retention score (namely, a score representing the retention degree of semantic properties).

(f) Information amount: counting the word frequency of each word in the existing data (such as the data belonging to the same field as the original subject); filtering the single words of the words with high word frequency, filtering the stop words and the like to obtain a plurality of information words in the field; obtaining a plurality of (such as 50) neighboring words of each information word through the word vector, obtaining a weighted sum of the neighboring words by using the similarity as a weight, and taking a logarithm as the information content of the information word; spreading the information quantity of the information words to all neighboring words by taking the similarity as weight so as to enlarge the quantity of the information words and reduce the sparsity; and taking the sum of the information content of all the words in the candidate topics as the information content score of the candidate topics.

(g) Other characteristics are as follows: such as the length of the candidate topic, etc.

Of course, the way to compare the similarity of the candidate topic to the original topic is various and is not limited to the use of the above models and features.

There must be many entities in the unstructured data, some of which correspond to the entity set of the above target topic, and this step needs to extract these entities (target entities) and compose the target entity set

In some embodiments, the step (S200) may include:

s201, identifying entities in the unstructured data.

S202, screening at least one candidate entity from the identified entities according to the distribution characteristics of the identified entities in the unstructured data.

Obviously, the number of entities included in the unstructured data is very large, and they cannot be all target entities, so all or part of the entities in the unstructured data can be extracted first, and then the extracted entities are subjected to preliminary screening according to the distribution rules (distribution characteristics) of the entities in the unstructured data, so as to obtain candidate entities which may be the target entities.

For example, the candidate entities may be determined in different ways depending on whether the inventory block is included in the unstructured data. The counting block comprises a first section and a second section which are continuous, the length of the first section is smaller than or equal to a first threshold, the length of the second section is larger than or equal to a second threshold, and the second threshold is larger than the first threshold. That is, the counting block is a text in which a short paragraph is followed by at least one long paragraph, the short paragraph is shorter (e.g. less than 20 characters) and the long paragraph is longer (e.g. more than 100 characters), the short paragraph in the text is often a summarized "subtitle", and the long paragraph is the content of the subtitle, and the counting block can be determined according to the length of each paragraph because the unstructured data is not divided from the data structure.

Illustratively, in some embodiments, when the unstructured data includes an inventory block, the above step S201 includes: identifying an entity in a first paragraph of an inventory block; and the distribution characteristics of the entities in the unstructured data in the above step S202 include the distribution characteristics of the entities in the unstructured data overall and the distribution characteristics of the entities in the inventory block.

When there is an inventory block, the entity corresponding to the entity set is often located in each subtitle, i.e., in the first paragraph (short paragraph) of the inventory block, so that the entity can be extracted from only the first paragraph of the inventory block. The entity extracted at this time has integral distribution characteristics in the unstructured data overall text and specific distribution characteristics in the inventory block, so that whether the entity belongs to a candidate entity can be judged according to the two characteristics.

For example, the distribution characteristics of the above entities throughout the unstructured data may include one or more of the following:

(a) the frequency of an entity in a paragraph in a particular format refers to the ratio of the number of times an entity appears in a paragraph in a particular format to the number of times an entity appears in the unstructured data text. For example, if an entity occurs 11 times in the full unstructured data, where 3 times occur in a paragraph of a particular format, the frequency of the entity in the paragraph of the particular format is 3/11. The specific format may be bold, italic, underlined, etc., which generally represents a more important paragraph.

(b) The word count frequency of an entity in an entire page refers to the ratio of the total number of words of the entity to the total number of words of the full text of unstructured data. For example, if an entity appears 11 times in the full text of unstructured data, each entity has 3 words, and the full text of unstructured data has 9200 words, the frequency of words of the entity in the entire page is 33/9200.

(c) The frequency of entities in short paragraphs of a page is a ratio of the number of occurrences of an entity in a short paragraph (e.g., a paragraph with a number of characters less than 20) to the number of occurrences of an entity in the full text of unstructured data. For example, if an entity occurs 11 times in the full unstructured data, 4 of which are in short paragraphs, the frequency of the entity in the short paragraphs of the page is 4/11.

(d) The segment frequency of entity occurrence refers to the ratio of the number of segments with entity occurrence to the total number of segments in the unstructured data full text. For example, if there are 30 paragraphs in the unstructured data corpus, and there are entities in 8 paragraphs, the frequency of occurrence of the entities is 8/30.

(e) The frequency with which an entity appears at the beginning of a paragraph refers to the ratio of the number of times an entity appears at the beginning of a paragraph (which may be the first word, or the first 5 words, etc.) to the number of times an entity appears in the unstructured data text. For example, if an entity occurs 11 times in the full unstructured data, 2 of which occur at the beginning of a paragraph, the entity occurs at the beginning of the paragraph with a frequency of 2/11.

(e) The frequency with which an entity appears at a non-beginning of a paragraph refers to the ratio of the number of times an entity appears in the non-beginning of a paragraph to the number of times an entity appears in the unstructured data text. For example, if an entity occurs 11 times in total in the full unstructured data, where 9 times occur in the non-beginning portion of a paragraph, then the frequency of occurrence of the entity in the non-beginning portion is 9/11.

(f) The frequency of the most frequently occurring entity in a segment is the proportion of the segments with the most frequently occurring entity in each segment. For example, an entity appears in 7 paragraphs, 3 times in 1 of which and less than 3 times in the other paragraphs, the frequency with which the entity appears most often in a paragraph is 1/7.

(g) The frequency of non-maximum occurrence of entities in a segment refers to the proportion of each segment with entities that do not occur most frequently. For example, if an entity occurs in 7 paragraphs, 3 times in 1 of which and less than 3 times in the other paragraphs, the non-maximum number of occurrences of the entity in the paragraph is 6/7.

Accordingly, the distribution of entities is also related to the inventory block, and, for example, the distribution characteristics of entities in the inventory block may include one or more of the following:

(a) the frequency of an entity in a particular format passage of an inventory block refers to the ratio of the number of times an entity appears in a particular format passage of an inventory block to the number of times an entity appears in an inventory block.

(b) The frequency of the number of words of an entity in a paragraph of an inventory block is the ratio of the total number of words of the entity in a paragraph of the inventory block to the total number of words of the paragraph.

(c) The frequency of entities in a paragraph of an inventory block refers to the ratio of the total number of entities in unstructured data to the number of entities present in a certain paragraph of the inventory block.

(d) Whether an entity appears at the beginning of a paragraph refers to whether an entity appears at the beginning of a paragraph (which may be the first word, or the first 5 words, etc.) of an inventory block.

(e) The frequency of entities in an inventory block, which refers to the ratio of the total number of entities in unstructured data to the number of occurrences in the inventory block.

Illustratively, in some embodiments, when the unstructured data has no inventory block, the above step S201 includes: identifying entities in the unstructured data; and the distribution characteristics of the entities in the unstructured data in the above step S202 include the distribution characteristics of the entities in the unstructured data overall.

When there is no checking block, it is necessary to identify the entity from the full text of the unstructured data, and only use the distribution characteristics of the above entities in the full text of the unstructured data to determine the candidate entity.

The candidate entities may be judged by a plurality of preset models, and the available models include, but are not limited to, a naive bayes model, a KNN (k nearest neighbor classification) model, an lr (logical regression) model, an RF (Random forest) model, a DT (Decision Tree) model, an SVM (support Vector machine) model, a GBDT (Gradient Boosting Decision Tree) model, an Ensemble model, and the like.

The specific manner of determination by the above model is also various. For example, a model may be used to determine which entities belong to candidate entities. For another example, the entities may be scored by using a plurality of models, the scoring results of the models are normalized (i.e., the scoring of each model is converted into a value between 0 and 1), corresponding weights are set for the models, and finally the scores of the models are multiplied by the weights and added, so as to determine whether the entities are candidate entities according to the total scores obtained by weighting.

S203, screening at least one target entity from the candidate entities according to the correlation degree of the candidate entities and the unstructured data.

After obtaining a plurality of candidate entities, continuing to analyze the correlation of the candidate entities with the unstructured data, wherein the candidate entities with higher correlation with the unstructured data are more likely to be target entities.

In some embodiments, the step (S203) may include:

s2031, a first vector is obtained according to the relevant information of the candidate entity in a preset first database, and a second vector is obtained according to the unstructured data.

To obtain more information about the candidate entity, the first vector of the candidate entity can be derived using its associated information in the existing first database. The first database may be a database introducing the entity, such as an intellectual encyclopedia, a knowledge graph, or the like, and the first database includes related information (such as attribute-value information, or related introduction text, or the like) of the entity.

For example, the first vector may include a plurality of attribute parameters (or items), each of which may have a value. The selection and specific value of the attribute parameter are obtained according to the related information of the candidate entity in the first database, and the specific calculation method can be various.

For example, when the first database is structured data such as a knowledge graph, attributes, types, etc. related to candidate entities therein may be selected as candidate attribute parameters; when the first database is unstructured data or semi-structured data such as encyclopedic knowledge, candidate attribute parameters can be extracted from the unstructured data or semi-structured data through the existing TF-IDF (term frequency-inverse Text frequency) template, Text-Rank (Text-ordering) template, part-of-speech matching template and the like. For each candidate attribute parameter, single word filtering, stop word filtering, special symbol filtering and the like can be performed, and the remaining candidate attribute parameters after filtering are used as the attribute parameters of the first vector.

For another example, the value of the attribute parameter of the first vector may be a specific value of the attribute parameter in the related information, or may be 1 if the attribute parameter exists in the related information, or 0 if the attribute parameter does not exist in the related information.

Accordingly, the second vector may have the same attribute parameters and values as the first vector, except that the values of the attribute parameters of the second vector are determined based on the unstructured data (e.g., based on whether the attribute parameters appear in the unstructured data).

S2032, calculating the similarity between the first vector and the second vector.

After the first vector and the second vector are obtained, the similarity between the first vector and the second vector can be calculated, and the similarity represents the correlation degree between the extracted candidate entity and the unstructured data.

For example, the similarity may be calculated using cosine values of the first vector and the second vector. Of course, other ways of comparing existing vector similarity are also available and will not be described in detail herein.

S2033, calculating the score of the candidate entity according to the similarity, the heat degree of the candidate entity, the authority coefficient of the first database and the position coefficient of the candidate entity in the unstructured data.

After the similarity between the first vector and the second vector is obtained, the score of the candidate entity is calculated by combining other information, and the score also represents the correlation degree of the candidate entity and the unstructured data.

For example, the score of a candidate entity may be calculated according to the following formula:

the score of the candidate entity is similarity, the heat of the candidate entity is authority coefficient of the first database, and the position coefficient of the candidate entity in the unstructured data is obtained;

the similarity is the similarity between the first vector and the second vector obtained above.

The hot degree of the candidate entity refers to the frequency of the candidate entity being utilized in other databases, such as the number of times the candidate entity is retrieved by the index in a predetermined time, or the number of times data of the candidate entity (such as a knowledge encyclopedia web page corresponding to the candidate entity or a knowledge graph item) is accessed in a predetermined time, and the like. Illustratively, the heat of the candidate entity may be obtained by QueryLog (search log), a knowledge-encyclopedia access record, a knowledge-graph access record, and the like.

The authority coefficient of the first database refers to the reliability of the first database for generating the first vector, and may be set manually or calculated based on the access amount of the first database and the like. For example, the authority coefficient of the knowledge-encyclopedia may be considered to be 1, while the authority coefficient of the knowledge-graph is 1.5, and so on.

The position coefficient of the candidate entity in the unstructured data refers to a coefficient generated according to which part of the unstructured data the candidate entity is selected from, and may be set artificially. For example, the position coefficient may be 1.5 when a candidate entity is extracted from an inventory block of unstructured data, or 1 when a candidate entity is extracted from a position of a non-inventory block of unstructured data.

In the above manner, a score evaluating the comprehensiveness of the properties of the candidate entity can be obtained from various aspects.

S2034, screening at least one target entity from each candidate entity according to the score of each candidate entity.

After the scores of the candidate entities are obtained, the better candidate entities can be selected from the candidate entities as target entities according to the scores.

For example, the scores of all candidate entities may be clustered through a GMM gaussian mixture model, such as clustering into 3 classes and removing the class with the lowest score, and the candidate entities in the remaining 2 classes are used as target entities. Through the clustering mode, the relative quality of each candidate entity can be more effectively evaluated, so that a more reasonable target entity is selected.

Of course, the manner in which the target entity is selected by the score is various. For example, when clustering is performed by the GMM gaussian mixture model, the specific cluster number and the selected cluster number may be different; for another example, the target entity may be obtained by comparing with a preset threshold, selecting a candidate entity with a score above the average, and the like.

In some embodiments, after the step S300 above, the method further comprises:

and S400, adding the mapping relation into a second database.

After the mapping relationship between the target topic and the target entity is established, the mapping relationship can be stored in a second database, i.e. the second database can be established by using the obtained mapping relationship, or the content of the second database can be perfected.

Illustratively, the fourth database may be a knowledge graph, i.e., the method of the embodiments of the present disclosure may be used to construct a knowledge graph.

Of course, it should be understood that after the above mapping is established, other operations may be performed according to the mapping. For example, when a user searches for an entity set of a target topic or needs to recommend content related to the entity set to the user, a target entity set corresponding to the entity set can be found according to the mapping relationship, and the target entity set is provided for the user.

Of course, it should be understood that the above describes a process of processing one unstructured data, and if a plurality of unstructured data are to be processed separately, the above process may be performed multiple times. For example, the above process may be performed on a plurality of unstructured data in a preset database (e.g., all data retrievable by a search engine), and all the established mapping relationships are added to a database (e.g., a second database) to obtain a new database (e.g., a knowledge graph).

Fig. 3 is a block diagram of an apparatus for establishing a mapping relationship according to an embodiment of the disclosure.

In a second aspect, referring to fig. 3, an embodiment of the present disclosure provides an apparatus for establishing a mapping relationship, including:

the target title determining unit is used for determining a target title of the unstructured data according to an original title of the unstructured data, wherein the target title is an entity set;

and the mapping relation establishing unit is used for establishing a mapping relation between the target topic and the target entity set.

the target topic determination unit is used for: and extracting a target topic from the original topics of the unstructured data according to a preset regular matching model.

the target topic determination unit is used for: matching an original topic of the unstructured data with a preset part-of-speech sequence template, wherein the part-of-speech sequence template comprises at least one noun; extracting parts corresponding to predetermined nouns in the part-of-speech sequence template from the original topics to obtain at least two candidate topics; and determining the correlation degree of each candidate title and the original title, and taking the candidate title with the maximum corresponding correlation degree as the target title.

Referring to fig. 4, in some embodiments, the target entity extraction unit includes:

the candidate entity screening subunit is used for screening at least one candidate entity from the identified entities according to the distribution characteristics of the identified entities in the unstructured data;

In some embodiments, the unstructured data comprises an inventory block comprising consecutive first and second paragraphs, the length of the first paragraph being less than or equal to a first threshold, the length of the second paragraph being greater than or equal to a second threshold, the second threshold being greater than the first threshold;

the entity identification subunit is used for: identifying an entity in a first paragraph of an inventory block;

the distribution characteristics of the entities in the unstructured data include the distribution characteristics of the entities throughout the unstructured data and the distribution characteristics of the entities in the inventory blocks.

In some embodiments, there is no inventory block in the unstructured data, the inventory block comprising a first and a second consecutive segment, the length of the first segment being less than or equal to a first threshold, the length of the second segment being greater than or equal to a second threshold, the second threshold being greater than the first threshold;

the entity identification subunit is used for: identifying entities in the unstructured data;

the distribution characteristics of the entities in the unstructured data include distribution characteristics of the entities throughout the unstructured data.

In some embodiments, the target entity screening subunit is to:

calculating the similarity of the first vector and the second vector;

calculating the score of the candidate entity according to the similarity, the heat degree of the candidate entity, the authority coefficient of the first database and the position coefficient of the entity in the unstructured data;

In some embodiments, the apparatus for establishing a mapping relationship further includes:

and the adding unit is used for adding the mapping relation into the second database.

In a third aspect, an embodiment of the present disclosure provides a hardware device, which includes:

one or more processors;

the storage device stores one or more programs thereon, and when the one or more programs are executed by one or more processors, the one or more processors implement any of the above methods for establishing a mapping relationship.

In a fourth aspect, the present disclosure provides a computer readable medium, on which a computer program is stored, where the program is executed by a processor to implement any one of the above methods for establishing a mapping relationship.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A method of establishing a mapping relationship, comprising:

establishing a mapping relation between the target title and the target entity set;

wherein the extracting at least one target entity from the unstructured data comprises:

identifying entities in the unstructured data and screening at least one candidate entity from the identified entities according to the distribution characteristics of the identified entities in the unstructured data;

calculating the similarity of the first vector and the second vector, and calculating the score of the candidate entity according to the similarity, the heat of the candidate entity, the authority coefficient of the first database and the position coefficient of the candidate entity in the unstructured data;

2. The method of claim 1, wherein,

the unstructured data are unstructured data of an information introduction class;

3. The method of claim 2, wherein,

the original topic of the unstructured data comprises preset keywords.

4. The method of claim 1, wherein,

the unstructured data are unstructured data of a question answering class;

5. The method of claim 1, wherein,

the unstructured data comprises an inventory block comprising consecutive first and second paragraphs, the length of the first paragraph being less than or equal to a first threshold, the length of the second paragraph being greater than or equal to a second threshold, the second threshold being greater than the first threshold;

6. The method of claim 1, wherein,

the unstructured data has no checking block;

7. The method of claim 1, wherein after the establishing the mapping relationship between the target topic and the target entity set, further comprising:

and adding the mapping relation into a second database.

8. An apparatus for establishing a mapping relationship, comprising:

the mapping relation establishing unit is used for establishing a mapping relation between the target title and the target entity set;

wherein the target entity extraction unit includes:

the target entity screening subunit is used for obtaining a first vector according to the relevant information of the candidate entity in a preset first database and obtaining a second vector according to the unstructured data; calculating the similarity of the first vector and the second vector, and calculating the score of the candidate entity according to the similarity, the heat of the candidate entity, the authority coefficient of the first database and the position coefficient of the candidate entity in the unstructured data; and screening at least one target entity from each candidate entity according to the score of each candidate entity.

9. The apparatus of claim 8, wherein,

10. The apparatus of claim 9, wherein,

the original topic of the unstructured data comprises preset keywords.

11. The apparatus of claim 8, wherein,

the unstructured data are unstructured data of a question answering class;

12. The apparatus of claim 8, wherein,

13. The apparatus of claim 8, wherein,

the unstructured data has no checking block;

14. The apparatus of claim 8, further comprising:

and the adding unit is used for adding the mapping relation into a second database.

15. A hardware device, comprising:

one or more processors;

storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1 to 7.

16. A computer-readable medium, having stored thereon a computer program, wherein,

the program when executed by a processor implementing the method of any one of claims 1 to 7.