CN111814027B

CN111814027B - Multi-source character attribute fusion method based on search engine

Info

Publication number: CN111814027B
Application number: CN202010867732.6A
Authority: CN
Inventors: 于富财; 叶浩维; 胡光岷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2023-03-21
Anticipated expiration: 2040-08-26
Also published as: CN111814027A

Abstract

The invention discloses a multisource character attribute fusion method based on a search engine, which is applied to multisource character attribute fusion, and aims at the effective solution that whether multisource attribute sets belong to the same target character or not and homonymy and noise are eliminated as much as possible by aiming at the shortage in the prior art; self-adaptive parameters are set according to different people awareness degrees, so that the degree of confidence coefficient dispersion is adjusted; and finally, combining the two confidence coefficient calculation methods, and providing a weighted person attribute pair combined confidence coefficient calculation method.

Description

Multi-source character attribute fusion method based on search engine

Technical Field

The invention belongs to the field of big data processing, and particularly relates to a character attribute extraction technology.

Background

With the rapid development of internet application, the data volume that can be obtained through the network also grows exponentially in a well-jet manner, and it is critical and urgent to quickly and accurately analyze the truly useful information from the massive data.

Person attributes, also called person characteristics. The person attributes contain all information describing a person from birth to death, such as: birth place, birth time, country, work, religion of belief, place of death, death time, etc. The person attribute extraction is to identify the attributes of persons in the network, and the person attribute extraction has important practical applications, such as name disambiguation, construction of a person knowledge base, a person search engine and the like. Most of the research nowadays mainly focuses on information extraction in specific fields of the network, and only relatively few researches are conducted on extraction of the character attributes.

The character attribute extraction generally comprises two important processes of multi-source character attribute extraction and character attribute fusion. The attribute extraction refers to analyzing and obtaining an attribute set of a person from different information sources, wherein the information sources mainly refer to different webpages obtained by searching the name of the person; the attribute fusion is to analyze the attribute sets of different sources, judge whether the attribute sets belong to the target people searched by people, finally integrate the attribute sets belonging to the target people together, and output the result. Character fusion is mainly used for solving the problems of homonymy character disambiguation, noise and the like of character attributes from different sources.

As shown in Table 1, if we search for person Zhang III, we obtain the following set of attributes from different sources:

TABLE 1 example of different Source Attribute sets for the target person Zhang three

As shown in Table 1, we assume that four property sets of Zhang three are extracted from different sources. The following presumptions can be made:

(1) Literally, it can be assumed that the two attribute sets 1 and 4 belong to three pages of our target character, because 1 and 4 have the same attribute of birthday and university.

(2) The attribute set 3 may belong to another three-blossoms of the same name, which is a problem of homonym disambiguation.

(3) We cannot determine that 1 and 2 are the same page three and 2 is also likely another person, because searching page three may cause other character web pages related to page three to affect the extraction result, which is a noise problem.

Fusion generally involves two important steps: attribute alignment and entity alignment. The attribute alignment is used for determining whether the attribute sets have similar or identical attributes, and the main methods are a method based on character string distance, a method based on dictionary matching, a method based on semantic similarity and the like. For example: in the above table, the attribute "education background" of the attribute set 1 and the attribute "university" of the attribute set 4 correspond to each other, and finding such a correspondence is a process for which the attribute corresponds. Entity alignment is the process of determining whether multiple entities in the real world point to the same entity. That is, it is necessary to determine whether the character attribute sets from different sources point to the same objective character entity, and entity alignment needs to be implemented by calculating the similarity between characters and attribute values or introducing some other technical means.

The network is a natural massive text corpus, for example, google can extract relative page count, the count is close to the use frequency of real social words and phrases, and the current research field of the linguistics also starts to support the method ^[1] 。

Two useful sources of information that search engines can provide: web page result counts and Snippets (Snippets). The page count of the query is an estimate of the number of pages containing the query term. In general, the number of pages may not necessarily equal the word frequency, as the words of the query may appear multiple times on a page. But can be used here to estimate the frequency of occurrence of words, taking into account the amount of data of the search engine. The number of pages for queries p and q can be viewed as a global measure of the simultaneous occurrence of words p and q. Snippets (Snippets) are a short window of text that the search engine extracts around the query term in a document, providing useful information about the local context of the query term. Semantic similarity measurement by using fragments has been used in the fields of query expansion, personal name disambiguation, community mining, and the like. For web page content, a snippet is information that is relatively easy to obtain, and, in terms of engineering, the efficiency problem caused by downloading all pages of a search engine result can be solved using the snippet.

The related prior art is as follows:

1. normalized Google Distance (NGD)

Cilibrasi and Vitenyi propose a lexical semantic similarity algorithm based on the number of pages queried by Google, called Normalized Google Distance.

Wherein, H (P) is the result page count of the query P, H (Q) is the result page count of the query Q, H (P, Q) is the result page count of the query P and Q (namely, the pages contain both P and Q), and N is the total index page number (10) of Google search generally selected as value ¹¹ ) N can also be selected to be any value larger than H (x), N is increased greatly, the calculation result of NGD is reduced, and the distribution is more compact; and N is reduced, the NGD calculation result is enlarged, and the distribution is more discrete.

2. Point-to-point mutual information similarity algorithm (WebPMI)

In the related field of data mining or Information retrieval, PMI (Pointwise Mutual Information) is often used to measure the correlation between two events. Based on this, a method for calculating similarity based on search engine page count in the form of PMI can be defined, and the formula is as follows:

here, N is the number of documents indexed by the search engine. H (P) is the result page count of the query P, H (Q) is the result page count of the query Q, and H (P n Q is the result page count of the query P and Q (namely, the pages contain both P and Q).

3. Double-check similarity calculation method based on search result segments

Chen et al propose a double-check model that uses text Snippets (Snippets) returned by a web search engine to compute semantic similarity between words. For two words, P and Q, they collect a segment of each word from the web search engine. Then, the number of times the word P appears in the search result segment of Q, and the number of times the word Q appears in the search result segment of P, are calculated. These two values are then non-linearly combined to calculate the similarity between P and Q. The cooccurrence Double Check (CODC) metric is defined as:

where H (P@Q) is the number of times the word P appears in the search result segment of Q, H (Q@P) is the number of times the word Q appears in the search result segment of P, H (P) is the result page count of query P, and H (Q) is the result page count of query Q. Alpha is an adjustable parameter.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: how to judge whether the multi-source attribute sets belong to the same target person and eliminate the influence of the same name and noise as much as possible.

In order to solve the technical problems, the invention adopts the technical scheme that: a multi-source character attribute fusion method based on a search engine comprises the following steps:

s1, carrying out attribute alignment processing on two target attribute sets; if the corresponding attribute exists, recording as an attribute pair, and executing the step S2, otherwise, ending;

s2, calculating confidence coefficients of each attribute pair with the corresponding relation;

s3, calculating an average value of confidence coefficients; if the average value is larger than the threshold value, the two target attribute sets belong to the same target person, otherwise, the two target attribute sets do not belong to the same target person.

Further, step S1 comprises the following sub-steps:

s11, establishing a high-confidence attribute dictionary;

s12, performing attribute name pairing based on the character string editing distance;

and S13, carrying out attribute alignment on the rest attribute names based on point-by-point common information.

Further, the confidence level in step S2 is calculated based on the page count of the search engine, and the specific calculation formula is:

M＝f(C)

wherein, f (C, v) _p ) For searching "person C" and "attribute value v _p "result page count of f (C, v) _q ) For searching "person C" and "attribute value v _p "result page count of f (C, v) _p ,v _q ) For searching "person C" and "attribute value v _p The "sum" attribute value v _q "result page count.

Further, the confidence level in step S2 is calculated based on the content of the segment, and the specific calculation formula is as follows:

wherein, f (v) _q @(C,v _p ) Refer to query person C and attribute name v _p In the result fragment of (1), v _q The number of occurrences; f (C, v) _p ) For querying person C and attribute name v _p Total number of result fragments of (a); μ denotes f (C, v) taken _p ) Ratio of the total number of fragments, μ e (0,1)](ii) a q and p are regulatory factors.

Further, the confidence level in step S2 is calculated by using the following formula:

Con(Tp,Tq,C)＝β×TCDC(T _p ,T _q ,C)+(1-β)(1-NGDC(T _p ,T _q ,C))

wherein beta is weight, N is total page number of search engine index, alpha is adjustable parameter, TCDC (T) _p ,T _q C) person attribute versus confidence based on double checks, NGDC (T) _p ,T _q C) denotes a common name based on a person's name and two attributesThe person attributes of the current page count versus confidence.

The invention has the beneficial effects that: aiming at the character attribute fusion application scene, the invention designs a gradient attribute alignment method, reduces the calculated amount, introduces a WebPMI similarity calculation method based on search engine page counting to calculate the similarity between attribute names, is suitable for attribute names in any form, and simultaneously, the calculation result is cacheable, further reduces the calculated amount; the invention also designs a calculation method of three character attributes for the confidence level, and provides two confidence level measurement methods of NGDC and TCDC and a measurement method for fusing the two confidence levels based on the thought of the calculation of the vocabulary semantic similarity of the search engine, wherein the NGDC and the TCDC respectively calculate the possibility that the attribute pair belongs to the same target character based on the page counting and the fragment content of the search engine; the method of the invention skillfully utilizes the characteristics of the search engine, introduces additional information, utilizes the Web which is a natural massive database, and solves the problems of the character attributes and the calculation of the reliability:

a. the attribute values have more long phrases, and the semantic similarity is difficult to calculate by using the traditional method;

b. the attribute value has an unforeseen expression form or some new words which are not included yet;

c. the problem of insufficient information that the attribute does not belong to the same person cannot be determined if the attribute values of two attribute pairs are completely different.

Drawings

FIG. 1 is a flow chart of attribute fusion for a human;

FIG. 2 is a flow diagram of attribute alignment.

Detailed Description

The multi-source character attribute fusion is an important part in the character attribute extraction application process. The main purposes of character attribute fusion are noise removal and name disambiguation. Colloquially, person attribute fusion requires a determination of whether a set of attributes from different information sources point to our target person.

For ease of presentation, assuming that several sets of attributes of person C have been obtained from different information sources (e.g., web pages, knowledge bases, etc.), two-by-two calculations are required, considering two sets of target attributes to be calculated:

in the attribute set P, k _p Is an attribute name, v _p For corresponding attribute value, called K _p Set of property names, V, for P _p Defining a certain attribute pair T in P for the attribute value set of P _p ＝(k _p ,v _p ) The attribute set Q is defined as above, and the superscripts 1,2, …, n denote the attribute numbers. The following description of the present invention is based on this assumption and will not be repeated.

If we search for a character C = "zhang san", two different attribute sets are extracted from different web page sources, as shown in table 1, and attribute set 1 is taken and denoted as P, and attribute set 2 is taken and denoted as Q, so as to obtain attribute set P, Q of different sources as shown in table 2.

Table 2 example of different source attribute sets P, Q

In the character fusion process, two-two calculation needs to be carried out on an attribute set, wherein P and Q are taken as examples, and whether P and Q belong to the same Zhang III or not needs to be calculated, and if not, fusion is not carried out, so that ambiguity is eliminated. Of course, P or Q may belong to another person, i.e., noise, which also needs to be identified in the fusion process.

In the calculation process, the following problems may exist:

(1) Due to the fact that one attribute can have a plurality of expression methods, for example, the expression mode of the attribute of a birthday can have { birthday, … }, and it is difficult for a dictionary to be constructed to cover all expression modes of all attributes. The traditional vocabulary similarity calculation is not suitable for the situation because the attribute expression may have many long phrases, short texts and even some new vocabularies. Therefore, the invention needs to solve the technical problem of attribute alignment under the condition of character attribute fusion.

(2) The person attribute pair confidence measure is used for obtaining a pair of aligned attribute pairs T of the person C by assuming attribute alignment _p ＝(k _p ,v _p ) And T _q ＝(k _q ,v _q ). Let us call T _p And T _q The probability of all belonging to the target character C is the character attribute confidence score, and is denoted as Con (T) _p ,T _q And C). Like attribute names, attribute values also have the problems of inconsistent expression modes, mostly long phrases and many shorthand and new words. In addition, the confidence level calculation has a difficulty that the confidence level cannot be measured only according to the information of the attribute values on the character strings or semantics, because different attribute sets may not intersect each other on the information. For example: there is an attribute pair with the attribute name "professional" in both P and Q, i.e., T _p = ("occupation", "teacher"), T _q = ("professional", "scholars"). The two attribute values are literally irrelevant, but the target character may be both a teacher and a student. This requires that we introduce information that is literally outside the attribute values. Therefore, the invention needs to solve the problem of calculating the confidence of the attribute value under the condition of character attribute fusion.

In order to solve the technical problem, the invention provides a multi-source character fusion method based on a search engine; as shown in fig. 1, includes:

a1, carrying out attribute alignment processing on two target attribute sets; if the corresponding attribute exists, executing the step S2, otherwise, ending;

a2, calculating confidence coefficient of each person attribute pair with corresponding relation;

a3, calculating an average value of confidence degrees; if the average value is larger than the threshold value, the two target attribute sets belong to the same target person, otherwise, the two target attribute sets do not belong to the same target person.

Step A1 the attribute alignment process is shown in fig. 2, and includes the following sub-steps:

(1) A high confidence attribute dictionary needs to be built first. High confidence attributes are attributes that can highly distinguish one person entity and which have predictable ways of expression. For example: date of birth, date of death, etc. The dictionary should contain all the expressions of the attribute. Although it was mentioned before that the different expressions of all attributes are difficult to cover, here only a dictionary of a few attributes has to be constructed, which can be done manually. First, we search P, Q for whether these high confidence attributes are included at the same time, and if they exist, then directly perform attribute value analysis by using the regular rule. And directly judging whether P, Q belongs to the same person or not according to the analysis result.

(2) Next, attribute name pairing is performed based on the character string edit distance, and edit distance similarity is calculated for the attribute names kp and kq in P, Q two by two. If the edit distance similarity is greater than the threshold (threshold) _lev ) Then it is determined that the two attributes are aligned.

For attribute pair T _p And T _q Handle k _p And k _q The edit distance of (c) is denoted as lev (k) _p ,k _q ) If lev (k) _p ,k _q )>threshold _lev Then, T is determined _p And T _q Are aligned.

Threshold here _lev The setting should be large because many attribute names are short words, and the edit distance is mainly used to correct some redundant spaces, singles and multiplicities, etc. According to the experiment, the threshold can be summarized _lev Should be in the range of [0.9,1]. However, it should be noted that the edit distance does not reflect the semantic features of the attribute name, and there are many attribute names that are in non-canonical form of expression or are abbreviated, so we need to consider the semantic information in the last alignment step.

(3) Finally, we perform attribute alignment based on Pointwise Mutual Information (PMI) for the remaining attribute names.

Where N is the number of documents indexed by the search engine, H (k) _p ) As an attribute name k _p Search result count, H (k) _p ∩k _q ) Is attribute name k _p And attribute name k _p The search result count of (2). If WebPMI (k) _p ,k _q )>threshold _pmi Then, consider attribute pair T _p And T _q Are aligned. Wherein threshold is _pmi As a threshold value, generally [0.5,0.7 ] is taken]。

Compared with the NGD (Normalized Google Distance), the similarity calculation based on the PMI (Pointwise Mutual Information) is more suitable for the case of more result pages, and considering that most of attribute name sets are common words, the similarity calculation method based on the PMI is more suitable for the case.

Step A2, based on a search engine character attribute versus reliability algorithm, comprises the following contents:

1) Rationality of the algorithm

The number of Web pages currently indexed by google is close to 10 ¹⁰ Each common search term appears in millions of web pages. Such a huge amount of data can be considered as a sample truly representing human knowledge, the probability of Google search terms, which is the frequency of page counts returned by Google divided by the number of pages indexed by Google, is close to the actual relative frequency of search terms actually used in society.

In the name disambiguation process, suppose that three names of people are searched, two different attributes in the result are concentrated to have two different professional attributes of teacher and student, the term of teacher with three teachers and the term of student with three teachers are the google distance calculation targets, if three teachers are indeed double identities of the teacher and the student, the page counting of the term of teacher with three teachers does not have too large difference, and if the two attribute sets do not point to the same three teachers or one of the attributes is not three teachers at all, the counting index of the term of student with three teachers is decreased, so that the two attribute sets are reflected in the calculation formula of the NGD. This is also in line with the background of google distance. Therefore, the method and the device have the advantage that the idea of Google distance is used for calculating the similarity of the character attribute pair.

In addition, the attribute similarity is not suitable for being measured by simply using the character string similarity, for example, in the above example, "university of electronic technology" and "UESTC" have the same meaning, but the character string similarity is 0, and similarly, many attribute names have many unpredictable expression forms, and most of them have more than one word, and it is difficult to perform semantic measurement by simply using the traditional methods such as word2vec or word network.

2) Computing character attribute contra-reliability based on Google distance

Based on the idea of the NGD algorithm, the invention designs a figure attribute versus reliability calculation method. The method mainly utilizes the name of a person and the co-occurrence page count of two attributes to measure the degree of association between the person and the attributes, and is called NGDC (Normalized Google distance of characters).

M＝f(C)

Wherein, f (C, v) _p ) For searching "person C" and "attribute value v _p "result page count. f (C, v) _p ,v _q ) For searching "person C" and "attribute value v _p The value v of the attribute of "and _q "result page count. Because the search results for pairs of person attributes may be rare. And according to the popularity of different target characters, the search results have larger difference, so that M is an adaptive parameter, f (C) refers to the webpage result count of the character C which is searched independently, and the result can be more discrete by setting M. NGDC (T) _p ,T _q C) E [0, + ∞) (in some special cases, if the search engine results are not accurate, NGDC (T) _p ,T _q C) will be less than 0, which case can be ignored), NGDC (T) _p ,T _q And the larger the value of C) is, the larger T is _p And T _q Belong to the same genusThe lower the probability at C and vice versa, where the threshold value generally ranges from [0.5,1.5]For example, in this embodiment, the threshold is set to 1.0 when NGDC (T) _p ,T _q When the value of C) is less than 1.0, T _p And T _q Belong to the same category as C; otherwise T _p And T _q At least one of which does not belong to C.

3) Segment-based dual-check algorithm for calculating character attribute contra-reliability

The method for calculating the character attribute contra-reliability based on the Google distance is suitable for calculating the character attribute confidence with less retrieval results. The method only considers the webpage count and does not consider the webpage content. In order to compensate for the problem, the invention provides a human attribute versus reliability calculation method based on segments. A snippet refers to a excerpt window of each web page in the search engine results, typically containing the keywords of the search. The snippet can reflect key content in the web page about the search keyword. Since it is difficult to engineer to request all web page content, analysis of snippets is a good choice here.

The invention relates to a person attribute contra-reliability algorithm based on double inspection, which is called TCDC (pipe of characters and attributes confidence on double check).

Wherein, f (v) _q @(C,v _p ) Refer to query person C and attribute name v _p In the result fragment of (1), v _q The number of occurrences. f (C, v) _p ) For querying person C and attribute name v _p Total number of resulting fragments. Mu e (0,1)]Denotes the taken f (C, v) _p ) The proportion of the total number of segments prevents the calculation of too many segments to guarantee the feasibility of the algorithm. q and p are regulatory factors to prevent f (C, v) _p ) Or f (C, v) _q ) Is smaller per seResulting in distortion of the result. TCDC (T) _p ,T _q ,C)∈[0,1]Generally, the segment ratio μ can be set according to the processing capability of the computer, and the number of the calculated segments is generally controlled within 1000. TCDC (T) _p ,T _q C) threshold of about 0.5, i.e. TCDC (T) _p ,T _q C) value greater than 0.5, T _p And T _q Belong to the same category as C; otherwise T _p And T _q At least one of which does not belong to C.

4) Character attribute opposition credibility Con (Tp, tq, C)

Combining the human attribute pair similarity calculation methods NGDC and TCDC, the present invention proposes a method for calculating human attribute pair confidence Con (Tp, tq, C) for one human C and an attribute pair Tp, tq:

Con(Tp,Tq,C)＝β×TCDC(T _p ,T _q ,C)+(1-β)(1-NGDC(T _p ,T _q ,C)) (7)

where β is the weight and N is the total number of pages indexed by the search engine, typically 10 ¹¹ Alpha is an adjustable parameter used for reducing the influence of overlarge difference between N and the query result, and belongs to the field of the design of a network element (0,1)]. It can be seen that the fewer the persona attributes are to a query page, the smaller β, the lower weight of NGDC, and vice versa. Con (Tp, tq, C) belongs to (- ∞, + ∞), when Con (Tp, tq, C) is less than a threshold value, tp and Tq are judged to belong to C, otherwise, tp and Tq do not belong to C. Normally, the absolute value of Con (Tp, tq, C) is not very large (less than 1, mainly influenced by NGDC), and the threshold value range is [0.5,0.75 ]]For example, in the present embodiment, the threshold value is 0.6, and when Con (Tp, tq, C) is greater than 0.6, it is determined that Tp and Tq belong to C; otherwise, at least one of Tp and Tq is judged not to belong to C.

The confidence level average calculation formula in the step A3 is as follows:

wherein Con (P, Q) is PAnd Q and the confidence coefficient of C, and n is the total number of the aligned attribute pairs. Con (Tp, tq, C) is the confidence that Tp, tq belong to C, if Con (P, Q) is larger than threshold threshhood _con If yes, the decision P, Q is both C, otherwise the decision is not, the threshold threshhood _con Values are taken with reference to Con (Tp, tq, C) above.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A multi-source character attribute fusion method based on a search engine is characterized by comprising the following steps:

s1, performing attribute alignment processing on two target attribute sets; if the corresponding attribute exists, recording as an attribute pair, and executing the step S2, otherwise, ending;

s2, calculating confidence coefficients of each attribute pair with the corresponding relation; and S2, calculating the confidence coefficient based on the page count of the search engine, wherein the specific calculation formula is as follows:

M＝f(C)

wherein, f (C, v) _p ) For searching "person C" and "attribute value v _p "result page count, f (C, v) _p ,v _q ) For searching "person C" and "attribute value v _p The "sum" attribute value v _q "result page count, max]Indicates that the maximum value, min [, ]]Expressing to calculate the minimum value;

or the like, or a combination thereof,

and S2, calculating the confidence coefficient based on the fragment content, wherein the specific calculation formula is as follows:

wherein, f (v) _q @(C,v _p ) Refer to query person C and attribute name v _p In the result fragment of (1), v _q The number of occurrences; f (C, v) _p ) For querying person C and attribute name v _p Total number of result fragments of (a); μ denotes f (C, v) taken _p ) Ratio of the total number of fragments, μ e (0,1)](ii) a q and p are regulatory factors;

or the like, or, alternatively,

the confidence coefficient in the step S2 is calculated by adopting the following formula:

Con(Tp,Tq,C)＝β×TCDC(T _p ,T _q ,C)+(1-β)(1-NGDC(T _p ,T _q ,C))

wherein beta is weight, N is total page number of search engine index, alpha is adjustable parameter, TCDC (T) _p ,T _q C) person attribute versus confidence based on double checks, NGDC (T) _p ,T _q C) person attribute opposition confidence, which is based on the person name and the co-occurrence page count of the two attributes;

2. The multi-source character attribute fusion method based on the search engine as claimed in claim 1, wherein the step S1 comprises the following sub-steps:

s11, establishing a high-confidence attribute dictionary;

3. The multi-source character attribute fusion method based on the search engine as claimed in claim 1, wherein the formula for calculating the average value of the confidence degrees in step S3 is:

wherein Con (P, Q) is the confidence that P, Q belongs to C, n is the total number of the aligned attribute pairs, and Con (Tp, tq, C) is the confidence that Tp, tq belong to C.