CN111814027B - Multi-source character attribute fusion method based on search engine - Google Patents

Multi-source character attribute fusion method based on search engine Download PDF

Info

Publication number
CN111814027B
CN111814027B CN202010867732.6A CN202010867732A CN111814027B CN 111814027 B CN111814027 B CN 111814027B CN 202010867732 A CN202010867732 A CN 202010867732A CN 111814027 B CN111814027 B CN 111814027B
Authority
CN
China
Prior art keywords
attribute
person
confidence
search engine
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010867732.6A
Other languages
Chinese (zh)
Other versions
CN111814027A (en
Inventor
于富财
叶浩维
胡光岷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010867732.6A priority Critical patent/CN111814027B/en
Publication of CN111814027A publication Critical patent/CN111814027A/en
Application granted granted Critical
Publication of CN111814027B publication Critical patent/CN111814027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multisource character attribute fusion method based on a search engine, which is applied to multisource character attribute fusion, and aims at the effective solution that whether multisource attribute sets belong to the same target character or not and homonymy and noise are eliminated as much as possible by aiming at the shortage in the prior art; self-adaptive parameters are set according to different people awareness degrees, so that the degree of confidence coefficient dispersion is adjusted; and finally, combining the two confidence coefficient calculation methods, and providing a weighted person attribute pair combined confidence coefficient calculation method.

Description

Multi-source character attribute fusion method based on search engine
Technical Field
The invention belongs to the field of big data processing, and particularly relates to a character attribute extraction technology.
Background
With the rapid development of internet application, the data volume that can be obtained through the network also grows exponentially in a well-jet manner, and it is critical and urgent to quickly and accurately analyze the truly useful information from the massive data.
Person attributes, also called person characteristics. The person attributes contain all information describing a person from birth to death, such as: birth place, birth time, country, work, religion of belief, place of death, death time, etc. The person attribute extraction is to identify the attributes of persons in the network, and the person attribute extraction has important practical applications, such as name disambiguation, construction of a person knowledge base, a person search engine and the like. Most of the research nowadays mainly focuses on information extraction in specific fields of the network, and only relatively few researches are conducted on extraction of the character attributes.
The character attribute extraction generally comprises two important processes of multi-source character attribute extraction and character attribute fusion. The attribute extraction refers to analyzing and obtaining an attribute set of a person from different information sources, wherein the information sources mainly refer to different webpages obtained by searching the name of the person; the attribute fusion is to analyze the attribute sets of different sources, judge whether the attribute sets belong to the target people searched by people, finally integrate the attribute sets belonging to the target people together, and output the result. Character fusion is mainly used for solving the problems of homonymy character disambiguation, noise and the like of character attributes from different sources.
As shown in Table 1, if we search for person Zhang III, we obtain the following set of attributes from different sources:
TABLE 1 example of different Source Attribute sets for the target person Zhang three
Figure BDA0002650227170000011
As shown in Table 1, we assume that four property sets of Zhang three are extracted from different sources. The following presumptions can be made:
(1) Literally, it can be assumed that the two attribute sets 1 and 4 belong to three pages of our target character, because 1 and 4 have the same attribute of birthday and university.
(2) The attribute set 3 may belong to another three-blossoms of the same name, which is a problem of homonym disambiguation.
(3) We cannot determine that 1 and 2 are the same page three and 2 is also likely another person, because searching page three may cause other character web pages related to page three to affect the extraction result, which is a noise problem.
Fusion generally involves two important steps: attribute alignment and entity alignment. The attribute alignment is used for determining whether the attribute sets have similar or identical attributes, and the main methods are a method based on character string distance, a method based on dictionary matching, a method based on semantic similarity and the like. For example: in the above table, the attribute "education background" of the attribute set 1 and the attribute "university" of the attribute set 4 correspond to each other, and finding such a correspondence is a process for which the attribute corresponds. Entity alignment is the process of determining whether multiple entities in the real world point to the same entity. That is, it is necessary to determine whether the character attribute sets from different sources point to the same objective character entity, and entity alignment needs to be implemented by calculating the similarity between characters and attribute values or introducing some other technical means.
The network is a natural massive text corpus, for example, google can extract relative page count, the count is close to the use frequency of real social words and phrases, and the current research field of the linguistics also starts to support the method [1]
Two useful sources of information that search engines can provide: web page result counts and Snippets (Snippets). The page count of the query is an estimate of the number of pages containing the query term. In general, the number of pages may not necessarily equal the word frequency, as the words of the query may appear multiple times on a page. But can be used here to estimate the frequency of occurrence of words, taking into account the amount of data of the search engine. The number of pages for queries p and q can be viewed as a global measure of the simultaneous occurrence of words p and q. Snippets (Snippets) are a short window of text that the search engine extracts around the query term in a document, providing useful information about the local context of the query term. Semantic similarity measurement by using fragments has been used in the fields of query expansion, personal name disambiguation, community mining, and the like. For web page content, a snippet is information that is relatively easy to obtain, and, in terms of engineering, the efficiency problem caused by downloading all pages of a search engine result can be solved using the snippet.
The related prior art is as follows:
1. normalized Google Distance (NGD)
Cilibrasi and Vitenyi propose a lexical semantic similarity algorithm based on the number of pages queried by Google, called Normalized Google Distance.
Figure BDA0002650227170000021
Wherein, H (P) is the result page count of the query P, H (Q) is the result page count of the query Q, H (P, Q) is the result page count of the query P and Q (namely, the pages contain both P and Q), and N is the total index page number (10) of Google search generally selected as value 11 ) N can also be selected to be any value larger than H (x), N is increased greatly, the calculation result of NGD is reduced, and the distribution is more compact; and N is reduced, the NGD calculation result is enlarged, and the distribution is more discrete.
2. Point-to-point mutual information similarity algorithm (WebPMI)
In the related field of data mining or Information retrieval, PMI (Pointwise Mutual Information) is often used to measure the correlation between two events. Based on this, a method for calculating similarity based on search engine page count in the form of PMI can be defined, and the formula is as follows:
Figure BDA0002650227170000031
here, N is the number of documents indexed by the search engine. H (P) is the result page count of the query P, H (Q) is the result page count of the query Q, and H (P n Q is the result page count of the query P and Q (namely, the pages contain both P and Q).
3. Double-check similarity calculation method based on search result segments
Chen et al propose a double-check model that uses text Snippets (Snippets) returned by a web search engine to compute semantic similarity between words. For two words, P and Q, they collect a segment of each word from the web search engine. Then, the number of times the word P appears in the search result segment of Q, and the number of times the word Q appears in the search result segment of P, are calculated. These two values are then non-linearly combined to calculate the similarity between P and Q. The cooccurrence Double Check (CODC) metric is defined as:
Figure BDA0002650227170000032
where H (P@Q) is the number of times the word P appears in the search result segment of Q, H (Q@P) is the number of times the word Q appears in the search result segment of P, H (P) is the result page count of query P, and H (Q) is the result page count of query Q. Alpha is an adjustable parameter.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to judge whether the multi-source attribute sets belong to the same target person and eliminate the influence of the same name and noise as much as possible.
In order to solve the technical problems, the invention adopts the technical scheme that: a multi-source character attribute fusion method based on a search engine comprises the following steps:
s1, carrying out attribute alignment processing on two target attribute sets; if the corresponding attribute exists, recording as an attribute pair, and executing the step S2, otherwise, ending;
s2, calculating confidence coefficients of each attribute pair with the corresponding relation;
s3, calculating an average value of confidence coefficients; if the average value is larger than the threshold value, the two target attribute sets belong to the same target person, otherwise, the two target attribute sets do not belong to the same target person.
Further, step S1 comprises the following sub-steps:
s11, establishing a high-confidence attribute dictionary;
s12, performing attribute name pairing based on the character string editing distance;
and S13, carrying out attribute alignment on the rest attribute names based on point-by-point common information.
Further, the confidence level in step S2 is calculated based on the page count of the search engine, and the specific calculation formula is:
Figure BDA0002650227170000041
M=f(C)
wherein, f (C, v) p ) For searching "person C" and "attribute value v p "result page count of f (C, v) q ) For searching "person C" and "attribute value v p "result page count of f (C, v) p ,v q ) For searching "person C" and "attribute value v p The "sum" attribute value v q "result page count.
Further, the confidence level in step S2 is calculated based on the content of the segment, and the specific calculation formula is as follows:
Figure BDA0002650227170000042
Figure BDA0002650227170000043
wherein, f (v) q @(C,v p ) Refer to query person C and attribute name v p In the result fragment of (1), v q The number of occurrences; f (C, v) p ) For querying person C and attribute name v p Total number of result fragments of (a); μ denotes f (C, v) taken p ) Ratio of the total number of fragments, μ e (0,1)](ii) a q and p are regulatory factors.
Further, the confidence level in step S2 is calculated by using the following formula:
Con(Tp,Tq,C)=β×TCDC(T p ,T q ,C)+(1-β)(1-NGDC(T p ,T q ,C))
Figure BDA0002650227170000044
wherein beta is weight, N is total page number of search engine index, alpha is adjustable parameter, TCDC (T) p ,T q C) person attribute versus confidence based on double checks, NGDC (T) p ,T q C) denotes a common name based on a person's name and two attributesThe person attributes of the current page count versus confidence.
The invention has the beneficial effects that: aiming at the character attribute fusion application scene, the invention designs a gradient attribute alignment method, reduces the calculated amount, introduces a WebPMI similarity calculation method based on search engine page counting to calculate the similarity between attribute names, is suitable for attribute names in any form, and simultaneously, the calculation result is cacheable, further reduces the calculated amount; the invention also designs a calculation method of three character attributes for the confidence level, and provides two confidence level measurement methods of NGDC and TCDC and a measurement method for fusing the two confidence levels based on the thought of the calculation of the vocabulary semantic similarity of the search engine, wherein the NGDC and the TCDC respectively calculate the possibility that the attribute pair belongs to the same target character based on the page counting and the fragment content of the search engine; the method of the invention skillfully utilizes the characteristics of the search engine, introduces additional information, utilizes the Web which is a natural massive database, and solves the problems of the character attributes and the calculation of the reliability:
a. the attribute values have more long phrases, and the semantic similarity is difficult to calculate by using the traditional method;
b. the attribute value has an unforeseen expression form or some new words which are not included yet;
c. the problem of insufficient information that the attribute does not belong to the same person cannot be determined if the attribute values of two attribute pairs are completely different.
Drawings
FIG. 1 is a flow chart of attribute fusion for a human;
FIG. 2 is a flow diagram of attribute alignment.
Detailed Description
The multi-source character attribute fusion is an important part in the character attribute extraction application process. The main purposes of character attribute fusion are noise removal and name disambiguation. Colloquially, person attribute fusion requires a determination of whether a set of attributes from different information sources point to our target person.
For ease of presentation, assuming that several sets of attributes of person C have been obtained from different information sources (e.g., web pages, knowledge bases, etc.), two-by-two calculations are required, considering two sets of target attributes to be calculated:
Figure BDA0002650227170000051
Figure BDA0002650227170000052
in the attribute set P, k p Is an attribute name, v p For corresponding attribute value, called K p Set of property names, V, for P p Defining a certain attribute pair T in P for the attribute value set of P p =(k p ,v p ) The attribute set Q is defined as above, and the superscripts 1,2, …, n denote the attribute numbers. The following description of the present invention is based on this assumption and will not be repeated.
If we search for a character C = "zhang san", two different attribute sets are extracted from different web page sources, as shown in table 1, and attribute set 1 is taken and denoted as P, and attribute set 2 is taken and denoted as Q, so as to obtain attribute set P, Q of different sources as shown in table 2.
Table 2 example of different source attribute sets P, Q
Figure BDA0002650227170000061
In the character fusion process, two-two calculation needs to be carried out on an attribute set, wherein P and Q are taken as examples, and whether P and Q belong to the same Zhang III or not needs to be calculated, and if not, fusion is not carried out, so that ambiguity is eliminated. Of course, P or Q may belong to another person, i.e., noise, which also needs to be identified in the fusion process.
In the calculation process, the following problems may exist:
(1) Due to the fact that one attribute can have a plurality of expression methods, for example, the expression mode of the attribute of a birthday can have { birthday, … }, and it is difficult for a dictionary to be constructed to cover all expression modes of all attributes. The traditional vocabulary similarity calculation is not suitable for the situation because the attribute expression may have many long phrases, short texts and even some new vocabularies. Therefore, the invention needs to solve the technical problem of attribute alignment under the condition of character attribute fusion.
(2) The person attribute pair confidence measure is used for obtaining a pair of aligned attribute pairs T of the person C by assuming attribute alignment p =(k p ,v p ) And T q =(k q ,v q ). Let us call T p And T q The probability of all belonging to the target character C is the character attribute confidence score, and is denoted as Con (T) p ,T q And C). Like attribute names, attribute values also have the problems of inconsistent expression modes, mostly long phrases and many shorthand and new words. In addition, the confidence level calculation has a difficulty that the confidence level cannot be measured only according to the information of the attribute values on the character strings or semantics, because different attribute sets may not intersect each other on the information. For example: there is an attribute pair with the attribute name "professional" in both P and Q, i.e., T p = ("occupation", "teacher"), T q = ("professional", "scholars"). The two attribute values are literally irrelevant, but the target character may be both a teacher and a student. This requires that we introduce information that is literally outside the attribute values. Therefore, the invention needs to solve the problem of calculating the confidence of the attribute value under the condition of character attribute fusion.
In order to solve the technical problem, the invention provides a multi-source character fusion method based on a search engine; as shown in fig. 1, includes:
a1, carrying out attribute alignment processing on two target attribute sets; if the corresponding attribute exists, executing the step S2, otherwise, ending;
a2, calculating confidence coefficient of each person attribute pair with corresponding relation;
a3, calculating an average value of confidence degrees; if the average value is larger than the threshold value, the two target attribute sets belong to the same target person, otherwise, the two target attribute sets do not belong to the same target person.
Step A1 the attribute alignment process is shown in fig. 2, and includes the following sub-steps:
(1) A high confidence attribute dictionary needs to be built first. High confidence attributes are attributes that can highly distinguish one person entity and which have predictable ways of expression. For example: date of birth, date of death, etc. The dictionary should contain all the expressions of the attribute. Although it was mentioned before that the different expressions of all attributes are difficult to cover, here only a dictionary of a few attributes has to be constructed, which can be done manually. First, we search P, Q for whether these high confidence attributes are included at the same time, and if they exist, then directly perform attribute value analysis by using the regular rule. And directly judging whether P, Q belongs to the same person or not according to the analysis result.
(2) Next, attribute name pairing is performed based on the character string edit distance, and edit distance similarity is calculated for the attribute names kp and kq in P, Q two by two. If the edit distance similarity is greater than the threshold (threshold) lev ) Then it is determined that the two attributes are aligned.
For attribute pair T p And T q Handle k p And k q The edit distance of (c) is denoted as lev (k) p ,k q ) If lev (k) p ,k q )>threshold lev Then, T is determined p And T q Are aligned.
Threshold here lev The setting should be large because many attribute names are short words, and the edit distance is mainly used to correct some redundant spaces, singles and multiplicities, etc. According to the experiment, the threshold can be summarized lev Should be in the range of [0.9,1]. However, it should be noted that the edit distance does not reflect the semantic features of the attribute name, and there are many attribute names that are in non-canonical form of expression or are abbreviated, so we need to consider the semantic information in the last alignment step.
(3) Finally, we perform attribute alignment based on Pointwise Mutual Information (PMI) for the remaining attribute names.
Figure BDA0002650227170000071
Where N is the number of documents indexed by the search engine, H (k) p ) As an attribute name k p Search result count, H (k) p ∩k q ) Is attribute name k p And attribute name k p The search result count of (2). If WebPMI (k) p ,k q )>threshold pmi Then, consider attribute pair T p And T q Are aligned. Wherein threshold is pmi As a threshold value, generally [0.5,0.7 ] is taken]。
Compared with the NGD (Normalized Google Distance), the similarity calculation based on the PMI (Pointwise Mutual Information) is more suitable for the case of more result pages, and considering that most of attribute name sets are common words, the similarity calculation method based on the PMI is more suitable for the case.
Step A2, based on a search engine character attribute versus reliability algorithm, comprises the following contents:
1) Rationality of the algorithm
The number of Web pages currently indexed by google is close to 10 10 Each common search term appears in millions of web pages. Such a huge amount of data can be considered as a sample truly representing human knowledge, the probability of Google search terms, which is the frequency of page counts returned by Google divided by the number of pages indexed by Google, is close to the actual relative frequency of search terms actually used in society.
In the name disambiguation process, suppose that three names of people are searched, two different attributes in the result are concentrated to have two different professional attributes of teacher and student, the term of teacher with three teachers and the term of student with three teachers are the google distance calculation targets, if three teachers are indeed double identities of the teacher and the student, the page counting of the term of teacher with three teachers does not have too large difference, and if the two attribute sets do not point to the same three teachers or one of the attributes is not three teachers at all, the counting index of the term of student with three teachers is decreased, so that the two attribute sets are reflected in the calculation formula of the NGD. This is also in line with the background of google distance. Therefore, the method and the device have the advantage that the idea of Google distance is used for calculating the similarity of the character attribute pair.
In addition, the attribute similarity is not suitable for being measured by simply using the character string similarity, for example, in the above example, "university of electronic technology" and "UESTC" have the same meaning, but the character string similarity is 0, and similarly, many attribute names have many unpredictable expression forms, and most of them have more than one word, and it is difficult to perform semantic measurement by simply using the traditional methods such as word2vec or word network.
2) Computing character attribute contra-reliability based on Google distance
Based on the idea of the NGD algorithm, the invention designs a figure attribute versus reliability calculation method. The method mainly utilizes the name of a person and the co-occurrence page count of two attributes to measure the degree of association between the person and the attributes, and is called NGDC (Normalized Google distance of characters).
Figure BDA0002650227170000081
M=f(C)
Wherein, f (C, v) p ) For searching "person C" and "attribute value v p "result page count. f (C, v) p ,v q ) For searching "person C" and "attribute value v p The value v of the attribute of "and q "result page count. Because the search results for pairs of person attributes may be rare. And according to the popularity of different target characters, the search results have larger difference, so that M is an adaptive parameter, f (C) refers to the webpage result count of the character C which is searched independently, and the result can be more discrete by setting M. NGDC (T) p ,T q C) E [0, + ∞) (in some special cases, if the search engine results are not accurate, NGDC (T) p ,T q C) will be less than 0, which case can be ignored), NGDC (T) p ,T q And the larger the value of C) is, the larger T is p And T q Belong to the same genusThe lower the probability at C and vice versa, where the threshold value generally ranges from [0.5,1.5]For example, in this embodiment, the threshold is set to 1.0 when NGDC (T) p ,T q When the value of C) is less than 1.0, T p And T q Belong to the same category as C; otherwise T p And T q At least one of which does not belong to C.
3) Segment-based dual-check algorithm for calculating character attribute contra-reliability
The method for calculating the character attribute contra-reliability based on the Google distance is suitable for calculating the character attribute confidence with less retrieval results. The method only considers the webpage count and does not consider the webpage content. In order to compensate for the problem, the invention provides a human attribute versus reliability calculation method based on segments. A snippet refers to a excerpt window of each web page in the search engine results, typically containing the keywords of the search. The snippet can reflect key content in the web page about the search keyword. Since it is difficult to engineer to request all web page content, analysis of snippets is a good choice here.
The invention relates to a person attribute contra-reliability algorithm based on double inspection, which is called TCDC (pipe of characters and attributes confidence on double check).
Figure BDA0002650227170000091
Figure BDA0002650227170000092
Wherein, f (v) q @(C,v p ) Refer to query person C and attribute name v p In the result fragment of (1), v q The number of occurrences. f (C, v) p ) For querying person C and attribute name v p Total number of resulting fragments. Mu e (0,1)]Denotes the taken f (C, v) p ) The proportion of the total number of segments prevents the calculation of too many segments to guarantee the feasibility of the algorithm. q and p are regulatory factors to prevent f (C, v) p ) Or f (C, v) q ) Is smaller per seResulting in distortion of the result. TCDC (T) p ,T q ,C)∈[0,1]Generally, the segment ratio μ can be set according to the processing capability of the computer, and the number of the calculated segments is generally controlled within 1000. TCDC (T) p ,T q C) threshold of about 0.5, i.e. TCDC (T) p ,T q C) value greater than 0.5, T p And T q Belong to the same category as C; otherwise T p And T q At least one of which does not belong to C.
4) Character attribute opposition credibility Con (Tp, tq, C)
Combining the human attribute pair similarity calculation methods NGDC and TCDC, the present invention proposes a method for calculating human attribute pair confidence Con (Tp, tq, C) for one human C and an attribute pair Tp, tq:
Con(Tp,Tq,C)=β×TCDC(T p ,T q ,C)+(1-β)(1-NGDC(T p ,T q ,C)) (7)
Figure BDA0002650227170000101
where β is the weight and N is the total number of pages indexed by the search engine, typically 10 11 Alpha is an adjustable parameter used for reducing the influence of overlarge difference between N and the query result, and belongs to the field of the design of a network element (0,1)]. It can be seen that the fewer the persona attributes are to a query page, the smaller β, the lower weight of NGDC, and vice versa. Con (Tp, tq, C) belongs to (- ∞, + ∞), when Con (Tp, tq, C) is less than a threshold value, tp and Tq are judged to belong to C, otherwise, tp and Tq do not belong to C. Normally, the absolute value of Con (Tp, tq, C) is not very large (less than 1, mainly influenced by NGDC), and the threshold value range is [0.5,0.75 ]]For example, in the present embodiment, the threshold value is 0.6, and when Con (Tp, tq, C) is greater than 0.6, it is determined that Tp and Tq belong to C; otherwise, at least one of Tp and Tq is judged not to belong to C.
The confidence level average calculation formula in the step A3 is as follows:
Figure BDA0002650227170000102
wherein Con (P, Q) is PAnd Q and the confidence coefficient of C, and n is the total number of the aligned attribute pairs. Con (Tp, tq, C) is the confidence that Tp, tq belong to C, if Con (P, Q) is larger than threshold threshhood con If yes, the decision P, Q is both C, otherwise the decision is not, the threshold threshhood con Values are taken with reference to Con (Tp, tq, C) above.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (3)

1. A multi-source character attribute fusion method based on a search engine is characterized by comprising the following steps:
s1, performing attribute alignment processing on two target attribute sets; if the corresponding attribute exists, recording as an attribute pair, and executing the step S2, otherwise, ending;
s2, calculating confidence coefficients of each attribute pair with the corresponding relation; and S2, calculating the confidence coefficient based on the page count of the search engine, wherein the specific calculation formula is as follows:
Figure FDA0004054742050000011
M=f(C)
wherein, f (C, v) p ) For searching "person C" and "attribute value v p "result page count, f (C, v) p ,v q ) For searching "person C" and "attribute value v p The "sum" attribute value v q "result page count, max]Indicates that the maximum value, min [, ]]Expressing to calculate the minimum value;
or the like, or a combination thereof,
and S2, calculating the confidence coefficient based on the fragment content, wherein the specific calculation formula is as follows:
Figure FDA0004054742050000012
Figure FDA0004054742050000013
wherein, f (v) q @(C,v p ) Refer to query person C and attribute name v p In the result fragment of (1), v q The number of occurrences; f (C, v) p ) For querying person C and attribute name v p Total number of result fragments of (a); μ denotes f (C, v) taken p ) Ratio of the total number of fragments, μ e (0,1)](ii) a q and p are regulatory factors;
or the like, or, alternatively,
the confidence coefficient in the step S2 is calculated by adopting the following formula:
Con(Tp,Tq,C)=β×TCDC(T p ,T q ,C)+(1-β)(1-NGDC(T p ,T q ,C))
Figure FDA0004054742050000014
wherein beta is weight, N is total page number of search engine index, alpha is adjustable parameter, TCDC (T) p ,T q C) person attribute versus confidence based on double checks, NGDC (T) p ,T q C) person attribute opposition confidence, which is based on the person name and the co-occurrence page count of the two attributes;
s3, calculating an average value of confidence coefficients; if the average value is larger than the threshold value, the two target attribute sets belong to the same target person, otherwise, the two target attribute sets do not belong to the same target person.
2. The multi-source character attribute fusion method based on the search engine as claimed in claim 1, wherein the step S1 comprises the following sub-steps:
s11, establishing a high-confidence attribute dictionary;
s12, performing attribute name pairing based on the character string editing distance;
and S13, carrying out attribute alignment on the rest attribute names based on point-by-point common information.
3. The multi-source character attribute fusion method based on the search engine as claimed in claim 1, wherein the formula for calculating the average value of the confidence degrees in step S3 is:
Figure FDA0004054742050000021
wherein Con (P, Q) is the confidence that P, Q belongs to C, n is the total number of the aligned attribute pairs, and Con (Tp, tq, C) is the confidence that Tp, tq belong to C.
CN202010867732.6A 2020-08-26 2020-08-26 Multi-source character attribute fusion method based on search engine Active CN111814027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010867732.6A CN111814027B (en) 2020-08-26 2020-08-26 Multi-source character attribute fusion method based on search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010867732.6A CN111814027B (en) 2020-08-26 2020-08-26 Multi-source character attribute fusion method based on search engine

Publications (2)

Publication Number Publication Date
CN111814027A CN111814027A (en) 2020-10-23
CN111814027B true CN111814027B (en) 2023-03-21

Family

ID=72859681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010867732.6A Active CN111814027B (en) 2020-08-26 2020-08-26 Multi-source character attribute fusion method based on search engine

Country Status (1)

Country Link
CN (1) CN111814027B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001195426A (en) * 2000-01-17 2001-07-19 Nippon Telegr & Teleph Corp <Ntt> Method and device for retrieving document class and storage medium with document class retrieval program stored therein
CN104699818A (en) * 2015-03-25 2015-06-10 武汉大学 Multi-source heterogeneous multi-attribute POI (point of interest) integration method
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN110377747A (en) * 2019-06-10 2019-10-25 河海大学 A kind of knowledge base fusion method towards encyclopaedia website
CN110457486A (en) * 2019-07-05 2019-11-15 中国人民解放军战略支援部队信息工程大学 The people entities alignment schemes and device of knowledge based map
CN111221982A (en) * 2020-01-13 2020-06-02 腾讯科技(深圳)有限公司 Information processing method, information processing device, computer-readable storage medium and computer equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10148589B2 (en) * 2014-09-29 2018-12-04 Pearson Education, Inc. Resource allocation in distributed processing systems
CN106599297A (en) * 2016-12-28 2017-04-26 北京百度网讯科技有限公司 Method and device for searching question-type search terms on basis of deep questions and answers

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001195426A (en) * 2000-01-17 2001-07-19 Nippon Telegr & Teleph Corp <Ntt> Method and device for retrieving document class and storage medium with document class retrieval program stored therein
CN104699818A (en) * 2015-03-25 2015-06-10 武汉大学 Multi-source heterogeneous multi-attribute POI (point of interest) integration method
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN110377747A (en) * 2019-06-10 2019-10-25 河海大学 A kind of knowledge base fusion method towards encyclopaedia website
CN110457486A (en) * 2019-07-05 2019-11-15 中国人民解放军战略支援部队信息工程大学 The people entities alignment schemes and device of knowledge based map
CN111221982A (en) * 2020-01-13 2020-06-02 腾讯科技(深圳)有限公司 Information processing method, information processing device, computer-readable storage medium and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Web中结构化人物属性提取与融合方法研究;叶浩维;《中国优秀硕士学位论文 全文数据库 信息科技辑》;I138-919 *
多源人物属性融合方法研究;张磊;《中国优秀硕士学位论文全文数据库 信息科技辑》;I138-1612 *

Also Published As

Publication number Publication date
CN111814027A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
US9558264B2 (en) Identifying and displaying relationships between candidate answers
US9613024B1 (en) System and methods for creating datasets representing words and objects
Plachouras et al. Interacting with financial data using natural language
Kowalski et al. Information storage and retrieval systems: theory and implementation
Kilgarriff et al. Introduction to the special issue on the web as corpus
US8073877B2 (en) Scalable semi-structured named entity detection
Varma et al. IIIT Hyderabad at TAC 2009.
US8370129B2 (en) System and methods for quantitative assessment of information in natural language contents
US20110301941A1 (en) Natural language processing method and system
US20040141354A1 (en) Query string matching method and apparatus
JP2005539283A (en) System, method, and software for hyperlinking names
Küçük et al. Exploiting information extraction techniques for automatic semantic video indexing with an application to Turkish news videos
CN114912449B (en) Technical feature keyword extraction method and system based on code description text
Yilahun et al. Entity extraction based on the combination of information entropy and TF-IDF
Liu et al. Temporal knowledge extraction from large-scale text corpus
Karpagam et al. A framework for intelligent question answering system using semantic context-specific document clustering and Wordnet
Derici et al. A closed-domain question answering framework using reliable resources to assist students
Kashefi et al. Optimizing Document Similarity Detection in Persian Information Retrieval.
CN115828854B (en) Efficient table entity linking method based on context disambiguation
CN111814027B (en) Multi-source character attribute fusion method based on search engine
Zhang et al. An approach for named entity disambiguation with knowledge graph
CN111259136A (en) Method for automatically generating theme evaluation abstract based on user preference
Patel et al. An automatic text summarization: A systematic review
CN115544225A (en) Digital archive information association retrieval method based on semantics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant