CN111079035B

CN111079035B - Domain searching and sorting method based on dynamic map link analysis

Info

Publication number: CN111079035B
Application number: CN201911146865.8A
Authority: CN
Inventors: 鲍家坤; 刘思培; 高天成; 曹玲玲; 张志虎; 袁鸯; 宋春林; 侯海婷; 邹媛媛; 童安玲; 李金龙; 李香亭; 王娟; 杨磊
Original assignee: North Information Control Institute Group Co ltd
Current assignee: North Information Control Institute Group Co ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2023-04-28
Anticipated expiration: 2039-11-21
Also published as: CN111079035A

Abstract

The invention belongs to the field of Internet searching, and particularly relates to a field searching and sorting method based on dynamic map link analysis. The method and the device establish a semantic-level link relation for file resources in field search, further calculate the semantic-level link relation from two aspects of authority and relativity, and finally realize fusion ordering of search results. The method comprises the following steps: dynamic construction of search sorting-oriented domain patterns; offline incremental calculation of authority of file nodes based on a full graph; on-line calculation of file node relevance based on searching subgraph; search results based on authority and relevance are fused and ranked. According to the method and the device, the entity and the relation in the text content of the file are used as the tie, the originally isolated file is associated from the semantic level, the problem of information island of a single file in search ordering is broken through, analysis and calculation are carried out from the authority level and the correlation level of the file node, and finally fusion ordering of search results is achieved.

Description

Domain searching and sorting method based on dynamic map link analysis

Technical Field

The invention belongs to the field of Internet searching, and particularly relates to a field searching and sorting method based on dynamic map link analysis.

Background

Helping users locate the required resources accurately and quickly is a consistent goal for search engines. However, as information is continuously generated and accumulated, a search often returns a large number of results. Therefore, the search engine must rely on an efficient search ranking method to return the results desired by the user and to give preferential presentation. Compared with internet searching, the method has stronger user specificity and destination in domain searching, and also has higher requirement on searching sorting.

The conventional search ranking method based on word frequency and word position is based on too single ranking basis, and cannot consider the quality of file resources. The existing search ordering method (such as PageRank, hillTop and the like) based on webpage link analysis cannot be directly applied to domain search lacking webpage link relation. The existing search ordering method (such as RankSVM) based on user browsing preference learning usually trains a user-query record as an isolated sample set, and can better process historical search requests of historical users, but is difficult to provide effective ordering for new users or new requests; even though improved by similar "user-queries", it is not applicable to small user volume domain search scenarios. The bidding ranking method of the internet search engine is contrary to the principles of the professionality and the authority of the field search and is not applicable.

Disclosure of Invention

The invention aims to provide a domain searching and sorting method based on dynamic map link analysis.

The technical solution for realizing the purpose of the invention is as follows:

the method comprises the steps of firstly establishing a semantic level link relation for file resources in domain searching, further calculating from two aspects of authority and relativity, and finally realizing fusion ordering of search results; the method comprises the following specific steps:

step (1): dynamic construction of a domain map facing to search sequencing; taking various file sets in the field as input, and constructing a field map;

step (2): offline calculation of the authority increment of the file node based on the full graph; taking the domain map in the step (1) as input, and calculating to obtain authority of each file node in the domain map;

step (3): on-line calculation of file node relevance based on searching subgraph; taking a domain map and a user search term as input, extracting a search sub-map related to search from the whole domain map, and calculating the relevance of each file node in the sub-map;

step (4): the search results based on the authority degree and the relativity degree are fused and sequenced; and (3) comprehensively calculating the ranking degree of the file nodes by taking authority degree and correlation degree of each file node in the searching subgraph in the step (3) as input in the calculation process, and sequencing according to the ranking degree and returning to the user.

Compared with the prior art, the invention has the remarkable advantages that:

(1) According to the search ordering method-oriented domain map construction method, the entity and the relation in the text content of the file are used as the tie, the originally isolated file is associated from the semantic level, the problem of information island of a single file in search ordering is broken through, all domain files are brought into the same association system to be evaluated, and the constructed domain map lays a foundation for analyzing authority degree and relativity of each file node.

(2) The definition and calculation method of the authority degree and the correlation degree of the file node based on the domain map, which are provided in the step (2) and the step (3), can quantitatively evaluate the authority degree of the file node in the whole domain map and the correlation between the file node and the search keyword input by the user in the search subgraph, thereby realizing the search ordering method of the fusion authority degree and the correlation degree, which is provided in the step (4).

(3) The dynamic construction method and the incremental calculation method provided in the step (1) and the step (2) can dynamically construct the domain map and calculate the authority increment of the file node in the whole domain map according to the addition, deletion and modification conditions of the file to be searched in the domain search, so that the calculation amount of the system is reduced, and the calculation efficiency and the practicability of the system are improved.

Drawings

FIG. 1 is a flow chart of a search ranking method of the present invention.

Fig. 2 is a partial schematic diagram of a domain map of the search ranking method of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

As shown in fig. 1, an overall flowchart of a domain search ranking method based on dynamic graph link analysis according to an embodiment of the present invention includes the following steps:

and S1, dynamically constructing a search sorting field-oriented map. Fig. 2 is a partial schematic diagram of a domain map, which is composed of 4 elements, namely entity nodes, file nodes, associated edges and linked edges. The file node corresponds to all files to be searched in the field search, and the file types include but are not limited to text files, multimedia files, database files and the like; the entity node corresponds to the entity described in the file content and is obtained through the steps of entity extraction, entity disambiguation, coreference resolution and the like; the association side describes the association relation between the entity nodes, is obtained through relation extraction on the basis of extracting the entity nodes, discovers a new potential relation between the entities through relation reasoning, has a weight value, and the weight value represents the tightness degree of the relation between the two entity nodes; the link edge is used for connecting the entity node and the file node, the entity node is extracted according to the file description, the link edge has a weight, and the weight size represents the tightness degree of the file and the entity.

In step S101, the "entity node" and "association edge" construct and "association weight" are calculated. By combining with the prior knowledge of the field, named entities which are as accurate as possible and have no ambiguity can be obtained by carrying out entity identification, entity disambiguation (solving the problem of synonymy and dissimilarity) and coreference resolution (solving the problem of synonymy and dissimilarity) on the text content of the file, and the entities form 'entity nodes' in the field map. Further, through identifying the association relation between the entities in the text content, potential candidate relation is obtained, and through disambiguation and resolution of the relation, accurate and unambiguous association relation is obtained, the association relation forms an association side in the domain map, and the association side is noticed to have directionality and appear in pairs. Each associated edge has an "associated weight" that represents how tightly the relationship between the entities is, and may be, but is not limited to, represented by co-occurrence of the entities.

The method comprises the following two steps of initial association weight calculation and normalization: if the entities at the two ends of the associated edge co-appear in k files in total, the initial association weight corrValue' (i, j) of the associated edge is equal to k; after the association weights of all the association edges are calculated, the initial association weights corrValue' (i, j) sent by the same entity node are normalized according to the numerical proportion, and the association weights corrValue (i, j) of the association edges can be obtained.

Step S102, the "file node" and "link edge" are constructed and calculated with the "link weight". The file nodes in the domain map and the files to be searched are mutually bijective, and can be directly constructed, and each file node in the map represents one file to be searched. If a certain 'entity node' is extracted from the file content corresponding to a certain 'file node', a link edge exists between the entity node and the file node. The link edge weight calculation comprises two processes of initial link weight calculation and normalization calculation. The initial link weight calculation considers two aspects, namely the association degree alpha of the entity node to the file node and the importance degree beta of the file node to the entity node.

(1) When it is difficult for a file node to manually classify or evaluate the importance of an entity node, β≡1 is different for different file nodes. At this time, after calculating the initial weights of all the link edges, normalizing the initial weights of all the link edges connected with the same file node to obtain the link weight linkValue' =α; alpha adopts the following calculation method:

α＝TF(t,d)·IDF(t,d)·α ₁ (t,d)

where t is the entity name of the entity node, d is the file to be retrieved, TF (t, d) is the frequency of occurrence of t in d, idf=log (N/(N) _t,d +γ)) (N is the number of files in the set of files to be retrieved, N _t,d For the number of files containing entity t, γ takes 0.01 to ensure that the denominator is not zero), α ₁ And (t, d) the position coefficient is greater than 1 coefficient when the entity name t is in a special position such as title, abstract, keyword and the like, otherwise, 1 coefficient is obtained.

(2) Further, when it is possible to classify and score the entities and the files manually according to different fields, for example, the files in the financial field are classified into types such as report, account, financial news, etc., the files in the mechanical field are classified into types such as instruction manual, operation manual, reference materials, etc., the files in the software field are classified into types such as software test description, software development manual, software test report, etc., and the β value is set for the importance degree of different types of files in each field. At this time, the initial link weight linkValue' =α·β (α calculation method is the same as the above case).

And step S103, dynamically updating the map increment. In the application scene of the domain search, the file to be searched has the possibility of updating change, so that a corresponding domain map increment updating mechanism needs to be designed, and the global map reconstruction caused by local file change is avoided. The change forms of the file set to be searched comprise 3 types of new files, deleted files and modified files. In the case of newly added files, extraction of entity nodes, file nodes, associated edges and linked edges corresponding to the newly added files is required to be completed according to the methods in the steps S101 and S102; and updating the weights of the affected associated edges and the linked edges. In the case of deleting files, corresponding file nodes and associated edges thereof need to be deleted first; if the entity node is caused to have no connected link edge, deleting the entity node and the related edge thereof; and updating the weights of the affected associated edges and the linked edges. And updating the domain map according to the equivalent operation of deleting and adding in the case of modifying the file.

And S2, performing offline incremental calculation on authority of the file node based on the full graph. The invention takes the entity nodes of the domain map as the states which can be reached by the system, the transition probability among the states is determined by the association edge weight among the entity nodes, and the whole system forms a Markov chain, and the stable distribution of the Markov chain is the authority degree of the entity nodes. If the total number of the entity nodes is N, the transfer matrix is B _N×N (N rows and N columns of matrix B), and N entity node authority vectors x _N×1 (N row 1 column vector x), bx=x. The method is based on a Monte Carlo method, utilizes random walk to simulate the behavior of a user accessing entity nodes, and can update the random walk process in an increment aiming at the affected entity nodes when the domain map changes, thereby realizing the increment calculation of the authority of the entity nodes. The authority of the file node is equal to the sum of the authority of each link weight of the file node multiplied by the authority of the linked entity node.

In step S201, an authority degree design of "entity node".

If the entity node i has an associated edge pointing to the entity node j, A _ji =corrvue (i, j), otherwise a _ji =0. In step S101, the weight of the associated edge sent by the same entity node is normalized, so if the entity node i has an associated edge pointing to the entity node, the sum of the ith column of the matrix a is 1. If the entity node i does not point to any other entity node, then the force is made A _ii =1. This ensures that matrix a is a column and all 1 transfer matrix.

Considering that the user has a certain probability 1-delta (which can be obtained by counting the number of times the user directly accesses the new node/the total number of times the user accesses each node) and skips the link relation, the user directly accesses the new node, the method can be obtained according to a Markov model:

according to the definition in step S2, the authority vectors x of the N entity nodes are the smooth distribution of the markov chain. X in the above _n 、x _n+1 To calculate x (x=x can be considered as an iterative process _n＝∞ )。

Order the

B also satisfies the column sum being all 1, then the entity node authority is equivalent to solving the stationary distribution of the markov chain with the transition matrix B, i.e. the authority vector satisfies x=bx (equivalent to x=x _n＝∞ )。

In step S202, the "entity node" authority increment calculation is designed based on the monte carlo method. The behavior of the user accessing the entity nodes is simulated by using the random walk, and the smooth distribution of the Markov chain in the step S201, namely the authority of each entity node is estimated by counting the accessed times of each node.

The invention adopts a circulation starting point mode, and starts M random walk processes (total N multiplied by M random walk processes) by taking N entity nodes as starting points, wherein each step of random walk directly accesses a new node (can be regarded as the current random walk stop) with the probability of (1-alpha), and walks from the entity node i to the entity node j with the probability of alpha-corrValue (i, j). Finally, the number v (i) of accessed times of any entity node i is counted, and then v (i) is divided by the sum of the accessed times of all entity nodes, so that the average access probability of the node i, namely authority of the entity node i, is obtained.

When the domain map structure changes, authority degrees of all entity nodes can be calculated in an incremental mode. The specific method is that a random walk process before each round of map structure change is firstly required to be recorded, entity nodes (including adding and deleting of entity nodes, a set of the entity nodes is marked as X) and associated edges (including adding and deleting of associated edges or weight change of the associated edges, a set of the entity nodes is marked as Y) which generate changes in the round of map are counted, the entity nodes related to the X or the entity nodes connected with the Y are marked as a set Z, and then the X is the trigger node which needs to update the flow in the random walk. The updating process is to examine N multiplied by M random walk processes of the previous round, find a first trigger node in each random walk process, reserve random walk before the trigger node, continue to carry out subsequent random walk according to a new domain map, and calculate authority of each entity node.

Step S203, authority degree calculation of the file node is performed. The authority level file of the file node is equal to the sum of the link weight values linkValue of the file node multiplied by the authority level authority of the linked entity node. That is to say,

where authorityFile (p) represents the authority of the file node p; authorityEntity (q) the authority of the entity node q, and there is a link edge between the file node p and the entity node q; linkValue (p, q) represents the link edge weight between file node p and entity node q.

And step S3, online calculation of the relevance of the file node based on the searching subgraph. And extracting a search subgraph from the domain map according to the file nodes contained in the search result. The relevance of the entity node is determined by the number of file nodes linked by the entity node. The file node relevance is determined by the product of the weight of each link edge of the file node and the relevance of the link to the entity node.

Step 301, searching for subgraph construction. The searching subgraph is constructed according to the related results obtained by each search and is the subgraph of the domain map. Each relevant result obtained by the search engine through keyword matching and the like corresponds to a certain file node, and the file nodes form a 'file node' of the search subgraph. The linked edges of the file nodes in the domain map and the linked entity nodes respectively form the linked edges and the entity nodes of the search subgraph. And (3) reserving the association relation among the entity nodes according to the structure of the domain map in the searching sub-graph to form the association edge of the searching sub-graph.

In step 302, a "entity node" relevance is calculated. The relevance of the entity nodes is determined by the number of the file nodes linked by the entity nodes, and the relevance of each entity node in the searching sub-graph is equal to the number of the file nodes linked by the entity nodes. Assuming that fig. 2 is a search sub-graph, the correlation degree between the entity node a and the entity node B is 3.

In step 303, a "file node" relevance is calculated. The file node relevance is determined by the product of the weight of each link edge of the file node and the relevance of the link to the entity node. When the file node has multiple link edges, the product of each link is calculated and summed.

Taking fig. 2 as an example of the above calculation rule, assume that fig. 2 is a search sub-graph, the relevance of the entity node a and the entity node B is a relativity of a relativity entity a and a relativity entity B, respectively, linkValue3 is a linking edge weight between the entity node a and the file node C, and linkValue4 is a linking edge weight between the entity node B and the file node C. The method for calculating authority of the file node c is as follows:

relavancyFileC＝relavancyEntityA·linkValue3+relavancyEntityB·linkValue4。

and S4, integrating and sequencing search results based on authority degrees and relevancy degrees.

According to the method, the influence of authority and relevance is comprehensively considered in the search result sorting, so that the ranking degree rankValue=Ω -authenticayFile+ (1- Ω) ·λ -relavancyFile of each file node is required to be comprehensively considered, λ is introduced to ensure that the authority and relevance are similar in measurement level, and Ω is used for determining the ranking weight of the authority and relevance in the file nodes. The file node here considers only the retrieved files during each search.

If the authenticateFile has a median of a and the relavancyFile has a median of b, λ may be a/b. Construction of m times of search results and manual sequencing of samples, and recording n times of search results of the ith time _i Artificially ordered samples, n of the ith search result may be obtained for each given Ω _i The results are automatically ordered.The manual sorting samples are considered as correct sorting results, the error rate of the minimized automatic sorting results is taken as an optimization target, and the omega value can be obtained through an equidistant sampling method (omega is delta (determined by the required precision, such as 0.01) from 0 to 1 each time) or a one-dimensional searching algorithm (such as Newton method).

Claims

1. The method is characterized in that a semantic level link relation is established for file resources in the field search, and then calculation is performed from two aspects of authority and relativity, and finally fusion ordering of search results is realized; the method comprises the following specific steps:

the step (1) comprises the following steps:

step (11): the 'entity node', 'associated edge' constructs and 'associated weight' calculates;

obtaining accurate and unambiguous named entities by carrying out entity identification, entity disambiguation and coreference resolution on text contents of the file, wherein the entities form 'entity nodes' in the domain map; the method comprises the steps of identifying association relations among entities in text content to obtain potential candidate relations, and obtaining accurate and unambiguous association relations through disambiguation and resolution of the relations, wherein the association relations form an association edge in a domain map; each association edge has an association weight, and the weight size represents the tightness degree of the relationship between the entities;

step (12): the "file node" and the "link edge" are constructed and calculated with the "link weight"; the file nodes in the domain map and the files to be searched are mutually bijective, and are directly constructed, and each file node in the map represents one file to be searched; if a certain 'entity node' is extracted from the file content corresponding to a certain 'file node', a link edge exists between the entity node and the file node; the calculation of the link weight comprises two processes of initial link weight calculation and normalization calculation;

step (13): dynamically updating the map increment;

the change forms of the file set to be searched comprise a new added file, a deleted file and a modified file, and the extraction of entity nodes, file nodes, associated edges and linked edges corresponding to the new added file is required to be completed according to the steps (11) and (12) in the case of the new added file; updating the weight of the affected associated edge and the link edge; in the case of deleting files, corresponding file nodes and associated edges thereof need to be deleted first; if the entity node is caused to have no connected link edge, deleting the entity node and the related edge thereof; updating the weight of the affected associated edge and the link edge; in the case of modifying the file, updating the domain map according to the equivalent operation of deleting and adding;

in the step (2), the entity nodes of the domain map are used as states which can be reached by the system, the transition probability among the states is determined by the association edge weight among the entity nodes, the whole system forms a Markov chain, and the stable distribution of the Markov chain is the authority degree of the entity nodes, and the method specifically comprises the following steps:

step (21): authority degree design of 'entity node';

step (22): the authority degree increment of the entity node is calculated; based on a Monte Carlo method, simulating the behavior of a user accessing entity nodes by utilizing random walk, and when the domain map changes, incrementally updating the random walk process aiming at the affected entity nodes to realize incremental calculation of the authority degree of the entity nodes;

step (23): calculating authority degree of the file node; the authority level File of the file node is equal to the sum of the link weight values linkValue of the file node multiplied by the authority level authority of the linked entity node, namely

Where authorityFile (p) represents the authority of the file node p; authorityEntity (q) the authority of the entity node q, and there is a link edge between the file node p and the entity node q; linkValue (p, q) represents a link edge weight between the file node p and the entity node q;

the step (3) specifically comprises the following steps:

step (31): searching for sub-graph construction; the searching subgraph is constructed according to the related results obtained by each search and is a subgraph of the domain map; each related result obtained by the search engine in a keyword matching mode corresponds to a certain file node, and the file nodes form a file node of a search sub-graph; the linked edges of the file nodes in the domain map and the linked entity nodes respectively form the linked edges and the entity nodes of the search subgraph; the entity nodes in the searching subgraph keep the association relation among the entity nodes according to the structure of the domain map to form the association edges of the searching subgraph;

step (32): searching the relevance calculation of the entity node of the subgraph; the relevance of the entity nodes is determined by the number of the file nodes linked by the entity nodes, and the relevance of each entity node in the searching sub-graph is equal to the number of the file nodes linked by the entity nodes;

step (33): searching for the "file node" correlation calculation of the subgraph; the relevance of the file node is determined by the product of the weight of each link edge of the file node and the relevance of the link to the entity node; when the file node has a plurality of link edges, calculating the product of each link and summing;

2. The method according to claim 1, wherein the calculating of the "association weight" in the step (11) includes two steps of initial association weight calculation and normalization; the method comprises the following steps: if the entities at the two ends of the associated edge co-appear in k files in total, the initial association weight corrvue' (i, j) of the associated edge is equal to k; after the association weights of all the association edges are calculated, the initial association weights corrValue' (i, j) sent by the same entity node are normalized according to the numerical proportion, and the association weights corrValue (i, j) of the association edges are obtained.

3. The method according to claim 1, wherein the initial link weight calculation in the step (12) considers two aspects, namely, a degree of association α of the entity node to the file node and a degree of importance β of the file node to the entity node; the method comprises the following steps:

(1) when the importance degree of the file node to the entity node is difficult to manually classify or evaluate, for different file nodes beta=1, the initial link weight linkValue' =alpha, and after the initial weight of each link edge is calculated, normalizing the initial weight of each link edge connected with the same file node, thereby obtaining the link weight linkValue; alpha adopts the following calculation method:

α＝TF(t,d)·IDF(t,d)·α ₁ (t,d)

where t is the entity name of the entity node, d is the file to be retrieved, TF (t, d) is the frequency of occurrence of t in d, idf=log (N/(N) _t,d +γ)), N is the number of files in the file set to be retrieved, N _t,d For the number of files containing entity t, gamma takes 0.01 to ensure that denominator is not zero, alpha ₁ (t, d) is a position coefficient, when the entity name t is in the title, abstract and key word, the position coefficient is greater than 1, otherwise, the position coefficient is 1;

(2) when the entity and the file can be classified and scored manually according to different fields, the beta value is set for the importance degree of different types of files in each field, and at this time, the initial link weight value' =α·β.

4. The method according to claim 1, wherein the incremental calculation of authority of the entity node adopts a cyclic starting point mode, M random walk processes are respectively started by taking N entity nodes as starting points, n×m random walk processes are used, each step of random walk directly accesses a new node with probability of (1-alpha), and the entity node i is walked to the entity node j with probability of alpha-corrvue (i, j), finally, the number v (i) of times that any entity node i is accessed is counted, and then v (i) is divided by the sum of the accessed times of all entity nodes, so that the average access probability of the node i, namely authority of the entity node i, is obtained;

when the domain map structure changes, authority degrees of all entity nodes are calculated in an incremental mode; the specific method comprises the following steps: firstly, recording a random walk process before each round of map structure change, and counting entity nodes generating change in the map of the round, wherein the entity nodes comprise adding and deleting of the entity nodes, the set is marked as X and associated sides, the set is marked as Y, the entity nodes with association relation with X or the entity nodes connected with Y are marked as a set Z, and then X U Z is a trigger node needing to update a flow in the random walk; the updating process is to examine N multiplied by M random walk processes of the previous round, find a first trigger node in each random walk process, reserve random walk before the trigger node, continue to carry out subsequent random walk according to a new domain map, and calculate authority of each entity node.

5. The method according to claim 1, wherein the step (4) is specifically:

the search result ordering needs to comprehensively consider the influence of authority and relevance, so that the ranking degree of each file node is as follows:

rankValue＝Ω·authorityFile+(1-Ω)·λ·relavancyFile，

lambda is introduced to ensure that authority and correlation measurement levels are similar, and omega is used for determining weights of authority and correlation in file node ranking; the file node herein considers only the files retrieved during each search,

if the median of the authenticatyFile is a and the median of the delavancnyFile is b, lambda takes a/b; construction of m times of search results and manual sequencing of samples, and recording n times of search results of the ith time _i Manually ordering the samples to obtain n of the ith search result for each given Ω _i Automatically sequencing results; the manual sequencing samples are considered as correct sequencing results, the error rate of the minimized automatic sequencing results is taken as an optimization target, and omega values are obtained through an equidistant sampling method, wherein omega is delta from 0 to 1 each time, or a one-dimensional searching algorithm is adopted.