CN114519132A - Formula retrieval method and device based on formula reference graph - Google Patents

Formula retrieval method and device based on formula reference graph Download PDF

Info

Publication number
CN114519132A
CN114519132A CN202011293008.3A CN202011293008A CN114519132A CN 114519132 A CN114519132 A CN 114519132A CN 202011293008 A CN202011293008 A CN 202011293008A CN 114519132 A CN114519132 A CN 114519132A
Authority
CN
China
Prior art keywords
formula
original
nodes
article
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011293008.3A
Other languages
Chinese (zh)
Inventor
汤帜
袁珂
高良才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202011293008.3A priority Critical patent/CN114519132A/en
Publication of CN114519132A publication Critical patent/CN114519132A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Abstract

The invention discloses a formula retrieval method and a device based on a formula reference graph, wherein the formula retrieval device comprises: the system comprises a mathematic reference map building module, an inquiry mathematic formula description keyword automatic generation module, an initial ordering module and a rearrangement module. The invention represents the reference relation between formulas by designing and constructing the formula reference graph, expands the keywords of the inquired formula, further enhances the semantic retrieval performance, solves the problem that the prior formula retrieval system cannot better recall the mathematical formula with the same semantics but dissimilar structure, so that the retrieval efficiency of the mathematical formula is low, and improves the accuracy of formula retrieval.

Description

Formula retrieval method and device based on formula reference graph
Technical Field
The invention relates to the technical field of information retrieval, in particular to a formula retrieval method and device based on a formula reference graph.
Background
With the rapid development of the internet, a large number of mathematical formulas appear in web pages or scientific documents. The rapid growth in internet volume has led to an increasing difficulty in finding documents on the internet that can assist in understanding mathematical formulas. Currently, mainstream search engines such as Google, Baidu, etc. are widely cited, but these search engines only support conventional text search. The mathematical formula is different from the conventional text information, and the mathematical formula not only contains semantic information but also contains a large amount of structural information.
Currently, existing mathematical formula retrieval systems, such as systems of MIAS, wikimers, and tangnt, can help users to retrieve mathematical formulas and further help them to understand the mathematical formulas. These related articles recalled (retrieved) by the mathematical formula retrieval systems typically contain mathematical formulas that exactly match the structure of the mathematical formula of the user query. The existing mathematical formula retrieval system can be divided into text-based retrieval, tree-based retrieval and other retrieval according to the expression form of mathematical formulas. In text-based retrieval, the mathematical formula is expressed linearly, i.e., "a + b" as "a plus b"; based on tree structure search, namely constructing a mathematical formula into a tree form for representation, so as to keep the structure information of the mathematical formula as much as possible; other searches mainly convert mathematical formulas into vector representations in different ways.
Most of the current mainstream mathematical formula retrieval represents mathematical formulas in a tree form, and the purpose of the retrieval is to recall mathematical formulas which are very similar to a query formula structure, namely similar mathematical formulas. But the semantic information in the mathematical formula is ignored at the same time, so that other mathematical formulas with similar related mathematical semantic information but dissimilar shapes cannot be effectively recalled. Such as by mathematical formulas
Figure BDA0002784382420000011
As a query, the mathematical formula "P (a, B) ═ P (a | B) P (B) ═ P (B | a) P (a)" cannot be recalled well, and although all the mathematical formula "Bayes Theorem" have meanings, they have different "shapes". In order to improve the recall effect of related documents, Google adopts the reference relationship between pages to improve the effective recall of semantically related documents, such as the pagerank algorithm. Therefore, the introduction of the reference relation between the mathematical formulas in the mathematical formula retrieval system can improve the retrieval effect of the mathematical formula retrieval system to a certain extent.
At present, some mathematical formula retrieval systems need users to provide text information to enhance retrieval recall effect of the retrieval systems while retrieving mathematical formulas, but for users with weak professional mathematical background, users who provide retrieval mathematics are providedIt is very difficult to provide the keywords of the mathematical formula at the same time, and a large number of mathematical formulas have different meanings in different fields, such as the mathematical formulas
Figure BDA0002784382420000012
It may be expressed in the mathematical field as a "logistic function" or "Sigmoid function", but in machine learning, it is expressed as an "activation function". The existing mathematical formula retrieval system can not automatically expand to obtain all relevant mathematical keywords according to the mathematical formula provided by the user, the retrieval performance of the mathematical formula is not high, and the use effect and the user experience are not good.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a mathematical formula retrieval method and a mathematical formula retrieval device based on a formula reference graph, which express the reference relation between mathematical formulas by designing and constructing the formula reference graph, further enhance the semantic retrieval performance of a retrieval system and solve the technical problem that the prior mathematical formula retrieval system cannot better recall mathematical formulas with the same semantics but dissimilar structures, so that the mathematical formula retrieval efficiency is low.
The technical scheme of the invention is as follows:
a formula retrieval method based on a formula reference graph comprises the following steps:
1) constructing a formula reference graph;
the invention constructs a formula reference graph by the relationship of mathematical formulas and links in an article. The formula reference graph contains three nodes and three edge relationships. In the graph, three nodes are an A node, an O node and a G node; the A node represents the title of the article, the O node represents the original mathematical formula, and the G node represents the generalized mathematical formula. The three types of edges include an edge between nodes A and O denoted as r (A-O), an edge between nodes O and G denoted as r (O-G), and an edge between nodes A and G denoted as r (A-G).
This section will describe in detail how the formula refers to the calculation of the value of an edge (probability of an edge) between the respective nodes in the graph. Node aiThe edge representation article with the node o contains the original publicFormula node o, node aiProbability representation w (o, a) on the edge with node oi) I.e. original formula o in article aiThe ratio probability of the sum of the importance of (a) to the importance of all the articles containing the original formula o is calculated as
Figure BDA0002784382420000021
Where I (o, a) represents the original formula o for article aiThe value of (1) is no, how many articles in the data set (such as article data adopting Wikipedia) contain the original formula o, the calculation mode is calculated according to the documents 'Wikimors 3.0: a hybrid MIR system based on the context, structure and architecture of the expression in a document', and the method is based on the original formula oiWhether or not to be calculated independently of the line and the similarity of the original formula o context to the entire article. The edge between o and g represents the probability that the generalized formula represented by the g node is generalized by the original formula o, and the value on the edge between o and g is calculated as
Figure BDA0002784382420000022
Where count (o, g) represents how many times the original formula o is converted to the generalized formula g in the Wikipedia dataset and ng represents how many mathematical formulas o are converted to the generalized formula g. The value (probability) of the edge between the node a and the node g represents the strength of the connection between the article a and the generalized formula g, and is calculated as:
Figure BDA0002784382420000023
wherein, count (g, a) represents the cumulative sum of times that the original mathematical formula o refers to the article a in the context (the original mathematical formula is located in the upper and lower text sections of the document), and na represents the times that the original mathematical formula o refers to the article a.
2) Automatically generating description keywords of a formula, wherein the description keywords can be used for query retrieval;
in a retrieval system, keywords of mathematical formulas facilitate recall of semantically highly related but structurally dissimilar mathematical formulas. The invention automatically generates description keywords for inquiring a mathematical formula, firstly converts the mathematical formula q into an original and generalized tree structure, and adopts a method proposed by a document 'Wikimors 3.0: a hybrid MIR system based on the context, structure and import of the format in a document', wherein in the method, 1) format normalization is carried out on formula data of different sources, and the formula data are stored in a database; 2) establishing a formula tree for each normalized formula according to the symbol priority order and the symbol action range; 3) for each formula tree, extracting an original substructure of the formula tree and creating a generalized substructure corresponding to the original substructure; 4) creating an inverted index for the database according to the original substructure and the generalized substructure; 5) establishing a formula tree for a formula to be queried, extracting an original substructure and establishing a generalized substructure, and searching the formula containing the original substructure and the generalized substructure of the formula to be queried from the database; 6) and sequencing the searched formulas according to the similarity between the searched formulas and the to-be-inquired bulletin. According to the returned mathematical formula list, selecting a formula with the relevance ranking of 3, then finding out an original formula o node and a generalized formula g node of the three formulas respectively in a formula reference graph, and then calculating the edges of all article nodes directly connected with the node g and the node o, thereby finding out potential related keywords, namely the titles of the connected article nodes a, wherein in the formula reference graph constructed in the last step, each related keyword (the title of the article node a) can be obtained in the following calculation mode:
Figure BDA0002784382420000031
wherein KP (q, a)j) As article ajCorrelation with the mathematical formula q of the top 3 of the recall ranking; w (o)i,aj) Is an edge oi-ajProbability of (a), w (o)i,gk) Is an edge oi-gkProbability of (c), w (g)k,aj) Is a side gk-ajThe probability of (c).
The description keywords of the query mathematical formula can be automatically generated through the formula calculation, and the related description keywords of the query mathematical formula are obtained.
3) Carrying out initial sequencing;
according to the obtained mathematical formula and related keywords of the mathematical formula, 1) format normalization is carried out on formula data of different sources and the formula data are stored in a database through a sorting mode in a document of Wikimors 3.0: a hybrid MIR system based on the context and structure and import of formula in a document; 2) establishing a formula tree for each normalized formula according to the symbol priority order and the symbol action range; 3) for each formula tree, extracting an original substructure of the formula tree and creating a generalized substructure corresponding to the original substructure; 4) creating an inverted index for the database according to the original substructure and the generalized substructure; 5) establishing a formula tree for a formula to be queried, extracting an original substructure and establishing a generalized substructure, and searching the formula containing the original substructure and the generalized substructure of the formula to be queried from the database; 6) and sorting the searched formula according to the similarity between the searched formula and the formula to be inquired, and selecting a plurality of formulas to return, thereby obtaining an initial sorting of all related articles and obtaining the articles which are highly similar to the inquiry mathematical formula.
4) Carrying out rearrangement;
in the invention, the articles which are highly similar to the query mathematical formula are recalled through initial sequencing, so the aim of reordering is to balance the articles with the structural matching and semantic matching of the mathematical formula, and the retrieval result list finally returned to the user is more reasonable. Therefore, in the process of reordering, the invention introduces a greedy matching mode, searches all articles related to the query mathematical formula semantics in the formula reference diagram, calculates the semantic relevance of the related articles, updates the relevance scores of all the articles in the initial ordering process according to the semantic relevance of the related articles, and then obtains a new retrieval ordering result list.
In the process of reordering, the invention firstly randomly walks from seed nodes (original mathematical formula nodes) to find all related article nodes with similarity exceeding a set threshold t, and the random walking mode is as follows:
Path={(r0,r1,r2,...,ri)|RW(r)}
wherein, Path is a Path set, and a set of paths from the seed node to the related document nodes; r isiIs the ith path, which is walked through rw (r), which is a random walk algorithm.
Therefore, the relevance value (a) of the traversed article is calculated byi) When value (a)i) When the value is less than the set threshold value t, the article node is discarded:
Figure BDA0002784382420000041
wherein, w (v)i,vj) Is the probability of an edge between two nodes.
Figure BDA0002784382420000042
Wherein value (a)i) Namely document aiA semantic relevance score to the query mathematical formula; s (q, a)i) Query q and article a for mathematical formulasiThe composite score of (a) is the article a in the initial rankingiIs scored.
And calculating to obtain the final score of each recall formula and each query formula according to the formulas, and sorting according to the scores from large to small to obtain the sorting result of formula retrieval.
Through the steps, formula retrieval based on the formula reference graph is achieved.
The invention also provides a formula retrieval device based on the formula reference graph, which comprises: the system comprises a formula reference map construction module, a query mathematical formula description keyword automatic generation module, an initial ordering module and a rearrangement module; wherein the content of the first and second substances,
A. the formula reference graph building module is used for building a formula reference graph according to a mathematical formula in the article and the relation of the links. The formula reference graph contains three nodes and three edge relationships. In the graph, three nodes are an A node, an O node and a G node; the A node represents the title of the article, the O node represents the original mathematical formula, and the G node represents the generalized mathematical formula. The three types of edges include an edge between nodes A and O denoted as r (A-O), an edge between nodes O and G denoted as r (O-G), and an edge between nodes A and G denoted as r (A-G).
B. The query mathematical formula description keyword automatic generation module is used for automatically generating description keywords of the query mathematical formula to obtain related description keywords of the query mathematical formula.
C. The initial ranking module is used for obtaining an initial ranking of all related documents according to the obtained mathematical formula and related keywords of the mathematical formula, and obtaining documents which are highly similar to the query mathematical formula;
D. the rearrangement module is used for searching all documents relevant to the query mathematical formula semantics in the formula reference diagram, calculating the semantic relevance of the relevant documents, and updating the relevance scores of all the documents in the initial sorting process so as to obtain a new retrieval sorting result list.
Compared with the prior art, the invention has the following positive effects:
the invention provides a mathematical formula retrieval method and a mathematical formula retrieval device based on a formula reference graph, wherein the reference relation between mathematical formulas is explored by designing and constructing the formula reference graph, the retrieved mathematical formulas are expanded by mathematical keywords of the inquired mathematical formulas through the formula reference graph, the semantic retrieval performance of a retrieval system is further enhanced, the problem that the retrieval efficiency of the mathematical formulas is low because the conventional mathematical formula retrieval system cannot recall the mathematical formulas with the same semantics and dissimilar structures well is solved, and the accuracy of the mathematical formula retrieval is improved. In specific implementation, the method and an existing formula retrieval method wikimis are respectively used for mathematical formula retrieval and compared, and are respectively described by the accuracy of top5, to10 and top15 and two indexes of measuring search engine quality index DCG (discrete customized gain) DCG (description result ordering condition).
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
FIG. 2 is a schematic diagram of a process for constructing a formula reference map in accordance with an embodiment of the present invention;
in the figure, the A mark node represents the titles of the articles, and the titles of the two articles are respectively A1And A2The O node represents the original mathematical formula and the G node represents the generalized mathematical formula.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a mathematical formula retrieval method and a mathematical formula retrieval device based on a formula reference graph, which are used for exploring the reference relation among mathematical formulas by designing and constructing the formula reference graph, further enhancing the semantic retrieval performance of a retrieval system and improving the retrieval efficiency of the mathematical formulas.
Fig. 1 shows a flow of the method of the present invention, which mainly includes four processes, namely, building a mathematical reference diagram, querying a mathematical formula to describe automatic generation and building of keywords, and performing initial sequencing and rearrangement. The overall flow chart of the invention is shown in fig. 1.
The construction of the formula reference graph is based on two assumptions: 1) if the relation between a article and the formula is stronger, the article can express and describe the mathematical formula; 2) if the formula does not refer to any other article, then the current article is descriptive of the formula. The construction of a formula reference graph by the relationship of mathematical formulas and links within an article is shown in FIG. 2. The formula reference graph contains three nodes and three edge relationships. In the figure, a represents the title of the article, and the figure contains two article titles, which are respectively represented as a1And a2The o node represents the original mathematical formula and the g node represents the generalized mathematical formula. The edge between node a and node o indicates that the article contains the original mathematical formula in node o, and the probability on the edge is calculated as
Figure BDA0002784382420000061
Wherein I (o, a) represents the importance of the original formula o to the article a, no is how many documents contain the original formula o, and the calculation mode is according to the document "wikimers 3.0:a hybrid MIR system based on the context, structure and import of the language in a document "; the edge between o and g indicates that the generalized formula is generalized from the original formula o, and the probability on the edge is
Figure BDA0002784382420000062
Where count (o, g) represents how many times the original formula o is converted to the generalized formula g in the dataset and ng represents how many mathematical formulas o are converted to the generalized formula g. The edge between the node a and the node g represents the strength of the connection between the article a and the generalized mathematical formula, and the probability is calculated as:
Figure BDA0002784382420000063
where count (g, a) represents the cumulative sum of times that the original mathematical formula o references the article a cumulatively in context, and na represents the number of times that the original mathematical formula o references the document a.
And automatically generating and constructing description keywords of the query mathematical formula. In a retrieval system, the keywords of the mathematical formula are very important and help to recall the semantically highly related but structurally dissimilar mathematical formulas. In the process, the invention firstly converts the mathematical formula q into an original and generalized tree structure by the method proposed by the document "Wikimors 3.0: a hybrid MIR system based on the context, structure and import of the expression in a document". Then searching potential related keywords by directly connecting the edges of the document nodes of the original mathematical formula and the generalized mathematical formula, wherein in the mathematical formula reference graph constructed in the last step, each related keyword can be obtained by the following calculation mode:
Figure BDA0002784382420000064
wherein, w (o)i,aj) Is the edge oi-ajProbability of (a), w (o)i,gk) Is an edge oi-gkProbability of (c), w (g)k,aj) Is a side gk-ajThe probability of (c).
Some relevant keywords relevant to the query mathematical formula can be obtained through the calculation of the formula.
In table 2, the keywords generated for the queried mathematical formula by the mathematical reference graph are the query mathematical formula on the left and the result extracted from the keywords of the second top of the queried mathematical formula on the right.
Table 2: sample keyword query
Figure BDA0002784382420000071
After the mathematical formula and the corresponding key words are obtained, the documents are sorted according to an initial retrieval system to obtain an initial sorted list, and then related documents are found according to a random walk mode to be reordered.
And initially sorting to obtain a mathematical formula and keywords related to the mathematical formula, and then obtaining an initial sorting of all related documents by a sorting mode in a document 'Wikimers 3.0: a hybrid MIR system based on the context, structure and import of the format in a document'.
Initial ranking tends to recall documents that contain a high degree of similarity to the query mathematical formula, so the goal of reordering is to balance the structural and semantic matching of the mathematical formula to make the resulting search listing returned to the user more reasonable. Therefore, in the process of reordering, the invention introduces a greedy matching mode to search all documents relevant to the query mathematical formula semantics in the formula reference diagram, calculate the semantic relevance of the relevant documents, update the relevance scores of all the documents in the initial ordering process according to the semantic relevance of the relevant documents, and then obtain a new ordering list.
In the process of reordering, the invention firstly finds all the relevant document nodes with the threshold value exceeding the set value by random walk from the node of the original mathematical formula of the seed, and the random walk mode is as follows:
Path={(r0,r1,r2,...,ri)|RW(r)}
where Path is the Path set from the seed node toPath of related documents, riIs the ith path traveled by rw (r).
The relevance calculation of the document being traversed can thus be performed by:
Figure BDA0002784382420000072
wherein w (v)i,vj) Is the probability of an edge between two nodes.
Figure BDA0002784382420000073
Wherein value (d)i) Namely document diA semantic relevance score to the query mathematical formula; s (q, d)i) Querying q and d for a mathematical formulaiThe composite score of (1), document d at initial rankingiIs scored.
Given the mathematical formula "F ═ ma", the initial rank given by the system and the top five ranks after the reordering calculation are retrieved, as shown in table 3.
Table 3: retrieval system initial ordering and re-ordering results
Initial ordering Reordering
1,Newton(unit) 1,Newton(unit)
2,Classical mechanics 2,Force
3,Added mass 3,Inertia
4,Newton’s laws if motion 4,History of variational principles in physics
5,Force 5,Newton’s laws of motion
And calculating the final score of each recall formula and each query formula, and obtaining the sorting result of formula retrieval according to the sorting of scores from large to small.
In specific implementation, the invention provides a formula retrieval device based on a formula reference graph, which comprises: the system comprises a mathematic reference map building module, an inquiry mathematic formula description keyword automatic generation module, an initial ordering module and a rearrangement module. The formula retrieval device is installed and operated on a computer system, expands keywords for the queried mathematical formula, further enhances semantic retrieval performance, solves the problem that the prior formula retrieval system cannot recall the mathematical formula with the same semantics but dissimilar structure well, so that the retrieval efficiency of the mathematical formula is low, and improves the accuracy of formula retrieval.
Comparing the invention with an existing formula retrieval method wikimis, respectively calculating the accuracy of top5, to10 and top15, and measuring the search engine quality index DCG (discrete relational gain) to explain the result sorting condition). The specific result data are shown in table 1.
Table 1 shows the comparison of the index values obtained by the invention and the wikimis formula search method
Wikimir The invention
P5 0.307 0.347
P10 0.240 0.243
P15 0.198 0.211
DCG5 14.295 15.539
DCG10 17.867 18.693
DCG15 19.762 20.954
The results in table 1 show that the accuracy and DCG are greatly improved by using the present invention.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (7)

1. A formula retrieval method based on a formula reference graph comprises the following steps:
1) constructing a formula reference graph through the formula of the article and the relation of the links;
the formula reference graph comprises three nodes and three edge relations; the nodes are A nodes, O nodes and G nodes and respectively represent titles, original formulas and generalized formulas of articles; the edges include an edge r (A-O) between nodes A and O, an edge r (O-G) between nodes O and G, and an edge r (A-G) between nodes A and G;
the original formula o is put in an article aiThe proportional probability of the importance in (b) relative to the sum of the importance of all articles containing the original formula o is expressed as:
Figure FDA0002784382410000011
wherein, w (o, a)i) Is node aiProbability of an edge with node o, i.e., the value of the edge; i (o, a) denotes the original formula o for article aiNo is the number of articles in the data set containing the original formula o;
the probability that the generalized formula represented by the node with the edge value between the nodes O and G being the edge is generalized by the original formula; calculated by the following formula:
Figure FDA0002784382410000012
wherein count (o, g) represents the number of times the original formula o is converted to the generalized formula g in the dataset; ng represents the number of times formula o is converted to generalized formula g;
the value of the edge between node a and node g is the strength of the connection between article a and generalized formula g, and is calculated by the following formula:
Figure FDA0002784382410000013
wherein count (g, a) represents the cumulative sum of times that the original formula o refers to the article a in the context, and na represents the number of times that the original formula o refers to the article a;
2) automatically generating description keywords of a formula;
the title of the article node a is a keyword, and each related keyword can be obtained by the following calculation:
Figure FDA0002784382410000014
wherein KP (q, a)j) As article ajRelevance to formula q retrieved; w (o)i,aj) Is the edge oi-ajProbability of (a), w (o)i,gk) Is an edge oi-gkProbability of (c), w (g)k,aj) Is a side gk-ajThe probability of (d);
3) performing initial ranking in an initial ranking mode to obtain articles which are highly similar to the query formula and serve as the initial ranking of all related articles;
4) and (3) rearranging to obtain a more reasonable final retrieval result list:
searching all articles related to the query formula semantics in the formula reference diagram by adopting a greedy matching mode, calculating the semantic relevance of the related articles, updating the relevance scores of all the articles in the initial sequencing by using the semantic relevance of the related articles, and then obtaining a new retrieval sequencing result list;
the method specifically comprises the following steps:
41) taking an original formula node as a seed node, firstly randomly walking from the seed node, and finding all related article nodes with similarity exceeding a set threshold, wherein the random walking mode is as follows:
Path={(r0,r1,r2,...,ri)|RW(r)}
wherein, Path is a Path set and is a Path from the seed node to the related article; r isiIs the ith path, which is walked through RW (r); RW (r) is random walk algorithm;
42) the relevance of the traversed article is calculated by:
Figure FDA0002784382410000021
wherein, w (v)i,vj) Is the probability of an edge between two nodes;
Figure FDA0002784382410000022
wherein value (a)i) I.e. document aiA semantic relevance score to the query formula; s (q, a)i) Query q and article a for formulaiThe composite score of (1), i.e. article a at the time of initial rankingiScore of (a);
calculating to obtain the final score of each recall formula and each query formula according to the formulas; sorting according to the scores from large to small to obtain a sorting result of formula retrieval;
through the steps, formula retrieval based on the formula reference graph is achieved.
2. The method for retrieving a formula based on a formula reference graph as claimed in claim 1, wherein in step 1), the number of articles no containing the original formula o is obtained by calculating the similarity between the context of the original formula o and the articles according to whether the original formula o is independent in the articles.
3. The formula search method based on the formula reference diagram as claimed in claim 1, wherein the step 2) of automatically generating the description keyword of the formula comprises the following processes:
21) firstly, converting a formula q into an original and generalized tree structure, carrying out format normalization on formula data from different sources, and storing the formula data into a database;
22) establishing a formula tree for each normalized formula;
23) for each formula tree, extracting an original substructure of the formula tree and creating a generalized substructure corresponding to the original substructure;
24) creating an inverted index for the database according to the original substructure and the generalized substructure;
25) establishing a formula tree for a formula to be queried, extracting an original substructure and establishing a generalized substructure, and searching the formula containing the original substructure and the generalized substructure of the formula to be queried from a database;
26) sorting the searched formulas according to the similarity between the searched formulas and the formulas to be inquired; and selecting a formula with the top relevance rank according to the returned formula list, and respectively finding out the edges of the article nodes which are directly connected with the corresponding original formula and the generalized formula, thereby finding out potential related keywords.
4. The formula retrieval method based on the formula reference graph as recited in claim 3, wherein the step 3) performs initial ranking, specifically, the searched formula is ranked according to the similarity between the searched formula and the formula to be queried, so as to obtain an initial ranking of all related articles, thereby obtaining the articles containing the articles highly similar to the query formula.
5. The formula retrieval method based on the formula reference graph as claimed in claim 4, wherein the step 4) is rearranged by a greedy matching method.
6. A formula search apparatus for implementing the formula search method based on the formula reference map of claim 4, comprising: the system comprises a formula reference graph building module, an inquiry formula description keyword automatic generation module, an initial ordering module and a rearrangement module; wherein:
A. the formula reference graph building module is used for building a formula reference graph according to the relation between the formulas and the links in the article; the formula reference graph comprises three nodes and three edge relations; the nodes are A nodes, O nodes and G nodes and respectively represent titles, original formulas and generalized formulas of articles; the edges include an edge r (A-O) between nodes A and O, an edge r (O-G) between nodes O and G, and an edge r (A-G) between nodes A and G;
B. the query formula description keyword automatic generation module is used for automatically generating description keywords of the query formula to obtain related description keywords of the query formula;
C. the initial ranking module is used for obtaining an initial ranking of all related documents according to the obtained formula and related keywords of the formula, so that documents which are highly similar to the query formula are obtained;
D. the rearrangement module is used for searching all documents relevant to the query formula semantics in the formula reference diagram, calculating the semantic relevance of the relevant documents, and updating the relevance scores of all the documents in the initial sorting process so as to obtain a new retrieval sorting result list.
7. A computer system for implementing the formula retrieval apparatus of claim 6.
CN202011293008.3A 2020-11-18 2020-11-18 Formula retrieval method and device based on formula reference graph Pending CN114519132A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011293008.3A CN114519132A (en) 2020-11-18 2020-11-18 Formula retrieval method and device based on formula reference graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011293008.3A CN114519132A (en) 2020-11-18 2020-11-18 Formula retrieval method and device based on formula reference graph

Publications (1)

Publication Number Publication Date
CN114519132A true CN114519132A (en) 2022-05-20

Family

ID=81594534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011293008.3A Pending CN114519132A (en) 2020-11-18 2020-11-18 Formula retrieval method and device based on formula reference graph

Country Status (1)

Country Link
CN (1) CN114519132A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116363566A (en) * 2023-06-02 2023-06-30 华东交通大学 Target interaction relation recognition method based on relation knowledge graph

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116363566A (en) * 2023-06-02 2023-06-30 华东交通大学 Target interaction relation recognition method based on relation knowledge graph
CN116363566B (en) * 2023-06-02 2023-10-17 华东交通大学 Target interaction relation recognition method based on relation knowledge graph

Similar Documents

Publication Publication Date Title
US9792304B1 (en) Query by image
US10002330B2 (en) Context based co-operative learning system and method for representing thematic relationships
Wang et al. Semplore: A scalable IR approach to search the Web of Data
US8473532B1 (en) Method and apparatus for automatic organization for computer files
Liu et al. Processing keyword search on XML: a survey
CN108846029B (en) Information correlation analysis method based on knowledge graph
JP6216467B2 (en) Visual-semantic composite network and method for forming the network
CN102915381A (en) Multi-dimensional semantic based visualized network retrieval rendering system and rendering control method
Chuang et al. Automatic query taxonomy generation for information retrieval applications
CN112800023B (en) Multi-model data distributed storage and hierarchical query method based on semantic classification
CN114519132A (en) Formula retrieval method and device based on formula reference graph
Wu et al. Searching online book documents and analyzing book citations
CN111737413A (en) Feedback model information retrieval method, system and medium based on concept net semantics
CN111782699A (en) Intelligent interest point searching method based on user history tile browsing records
CN116720511A (en) Paper recommendation method integrating multilayer diagram and time sequence characteristics
CN112199461B (en) Document retrieval method, device, medium and equipment based on block index structure
Kargar et al. eGraphSearch: Effective keyword search in graphs
Pan Relevance feedback in XML retrieval
CN112163065A (en) Information retrieval method, system and medium
Gadge et al. Performance Analysis of Layered Vector Space Model in Web Information Retrieval
Pan et al. Query refinement by relevance feedback in an XML retrieval system
Ganta et al. Search engine optimization through spanning forest generation algorithm
Taneja et al. Query recommendation for optimizing the search engine results
Wang et al. Diversified Top-k keyword query interpretation on knowledge graphs
Bhatia et al. Contextual paradigm for ad hoc retrieval of user-centric web data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination