CN104991905B - A kind of mathematic(al) representation search method based on level index - Google Patents

A kind of mathematic(al) representation search method based on level index Download PDF

Info

Publication number
CN104991905B
CN104991905B CN201510336356.7A CN201510336356A CN104991905B CN 104991905 B CN104991905 B CN 104991905B CN 201510336356 A CN201510336356 A CN 201510336356A CN 104991905 B CN104991905 B CN 104991905B
Authority
CN
China
Prior art keywords
node
mathematic
representation
level
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510336356.7A
Other languages
Chinese (zh)
Other versions
CN104991905A (en
Inventor
田学东
周南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University
Original Assignee
Hebei University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University filed Critical Hebei University
Priority to CN201510336356.7A priority Critical patent/CN104991905B/en
Publication of CN104991905A publication Critical patent/CN104991905A/en
Application granted granted Critical
Publication of CN104991905B publication Critical patent/CN104991905B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of mathematic(al) representation search method based on level index.This method is realized based on the index established in advance.The process that index is established includes parsing the mathematic(al) representation of LaTeX forms, forms the cryptographic Hash hierarchical structure tree of expression formula, by inserting node structure Treap trees, forms KP map indexs layer and inverted index layer.During retrieval, the LaTeX forms mathematic(al) representation of user's input is parsed and formed in the same way cryptographic Hash hierarchical structure tree, then is retrieved from the KP map indexs layer and inverted index layer of index, realizes precise search or the isomorphism retrieval to mathematic(al) representation.The present invention uses double-deck index structure, has taken into account the integrality and effectiveness of retrieval and function of mathematic(al) representation information, has had more rich search modes and higher recall precision.

Description

A kind of mathematic(al) representation search method based on level index
Technical field
The present invention relates to mathematical information searching field, specifically a kind of mathematic(al) representation retrieval based on level index Method.
Background technology
At present, some research institutions both domestic and external or tissue have begun to carry out correlative study for mathematical information retrieval, And construct the prototype system for possessing mathematical information search function, as DLMF Search, EgoMath, MathDex, LeActiveMath, MathWebSearch, WikiMirs etc..These prototype systems press the research and development of mathematic(al) representation retrieval technique Strategy can be divided into two classes:One kind is to be subject to mathematics expansion based on full-text search engine, and another kind of is to be directed to mathematical expression Formula feature specially designs.Wherein, DLMF Search, EgoMath, MathDex and LeActiveMath belong to previous classification, MathWebSearch and WikiMirs belong to latter class.These prototype systems are introduced respectively below.
DLMF (Digital Library of Mathematical Functions) is ground by American National Standard technology Study carefully sponsored American National mathematical formulae digital library.DLMF Search systems based on DLMF structures are with TeX/ The mathematical material of LaTeX forms is as retrieval object;DLMF Search index part, using Parse-tree as index Structure, spcial character and mathematic sign is corresponding with letter in alphabet, make mathematic(al) object linearisation, delimit, and serialize, Each expression element is converted into canonical form in order, then realized using traditional full-text search engine towards mathematical table Up to the retrieval of formula.
EgoMath is directed to MathML forms, and the mathematic(al) representation of particularly Presentation MathML forms is examined Rope is studied.The formula of complexity is converted to an ordered set to represent by EgoMath.The table to be checked that the system inputs user Handled up to formula, generate multiple subitems, element of these subitems as boolean's mould in information retrieval, by logical AND, jointly A querying condition is formed to be attached on file polling.By special algorithm each mathfile can be divided into math portions and Textual portions are stored, so as to improve the specific aim of inquiry.
MathDex is the mathematical search engine based on Apache foundation Lucene full-text search engines, is earliest energy Enough carry out the search engine of mathematical material identification.While MathDex is that each expression formula establishes index, also to its subexpression Marked, and indicate the frequency of occurrences of these subexpressions.In terms of retrieval, MathDex employs N-grams matchings Method, destination file is converted into embedded MathML XHTML formatted files, and result arranged according to querying condition Sequence;Then, destination file is divided into multiple domains such as topic, text, and assigns different weights.In query process, due to rope Subexpression information in expression formula is stored in drawing, so, expression formula to be checked is disassembled and carried out parallel for multiple subexpressions Inquiry.Obviously, the subexpressions for disassembling gained can be matched more in the index, and the expression formula degree of correlation is higher.
LeActiveMath is also a mathematical search engine based on Apache foundations Lucene, and the system is directed to OMDoc is encoded and the mathfile with semantic information is handled, and can improve file correlation.The system can be by mathematics Expression formula semantic information includes in the index, and is converted to OMDoc forms comprising specific information during index is established Textual mark.LeActiveMath in index structure from one to depth capacity to making iterative queries into, to avoid input The unnecessary problem that expression formula is brought in conversion process.
Another kind mathematics indexing means as obtained from extending full-text search engine are proposed by designs such as Misutka.Index The process of foundation is divided into mathematic(al) representation extraction, mathematic(al) representation classification and three parts of sort result.Wherein, in mathematic(al) representation In classification, by expression parsing, multiple queries condition, and according to condition iterative query one by one are abstracted as.Pass through iterative query, solution Full-text search engine of having determined only focuses on its text characteristics for mathematic sign and ignores its feature of semanteme, cause equivalent expression without The problem of method matches.
At home, the MathSearch of Lanzhou University's design has certain representativeness.The system is based on Lucene in full Search engine, by extending to obtain mathematical search function to it.MathSearch towards Content MathML forms, build with Content MathML are labeled as the parsing tree construction of content, and obtain the analytic tree knot of rule by series of standards process Structure, based on N-grams is divided, build the index structure based on semanteme.In addition, to improve system retrieval efficiency, MathSearch employs a kind of mathematics query language MQL (Math Query for meeting XML specification based on MathML Language), inquired about with supporting to include structure query, semantic query, query composition, the abstract mathematics inquired about.Simultaneously as Content MathML and Presentation MathML belong to MathML, and MathSearch, which has incorporated, to be directed to Achievement in research of the Presentation mathematical formulaes to Content conversion key issues --- expression formula rule is defined, is eliminated Ambiguity problem in transfer process.
MathWebSearch employs the querying method of non-textualization, using displacement tree as performance expression parsing result Structure type.In tree construction is replaced, non-root node contains the replacement to its father node displacement variable --- and it is that father is saved The materialization of point content.The subexpression of each expression formula is separately added into index database by the system, so as to subexpression Searched.The query language that MathWebSearch is used is extended from MathML.
Lin Xiaoyan etc. devises a kind of mathematic(al) representation search method.Pass through user using PDF reader plug-in units Mathematic(al) representation information in visual manually operated extraction PDF document expresses its Search Requirement.Pre-processed in expression formula In the stage, the operator tree construction based on semanteme is constructed, and has carried out standardization processing, so as to improve with identical semantic etc. Complete, precision ratio is looked into the retrieval of valency expression formula.Establishment stage is indexed, by establishing formula item and expression formula index file (Index Of terms and formulae), under formula item and file index file (Index of terms and documents) line Calculating and the statistic document degree of correlation, to improve online retrieving efficiency.A series of retrieval result is carried in order to return exactly Document information, the degree of correlation between design concern query expression and document, and done accurate calculating --- this is also Improve the key point of returning result accuracy rate.
WikiMirs is a kind of searching system for being applied to mathematic(al) representation in wikipedia (Wikipedia).The system One segmenter is devised using the way of thinking in terms of mathematic(al) representation form normalization, by architectural characteristic clearly expression formula It is configured to a kind of stratose general tree.The system uses traditional inverted file index model, and expression formula is advised Generalized result, index is added in the form of inverted file.
Carry out adapting to the expansion of mathematic(al) representation feature based on global search technology to realize mathematic(al) representation retrieval Mode, its essence are the text shape by certain method migration for adaptation full-text search engine by the mathematic(al) representation of various forms Formula is simultaneously retrieved, and used is the index and matching system of full-text search engine, and it is retrieved performance and depends on mathematical expression The quality of formula conversion and the performance of original global search technology.The mathematical expression specially designed for mathematic(al) representation feature Formula search method, mathematic(al) representation is parsed and converted by structure mathematics expression formula semantic structure tree, and to retrieve tool Structure indexes for the purpose of having the mathematic(al) representation of similar semantic;Its retrieval capability depends on mathematic(al) representation parsing, the matter of conversion Amount and the mathematics index specially built and the performance of Matching Model, are made in index construct by extraction mathematical semantics information Some conversion be likely to result in the loss of the original hierarchy information of expression formula, or mathematic(al) representation hierarchical structure is to expression formula Operator order influences to be ignored.
The content of the invention
It is an object of the invention to provide a kind of mathematic(al) representation search method based on level index, this method is using double-deck Index structure, the integrality and effectiveness of retrieval and function of mathematic(al) representation information are taken into account, there are more rich search modes Higher recall precision.
What the present invention was realized in:A kind of mathematic(al) representation search method based on level index, comprises the following steps:
A, mathematic(al) representation to be retrieved is represented with LaTeX forms, the operator and operand in mathematic(al) representation are equal Referred to as node;
B, LaTeX formulas are parsed, draws the level where each node in expression formula;First node in expression formula The level at place is first layer;The node for causing node level to change is referred to as trigger point, as the level caused by trigger point Next layer of level level where corresponding trigger point where the node changed;
C, according to the level where each node in expression formula, mathematic(al) representation is represented in the form of hierarchical structure tree; First layer in hierarchical structure tree is set to main stor(e)y;
D, the cryptographic Hash of the node string that all nodes are formed in main stor(e)y time is calculated, and as the key assignments of main stor(e)y time;Meter The cryptographic Hash of the node string that all operator nodes are formed in main stor(e)y time is calculated, and as the preferred value of main stor(e)y time;Carry simultaneously Take operator coded strings and operand coded strings in each layer in the hierarchical structure tree of mathematic(al) representation;
E, the operator coded strings in the secondary key assignments of main stor(e)y, preferred value and main stor(e)y time and operand coded strings are inserted In the main stor(e)y time of hierarchical structure tree, the operator coded strings in other levels in addition to main stor(e)y time and operand coded strings are inserted In level corresponding to entering, the cryptographic Hash hierarchical structure tree of mathematic(al) representation is formed;
F, according to the cryptographic Hash hierarchical structure tree of mathematic(al) representation, precise search or same is carried out from the index pre-established Structure is retrieved;The index pre-established includes KP map indexs layer and inverted index layer;The KP map indexs layer is to pass through If the Treap tree constructions for inserting passive node and forming, in Treap trees, each node is by the key of mathematic(al) representation main stor(e)y time What value and preferred value formed;The inverted index layer includes the Kazakhstan with the mathematic(al) representation corresponding to each node in Treap trees Uncommon value hierarchical structure tree;
Precise search is carried out from the index pre-established, is specially:By the key of the main stor(e)y of mathematic(al) representation to be retrieved time Value and preferred value are as target key value and preferred value;The knot comprising target key value and preferred value is found from KP map index layers Point, and find from inverted index layer the cryptographic Hash hierarchical structure tree of mathematic(al) representation corresponding with the node;By number to be retrieved Learn the cryptographic Hash hierarchical structure tree and the cryptographic Hash hierarchical structure tree of the mathematic(al) representation found from inverted index layer of expression formula Successively contrasted since main stor(e)y time, if the operator coded strings, operand coded strings and trigger point in each layer are corresponding It is identical, then retrieve successfully, otherwise retrieval failure;
Isomorphism retrieval is carried out from the index pre-established, is specially:By the excellent of the main stor(e)y of mathematic(al) representation to be retrieved time First value is used as target priority value;The node for including target priority value is found from KP map index layers, and from inverted index layer Find the cryptographic Hash hierarchical structure tree of mathematic(al) representation corresponding with the node;By the cryptographic Hash level of mathematic(al) representation to be retrieved The cryptographic Hash hierarchical structure tree of mathematic(al) representation of the structure tree with being found from inverted index layer is successively carried out since main stor(e)y time Contrast, only compares trigger point and operator coded strings when each layer is contrasted, a certain layer comparing result is identical, then the layer matches Success;If main stor(e)y time is without the match is successful, retrieval failure;After in main stor(e)y time, the match is successful, the number of plies of other multilevel matchings is got over More, then the structural similarity of corresponding expression formula and expression formula to be retrieved is higher, and finally by main stor(e)y, that time the match is successful is all with treating The expression formula that the structure of expression formula for search has similarity is used as retrieval result.
In step f, the cryptographic Hash layer of the acquisition of each node and node corresponding in inverted index layer in Treap trees The formation of secondary structure tree, realized according to step a~e.
In step f, the specific forming process of KP map index layers is:Node is inserted one by one, forms Treap tree constructions; In Treap trees, the key assignments of inter-node meets the Spreading requirements of binary sort tree, and the preferred value of inter-node meets the distribution of big root heap It is required that;
A node is often inserted in Treap trees, cryptographic Hash corresponding with inserted node is formed in inverted index layer Hierarchical structure tree;When node is inserted in Treap trees, if being inserted into the key assignments of node and the key in a certain node being previously inserted into It is worth identical, but both preferred values are different, then the preferred value that will be inserted into node inserts above-mentioned inter-node, makes the preferential of inter-node It is worth and deposits to form preferential value set, while updates inverted index layer;If it is inserted into the key assignments of node and a certain knot being previously inserted into Key assignments in point is identical, and both preferred values are also identical, then updates inverted index layer, make to be inserted into cryptographic Hash corresponding to node Hierarchical structure tree cryptographic Hash hierarchical structure tree corresponding with above-mentioned node merges.
In step f, also include and the mathematic(al) representation institute corresponding to each node in Treap trees in the inverted index layer Document information;
A node is often inserted in Treap trees, cryptographic Hash corresponding with inserted node is formed in inverted index layer While hierarchical structure tree, the document information also where formation mathematic(al) representation corresponding with inserted node.
In step b, LaTeX formulas are parsed, are specifically:
1., the mathematic(al) representation of LaTeX forms show as a string of character strings, read the first character in LaTeX formulas;
2., judge read character whether be keyword starting character " ";If so, then step is performed 3., if it is not, then holding Row step is 9.;
3., by keyword starting character " " backward intercept maximum length character string;
4., judge whether obtained character string is keyword in dictionary, if so, then step is performed 5., if it is not, then holding Row step is 8.;
5., judge whether keyword is the keyword containing parameter, if so, then perform step 6., if it is not, then performing step ⑨;
6., to record the keyword be node, and the level where calculate node;The ginseng corresponding to keyword is also obtained simultaneously Number;
7., according to the mathematical sense of keyword, calculate the level where the parameter corresponding to keyword, read in parameter Character, perform step 2., the parameter corresponding to Recursion process keyword;
8., the last character of the character string obtained deleted, then perform step 4.;
9., character or keyword are designated as to node, and the level where calculate node;
10., judge whether character is all disposed in LaTeX formulas, if so, then terminating, if it is not, then reading next untreated Character, then perform step 2..
Mathematic(al) representation search method provided by the present invention based on level index, its basic (or precondition) are Mathematic(al) representation resource index is first established, on the basis of index is established, by being carried out to the mathematic(al) representation of LaTeX forms Parsing, the cryptographic Hash hierarchical structure tree for forming expression formula, precise search or isomorphism retrieval are carried out to mathematic(al) representation so as to realize Purpose.The index pre-established in the present invention combines the advantages of full-text index and tree index, forms a hierarchical structure, Index i.e. in the present invention includes KP map indexs layer and inverted index layer;Wherein, KP map indexs layer is that a Treap is tree-like Structure, thereon the data of inter-node be made up of the key assignments and preferred value of main stor(e)y in the cryptographic Hash hierarchical structure tree of expression formula time, rise To according to user search demand, node in the Treap trees having the function that corresponding to the mathematics resource of similar features is quickly searched; Inverted index layer is according to formula hierarchical structure attribute tissue formula resource, after node determines in Treap trees, further Resource required for fast positioning user.It is characteristic of the invention that Rapid matching and expressed intact number are taken into account in hierarchy The needs of expression formula structure are learned, more quick mathematical material lookup and the retrieval mode compared with horn of plenty can be realized.
The present invention extracts the architectural feature of expression formula from mathematic(al) representation hierarchical structure, and according to the architectural feature Build directory system.The characteristics of for mathematic(al) representation, the resource containing mathematic(al) representation is indexed, can be achieved according to The LaTeX form mathematics retrieval types of family input, the target of content needed for retrieval in mathematic(al) representation resources bank.
Brief description of the drawings
Fig. 1 is the method flow schematic diagram parsed to LaTeX formulas.
Fig. 2 is the schematic diagram of the hierarchical structure tree of formula (1).
Fig. 3 is the schematic diagram of the structure cluster of the hierarchical structure tree of two similar expression formulas.
Fig. 4 is the schematic diagram of the structure cluster of the cryptographic Hash hierarchical structure tree of two similar expression formulas.
Fig. 5 is the structural representation for building Treap trees.
Fig. 6 is the function distribution schematic diagram of the searching system of mathematic(al) representation.
Fig. 7 is interface schematic diagram when user retrieves to mathematic(al) representation.
Fig. 8 is the interface schematic diagram that the retrieval result of user is returned to for Fig. 7 retrievals content.
Embodiment
The preferred embodiments of the present invention are illustrated below in conjunction with accompanying drawing, it will be appreciated that described herein preferred real Apply example to be merely to illustrate and explain the present invention, be not intended to limit the present invention.
Mathematic(al) representation search method provided by the present invention based on level index, with the proviso that pre-establishing index. The foundation of index mainly includes:The mathematic(al) representation of collected LaTeX forms is parsed, seeks the Hash of mathematic(al) representation It is worth hierarchical structure tree, constructs Treap trees, forms KP map indexs layer and inverted index layer.It is further explained below.
In the generally seen mathematic(al) representation of people, symbol is Two dimensional Distribution, and its expression way in a computer is more Kind is various, and LaTeX is exactly one of which, and it is by american computer scholar Lesley Lambert (Leslie Lamport) 20 The initial stage eighties in century is created, and expression formula is described using one-dimensional string structure, is the most abundant mathematical expression of current the most frequently used, function One of formula description language.Such as mathematic(al) representation
Its LaTeX is expressed as:X=frac { {-b pm sqrt { { b^2 } -4ac } } } { { 2a } }.
The operator, the operand that have independent implication in mathematic(al) representation are referred to as node (can also claim symbol);Such as: Operator in formula (1) has:"=", " ± ", "-" (minus sign),"-" (score line, the i.e. division sign) etc., in formula (1) Operand have:" x ", " b ", " 2 ", " 4 ", " a ", " c " etc..
The mathematic(al) representation (abbreviation LaTeX formulas) of LaTeX forms is parsed, its purpose is to obtain mathematic(al) representation Logical relation in (abbreviation expression formula) between operator and operand, and each significance level of the symbol in expression formula, So as to the retrieval character for extraction expression formula, the similarity between judgment expression lays the first stone;Its analysis result is determined with expression formula Position information will collectively form the inverted index layer of expression formula index.
As shown in figure 1, Fig. 1 is the schematic flow sheet parsed to LaTeX formulas, specifically comprise the following steps:
1., start, read LaTeX formulas in first character.
LaTeX formulas show as a whole string character string, and one node of any of which is also a character string.In this step The first character in LaTeX formulas is read first.
2., judge read character whether be keyword starting character " ";If so, then step is performed 3., if it is not, then holding Row step is 9..
Keyword typically by " " afterwards plus word abbreviation form, such as:Keyword " frac " represent "-(score line, that is, remove Number) ", keyword " sqrt " representKeyword " pm " represent " ± ".All keywords are typically stored in specially Dictionary in.
3., by keyword starting character " " backward intercept maximum length character string.
If the character read be keyword starting character " ", just need corresponding keyword is found out, therefore, need Will since keyword starting character " " backward intercept certain length character string, with realize keyword is taken out.
In this step by keyword starting character " " backward intercept maximum length character string." most greatly enhancing herein The length of the most long character string stored in the dictionary of the generally referred to as special storage keyword of degree ".Certainly, if by " " Until the LaTeX formulas last character length also without above-mentioned " maximum length " greatly, then by " " be truncated to Last character of LaTeX formulas.
4., judge whether obtained character string is keyword in dictionary, if so, then step is performed 5., if it is not, then holding Row step is 8..
5., judge whether keyword is the keyword containing parameter, if so, then perform step 6., if it is not, then performing step ⑨。
When it is the keyword in dictionary to judge a character string, also to determine whether the keyword contains The keyword of parameter.The keyword containing parameter is illustrated below.Such as:LaTeX formulas for " x=frac { 1 } { 2 } ", keyword " frac " parameter be exactly " 1 " and " 2 ";LaTeX formulas for " x=sqrt2 ", keyword " sqrt " parameter It is exactly " 2 ";LaTeX formulas are " x=frac { {-b pm sqrt { { b^2 } -4ac } } } { { 2a } } ", keyword " frac " ginseng Number is exactly " {-b pm sqrt { { b^2 } -4ac } } " and " { 2a } ".
6., to record the keyword be node, and the level where calculate node;The ginseng corresponding to keyword is also obtained simultaneously Number.
If it is judged that keyword is the keyword containing parameter, then the keyword is designated as node first, and record it Sequence number in LaTeX formulas.If be corresponding to the keyword step 1. in the first character that is read, the keyword is remembered Record as first node;A character is subsequently often read, when node is recorded as just on the basis of front nodal sequence number Add 1, be recorded as successively the 2nd node, the 3rd node ..., until LaTeX formulas in all nodes records finish.It is being recorded as After node, the level where the node is calculated.
Level concretely comprises the following steps where calculate node:
Level where 1st node is designated as the 1st layer.The 1st node is the 1st in LaTeX formulas in mathematic(al) representation Individual node.
Node is compared with the 1st node below, if be in mathematic(al) representation with the 1st node same datum line (or Claim same horizontal line) on, then level where the node is also the 1st layer;If it is in different from the 1st node in mathematic(al) representation Datum line on, then to judge that it belongs to the parameter of which node, i.e., to judge to cause horizontal line position where it to become Which the node of change is.
After judging its parameter for being which node, then level where the parameter is to cause horizontal line position where parameter Next layer of level where the node to change.
For the node containing parameter, according to its mathematical semantics, it can be determined that parameter corresponding to egress and the node phase Than whether in same horizontal line, if in same horizontal line, the node is in its parameter in mathematic(al) representation Same layer;If being not in same horizontal line, level where the parameter of the node is next layer of level where the node.
Such as:LaTeX formulas are " x=(a+b) × 5 ", node " (", ") " and its parameter " a ", "+", " b " are in mathematical expression In formula in same horizontal line, then node " (", ") " and its parameter " a ", "+", " b " belong to same level;LaTeX formulas are " x =frac { 1 } { 2 } ", node " frac " and its parameter " 1 " and " 2 " be not in mathematic(al) representation in same horizontal line, ginseng Number " 1 " be located at node " frac " top, parameter " 2 " be located at node " frac " lower section, then parameter " 1 " and " 2 " place layer It is secondary for node " level where frac " next layer.Parameter is probably a node, it is also possible to includes multiple nodes.For bag Situation containing multiple nodes, answer the node in Recursion process parameter.
In the present invention, the node node that residing horizontal line position is changed in expression formula is caused to be referred to as trigger point. Trigger point is typically the node of containing parameter, and still, the node of containing parameter is all not trigger point.Trigger point and the ginseng contained by it For number in mathematic(al) representation on different horizontal lines, level where its parameter is next layer of level where trigger point.It is right The situation containing parameter, answers recurrence to be handled in parameter.
, it is necessary to which the parameter corresponding to keyword is found out in the case of keyword is containing parameter.It is single for parameter The situation of individual character, then keyword immediately behind single character be parameter corresponding to keyword.It is by fixed for parameter When boundary's symbol " { " and " } " shows to include, then the delimiter in LaTeX formulas " { " and " } " can be handled by data stecture stack, from And obtain the parameter corresponding to keyword.
7., according to the mathematical sense of keyword, calculate the level where the parameter corresponding to keyword, read in parameter Character, perform step 2., the parameter corresponding to Recursion process keyword.
After the parameter corresponding to keyword is obtained, the level where the parameter corresponding to keyword is also calculated.Upper one The level where keyword has been calculated in step, only need to be according to the mathematical sense of keyword, with regard to pass can be drawn in this step The level where parameter corresponding to key word.The mathematical sense of keyword is in dictionary and has storage, is contained according to its mathematics Whether justice just can know that and is between the keyword and its parameter in mathematic(al) representation in same horizontal line, therefore, if crucial The mathematical sense of word shows that the keyword is in same horizontal line with its parameter in mathematic(al) representation, then corresponding to keyword Parameter where level it is identical with the level where the keyword;If the mathematical sense of keyword shows that the keyword is joined with it Number is in mathematic(al) representation on varying level line, then the level where the parameter corresponding to keyword is in the keyword institute In next layer of level.
If the parameter corresponding to keyword is single character, the character is denoted as node, while records node place Level, continue to read next character in LaTeX formulas afterwards, then perform step 2., until terminating.
If the parameter corresponding to keyword is not single character, the ginseng corresponding to keyword calculated in this step Level where number, refer to parameter it is overall where level, the also level in parameter where first node in other words, extremely Level where other nodes in parameter, then to be obtained by Recursion process.Therefore, such a situation, it should read in order The character in parameter is taken, then performs step 2., the level in Recursion process parameter where each node.
8., the last character of the character string obtained deleted, then perform step 4..
If the judged result of step 4. is no, that is, the character string obtained is not the keyword stored in dictionary, now, The length of the character string intercepted in illustrating step 3. is big, therefore subtracts one by the length of the character string obtained, that is, deletes word 4. last character in symbol string, continues executing with step, until the character string obtained is the keyword of dictionary memory storage.
9., character or keyword are designated as to node, and the level where calculate node.
When the judged result of step 2. is no, this step is performed.The judged result of step 2. is no, shows what is read Character be not keyword starting character " ", that is to say, that the character read is digital or alphabetical variable, is now needed character It is designated as node, and the level where calculate node.
When the judged result of step 5. is no, this step is also performed.The judged result of step 5. is no, shows keyword It is not the keyword containing parameter, keyword is now designated as node, and the level where calculate node.
Character or keyword are designated as node in this step, it is in LaTeX formulas that can be recorded simultaneously when being designated as node Which node;And the level where calculate node, it for details, reference can be made to step 6..
10., judge whether character is all disposed in LaTeX formulas, if so, then terminating, if it is not, then reading next untreated Character, then perform step 2., until LaTeX formulas in all characters be all disposed.
Illustrate below and the computational methods of level where node are described in detail.
Mathematic(al) representation is:
LaTeX formulas corresponding to formula (1) are:" x=frac { {-b pm sqrt { { b^2 } -4ac } } } { { 2a } } ".It is public Node in formula (1) have " x ", "=", "-" (score line), "-" (negative sign), " b ", " ± "," b ", " 2 ", "-" (subtract Number), " 4 ", " a ", " c ", " 2 ", " a ".In order to distinguish score line and minus sign (or negative sign) in the present invention, in the following description will Score line is usedTo represent.
In LaTeX formulas corresponding to formula (1), " x " is the 1st node, then level where " x " is the 1st layer (of the invention by layer The 1st layer in secondary structure tree is designated as main stor(e)y).;"=" is the 2nd node, then first carries out "=" and the 1st node " x " Compare, in formula (1), "=" and " x " are in same horizontal line, i.e. both horizontal distributions, therefore, node "=" Place level is identical with level where node " x ", belongs to the 1st layer.For the 3rd node, andTo contain parameter Node,Parameter be respectively " {-b pm sqrt { { b^2 } -4ac } } " and " { 2a } ".Due toWith first node " x " is in same horizontal line in formula (1), thereforePlace level is also the 1st layer.AndHad in dictionary Mathematical sense (or semantic) be " score line " the meaning, " score line " and its corresponding parameter are in mathematic(al) representation On different horizontal lines, therefore,Corresponding parameter " {-b pm sqrt { { b^2 } -4ac } } " and " { 2a } ", are inNext layer of place level, i.e., the 2nd layer.In parameter " { 2a } ", delimiter is removed, " 2 " and " a " are in same horizontal line On, therefore, " 2 " and " a " are all in the 2nd layer.In parameter " {-b pm sqrt { { b^2 } -4ac } } ", delimiter is removed, one by one Node is analyzed, the 1st node "-(negative sign) " be the parameter it is overall where level, i.e., the 2nd layer.Node " b ", " ± " andIt is in node "-" (negative sign) in mathematic(al) representation in same horizontal line, therefore is in the 2nd layer.Node Containing parameter " { b^2 } -4ac ", according to nodeSemanteme in dictionary, it is known that its parameter is in varying level line with it On, therefore parameter " { b^2 } -4ac " is in the 3rd layer.The 1st node " b " in parameter " { b^2 } -4ac " is in the 3rd layer, node " ^ " be in dictionary on target the meaning, therefore the parameter " 2 " corresponding to it be in the 4th layer;Node "-" (minus sign), " 4 ", " a ", " c " is in same horizontal line with node " b ", therefore all in the 3rd layer.
While level where calculate node, calculate node position and decision node symbol attribute are also wanted.Node The setting of position:For any node, the arest neighbors last layer node (generally trigger point) of present node is found, according to Top, upper right side, right side, lower right, lower section, inside, upper left side, the lower left of a present node node layer disposed thereon, point It is not designated as " 1 ", " 2 ", " 3 ", " 4 ", " 5 ", " 6 ", " 7 " and " 8 ".The symbol attribute of node is set:If node is operator " 1 " is denoted as, is then denoted as " 0 " for operand.
For example, in formula (1), the 2nd layer of node "-" (negative sign), " b ", " ± " andIt is in its last layer section PointTop, therefore its position is should be recorded as " 1 ";2nd layer of node " 2 " and " a " are in a node layer thereon Lower section, therefore its position should be recorded as " 5 ", in fig. 2 to distinguish the relation between both, is indicated with arrow, i.e., byThe node of top points toThe node of lower section.The parameter value of formula (1) is as shown in table 1.
Result after LaTeX formulas parsing corresponding to the formula of table 1 (1)
According to level where each node in expression formula, hierarchical structure tree can be drawn, as shown in Figure 2.In Fig. 2, " x ", "=" WithIn the 1st layer, "-(negative sign) ", " b ", " ± "," 2 " and " a " be in the 2nd layer, " b ", "-(minus sign) ", " 4 ", " a " and " c " are in the 3rd layer, and " 2 " are in the 4th layer.Each level is from top to bottom arranged in order, so as to form a level knot Paper mulberry. " ^ " is trigger point.On be marked in mathematic(al) representation and belong to recessive operator, be to utilize intersymbol Size and location secondary indication, made domination processing in LaTeX, be expressed as a keyword " ^ ", therefore in level knot Hidden in paper mulberry.In hierarchical structure tree, trigger point can be used as instead of it by the node before subscript, so, in Fig. 2 In, " b " seemed in the 3rd layer is trigger point, is due to eliminate subscript " ^ " in fact, just replaces subscript operator by " b " " ^ " is used as trigger point.In the present invention in addition to subscript " ^ " is hidden in hierarchical structure tree, subscript operator " _ " is also hidden.
When there is close multiple structures or identical expression formula, can also be expressed by a hierarchical structure tree, Such as expression formulaWithHierarchical structure tree table shown in Fig. 3 can be passed through Show to come, form structure cluster (the expression formula structured set with identical main hierarchical structure).
After the hierarchical structure tree corresponding to mathematic(al) representation is obtained, seek in next step by the way that mathematical expression is calculated Cryptographic Hash hierarchical structure tree corresponding to formula.
First, the cryptographic Hash of the node string that all nodes are formed in main stor(e)y time is calculated.
The node string (or character string) that all nodes in expression formula main stor(e)y time are formed using " MD5 algorithms " in this step The double MD5 codings of extraction 16 are used as cryptographic Hash." MD5 algorithms " for details, reference can be made to periodical " R.L.Rivest.The MD5 Message-digest algorithm [J] .Internet Activities Board, described in 1992,143. ".
The object of " MD5 algorithms " effect can be a character string, all nodes during expression formula main stor(e)y is secondary (including operator And operand) connecting together is formed a character string.All nodes in expression formula main stor(e)y time are formed using " MD5 algorithms " Character string effect once after, from acquired results choose the 9th to the 24th (totally 16) coded strings, then using " MD5 calculate Method " once, chooses the 9th to the to selected 16 coded strings (or character string) effect from the result after second of computing 24 (totally 16) coded strings export as cryptographic Hash.Here it is the calculating process of 16 double MD5 codings.
With expression formulaExemplified by, its LaTeX form is:" y=frac { { x+2 } } { 3 } ", expression formula node For:" y ", "="," x ", "+", " 2 ", " 3 ", LaTeX form interior joints are:" y ", "=", " frac ", " x ", "+", " 2 ", " 3 ".
The cryptographic Hash that md5 (a) is " a " is defined, then expression formula16 double MD5 coding cryptographic Hash be:
It should be noted that seeking cryptographic Hash to expression formula, its solution procedure is in units of expression formula, although each expression The LaTeX forms of formula are a character string, but expression formula is made up of expression formula node, withExemplified by, its LaTeX Form is " y=frac { { x+2 } } { 3 } ", and its inscape is " y ", "="," x ", "+", " 2 " and " 3 ", Be in LaTeX forms " y ", "=", " frac ", " x ", "+", " 2 " and " 3 ", and be not " y ", "=", " ", " f ", " r ", " a ", " c ", " x ", "+", " 2 " and " 3 ", that is to say, that the character string in LaTeX formulas should be regarded to node one by one as to enter Row computing.
The cryptographic Hash (ExpCode) of all nodes of expression formula main stor(e)y time is sought, i.e., by operator and fortune in expression formula main stor(e)y time The main stor(e)y sub-symbol to count as expression formula, main stor(e)y sub-symbol is formed into the cryptographic Hash after symbol string is computed (ExpCode) Principal mark as main stor(e)y time is known, also referred to as key assignments.
For example, expression formulaIn (y=frac { { x+2 } } { 3 }), the symbol of operator and operand in main stor(e)y time Number string isThe expression formula main stor(e)y sub-symbol cryptographic Hash (ExpCode) is:
Secondly, the cryptographic Hash of the node string that all operator nodes are formed in main stor(e)y time is calculated.
Because expression formula structure and mathematical sense are closely connected with each level operator, therefore, institute in extraction main stor(e)y time Some operator nodes, it is right as its that 16 double MD5 codings are extracted to operator node string in main stor(e)y time using " MD5 algorithms " The cryptographic Hash answered, identified the cryptographic Hash as the secondary of main stor(e)y time, also referred to as preferred value (ExpStructureCode), be structure With preparing.
For example, expression formulaIn (" y=frac { { x+2 } } { 3 } "), expression formula main stor(e)y time operator is formed Coded strings be:The coded strings that the expression formula main stor(e)y time operator is formed, 16 double MD5 are extracted through " MD5 algorithms " The cryptographic Hash (ExpStructureCode) of gained is after coding:
3rd, after the key assignments and preferred value of main stor(e)y time is obtained, it should also be directed to the hierarchical structure tree of mathematic(al) representation, extraction The operand coded strings in operator coded strings (LayerStructureCode) and each layer in each layer (LayerCode).Operator coded strings in each layer, it is the coded strings that all operators are formed in corresponding level;It is each Operand coded strings in layer, it is the coded strings that all operands are formed in corresponding level.
4th, the operator coded strings in the secondary key assignments of main stor(e)y, preferred value and main stor(e)y time and operand coded strings are equal Insert in the main stor(e)y time of hierarchical structure tree, by the operator coded strings and computing number encoder in other levels in addition to main stor(e)y time In level corresponding to string insertion, the cryptographic Hash hierarchical structure tree of mathematic(al) representation is formed.
When multiple expression formulas have same or analogous main hierarchical structure, it can be given expression to by a hierarchical structure tree Come;Meanwhile key assignments in its main stor(e)y time and preferred value also correspond to identical respectively, therefore its cryptographic Hash hierarchical structure tree can also pass through Structure cluster represents.Such as expression formulaWithCryptographic Hash hierarchical structure The structure cluster of tree is as shown in Figure 4.Dotted portion show the node in each layer in figure.
The premise of index is established, it is necessary to be parsed to the LaTeX formulas of collected all mathematic(al) representations, and form phase The cryptographic Hash hierarchical structure tree answered.Treap trees are constructed afterwards, and Treap tree constructions are KP map index layers;Treap trees are formed Process along with inverted index layer formation.KP map index layers contribute to the quick expression formula searched and have similar structure Cluster, inverted index layer are used for the mathematic(al) representation required for more accurately positioning and the fileinfo where expression formula.
Treap tree building process is as follows:LaTeX formulas are parsed, and after forming cryptographic Hash hierarchical structure tree, by Hash The key assignments of main stor(e)y time and preferred value extract in value hierarchical structure tree, and by the key assignments of the main stor(e)y of same expression formula time and preferentially Value combination forms a node, by the node corresponding to all expression formulas by way of inserting one by one, ultimately forms Treap trees Structure.During forming Treap trees by inserting node, it should meet:The key assignments of inter-node meets point of binary sort tree Cloth requirement, the preferred value of inter-node meet the Spreading requirements of big root heap.During node is inserted, if key assignments in node and excellent First value can not meet Spreading requirements simultaneously, key assignments is met the Spreading requirements of binary sort tree.Insert the process of node In, in order to meet the Spreading requirements of key assignments and preferred value, often insert a node, it is necessary to by the key assignments of the inter-node and preferentially Be worth with the key assignments of inter-node and preferred value that are previously inserted into respectively compared with.Because key assignments and preferred value are to pass through in the present invention 16 double MD5 coded strings of " MD5 algorithms " extraction, that is, the character string of 16.The process of character string and character string comparison It is:Since first character, sequentially compare backward untill there is different characters, then with the 1st different character ASCII value compare size;Such as " abc " and " aabdfg ", all it is " a ", so seeing next because the 1st character is identical Individual character, the 2nd character, one is " b ", and one is " a ", because the ASCII value of the ASCII value ratio " a " of " b " is big, so, this The comparative result of two character strings is " abc ">“aabdfg”.
With expression formula" x+y=sin α ", " x-1 ",“3x≠ 0 " and " exemplified by x-y=cos β ", its in Treap trees corresponding to node (form is " ExpCode: ExpStructureCode ") be respectively:(76c60a9302863009:b69555f2bd3dc4fe)、 (85f09a43e1e2f970:5289df294f8f0d9f)、(d8abe6a49c0e9c25:c25d02f078e7efc6)、 (76c60a9302863009:b69555f2bd3dc4fe)、(3554ec91046c34ed:d6230586ef2c409a)、 (b0e1bada1503e8fe:9e0d0965cfb9026a).These nodes are inserted in Treap trees one by one, the structure of formation Schematic diagram is as shown in Figure 5.Corresponding to Fig. 5 top half is exactly in the Treap trees that these nodes inserted are formed Part-structure, in Treap trees, the key assignments in these nodes is satisfied by the Spreading requirements of binary sort tree, i.e., on left subtree Key assignments in all nodes is respectively less than the key assignments in its root node, and the key assignments on right subtree in all nodes is all higher than its root node In key assignments.Due to expression formulaWithCorresponding in Treap trees Node is identical, therefore only shows it with a node.The clash handle situation that such case belongs to during insertion node.
When inserting node, as long as being inserted into the key assignments of node and the key assignments in the node being previously inserted into is different from, just This is inserted into Knots inserting Treap trees.If it is inserted into the key assignments of node and the key assignments phase in a certain node being previously inserted into With (also referred to as key assignments has conflict), and both preferred values are also identical (i.e. preferred value also has conflict), and here it is table recited above Up to formulaWithWhat corresponding node occurred during insertion rushes Prominent situation, the node being at this moment inserted into just are not repeated to insert, and represent the two expression formulas by that node before, subsequently Need to update inverted index layer.If the key assignments for being inserted into node (i.e. key assignments identical with the key assignments in a certain node being previously inserted into Have conflict), but both preferred values are different (preferred value Lothrus apterus), illustrate that key assignments is had been stored in Treap tree constructions, now By the preferred value for being inserted into node insert above-mentioned node (key assignments with being inserted into node inserted before have conflict, preferred value The node of Lothrus apterus) in, make the preferred value of inter-node and deposit to form preferential value set, while update inverted index layer.
During insertion node forms Treap trees, a node is often inserted, should just be formed in inverted index layer Cryptographic Hash hierarchical structure tree corresponding with inserted node, while establish node and cryptographic Hash in inverted index layer in Treap trees Corresponding relation between hierarchical structure tree.The same of cryptographic Hash hierarchical structure tree corresponding to a certain node is formed in inverted index layer When, the document information that is also formed where corresponding with node mathematic(al) representation.
For conflict during above-mentioned insertion node be present, if key assignments and preferred value occur during insertion node simultaneously Conflict, then in inverted index layer, will be inserted into cryptographic Hash hierarchical structure tree corresponding to node with and its have the node to conflict Corresponding cryptographic Hash hierarchical structure tree merges, and forms the structure cluster of cryptographic Hash hierarchical structure tree, as shown in Figure 5.In Fig. 5 Expression formulaWithCorresponding node is in Treap trees (76c60a9302863009:B69555f2bd3dc4fe), node cryptographic Hash level knot corresponding in inverted index layer The structure cluster of paper mulberry is as shown in Fig. 5 the latter half., will be preferential if key assignments has conflict when inserting node and preferred value does not conflict Value inserts the foregoing inter-node for having conflict, makes the preferred value of inter-node and deposits, while updates inverted index layer, makes in Treap trees The node correspondingly points to the two cryptographic Hash hierarchical structure trees in inverted index layer.
After KP map indexs layer and inverted index layer are all formed, index is that foundation finishes.The later use index can be real Now to the search function of mathematic(al) representation.
As shown in fig. 6, LaTeX parsing modules 1, retrieval module 2 and index module 3 are shown in Fig. 6.LaTeX parses mould The function of block 1 is mainly:The document of some LaTeX forms is parsed by LaTeX resolvers, specific resolving can join See description above, final analysis result is stored by data storage.Index obtains all tables out of data storage Result after being parsed up to formula, and establish index (it is as described above to establish Index process).The index established in index module 3 i.e. Show as including KP map indexs layer and inverted index layer.Retrieving the major function of module 2 is:Inquired about first according to user, will Mathematic(al) representation is standardized, and the process of standardization is exactly that mathematic(al) representation is converted into standard in fact LaTeX forms, the LaTeX forms of the standard are identical with the form of the LaTeX documents in LaTeX parsing modules 1.Standardization Afterwards, the mathematic(al) representation of LaTeX forms to be retrieved is made to be parsed by LaTeX resolvers, resolving is such as established above As being parsed during index to LaTeX formulas, here is omitted;Analysis result (including the cryptographic Hash level formed after parsing Structure tree) deposit data storage.Extract parsing data out of data storage, then to entering in the index in index module 3 Row retrieval, precise search can be carried out by KP map indexs layer and inverted index layer or isomorphism is retrieved, retrieval result is carried out whole Reason, finally by result output display.
Precise search is carried out by KP map indexs layer and inverted index layer, is specifically:By mathematic(al) representation to be retrieved The key assignments of main stor(e)y time is as target key value, using the preferred value of the main stor(e)y of mathematic(al) representation to be retrieved time as target priority value;From The node comprising target key value and target priority value in node, if not finding, retrieval failure are found in KP map index layers; If have found, cryptographic Hash hierarchical structure tree corresponding with the node is found out from inverted index layer.Make mathematical expression to be retrieved The cryptographic Hash hierarchical structure tree of formula and the cryptographic Hash hierarchical structure tree found out from inverted index layer from main stor(e)y proceed by by Layer contrasts, and is intended to contrast operator coded strings, operand coded strings and trigger point in each layer;Said by taking main stor(e)y time as an example It is bright, if operator coded strings and the latter in the former (the cryptographic Hash hierarchical structure tree for referring to mathematic(al) representation to be retrieved) main stor(e)y time Operator coded strings in (referring to the cryptographic Hash hierarchical structure tree found out from inverted index layer) main stor(e)y time are identical, and the former main stor(e)y Operand coded strings in operand coded strings in secondary and the latter's main stor(e)y time are also identical, the also phase of the trigger point in both main stor(e)ies time Together, then say that the match is successful for both main stor(e)ies time, continue the matching of next level.If the match is successful for all levels, from the row's of falling rope It is the result retrieved to draw the mathematic(al) representation corresponding to the cryptographic Hash hierarchical structure tree found out in layer;If there is any one layer not have Have that the match is successful, then retrieval failure.For example, if main stor(e)y time without the match is successful, shows retrieval failure, failure result is returned Return, no longer carry out the matching of next level.
Isomorphism retrieval is carried out by KP map indexs layer and inverted index layer, is specifically:By mathematic(al) representation to be retrieved The preferred value of main stor(e)y time is as target priority value;The node for including target priority value in node is found from KP map index layers, If not finding, retrieval failure;If have found, cryptographic Hash level knot corresponding with the node is found out from inverted index layer Paper mulberry.Herein due to only retrieving preferred value, it is possible that retrieving multiple nodes, key assignments is different in these nodes, still Preferred value is identical with target priority value.To found qualified node, found out respectively from inverted index layer corresponding Cryptographic Hash hierarchical structure tree.Make the cryptographic Hash hierarchical structure tree of mathematic(al) representation to be retrieved and the Kazakhstan found out from inverted index layer Uncommon value hierarchical structure tree proceeds by from main stor(e)y successively to be contrasted, and is intended to contrast operator coded strings and trigger point in each layer, Except that, isomorphism retrieval can not contrast operand coded strings when successively contrasting with precise search above.Such as:It is if preceding Operator coded strings in person (the cryptographic Hash hierarchical structure tree for referring to mathematic(al) representation to be retrieved) main stor(e)y time (refer to from the row of falling with the latter The cryptographic Hash hierarchical structure tree found out in index level) the operator coded strings in main stor(e)y time are identical, and touching in both main stor(e)ies time Hair point is also identical, then says that the match is successful for both main stor(e)ies time, continue the matching of next level.If main stor(e)y time is just no, the match is successful, Then retrieval failure.In comparison process, the number of plies of other multilevel matchings is more in addition to main stor(e)y time, then corresponding expression formula with it is to be retrieved The structural similarity of expression formula is higher, and the document information where corresponding expression formula is remembered point higher.Even if a certain expression formula institute Document information score it is not high, as long as time the match is successful for its main stor(e)y, then just saying the expression formula, the match is successful.Finally will be all The expression formula that the match is successful is sequentially returned to user by document information score height where it, and user can be selected as needed Select.
The mathematic(al) representation retrieval prototype system that the present invention is established, its running environment are:
Server system:Microsoft Windows Sever 2012
Programming language:ASP.net
Database Systems:Microsoft SQL Server 2012
System architecture is B/S patterns.
In the present invention using 6234 from people education publishing house just, mathematics teaching material and higher education Mathematic(al) representation in the College Maths teaching material that publishing house publishes carries out measure of merit as sample.System is only to project team at present It is open, do not opened for internet.
Searching system chooses two class search modes --- precise search and isomorphism retrieval.Precise search requirement is treated with input Query expression is identical;Isomorphism retrieval requires there is the main hierarchical structure of identical with the expression formula to be checked of input.Entirely The mathematic(al) representation of system is expressed using LaTeX forms.System foreground effect is as shown in Figure 7 and Figure 8.
The LaTeX formulas of mathematic(al) representation to be retrieved are shown in Fig. 7 as " frac { b } { a } ", and have selected carry out isomorphism Retrieval (also referred to as structure retrieval).Fig. 8 shows the retrieval result of return, and a mathematical table is all included in each result of return Up to the document information (any this book etc. come from) where formula and the mathematic(al) representation.
Used by experimental stage of the invention in 6234 expression formulas, highest level reaches 11 layers, operator (or operation Symbol) 52680, operand (or operand) 72010, there is certain complexity.
By experiment test, it is as shown in table 2 that the time-consuming distribution of each expression formula is retrieved in collected sample;The present invention is based on The complete index construct duration of 6234 mathematic(al) representations and file size are as shown in table 3.
The recall precision statistical form of the present invention of table 2
The time is system response time in table 2 and table 3.The inventive method is used it can be seen from table 2 and table 3 from 6234 Retrieved in bar mathematic(al) representation sample, precise search and the average of isomorphism retrieval take about 360 milliseconds or so;6234 mathematics It is about 33831 milliseconds that expression formula, which forms the structure duration completely indexed, and the complete index file size for building formation is 549 K words Section.These data are all more satisfactory, meet the requirement of index construct and expression formula retrieval.

Claims (5)

1. a kind of mathematic(al) representation search method based on level index, it is characterized in that, comprise the following steps:
A, mathematic(al) representation to be retrieved is represented with LaTeX forms, the operator and operand in mathematic(al) representation are referred to as Node;
B, LaTeX formulas are parsed, draws the level where each node in expression formula;In expression formula where first node Level be first layer;The node for causing node level to change is referred to as trigger point, is occurred as the level caused by trigger point The next layer of level level where corresponding trigger point where the node of change;
C, according to the level where each node in expression formula, mathematic(al) representation is represented in the form of hierarchical structure tree;By layer First layer in secondary structure tree is set to main stor(e)y;
D, the cryptographic Hash of the node string that all nodes are formed in main stor(e)y time is calculated, and as the key assignments of main stor(e)y time;Calculate master The cryptographic Hash for the node string that all operator nodes are formed in level, and as the preferred value of main stor(e)y time;Extract number simultaneously Learn operator coded strings and operand coded strings in each layer in the hierarchical structure tree of expression formula;
E, the operator coded strings in the secondary key assignments of main stor(e)y, preferred value and main stor(e)y time and operand coded strings are inserted into level In the main stor(e)y time of structure tree, by the operator coded strings in other levels in addition to main stor(e)y time and the insertion pair of operand coded strings In the level answered, the cryptographic Hash hierarchical structure tree of mathematic(al) representation is formed;
F, according to the cryptographic Hash hierarchical structure tree of mathematic(al) representation, precise search or isomorphism inspection are carried out from the index pre-established Rope;The index pre-established includes KP map indexs layer and inverted index layer;The KP map indexs layer is to pass through insertion If passive node and the Treap tree constructions formed, in Treap trees, each node be by mathematic(al) representation main stor(e)y time key assignments and Preferred value composition;The inverted index layer includes the cryptographic Hash with the mathematic(al) representation corresponding to each node in Treap trees Hierarchical structure tree;
Precise search is carried out from the index pre-established, is specially:By the key assignments of the main stor(e)y of mathematic(al) representation to be retrieved time and Preferred value is as target key value and preferred value;The node comprising target key value and preferred value is found from KP map index layers, and The cryptographic Hash hierarchical structure tree of mathematic(al) representation corresponding with the node is found from inverted index layer;By mathematical expression to be retrieved The cryptographic Hash hierarchical structure tree of formula and the cryptographic Hash hierarchical structure tree of mathematic(al) representation that is found from inverted index layer are from main stor(e)y Secondary beginning is successively contrasted, if the operator coded strings, operand coded strings and trigger point in each layer correspond to it is identical, Retrieve successfully, otherwise retrieval failure;
Isomorphism retrieval is carried out from the index pre-established, is specially:By the preferred value of the main stor(e)y of mathematic(al) representation to be retrieved time As target priority value;The node for including target priority value is found from KP map index layers, and is found from inverted index layer The cryptographic Hash hierarchical structure tree of mathematic(al) representation corresponding with the node;By the cryptographic Hash hierarchical structure of mathematic(al) representation to be retrieved The cryptographic Hash hierarchical structure tree for setting the mathematic(al) representation with being found from inverted index layer is successively contrasted since main stor(e)y time, Only compare trigger point and operator coded strings when each layer is contrasted, a certain layer comparing result is identical, then the match is successful for the layer; If main stor(e)y time is without the match is successful, retrieval failure;After in main stor(e)y time, the match is successful, the number of plies of other multilevel matchings is more, then The structural similarity of corresponding expression formula and expression formula to be retrieved is higher, finally by main stor(e)y time all and table to be retrieved that the match is successful The expression formula similar up to formula structure is according to structural similarity order from high to low as retrieval result.
2. the mathematic(al) representation search method according to claim 1 based on level index, it is characterized in that, in step f, The formation of the cryptographic Hash hierarchical structure tree of the acquisition of each node and node corresponding in inverted index layer in Treap trees, Realized according to step a ~ e.
3. the mathematic(al) representation search method according to claim 2 based on level index, it is characterized in that, in step f, KP The specific forming process of map index layer is:Node is inserted one by one, forms Treap tree constructions;In Treap trees, inter-node Key assignments meets the Spreading requirements of binary sort tree, and the preferred value of inter-node meets the Spreading requirements of big root heap;
A node is often inserted in Treap trees, cryptographic Hash level corresponding with inserted node is formed in inverted index layer Structure tree;When node is inserted in Treap trees, if being inserted into the key assignments of node and the key assignments phase in a certain node being previously inserted into Together, but both preferred values are different, then the preferred value for being inserted into node are inserted into above-mentioned inter-node, make the preferred value of inter-node simultaneously Deposit to form preferential value set, while update inverted index layer;If it is inserted into key assignments and a certain node being previously inserted into of node Key assignments it is identical, and both preferred value it is also identical, then update inverted index layer, make to be inserted into cryptographic Hash level corresponding to node Structure tree cryptographic Hash hierarchical structure tree corresponding with above-mentioned node merges.
4. the mathematic(al) representation search method according to claim 3 based on level index, it is characterized in that, in step f, institute State in inverted index layer also including the document information where with the mathematic(al) representation corresponding to each node in Treap trees;
A node is often inserted in Treap trees, cryptographic Hash level corresponding with inserted node is formed in inverted index layer While structure tree, the document information also where formation mathematic(al) representation corresponding with inserted node.
5. the mathematic(al) representation search method according to claim 1 based on level index, it is characterized in that, it is right in step b LaTeX formulas are parsed, and are specifically:
1., the mathematic(al) representation of LaTeX forms show as a string of character strings, read the first character in LaTeX formulas;
2., judge read character whether be keyword starting character " ";If so, then step is performed 3., if it is not, then performing step Suddenly 9.;
3., by keyword starting character " " backward intercept maximum length character string;
4., judge whether obtained character string is keyword in dictionary, if so, then step is performed 5., if it is not, then performing step Suddenly 8.;
5., judge whether keyword is the keyword containing parameter, if so, then perform step 6., if it is not, then perform step 9.;
6., to record the keyword be node, and the level where calculate node;The parameter corresponding to keyword is also obtained simultaneously;
7., according to the mathematical sense of keyword, calculate the level where the parameter corresponding to keyword, read the word in parameter Symbol, perform step 2., the parameter corresponding to Recursion process keyword;
8., the last character of the character string obtained deleted, then perform step 4.;
9., character or keyword are designated as to node, and the level where calculate node;
10., judge whether character is all disposed in LaTeX formulas, if so, then terminating, if it is not, then reading next untreated word Symbol, then perform step 2..
CN201510336356.7A 2015-06-17 2015-06-17 A kind of mathematic(al) representation search method based on level index Active CN104991905B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510336356.7A CN104991905B (en) 2015-06-17 2015-06-17 A kind of mathematic(al) representation search method based on level index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510336356.7A CN104991905B (en) 2015-06-17 2015-06-17 A kind of mathematic(al) representation search method based on level index

Publications (2)

Publication Number Publication Date
CN104991905A CN104991905A (en) 2015-10-21
CN104991905B true CN104991905B (en) 2018-01-30

Family

ID=54303721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510336356.7A Active CN104991905B (en) 2015-06-17 2015-06-17 A kind of mathematic(al) representation search method based on level index

Country Status (1)

Country Link
CN (1) CN104991905B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301163B (en) * 2016-04-14 2020-11-17 科大讯飞股份有限公司 Formula-containing text semantic parsing method and device
CN105975584B (en) * 2016-05-03 2019-02-05 河北大学 A kind of mathematic(al) representation similarity distance measurement method
CN106021498B (en) * 2016-05-20 2019-01-22 电子科技大学 Dynamic keyboard information generating method and system based on problem solving process
CN106649568B (en) * 2016-11-15 2020-11-03 中国银联股份有限公司 Database retrieval method and device
CN109918473B (en) * 2017-12-14 2020-12-29 北大方正集团有限公司 Method and system for measuring similarity of mathematical formula
CN110058848A (en) * 2019-02-27 2019-07-26 贵州力创科技发展有限公司 A kind of intelligence expression formula analyzing platform and method
CN110083601B (en) * 2019-04-04 2021-11-30 中国科学院计算技术研究所 Key value storage system-oriented index tree construction method and system
CN110096555B (en) * 2019-04-17 2021-09-03 奇安信科技集团股份有限公司 Table matching processing method and device for distributed system
CN110414319B (en) * 2019-06-13 2021-08-31 中国软件与技术服务股份有限公司 Formula similarity calculation method and scientific and technological document retrieval method and device
CN110795526B (en) * 2019-10-29 2022-08-12 北京林业大学 Mathematical formula index creating method and system for retrieval system
CN112307719B (en) * 2020-10-13 2024-09-24 江汉大学 N-ary tree-based character string expression calculation method
CN112395324B (en) * 2020-11-09 2021-05-25 艾迪安逊教育科技发展(广州)有限公司 Big data storage system for online education platform
CN113033152B (en) * 2021-04-01 2024-05-28 北京有竹居网络技术有限公司 LaTeX formula display method and device
CN113220821A (en) * 2021-04-30 2021-08-06 作业帮教育科技(北京)有限公司 Index establishing method and device for test question retrieval and electronic equipment
CN114896280B (en) * 2022-03-22 2024-06-18 杭州未名信科科技有限公司 Data query method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7287023B2 (en) * 2003-11-26 2007-10-23 International Business Machines Corporation Index structure for supporting structural XML queries
CN102693303A (en) * 2012-05-18 2012-09-26 上海极值信息技术有限公司 Method and device for searching formulation data
CN104281589A (en) * 2013-07-03 2015-01-14 深圳习习网络科技有限公司 Mathematical formula searching method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7287023B2 (en) * 2003-11-26 2007-10-23 International Business Machines Corporation Index structure for supporting structural XML queries
CN102693303A (en) * 2012-05-18 2012-09-26 上海极值信息技术有限公司 Method and device for searching formulation data
CN104281589A (en) * 2013-07-03 2015-01-14 深圳习习网络科技有限公司 Mathematical formula searching method and device

Also Published As

Publication number Publication date
CN104991905A (en) 2015-10-21

Similar Documents

Publication Publication Date Title
CN104991905B (en) A kind of mathematic(al) representation search method based on level index
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN106250412A (en) The knowledge mapping construction method merged based on many source entities
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
CN104834693A (en) Depth-search-based visual image searching method and system thereof
WO2013170587A1 (en) Multimedia question and answer system and method
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
CN106446162A (en) Orient field self body intelligence library article search method
CN111767325B (en) Multi-source data deep fusion method based on deep learning
CN106372073A (en) Mathematical formula retrieval method and apparatus
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN109783806A (en) A kind of text matching technique using semantic analytic structure
CN105404677B (en) A kind of search method based on tree structure
CN108647322A (en) The method that word-based net identifies a large amount of Web text messages similarities
CN206411669U (en) SaaS ancient book knowledge service cloud platform
CN113761208A (en) Scientific and technological innovation information classification method and storage device based on knowledge graph
CN106202206A (en) A kind of source code searching functions method based on software cluster
CN118643134A (en) Retrieval enhancement generation system and method based on knowledge graph
CN102314464B (en) Lyrics searching method and lyrics searching engine
CN105426490B (en) A kind of indexing means based on tree structure
CN107562774A (en) Generation method, system and the answering method and system of rare foreign languages word incorporation model
Goldberg et al. CASTLE: crowd-assisted system for text labeling and extraction
CN111143457A (en) Student homonymy disambiguation method based on multiple source data sets
Freire et al. Identification of FRBR works within bibliographic databases: An experiment with UNIMARC and duplicate detection techniques
Terko et al. Neurips conference papers classification based on topic modeling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant