CN105447477B - Formula identification method and device based on formula library - Google Patents

Formula identification method and device based on formula library Download PDF

Info

Publication number
CN105447477B
CN105447477B CN201510985871.8A CN201510985871A CN105447477B CN 105447477 B CN105447477 B CN 105447477B CN 201510985871 A CN201510985871 A CN 201510985871A CN 105447477 B CN105447477 B CN 105447477B
Authority
CN
China
Prior art keywords
component
formula
image
level
subgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510985871.8A
Other languages
Chinese (zh)
Other versions
CN105447477A (en
Inventor
韦秋华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hanvon Digital Technology Co Ltd
Original Assignee
Beijing Hanvon Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hanvon Digital Technology Co Ltd filed Critical Beijing Hanvon Digital Technology Co Ltd
Priority to CN201510985871.8A priority Critical patent/CN105447477B/en
Publication of CN105447477A publication Critical patent/CN105447477A/en
Application granted granted Critical
Publication of CN105447477B publication Critical patent/CN105447477B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of formula identification method and device based on formula library, which includes: to construct the component of each formula in formula library based on formula library;And then construct part set;For each component setting coding forming member coded set in the part set, extracts the component coding and concentrate the corresponding component feature of each coding, and then construct component feature collection corresponding with the component coding collection;Formula image to be identified is handled, component sub -image is obtained;More candidate recognition results of single component subgraph are obtained, and then obtain the formulas solutions result of the formula image to be identified by the way that the component feature of single component subgraph and component feature collection to be compared to component sub -image extracting parts feature.

Description

Formula identification method and device based on formula library
Technical field
The present invention relates to a kind of formula identification method and device, in particular to a kind of formula identification method based on formula library And device, it can be used, for example, in line education sector.
Background technique
With the development of online education technology, providing simple and direct quick answering questions service becomes one of technology development Important service, the market of mainstream is applied at present searches for for topic of taking pictures, i.e., based on the text in the image formed after taking pictures to topic Word, formula recognition result scan for.In current course of instruction, mathematics is the important course of middle and primary schools, and formula is several Most important a kind of expression-form in topic is learned, therefore, formulas solutions are a most important core for determining topic search performance Heart technology.
Traditional Formula Recognition Technology is mainly used in the identification of the parsing to image or handwritten form formula, the formulas solutions side The cardinal principle of method is single character to be first partitioned into from image, then identify to single character, finally to character and character Between structural relation analyzed, be finally completed the Understanding to formula.Therefore, this traditional Formula Recognition Technology by Picture quality and handwritten form writing style are affected, in addition, it is desirable to carry out structure between each character for constituting formula Analysis, identification process is relatively complicated, and recognition time is longer, and recognition efficiency is also low, can not be applied to topic search of taking pictures well.
Summary of the invention
The premise of topic of taking pictures search is that must have a substantial amounts, the topic of continuous renewal, answer and parsing data Therefore library is equivalent to the formulas solutions carried out when taking pictures topic search and has pre-established a specific formula library.This Invention provides a kind of formula identification method and device based on above-mentioned specific formula library, which is based on specific Formula library construct the component of each formula, establish the part set in formula library, using formula as the component in part set according to Sequencing carries out horizontal direction and is composed.
According to an aspect of the present invention, the embodiment of the present invention provides a kind of formula identification method based on formula library, described Formula identification method includes:
Construct the component of each formula in formula library;Wherein, the component includes the first estate symbol and subformula, described The first estate symbol is the primary sign of each formula and is the same of horizontal relationship in structural relation with the primary sign Designator, the subformula are that derivative all grades are lower than the first estate symbol on the basis of the first estate symbol The formula of secondary designator and the first estate symbol composition;
According to the component of each formula of building, the part set in formula library is established;
For each component setting coding forming member coded set in the part set, extracts the component coding and concentrate often It is a to encode corresponding component feature, and then construct component feature collection corresponding with the component coding collection;
Obtain the component sub -image of formula image to be identified, wherein each component sub -image includes a component;
The component feature for obtaining each component sub -image in the formula image to be identified, by each component sub -image Component feature be compared with the component feature collection, obtain more candidate recognition results of each component sub -image, in turn Obtain the formulas solutions result of the formula image to be identified.
The primary sign is formula leftmost side grade highest and fiducial mark is the symbol of symbol itself, the benchmark symbol Number for each symbol in formula structure benchmark.
The component of each formula includes: in the building formula library
Obtain all the first estate symbols of each formula in formula libraryt 0 ,t 1 ,t 2 ,……t n ;Wherein, with the starting Symbolt 0 For benchmark symbol one by one it is progressive acquisition with the primary sign structural relation be horizontal relationship right side symbol, by institute Primary sign and the right side symbol are stated as the first estate symbolt 0 ,t 1 ,t 2 ,……t n ;
Respectively with the first estate symbol of each formulat 0 ,t 1 ,t 2 ,……t n Time grade symbol is obtained for benchmark symbol Number, using the first estate symbol without time designator as component, will be present the first estate symbol of time designator with this The secondary grade set of symbols of one designator at subformula as component.
Each component setting coding forming member coded set in the part set, comprising: to the part set In component carry out deduplication operation, be duplicate removal after the part set in each component setting coding forming member coded set.
The component feature includes component textural characteristics and edge feature.
The component sub -image for obtaining formula image to be identified, comprising:
Horizontal direction is carried out to the formula image to be identified and projects cutting, obtains the level-one of the formula image to be identified Subgraphs sequence;
The average distance of the adjacent boundary of adjacent level-one subgraph in the level-one subgraphs sequence is obtained as the first reference value, The average value of all level-one subgraph width in the level-one subgraphs sequence is obtained as the second reference value;
To in the level-one subgraphs sequence each level-one subgraph carry out connected area segmentation, divide successfully after acquisition described in Identify the second level subgraphs sequence of formula image, and according to the space bit between the two neighboring stages subgraph in the second level subgraphs sequence The relationship of setting merges the three-level subgraphs sequence for obtaining the formula image to be identified;
To each three-level subgraph in the three-level subgraphs sequence and carry out the level-one not succeeded after connected area segmentation Figure carries out monocase identification respectively and determines component sub -image in conjunction with first reference value and the second reference value.
It is described that monocase identification is carried out to each three-level subgraph of the three-level subgraphs sequence, determine that component sub -image includes:
Monocase identification is carried out to each three-level subgraph in the three-level subgraphs sequence respectively and obtains each three-level The recognition result of subgraph and corresponding recognition credibility;
According to the monocase identification judge type of the first estate symbol present in the recognition result and accordingly Recognition credibility simultaneously determines component sub -image according to the judgment result.
The type and phase that the first estate symbol present in the recognition result is judged according to monocase identification The recognition credibility answered simultaneously determines that component sub -image includes: according to the judgment result
If there are recognition result being that fraction and recognition credibility are higher than the three of the first preset value in the three-level subgraphs sequence Grade subgraph, and the width of the three-level subgraph is close to the width of the corresponding level-one subgraph of the three-level subgraph, the three-level subgraph Width be greater than first reference value and the second reference value, then using the corresponding level-one subgraph of the three-level subgraph as fraction class The component sub -image of type;
If there are recognition result being that radical and recognition credibility are higher than first preset value in the three-level subgraphs sequence Three-level subgraph, and the height of the three-level subgraph is close to the height of the corresponding level-one subgraph of the three-level subgraph, the three-level The height of subgraph is greater than first reference value and the second reference value, then using the corresponding level-one subgraph of the three-level subgraph as root The component sub -image of formula type.
It is described that monocase identification is carried out to the level-one subgraph not succeeded after the progress connected area segmentation, determine component Figure includes:
To the level-one subgraph that does not succeed after the progress connected area segmentation carry out the recognition result after monocase identification and Recognition credibility is judged, if the recognition result is formal notation and recognition credibility is higher than first preset value, and institute State level-one subgraph with horizontal direction at a distance from adjacent level-one subgraph close to first reference value, then by the level-one subgraph make For the component sub -image of formal notation type.
This method further includes, according to each three-level subgraph in the three-level subgraphs sequence and carrying out after connected area segmentation not When the level-one subgraph that succeeds carries out not determining component sub -image after monocase identification, by all component sub -images that are not determined as Three-level subgraph and level-one subgraph re-flag as new subgraphs sequence, by Dynamic Programming obtain described in the new son that re-flags The optimum combination subgraphs sequence of graphic sequence, using each subgraph in the optimum combination subgraphs sequence as component sub -image.
The component feature for obtaining each component sub -image in the formula image to be identified, by each component The component feature of subgraph is compared with the component feature collection, obtains more candidate recognition result packets of each component sub -image It includes:
Each component sub -image in the formula image to be identified is normalized;
Extract be normalized after each of the component sub -image textural characteristics and edge feature, form each institute State the component feature of component sub -image;
The component feature of single component subgraph is compared with the component feature that the component feature is concentrated, described in acquisition The component feature of single component subgraph and the component feature concentrate the similarity between each feature;
The similarity is ranked up from big to small, M similarity corresponding M component feature before choosing, and by institute State more candidate recognition results of the corresponding component coding of M component feature as the single component subgraph.
The formulas solutions result for obtaining the formula image to be identified includes:
If the formula image to be identified only includes a component sub -image, chosen in more candidate recognition results similar The corresponding component coding of maximum component feature is spent as formulas solutions result;
If the formula image to be identified includes multiple component sub -images, by each component of the formula image to be identified More candidate recognition results of figure are as basic unit, and using the method for Dynamic Programming, join probability statistical language model is to each Candidate formulas solutions result is scored, is sorted, and is known the corresponding component coding of the component feature of highest scoring as formula Other result.
The formulas solutions device includes:
Component models are constructed, are used to construct the component of each formula in the formula library, wherein the component includes the One designator and subformula, the first estate symbol are the primary sign of each formula and are tying with the primary sign It is the ad eundem symbol of horizontal relationship in structure relationship, the subformula is derivative all grades on the basis of the first estate symbol The formula formed lower than the secondary designator of the first estate symbol and the first estate symbol;
Part set module is established, according to the component of each formula of building, establishes the part set of formula;
Component coding collection and component feature collection module are constructed, is used for each component setting coding in the part set Forming member coded set extracts the component coding and concentrates the feature of the corresponding component of each coding, and then constructs and the portion The corresponding component feature collection of part coded set;
Obtaining widget subgraph module is used to handle formula image to be identified, obtains the formula figure to be identified The component sub -image of picture, wherein each component sub -image includes a component;
Formulas solutions module, the component for being used to obtain each component sub -image in the formula image to be identified are special Sign, the component feature of each component sub -image is compared with the component feature collection, obtains each component sub -image More excellent recognition result, and then obtain the formulas solutions result of the formula image to be identified.
Compared with traditional formula identification method, recognition methods of the present invention based on the formula library pre-established, The recognition methods can make full use of formula library to calibrate with Statistical error as a result, simultaneously, and the recognition methods is by complicated type As soon as subformula is converted into a new character, a formula is treated as one or more complicated type subformula and single tradition The horizontal direction of character combines.The formula identification method and device through the invention do not need to carry out the knot between character The step of structure is analyzed, and formulas solutions are simplified, saves the time of formulas solutions, improves the accuracy rate of formulas solutions.
Technical solution of the present invention is done below in conjunction with attached drawing and preferred embodiment of the invention and is retouched in further detail It states, beneficial effects of the present invention will be further appreciated.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes a part of the invention, but its Illustrate for explaining only the invention, not constituting improper limitations of the present invention.
Fig. 1 is the flow chart of formula identification method according to one preferred embodiment of the present invention;
Fig. 2 is the flow chart of the component sub -image of acquisition formula image to be identified according to one preferred embodiment of the present invention.
Fig. 3 is the block diagram of formulas solutions device according to one preferred embodiment of the present invention.
Specific embodiment
Technical solution of the present invention is carried out below in conjunction with specific embodiments of the present invention and corresponding attached drawing clear, complete Ground description.Obviously, described embodiment is only a part of preferred embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Primary sign, fiducial mark, the first estate symbol, secondary grade in formula of the present invention first defined herein The meaning of symbol and subformula etc..Formula is made of the character in traditional character set, each formula character in formula There is a fiducial mark, which is the structure benchmark of each formula character.
Structural relation in formula between the corresponding fiducial mark of each formula character includes horizontal relationship, on fraction Subscript relationship, radical sign radical exponent relationship, relationship, matrix relationship, upper subscript relationship etc. in radical sign radical sign.Wherein horizontal relationship is Only consider horizontal right side relationship, i.e. the symbol of two horizontal relationships if it exists, then it is assumed that the symbol in left side is the base of right side symbol Quasi- symbol.If the structural relation between formula character and fiducial mark is horizontal relationship, then it is assumed that formula character and benchmark accord with It number is the symbol of same grade, and for other structures relationship, it is believed that the formula character than the lower grade of fiducial mark one Grade is defined as time designator.
One and only one primary sign of each formula, primary sign are the highest symbols of grade of the formula leftmost side Number, the fiducial mark of primary sign is exactly itself.All and primary sign structural relation is the symbol of horizontal relationship in formula For the first estate symbol.It is unique suitable in formula that the fiducial mark and structural relation of some formula character codetermine the character Sequence and structure feature.All secondary designators lower than its grade for being derived on the basis of certain the first estate symbol and certain first The formula of designator composition, referred to as using certain the first estate symbol as the subformula of benchmark symbol.
For example, formula "" in formula character " a ", " b ", " c " fiducial mark be respectively " radical () ", " a ", " b ", structural relation are respectively relationship, horizontal relationship, horizontal relationship in radical sign radical sign.The highest symbol of the grade of the leftmost side For radical, therefore primary sign is radical.
For another example, for formula I "".Find primary sign first, primary sign be the leftmost side etc. The highest symbol of grade, the leftmost side is the complicated type formula of a fraction structure in formula I, therefore the symbol of its highest level It number is " fraction-", the therefore primary sign of " fraction-" namely formula.And then the horizontal right side symbol of fraction is "+", the horizontal right side symbol of "+" are " a ", and the horizontal right side symbol of " a " is "+", the horizontal right side symbol of "+" be " point Formula-", so, above-mentioned " fraction-", "+", " a ", "+", the first estate symbol that " fraction-" is formula I.Also, the formula I comprising Liang Ge subformula "" and "", wherein subformula "" it is by the basis of first " fraction-" The subformula that symbol derives, "" it is the subformula derived by second " fraction-" for benchmark symbol.
Again the Liang Ge subformula in analytical formula I "" and "" structural relation.Subformula "" in, the separate equations character in the subformula is described in a manner of " fiducial mark-structural relation-formula character " Between structural relation, it is specific as follows: fraction-fraction subscript-b, b- be horizontal -+,+- level-radical, radical-radical radical sign Interior-b, b- upper right footmark -2, b- level -4,4- level-a, fraction-fraction subscript -2,2- level-a.Such as above " fraction- The meaning of fraction subscript-b " is that benchmark symbol is " fraction ", and formula character is " b ", and relationship is " fraction subscript " between the two, Remaining representation method is in the same way.Formula "" and aforementioned formula is similarly.The present invention is hereafter based on the definition of component above-mentioned to public affairs The definition of character in formula.
Illustrate the specific steps of formula identification method according to a preferred embodiment of the invention below with reference to Fig. 1.
Fig. 1 is the flow chart of formula identification method according to a preferred embodiment of the invention.As shown in Figure 1, according to this Specific step is as follows for the formula identification method of the preferred embodiment of invention.
Step S1: the component of each formula in building formula library.
The component of each formula includes the first estate symbol and subformula.The first estate symbol is the starting character of each formula Number and with the primary sign in structural relation be horizontal relationship ad eundem symbol.Subformula is using the first estate symbol as base Formula of all grades derived from standard lower than the secondary designator of the first estate symbol and the first estate symbol composition.
Step S2: according to the component of each formula of building, the part set in formula library is established.
Each formula in formula library by method described in step S1, the component of each formula is constructed, to establish The part set in formula library.
Step S3: for each component setting coding forming member coded set in part set, the component coding collection is extracted In the corresponding component feature of each coding, and then construct component feature collection corresponding with the component coding collection.
In the step, deduplication operation is carried out to the component in part set, is set for each component in the part set after duplicate removal Set coding forming member coded set.Component feature includes component textural characteristics and edge feature.It is each in extracting parts coded set Component feature collection corresponding with component coding collection is constructed after encoding corresponding textural characteristics and edge feature.
Step S4: the component sub -image of formula image to be identified is obtained, wherein each component sub -image includes a portion Part.
In the step, needs to carry out formula image to be identified horizontal direction projection cutting, obtain the formula to be identified The level-one subgraphs sequence of image.Obtain the average distance conduct of the adjacent boundary of adjacent level-one subgraph in the level-one subgraphs sequence First reference value obtains the average value of all level-one subgraph width in the level-one subgraphs sequence as the second reference value.To institute It states level-one subgraphs sequence and carries out connected area segmentation, the second level subgraphs sequence of the formula image to be identified is obtained after dividing successfully, And it is merged described in acquisition according to the spatial relation between the two neighboring stages subgraph in the second level subgraphs sequence wait know The three-level subgraphs sequence of other formula image.To the three-level subgraphs sequence and carry out the level-one not succeeded after connected area segmentation Graphic sequence carries out monocase identification respectively, in conjunction with first reference value and the second reference value, determines component sub -image.
Step S5: obtaining the component feature of each component sub -image, by comparing with component feature collection, realizes formulas solutions.
In the step, the component feature of each component sub -image in the formula image to be identified is obtained, by each institute The component feature for stating component sub -image is compared with the component feature collection, obtains more candidate identifications of each component sub -image As a result, obtaining the formulas solutions result of the formula image to be identified in turn.
The implementation method of above-mentioned steps described below.
For step S1: the component of each formula in building formula library.Firstly, obtaining the of each formula in formula library One designator.First have to obtain the primary sign of formula during thist 0 , primary signt 0 Acquisition rule be, obtain formula Leftmost side grade highest and fiducial mark are the symbol of symbol itself.Then, with primary signt 0 For benchmark symbol, obtain and its Structural relation is the right side symbol of horizontal relationshipt 1 , and so on, obtain respectively witht n-1 For benchmark symbol, structural relation is water The right side symbol of flat relationshipt n , thus, it completes to the first estate symbols all in formulat 0 ,t 1 ,t 2 ,……t n Acquisition.
After the symbol for obtaining the first estate, respectively with the first estate symbol of each formulat 0 ,t 1 ,t 2 ,……t n On the basis of Symbol obtains time designator.
If without time designator, then defining the first estate symbol in formula on the basis of a certain the first estate symbol As component;If there is secondary designator on the basis of a certain the first estate symbol, then with a certain the first estate symbol and next The formula of designator composition is subformula, which is component.So far, it completes to adopt the component of each formula in formula library Collection.
With formula above-mentioned I "" it is example.First obtain primary sign "-(branch) " (), so Afterwards using primary sign as benchmark symbol, obtain with its structural relation be horizontal relationship right side symbol "+" (), and so on, Obtain with "+" () be benchmark symbol, structural relation be horizontal relationship right side symbol "-(branch) " (), thus, obtaining should All the first estate symbols "-(branch) " in formula (), "+" (), " a " (), "+" () and "-(branch) " (). Finally, respectively withWithFor benchmark symbol, all secondary designators are obtained, thus, it obtainsWithSubformula "" and "", that is, entire formula can split out "”、“" Liang Ge subformula and "+" (), " a " (), "+" () three the first estate symbols, five components as the formula.For other formula in formula library, pass through Above-mentioned steps construct the component of each formula.
For set parts coded set in step S3.Firstly, doing deduplication operation to all components in part set;Then, Coding is arranged in all components in part set, so that component and coding correspond, thus forming member coded set.For example, By formula I "" in five components do deduplication operation first, remove a duplicate "+", to obtain “", "+", " a " and "" four components, then, being separately encoded to this four components is 0,1,2,3, then encodes For 0 component be "", the component for being encoded to 1 is "+", and the component for being encoded to 2 is " a ", is encoded to 3 portion Part be "".In this way, formula I is converted to 01213 coding form.
Then, it concentrates each component to carry out component feature extraction component coding, obtains the corresponding component line of each component Feature and edge feature are managed, component feature collection corresponding with component coding collection is constituted, is used in the identification being described later on In the process, to identify the formula in image.
Extraction for component textural characteristics, the texture of piece image are that the image in image calculating by quantization is special Sign.The spatial color distribution and light distribution of describing texture of image image or in which pocket.The present embodiment is using following step Rapid texture feature extraction.
1. the zonule (cell) for dividing an image into 16 × 16 first, to a pixel of each cell, by its annular 8 adjoint points in neighborhood carry out comparison clockwise or counterclockwise, if center pixel value is bigger than the adjoint point, by adjoint point assignment It is 1, is otherwise assigned a value of 0, point each in this way can obtains 8 bit (being typically converted into decimal number).
2. then calculating the histogram of each cell, i.e., the frequency that each digital (it is assumed that decimal number) occurs is (that is, right A big binary sequence of point is counted in the whether neighbour domain of each pixel), normalizing then is carried out to the histogram Change processing.
3. finally the statistic histogram of obtained each cell is attached, to obtain the textural characteristics of whole picture figure.
Extraction for edge-of-part feature.The present embodiment uses canny edge detection algorithm extracting parts edge feature. Detailed process are as follows: Gaussian smoothing denoising is carried out to original image;Seek gradient value;Gradient magnitude determines;Primarily determine image side Edge point;It is accurately positioned marginal position;Finally export a binary image.The purpose for carrying out edge detection is to find in image The set that the violent pixel of brightness change is constituted, the often profile which is showed.Ideally, to being given A series of available continuous curves of image application edge detector, for indicating the boundary in the image.Therefore pass through side Edge, which detects obtained result, can greatly reduce image data amount, filter out unwanted information, leave the important knot of image Structure, to enormously simplify processing.
Step S4, with reference to Fig. 2, the component sub -image for obtaining formula image to be identified includes following method:
Step S401: horizontal direction is carried out to formula image to be identified and projects cutting.
Horizontal direction is carried out to formula image to be identified and projects cutting, two-value is carried out to formula image to be identified first Then change processing counts the number of each column non-zero pixels value and is stored with array Col [w], finally carried out to array Col [w] Traversal, wherein the zero of array Col [w] and the intersection of nonzero value are cut-off.
By the determination of above-mentioned cut-off, obtain the formula image level-one subgraphs sequences 0 ,s 1 ,s 2 ... ...s n }, Since the result of horizontal direction projection cutting is influenced by factors such as cutting precision, image definitions, lead to the level-one subgraph Each subgraph in sequences i There is a possibility that three kinds: 1.s i It is the complete graph of some component;②s i It is some component A part;③s i Two or more component combinations at.
Due to exist be not expectation obtain component sub -image a possibility that 2. and 3.s i , it is therefore desirable to carry out step S402 It is further processed, to obtain desired component sub -image.
In addition, to the level-one subgraphs sequence of acquisitions 0 ,s 1 ,s 2 ... ...s n Calculated.Calculate adjacent level-one subgraphs n-1 Right margin and level-one subgraphs n Left margin distanced n (d n ﹥ 0), to what is be calculatedd n Mean value calculation is carried out, is obtained ?d n Average value, willIt is defined as the first reference value.
Then, all level-one subgraphs in level-one subgraphs sequence are obtaineds 0 ,s 1 ,s 3 ... ...s n Width be respectivelyd w0 ,d w1 ,d w2 ... ...d wn , calculate the average value of the width of all level-one subgraphs, so that, willIt is defined as the second ginseng Examine value.
Step S402: connected area segmentation is carried out to the level-one subgraphs sequence.
To level-one subgraphs sequenceEach of subgraph carry out connected area segmentation, make every One level-one subgraphIt is divided into m second level subgraphs sequence.Due to passing through connected area segmentation Obtained from second level subgraph it is more scrappy, therefore, it is necessary to merge processing according to the spatial relation between second level subgraph. Specific merging process is as follows:
Firstly, calculating second level subgraphs sequenceIn each second level subgraph(0≤j≤ M) rectangle frame Rect().
Then, m rectangle frame being calculated is traversed, if existed simultaneously between any two rectangle frame it is horizontal and The overlapping of vertical direction then merges the two second level subgraphs, the three-level subgraph new as one.
Repeat the above process, until there is no second level subgraph to be able to carry out merging, finally obtain k(k≤m) a connected domain Divide three-level subgraphs sequence
Step S403: to each three-level subgraph and the level-one subgraph that does not succeed of connected area segmentation in three-level subgraphs sequence Carry out monocase identification.
To the successful three-level subgraphs sequence of connected area segmentationIn each three-level son The level-one subgraph not succeeded after figure and progress connected area segmentation carries out monocase identification, then obtains each three-level subgraphWith The recognition result and its corresponding recognition credibility for the level-one subgraph not succeeded after each carry out connected area segmentation.
Specifically, monocase is carried out to each three-level subgraph in the three-level subgraphs sequence respectively and identifies that acquisition is described every The recognition result and corresponding recognition credibility of a three-level subgraph;Judged present in the recognition result according to monocase identification The type of the first estate symbol and corresponding recognition credibility simultaneously combine the first reference value and the second reference value to determine component sub -image. By the above-mentioned means, the component sub -image of following two type can be determined:
1. if there are recognition result being that fraction and recognition credibility are higher than the three of the first preset value in the three-level subgraphs sequence Grade subgraph, three-level subgraphWidth close to its corresponding level-one subgraphWidth, and three-level subgraphWidth be greater than First reference valueWith the second reference value, that is,Width and height ratio it is very big, then it is assumed that the level-one subgraphTo divide The component sub -image of formula type does not need further handled.That is, special, the first preset value appointing between 80%-100% One numerical value.
2. if there are recognition result being that radical and recognition credibility are higher than first preset value in the three-level subgraphs sequence Three-level subgraph, three-level subgraphHeight close to corresponding level-one subgraphHeight, and three-level subgraphHeight Degree is greater than the first reference valueWith the second reference value,Width and height ratio very little, then it is assumed that the level-one subgraph For the component sub -image of radical type, further handled is not needed.
Monocase identification is carried out to the level-one subgraph that does not succeed after connected area segmentation, according to recognition result and its corresponding Recognition credibility, and component sub -image is determined in conjunction with the first reference value, specifically:
For the level-one subgraph do not succeed after connected area segmentationMonocase identification is carried out, if recognition result is public affairs Formula symbol and recognition credibility are higher than first preset value, while the level-one subgraphThe adjacent level-one subgraph with horizontal direction Distance close to the first reference value, then it is assumed that the level-one subgraphFor the component sub -image of formal notation type, do not need to do into The processing of one step.
Wherein, referring to Fig. 2, to each three-level subgraph and the level-one that does not succeed of connected area segmentation in three-level subgraphs sequence Subgraph carries out monocase identification, and determines component sub -image, in timing in no particular order.I.e., it is possible to first in three-level subgraphs sequence Each three-level subgraph carry out monocase identify determine fraction and radical type component sub -image, can also be first to connected area segmentation The level-one subgraph not succeeded carries out monocase and identifies the component sub -image for determining formal notation type;It can also carry out simultaneously.
Step S404: Dynamic Programming is carried out to the three-level subgraph and level-one subgraph for being not determined as component sub -image.
According to the level-one not succeeded after each three-level subgraph in the three-level subgraphs sequence and progress connected area segmentation When can not determine component sub -image after subgraph progress monocase identification, by all three-level subgraphs and one for being not determined as component sub -image Grade subgraph carries out Dynamic Programming, specific step is as follows as basic unit.
By all above-mentioned subgraphs for being not determined as component sub -image according in images to be recognized position carry out from a left side to The right side, sequence from top to bottom re-flag as new subgraphs sequence.Dynamic Programming is using logical The method that locally optimal solution calculates globally optimal solution is crossed, new subgraphs sequence is obtainedIt is optimal Combination.Detailed process is as follows: calculating the optimum combination situation of the 0 to k-th subgraph in subgraphs sequence, then calculates separately the 0th and arrive The optimum combination of i-th subgraph and i-th of subgraph are to the optimum combination of k-th of subgraph, and wherein the range of i is [1, k-1].And the The optimum combination of 0 to the i-th subgraph be then the 0 to j-th subgraph optimum combination and jth to i-th of subgraph optimum combination, Wherein the range of j is [1, i-1], and so on, globally optimal solution is extrapolated by locally optimal solution.
In the solution procedure of optimal solution, assessment rules are as follows: by the weighting of geometry score and recognition credibility score, obtain Obtain final score, i.e. score=n1*Gem+ n2* Reg(score is final score, and Gem is geometry score, and Reg is that identification is credible Spend score, n1、n2Respectively indicate the weight of geometry score and recognition credibility score), the higher expression result of score is better herein, Therefore the last result for choosing highest scoring is as optimal solution.For example, to solveWithThe case where two subgraphs merge, Score indicates the final score that i is combined with i-1.Passing through willTwo subgraphs merge, and the subgraph after merging is carried out Component identification, wherein the component identification is referring to monocase recognition methods, is a kind of system for identifying object with new part set The recognition methods of meter mode.
Component identification method particularly includes: the subgraph after merging is subjected to non-linear scaling, i.e. curve matching normalizes, and extracts The textural characteristics and edge feature of subgraph form the component feature of subgraph, and are compared with component feature collection, and final obtain is known Other result Code and recognition credibility score Reg.The spatial position feature after merging above-mentioned two subgraph is extracted, with The spatial position feature for being encoded to the component of Code is matched, and similarity score Gem is obtained.According to the Gem and Reg of acquisition Value, can calculate the final score of the combination.
New subgraphs sequence is solved by the above processHighest scoring combination, The combination of the highest scoring is optimum combination subgraphs sequence, and each subgraph is component sub -image in optimum combination subgraphs sequence.
S401 to step S404 through the above steps completes the acquisition to all component sub -images of formula image to be identified, And each component sub -image includes a component.
For step S5: obtaining the component feature of each component sub -image, by comparing with component feature collection, realize that formula is known Not.
Firstly, to each component sub -image S obtained by step S4iNon-linear scaling is carried out, i.e. curve matching normalizes Processing, normalizes to unified size.
Then, to each component sub -image SiCarry out textural characteristics and Edge Gradient Feature, forming member subgraph SiComponent it is special Sign.Component sub -image S based on extractioniComponent feature and step S3 in the comparison result of component feature collection that extracts, carry out portion Part identification.Specifically, single component subgraph S is calculatediComponent feature and component feature concentrate the similarity between each feature.Example It such as, can be with calculating unit feature set and single subgraph SiFeature between Euclidean distance, using the inverse of Euclidean distance as phase It is measured like degree.Measuring similarity is sorted according to sequence from big to small, the corresponding M portion of M measuring similarity value before choosing Component feature in part feature set, the corresponding component coding of component feature that this M component feature is concentrated is as described single More candidate recognition results of component sub -image.
After the more candidate recognition results for obtaining single component subgraph, according to more candidate recognition results, formula is finally obtained Recognition result, the specific method is as follows:
When only including a component in formula image to be identified, then similarity is chosen most in more candidate recognition results The corresponding component coding of big component feature is as formulas solutions result.
When in formula image to be identified including multiple components, by more times of each component sub -image of formula image to be identified Select recognition result as basic unit, using the method for Dynamic Programming, in conjunction with specific language model, to the formula of each candidate Recognition result is scored, is sorted, and the code of points is according to the language probabilistic model counted in advance, the last selection gist model The result of highest scoring is as formulas solutions result.Language probabilistic model applied by the present invention is detailed below.
Above-mentioned language model is to be counted by corpus to the probability of certain components while appearance, first has to calculate Component β0Identification candidate α0With component β1Identification candidate α1The language probability P (g á) in corpus is appeared in simultaneously;P(gá) Calculating need to count component β0Identification candidate α0With component β1Identification candidate α1Number f (g in corpus is appeared in simultaneously á), and the total degree S (G Ω) that all formula occur in corpus is counted, thus there is P (g á)=f (g á)/S (G Ω).For example, Formula image to be identified obtains component sub -image ImageA and component sub -image ImageB after treatment, respectively by both parts The feature of figure is compared with component feature collection, according to the biggish of similarity as a result, obtaining more times of component sub -image ImageA Select recognition result CA1, CA2……CAn, similarly, obtain more candidate recognition result CB of component sub -image ImageB1, CB2... ... CBm, wherein n, m are more candidate numbers.At this point, taking the combination of the optimal identification result of two component sub -images as public respectively The recognition result of formula is not usually the optimal result of the formula, therefore, it is necessary to further use the method for Dynamic Programming, in conjunction with upper It states language model to be respectively combined more candidate recognition results of two component sub -images, searches corresponding combination in language model Probability value is given a mark according to the probability value, finally, in conjunction with the identification score of single component and the language mould of multiple component sub -images Type score determines the optimal identification result of formula.In general, above two score is weighted to obtain weight score.Selection weighting The result of highest scoring is as formulas solutions result.
Illustrate the specific configuration of formulas solutions device according to a preferred embodiment of the invention below with reference to Fig. 2.
Formulas solutions device of the invention include building component models, establish part set module, building component coding collection and Module, obtaining widget subgraph module and the formulas solutions module of component feature collection.Lower mask body says each component It is bright.
Component models are constructed, the component of each formula in the formula library is obtained using building component models, is obtained first The primary sign of formula, then obtain with primary signFor benchmark symbol, structural relation is the symbol of horizontal relationship, with this Analogize, obtain respectively withFor benchmark symbol, structural relation is the symbol of horizontal relationship, to complete to the first estate symbolAcquisition;Next obtain respectively withOn the basis of secondary grade symbol Number.If there is no time designators, then defining the first estate symbol is in formula on the basis of a certain the first estate symbol Component;Otherwise, secondary designator if it exists then obtains the subformula derived on the basis of the first estate symbol, and will be sub Formula is as component.By above step, complete to acquire the component of each formula in formula library.
Part set module is established, according to the component of each formula of building, establishes the part set of formula.
Component parts coded set and component feature collection module are used to carry out duplicate removal to each component in the part set Operation, and coding forming member coded set is arranged in component each in the part set after duplicate removal, it extracts the component coding and concentrates Each feature for encoding corresponding component, and then construct component feature collection corresponding with the component coding collection.
Obtaining widget subgraph module, the module specifically include horizontal direction projection cutting module, connected area segmentation module, Monocase identification module and Dynamic Programming module, for respectively to formula image to be identified carry out horizontal direction projection cutting, Connected area segmentation, monocase identification and Dynamic Programming, finally obtain the component sub -image of the formula image to be identified, wherein every A component sub -image includes a component.
Wherein, horizontal direction projects cutting module, is used to carry out formula image to be identified horizontal direction projection to cut Point.
Binary conversion treatment is carried out to formula image to be identified first, then counts the number of each column non-zero pixels value simultaneously It is stored, finally array Col [w] is traversed, wherein the boundary of the zero of array Col [w] and nonzero value with array Col [w] Place is cut-off.
By the determination of above-mentioned cut-off, the level-one subgraphs sequence of the formula image is obtained, due to the horizontal direction projection cutting result by cutting precision, image definition etc. because The influence of element, leads to each subgraph in the level-one subgraphs sequenceThere is a possibility that three kinds: 1.It is the complete of some component Figure;②It is a part of some component;③Two or more component combinations at.
Due to exist be not expectation obtain component sub -image a possibility that 2. and 3., it is therefore desirable to connected area segmentation mould Block is further processed, to obtain desired component sub -image.
In addition, to the level-one subgraphs sequence of acquisitionIt is calculated.Calculate adjacent one Grade subgraphRight margin and level-one subgraphLeft margin distance(), to what is be calculatedIt is averaged Value calculates, and obtainsAverage value, willIt is defined as the first reference value.
Then, all level-one subgraphs in level-one subgraphs sequence are obtainedWidth be respectively, calculate the average value of the width of all level-one subgraphs, so that, willIt is defined as the second reference value.
Connected area segmentation module is used for level-one subgraphs sequenceEach of son Figure carries out connected area segmentation, makes each level-one subgraphIt is divided into m second level subgraphs sequence.Since the second level subgraph as obtained from connected area segmentation is more scrappy, therefore, it is necessary to roots According to second level subgraphBetween spatial relation merge processing.Specific merging process is as follows:
Firstly, calculating second level subgraphs sequenceIn each second level subgraph(0≤j≤ M) rectangle frame Rect();
Then, m rectangle frame being calculated is traversed, if existed simultaneously between any two rectangle frame it is horizontal and The overlapping of vertical direction then merges the two second level subgraphs, the three-level subgraph new as one.
Repeat the above process, until there is no second level subgraph to be able to carry out merging, finally obtain k(k≤m) a connected domain Divide three-level subgraphs sequence
Monocase identification module, be used for in three-level subgraphs sequence each three-level subgraph and connected area segmentation do not obtain into The level-one subgraph of function carries out monocase identification.
To the successful three-level subgraphs sequence of connected area segmentationIn each three-level son The level-one subgraph not succeeded after figure and progress connected area segmentation carries out monocase identification, then obtains each three-level subgraphWith The recognition result and its corresponding recognition credibility for the level-one subgraph not succeeded after each carry out connected area segmentation.
Specifically, monocase is carried out to each three-level subgraph in the three-level subgraphs sequence respectively and identifies that acquisition is described every The recognition result and corresponding recognition credibility of a three-level subgraph;Judged present in the recognition result according to monocase identification The type of the first estate symbol and corresponding recognition credibility simultaneously combine the first reference value and the second reference value to determine component sub -image. By the above-mentioned means, the component sub -image of following two type can be determined:
1. if there are recognition result being that fraction and recognition credibility are higher than the three of the first preset value in the three-level subgraphs sequence Grade subgraph, three-level subgraphWidth close to its corresponding level-one subgraphWidth, and three-level subgraphWidth be greater than First reference value and the second reference value, that is,Width and height ratio it is very big, then it is assumed that the level-one subgraphFor fraction The component sub -image of type does not need further handled.That is, special, any of the first preset value between 80%-100% Numerical value.
2. if there are recognition result being that radical and recognition credibility are higher than first preset value in the three-level subgraphs sequence Three-level subgraph, three-level subgraphHeight close to corresponding level-one subgraphHeight, and three-level subgraphHeight Degree is greater than the first reference valueWith the second reference value,Width and height ratio very little, then it is assumed that the level-one subgraph For the component sub -image of radical type, further handled is not needed.
Monocase identification is carried out to the level-one subgraph that does not succeed after connected area segmentation, according to recognition result and its corresponding Recognition credibility, and component sub -image is determined in conjunction with the first reference value, specifically:
For the level-one subgraph do not succeed after connected area segmentationMonocase identification is carried out, if recognition result is public affairs Formula symbol and recognition credibility are higher than first preset value, while the level-one subgraphThe adjacent level-one subgraph with horizontal direction Distance close to the first reference value, then it is assumed that the level-one subgraphFor the component sub -image of formal notation type, do not need to do into The processing of one step.
Dynamic Programming module is used to carry out dynamic rule to the three-level subgraph and level-one subgraph for being not determined as component sub -image It draws.
According to the level-one not succeeded after each three-level subgraph in the three-level subgraphs sequence and progress connected area segmentation When can not determine component sub -image after subgraph progress monocase identification, by all three-level subgraphs and one for being not determined as component sub -image Grade subgraph carries out Dynamic Programming, specific step is as follows as basic unit.
By all above-mentioned subgraphs for being not determined as component sub -image according in images to be recognized position carry out from a left side to The right side, sequence from top to bottom re-flag as new subgraphs sequence.Dynamic Programming is using logical The method that locally optimal solution calculates globally optimal solution is crossed, new subgraphs sequence is obtainedIt is optimal Combination.Detailed process is as follows: calculating the optimum combination situation of the 0 to k-th subgraph in subgraphs sequence, then calculates separately the 0th and arrive The optimum combination of i-th subgraph and i-th of subgraph are to the optimum combination of k-th of subgraph, and wherein the range of i is [1, k-1].And the The optimum combination of 0 to the i-th subgraph be then the 0 to j-th subgraph optimum combination and jth to i-th of subgraph optimum combination, Wherein the range of j is [1, i-1], and so on, globally optimal solution is extrapolated by locally optimal solution.
In the solution procedure of optimal solution, assessment rules are as follows: by the weighting of geometry score and recognition credibility score, obtain Obtain final score, i.e. score=n1*Gem+ n2* Reg(score is final score, and Gem is geometry score, and Reg is that identification is credible Spend score, n1、n2Respectively indicate the weight of geometry score and recognition credibility score), the higher expression result of score is better herein, Therefore the last result for choosing highest scoring is as optimal solution.For example, to solveWithThe case where two subgraphs merge, Score indicates the final score that i is combined with i-1.Passing through willTwo subgraphs merge, and the subgraph after merging is carried out Component identification, wherein the component identification is referring to monocase recognition methods, is a kind of system for identifying object with new part set The recognition methods of meter mode.
Component identification method particularly includes: the subgraph after merging is subjected to non-linear scaling, i.e. curve matching normalizes, and extracts The textural characteristics and edge feature of subgraph form the component feature of subgraph, and are compared with component feature collection, and final obtain is known Other result Code and recognition credibility score Reg.The spatial position feature after merging above-mentioned two subgraph is extracted, with The spatial position feature for being encoded to the component of Code is matched, and similarity score Gem is obtained.According to the Gem and Reg of acquisition Value, can calculate the final score of the combination.
New subgraphs sequence is solved by the above processHighest scoring combination, The combination of the highest scoring is optimum combination subgraphs sequence, and each subgraph is component sub -image in optimum combination subgraphs sequence.
By above-mentioned resume module, the acquisition to all component sub -images of formula image to be identified is completed, and each described Component sub -image includes a component.
Formulas solutions module, using formulas solutions module to each subgraph SiTextural characteristics and Edge Gradient Feature are carried out, Form the component feature of each subgraph.Using component parts coded set and component feature collection module, single component subgraph is extracted Component feature, and calculate single subgraph SiComponent feature and component feature concentrate the similarity between each feature.By similarity Measurement sorts according to sequence from big to small, the corresponding M component feature of M similarity before choosing, and the M component is special Levy more candidate recognition results of the corresponding component coding as the single component subgraph.
If the formula image to be identified only includes a component sub -image, chosen in more candidate recognition results similar The corresponding component coding of maximum component feature is spent as formulas solutions result;
If the formula image to be identified includes multiple component sub -images, by each component of the formula image to be identified More candidate recognition results of figure are as basic unit, and using the method for Dynamic Programming, join probability statistical language model is to each Candidate formulas solutions result is scored, is sorted, and is known the corresponding component coding of the component feature of highest scoring as formula Other result.
A kind of formula identification method and device based on formula library proposed by the present invention is described above.With tradition Formula identification method compare, recognition methods of the present invention similar to line of text identify.Using the formula library pre-established as base Plinth, the recognition methods can make full use of formula library to calibrate with Statistical error as a result, simultaneously, and the recognition methods is by complicated type As soon as subformula be converted into a new character, a formula is treated as one or more complicated type subformula and single passes The horizontal direction combination of system character.The formula identification method and device through the invention do not need to carry out between character Structural analysis, saves the time of formulas solutions, improves the accuracy rate of formulas solutions the step of simplifying formulas solutions.
The above description is only an example of the present application, is not intended to restrict the invention, for those skilled in the art For member, the invention may be variously modified and varied.All within the spirits and principles of the present invention, it is made it is any modification, Equivalent replacement, improvement etc., should be included within scope of the presently claimed invention.

Claims (12)

1. a kind of formula identification method based on formula library, which is characterized in that the formula identification method includes:
Construct the component of each formula in formula library;Wherein, the component includes the first estate symbol and subformula, and described first Designator be the primary sign of each formula and with the primary sign be in structural relation horizontal relationship ad eundem Symbol, the subformula are that derivative all grades are inferior lower than the first estate symbol on the basis of the first estate symbol The formula of grade symbol and the first estate symbol composition, the primary sign are formula leftmost side grade highest and fiducial mark For the symbol of symbol itself, the fiducial mark is the structure benchmark of each symbol in formula;
According to the component of each formula of building, the part set in formula library is established;
For each component setting coding forming member coded set in the part set, extracts the component coding and concentrate each volume The corresponding component feature of code, and then construct component feature collection corresponding with the component coding collection;
Obtain the component sub -image of formula image to be identified, wherein each component sub -image includes a component;
The component feature for obtaining each component sub -image in the formula image to be identified, by the portion of each component sub -image Part feature is compared with the component feature collection, obtains more candidate recognition results of each component sub -image, and then obtains The formulas solutions result of the formula image to be identified.
2. the method as described in claim 1, which is characterized in that the component of each formula includes: in the building formula library
Obtain all the first estate symbol t of each formula in formula library0,t1,t2,……tn;Wherein, with the primary sign t0 For benchmark symbol one by one it is progressive acquisition with the primary sign structural relation be horizontal relationship right side symbol, by the starting Symbol and the right side symbol are as the first estate symbol t0,t1,t2,……tn
Respectively with the first estate symbol t of each formula0,t1,t2,……tnTime designator is obtained for benchmark symbol, it will The first estate symbol and the first estate of time designator will be present as component in the first estate symbol without time designator The secondary grade set of symbols of symbol at subformula as component.
3. the method as described in claim 1, which is characterized in that each component setting coding shape in the part set At component coding collection, comprising: carry out deduplication operation to the component in the part set, be every in the part set after duplicate removal A component setting coding forming member coded set.
4. the method as described in claim 1, which is characterized in that the component feature includes that component textural characteristics and edge are special Sign.
5. the method as described in claim 1, which is characterized in that the component sub -image for obtaining formula image to be identified, comprising:
Horizontal direction is carried out to the formula image to be identified and projects cutting, obtains the level-one subgraph of the formula image to be identified Sequence;
The average distance of the adjacent boundary of adjacent level-one subgraph in the level-one subgraphs sequence is obtained as the first reference value, is obtained The average value of all level-one subgraph width is as the second reference value in the level-one subgraphs sequence;
Connected area segmentation is carried out to each level-one subgraph in the level-one subgraphs sequence, is obtained after dividing successfully described to be identified The second level subgraphs sequence of formula image, and closed according to the spatial position between the two neighboring stages subgraph in the second level subgraphs sequence System merges the three-level subgraphs sequence for obtaining the formula image to be identified;
To each three-level subgraph in the three-level subgraphs sequence and carry out the level-one subgraph not succeeded after connected area segmentation point Not carry out monocase identification in conjunction with first reference value and the second reference value determine component sub -image.
6. method as claimed in claim 5, which is characterized in that each three-level subgraph in the three-level subgraphs sequence Monocase identification is carried out, determines that component sub -image includes:
Monocase identification is carried out to each three-level subgraph in the three-level subgraphs sequence respectively and obtains each three-level subgraph Recognition result and corresponding recognition credibility;
The type of the first estate symbol present in the recognition result and corresponding identification are judged according to monocase identification Confidence level simultaneously determines component sub -image according to the judgment result.
7. method as claimed in claim 6, which is characterized in that described to judge the recognition result according to monocase identification Present in the first estate symbol type and corresponding recognition credibility and according to the judgment result determine component sub -image include:
If there are recognition results in the three-level subgraphs sequence for fraction and recognition credibility is higher than three-level of the first preset value Figure, and the width of the three-level subgraph is close to the width of the corresponding level-one subgraph of the three-level subgraph, the width of the three-level subgraph Degree is greater than first reference value and the second reference value, then using the corresponding level-one subgraph of the three-level subgraph as fraction type Component sub -image;
If there are recognition result being that radical and recognition credibility are higher than the three of first preset value in the three-level subgraphs sequence Grade subgraph, and the height of the three-level subgraph is close to the height of the corresponding level-one subgraph of the three-level subgraph, the three-level subgraph Height be greater than first reference value and the second reference value, then using the corresponding level-one subgraph of the three-level subgraph as radical class The component sub -image of type.
8. method as claimed in claim 5, which is characterized in that described to do not succeed after the progress connected area segmentation one Grade subgraph carries out monocase identification, determines that component sub -image includes:
Recognition result and identification after monocase identification are carried out to the level-one subgraph not succeeded after the progress connected area segmentation Confidence level is judged, if the recognition result is formal notation and recognition credibility is higher than the first preset value, and level-one Figure with horizontal direction at a distance from adjacent level-one subgraph close to first reference value, then using the level-one subgraph as formula symbol The component sub -image of number type.
9. method according to claim 7 or 8, which is characterized in that this method further includes, according to the three-level subgraph sequence The level-one subgraph not succeeded after each three-level subgraph and progress connected area segmentation in column can not be true after carrying out monocase identification When determining component sub -image, all three-level subgraphs for being not determined as component sub -image and level-one subgraph are re-flagged as new subgraph sequence Column, by Dynamic Programming obtain described in the optimum combination subgraphs sequence of new subgraphs sequence that re-flags, by the optimal set Each subgraph in zygote graphic sequence is as component sub -image.
10. the method as described in claim 1, which is characterized in that each described in the acquisition formula image to be identified The component feature of each component sub -image is compared the component feature of component sub -image with the component feature collection, obtains More candidate recognition results of each component sub -image include:
Each component sub -image in the formula image to be identified is normalized;
The textural characteristics and edge feature for each of extracting after being normalized the component sub -image, form each portion The component feature of part subgraph;
The component feature of single component subgraph is compared with the component feature that the component feature is concentrated, is obtained described single The component feature of component sub -image and the component feature concentrate the similarity between each feature;
The similarity is ranked up from big to small, the corresponding M component feature of M similarity before choosing, and by the M More candidate recognition results of the corresponding component coding of component feature as the single component subgraph.
11. method as claimed in claim 10, which is characterized in that the formulas solutions for obtaining the formula image to be identified Result includes:
If the formula image to be identified only includes a component sub -image, similarity is chosen most in more candidate recognition results The corresponding component coding of big component feature is as formulas solutions result;
If the formula image to be identified includes multiple component sub -images, by each component sub -image of the formula image to be identified More candidate's recognition results are as basic unit, and using the method for Dynamic Programming, join probability statistical language model is to each candidate Formulas solutions result scored, sorted, using the corresponding component coding of the component feature of highest scoring as formulas solutions knot Fruit.
12. a kind of formulas solutions device based on formula library, which is characterized in that the formulas solutions device includes:
Component models are constructed, are used to construct the component of each formula in the formula library, wherein the component is including first etc. Grade symbol and subformula, the first estate symbol are the primary sign of each formula and close with the primary sign in structure The ad eundem symbol for horizontal relationship is fastened, the subformula is that derivative all grades are lower than on the basis of the first estate symbol The secondary designator of the first estate symbol and the first estate symbol composition formula, the primary sign be formula most Left side grade highest and fiducial mark are the symbol of symbol itself, and the fiducial mark is the structure base of each symbol in formula It is quasi-;
Part set module is established, according to the component of each formula of building, establishes the part set of formula;
Component coding collection and component feature collection module are constructed, is used to form each component setting coding in the part set Component coding collection extracts the component coding and concentrates the feature of the corresponding component of each coding, and then constructs and compile with the component The corresponding component feature collection of code collection;
Obtaining widget subgraph module is used to handle formula image to be identified, obtains the formula image to be identified Component sub -image, wherein each component sub -image includes a component;
Formulas solutions module is used to obtain the component feature of each component sub -image in the formula image to be identified, will The component feature of each component sub -image is compared with the component feature collection, obtains the more excellent of each component sub -image Recognition result, and then obtain the formulas solutions result of the formula image to be identified.
CN201510985871.8A 2015-12-25 2015-12-25 Formula identification method and device based on formula library Active CN105447477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510985871.8A CN105447477B (en) 2015-12-25 2015-12-25 Formula identification method and device based on formula library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510985871.8A CN105447477B (en) 2015-12-25 2015-12-25 Formula identification method and device based on formula library

Publications (2)

Publication Number Publication Date
CN105447477A CN105447477A (en) 2016-03-30
CN105447477B true CN105447477B (en) 2019-03-01

Family

ID=55557637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510985871.8A Active CN105447477B (en) 2015-12-25 2015-12-25 Formula identification method and device based on formula library

Country Status (1)

Country Link
CN (1) CN105447477B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389061A (en) * 2018-09-26 2019-02-26 苏州友教习亦教育科技有限公司 Paper recognition methods and system
CN111539383B (en) * 2020-05-22 2023-05-05 浙江蓝鸽科技有限公司 Formula knowledge point identification method and device
CN113610068B (en) * 2021-10-11 2022-07-08 江西风向标教育科技有限公司 Test question disassembling method, system, storage medium and equipment based on test paper image

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101149790A (en) * 2007-11-14 2008-03-26 哈尔滨工程大学 Chinese printing style formula identification method
CN101393601A (en) * 2007-09-21 2009-03-25 汉王科技股份有限公司 Method for identifying mathematical formula of print form
CN103810493A (en) * 2012-11-06 2014-05-21 夏普株式会社 Method and apparatus for identifying mathematical formula
CN104268118A (en) * 2014-09-23 2015-01-07 赵方 Mathematical formula calculator including touch screen and method for identifying mathematical formulas

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5267546B2 (en) * 2010-12-22 2013-08-21 カシオ計算機株式会社 Electronic computer and program with handwritten mathematical expression recognition function

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393601A (en) * 2007-09-21 2009-03-25 汉王科技股份有限公司 Method for identifying mathematical formula of print form
CN101149790A (en) * 2007-11-14 2008-03-26 哈尔滨工程大学 Chinese printing style formula identification method
CN103810493A (en) * 2012-11-06 2014-05-21 夏普株式会社 Method and apparatus for identifying mathematical formula
CN104268118A (en) * 2014-09-23 2015-01-07 赵方 Mathematical formula calculator including touch screen and method for identifying mathematical formulas

Also Published As

Publication number Publication date
CN105447477A (en) 2016-03-30

Similar Documents

Publication Publication Date Title
CN104268603B (en) Intelligent marking method and system for text objective questions
CN106844352B (en) Word prediction method and system based on neural machine translation system
CN104463101B (en) Answer recognition methods and system for character property examination question
CN108921166A (en) Medical bill class text detection recognition method and system based on deep neural network
CN106875546A (en) A kind of recognition methods of VAT invoice
CN106372648A (en) Multi-feature-fusion-convolutional-neural-network-based plankton image classification method
CN104867137B (en) A kind of method for registering images based on improvement RANSAC algorithms
CN107743225B (en) A method of it is characterized using multilayer depth and carries out non-reference picture prediction of quality
CN105447477B (en) Formula identification method and device based on formula library
CN108681735A (en) Optical character recognition method based on convolutional neural networks deep learning model
CN106372624A (en) Human face recognition method and human face recognition system
CN105930834A (en) Face identification method and apparatus based on spherical hashing binary coding
CN107562938A (en) A kind of law court intelligently tries method
US11763167B2 (en) Copy area identification method and device
CN112528664B (en) Address matching method based on multi-task joint learning and address hierarchical structure knowledge
CN109284760A (en) A kind of furniture detection method and device based on depth convolutional neural networks
CN109919076A (en) The method and medium of confirmation OCR recognition result reliability based on deep learning
CN110347857A (en) The semanteme marking method of remote sensing image based on intensified learning
CN109145287A (en) Indonesian word error-detection error-correction method and system
CN105631405B (en) Traffic video intelligent recognition background modeling method based on Multilevel Block
CN104978569A (en) Sparse representation based incremental face recognition method
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN112819837A (en) Semantic segmentation method based on multi-source heterogeneous remote sensing image
CN113420619A (en) Remote sensing image building extraction method
CN111192346B (en) Electronic menu generation method, device and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant