CN108241847A - Lateh format formula processing method and device in text recognition - Google Patents

Lateh format formula processing method and device in text recognition Download PDF

Info

Publication number
CN108241847A
CN108241847A CN201611227736.8A CN201611227736A CN108241847A CN 108241847 A CN108241847 A CN 108241847A CN 201611227736 A CN201611227736 A CN 201611227736A CN 108241847 A CN108241847 A CN 108241847A
Authority
CN
China
Prior art keywords
formula
character
fragment
space character
left bracket
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611227736.8A
Other languages
Chinese (zh)
Other versions
CN108241847B (en
Inventor
白建国
熊蜀光
周迅溢
兴百桥
杨镜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xintang Sichuang Education Technology Co Ltd
Original Assignee
Beijing Xintang Sichuang Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xintang Sichuang Education Technology Co Ltd filed Critical Beijing Xintang Sichuang Education Technology Co Ltd
Priority to CN201611227736.8A priority Critical patent/CN108241847B/en
Publication of CN108241847A publication Critical patent/CN108241847A/en
Application granted granted Critical
Publication of CN108241847B publication Critical patent/CN108241847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Discrimination (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the application provides a processing method and a device for Lateh format formula in text recognition, wherein the method comprises the following steps: obtaining the formula interval symbol number of a formula in text recognition, and judging whether the formula interval symbol number is an even number; if the number is even, determining the position of the head of the formula according to the character type before the first formula interval symbol in each formula fragment; determining the position of the tail of the formula according to the character type after the last formula interval symbol in each formula fragment; and deleting redundant formula interval symbols to obtain a complete Latah format formula. According to the embodiment of the application, the fragments of the Lathy format formula can be automatically synthesized into the Lathy format formula, the labor cost of image recognition is saved, and the recognition efficiency is improved.

Description

一种文本识别中的拉泰赫格式公式处理方法及其装置Method and device for processing Rateh format formulas in text recognition

技术领域technical field

本申请属于图像识别技术领域,具体涉及一种文本识别中的拉泰赫格式公式处理方法及其装置。The present application belongs to the technical field of image recognition, and in particular relates to a method and a device for processing Latek format formulas in text recognition.

背景技术Background technique

拉泰赫(LATEX,音译“拉泰赫”)是一种基于ΤΕΧ的排版系统,由美国计算机学家莱斯利·兰伯特(Leslie Lamport)在20世纪80年代初期开发,利用这种格式,即使使用者没有排版和程序设计的知识也可以充分发挥由TeX所提供的强大功能,能在几天,甚至几小时内生成很多具有书籍质量的印刷品。对于生成复杂表格和数学公式,这一点表现得尤为突出。因此它非常适用于生成高印刷质量的科技和数学类文档。这个系统同样适用于生成从简单的信件到完整书籍的所有其他种类的文档。LATEX (LATEX, transliterated "LATEH") is a typesetting system based on TEX, developed by American computer scientist Leslie Lamport in the early 1980s, using this format , Even if the user has no knowledge of typesetting and programming, he can give full play to the powerful functions provided by TeX, and can generate many book-quality prints in a few days or even a few hours. This is especially true for generating complex tables and mathematical formulas. It is therefore ideal for producing scientific and mathematical documents of high print quality. This system is equally suitable for generating all other kinds of documents from simple letters to complete books.

在传统的计算机辅助教学系统中,教师往往需要将大量的试卷题目与习题册题目录入计算机系统,以方便学生在线联系以及老师在线辅导。这一试题录入的过程往往会消耗大量的人力物力,而且进度却常常非常缓慢。利用图像识别技术可以很方便快捷的完成题目的绝大部分录入,但是因为题目中包含的公式是无法通过整体一次性识别的,所以图像识别的结果,还需要人工的二次干预,所以导致效率的提升非常有限。如果可以将图像识别的拉泰赫格式的公式碎片(以公式分隔符号分隔的拉泰赫格式公式的一部分)用自动化的方式合并在一起,就会节约图像识别的人工成本,提高识别效率。In the traditional computer-aided instruction system, teachers often need to enter a large number of test paper questions and exercise books into the computer system to facilitate online contact with students and online tutoring by teachers. The process of entering test questions often consumes a lot of manpower and material resources, and the progress is often very slow. Image recognition technology can be used to complete the entry of most of the questions very conveniently and quickly, but because the formulas contained in the question cannot be recognized by the whole at one time, the result of image recognition still needs manual secondary intervention, which leads to inefficiency. The improvement is very limited. If the formula fragments of the Rateh format for image recognition (a part of the Rateh format formula separated by the formula separator) can be merged together in an automated manner, the labor cost of image recognition will be saved and the recognition efficiency will be improved.

因此,如何在图像识别中自动化的对拉泰赫格式公式进行处理,成为现有技术中亟需解决的技术问题。Therefore, how to automatically process the Rateh scheme formula in image recognition has become a technical problem that needs to be solved urgently in the prior art.

发明内容Contents of the invention

本申请实施例解决的技术问题之一在于提供一种文本识别中的拉泰赫格式公式处理方法及其装置,其能够使拉泰赫格式公式碎片自动化合成为拉泰赫格式公式,节约图像识别的人工成本,提高识别效率。One of the technical problems to be solved by the embodiments of the present application is to provide a method and device for processing Rateh format formulas in text recognition, which can automatically synthesize Rateh format formula fragments into Rateh format formulas, saving image recognition Lower labor costs and improve recognition efficiency.

本申请实施例提供一种文本识别中的拉泰赫格式公式处理方法,包括:An embodiment of the present application provides a method for processing Latek format formulas in text recognition, including:

获得文本识别中公式的公式间隔符号数量,并判断所述公式间隔符号数量是否为偶数;Obtaining the number of formula interval symbols of the formula in the text recognition, and judging whether the number of formula interval symbols is an even number;

如为偶数,根据每个公式碎片的首个公式间隔符号之前的字符类型确定公式碎片的头部的位置;If it is an even number, the position of the head of the formula fragment is determined according to the character type before the first formula interval symbol of each formula fragment;

根据每个公式碎片的最后的公式间隔符号之后的字符类型确定公式碎片的尾部的位置;Determine the position of the tail of the formula fragment according to the character type after the last formula interval symbol of each formula fragment;

删除多余的公式间隔符号,获得完整的拉泰赫格式公式。Remove redundant formula spacers to obtain complete formulas in Rateh format.

在本申请具体实现中,所述方法还包括:In the specific implementation of the present application, the method also includes:

如为奇数,查找每一公式间隔符号之前未包含在公式碎片中的字符或者公式间隔符号,并在所述字符或者公式间隔符号之后插入一公式间隔符号。If it is an odd number, search for a character or a formula spacer that is not included in the formula fragment before each formula spacer, and insert a formula spacer after the character or formula spacer.

在本申请具体实现中,所述如为偶数,根据每个公式碎片的首个公式间隔符号之前的字符类型确定公式碎片的头部的位置包括:In the specific implementation of the present application, if it is an even number, determining the position of the head of the formula fragment according to the character type before the first formula interval symbol of each formula fragment includes:

检测所述每个公式碎片的首个公式间隔符号之前的第一字符的类型;Detecting the type of the first character before the first formula break symbol of each formula fragment;

如果所述第一字符为中文、公式间隔符号、标点符号中任一种,则结束向前查找,确定所述公式间隔符号之后的字符为公式碎片的头部的位置;If described first character is any one in Chinese, formula interval symbol, punctuation mark, then finish searching forward, determine that the character after described formula interval symbol is the position of the head of formula fragment;

如果所述第一字符为数字、字母或者数学符号,则交换所述公式间隔符号和所述第一字符的位置,并继续向前检测确定公式碎片头部的位置;If the first character is a number, a letter or a mathematical symbol, then exchange the position of the formula interval symbol and the first character, and continue to detect and determine the position of the formula fragment head;

如果所述第一字符为右括号,则根据向前查找是否获得左括号,确定所述公式碎片的头部的位置。If the first character is a right parenthesis, determine the position of the head of the formula fragment according to whether the forward search obtains a left parenthesis.

在本申请具体实现中,所述如果所述第一字符为右括号,则根据向前查找是否获得左括号,确定所述公式碎片的头部的位置包括:In the specific implementation of the present application, if the first character is a right parenthesis, then according to whether the forward search obtains a left parenthesis, determining the position of the head of the formula fragment includes:

如果所述第一字符为右括号,则判断向前查找是否获得左括号;If the first character is a right parenthesis, then determine whether the forward search obtains a left parenthesis;

如果查找未获得左括号,则结束向前查找,确定所述公式间隔符号之后的字符为公式碎片的头部的位置;If the search does not obtain the left parenthesis, then end the forward search, and determine that the character after the formula interval symbol is the position of the head of the formula fragment;

如果查找获得左括号,且所述右括号和左括号之间的字符为字母和/或数学符号以及字母和/或数学符号与数字,将所述公式间隔符号插入到所述左括号的前面。If the search obtains a left parenthesis, and the characters between the right parenthesis and the left parenthesis are letters and/or mathematical symbols and letters and/or mathematical symbols and numbers, the formula spacer is inserted in front of the left parenthesis.

在本申请具体实现中,所述根据每个公式碎片的最后的公式间隔符号之后的字符类型确定公式碎片的尾部的位置包括:In the specific implementation of the present application, the determination of the position of the tail of the formula fragment according to the character type after the last formula interval symbol of each formula fragment includes:

检测所述每个公式碎片的最后的公式间隔符号之后的第二字符的类型;detecting the type of the second character after the last formula break symbol of each formula fragment;

如果所述第二字符为中文、公式间隔符号、标点符号中任一种,则结束向后查找,确定所述公式间隔符号之前的字符为公式碎片的尾部的位置;If the second character is any one of Chinese, formula interval symbols, and punctuation marks, then end the backward search, and determine that the character before the formula interval symbols is the position of the tail of the formula fragment;

如果所述第二字符为字母、数字或者数学符号,则交换所述公式间隔符号和所述第二字符的位置,并继续向后检测确定公式碎片尾部的位置;If the second character is a letter, a number or a mathematical symbol, then exchange the position of the formula interval symbol and the second character, and continue to detect backwards to determine the position of the tail of the formula fragment;

如果所述第二字符为左括号,则根据向后查找是否获得右括号,确定所述公式碎片的尾部的位置。If the second character is a left parenthesis, determine the position of the tail of the formula fragment according to whether the backward search obtains a right parenthesis.

在本申请具体实现中,所述如果所述第二字符为左括号,则根据向后查找是否获得右括号,确定所述公式碎片的尾部的位置包括:In the specific implementation of the present application, if the second character is a left parenthesis, then according to whether the backward search obtains a right parenthesis, determining the position of the tail of the formula fragment includes:

如果所述第二字符为左括号,则判断向后查找是否获得右括号;If the second character is a left parenthesis, then determine whether the backward search obtains a right parenthesis;

如果查找未获得右括号,则结束向后查找,确定所述公式间隔符号之后的字符为公式碎片的尾部的位置;If searching does not obtain closing parentheses, then end the backward search, and determine that the character after the formula interval symbol is the position of the tail of the formula fragment;

如果查找获得右括号,且所述右括号和左括号之间的字符为字母和/或数学符号以及字母和/或数学符号与数字,将所述公式间隔符号插入到所述右括号的后面。If the search obtains a closing bracket, and the characters between the closing bracket and the left bracket are letters and/or mathematical symbols and letters and/or mathematical symbols and numbers, the formula spacer is inserted behind the closing bracket.

在本申请具体实现中,所述多余的公式间隔符号具体为:两个连续的公式间隔符号。In a specific implementation of the present application, the redundant formula interval symbols are specifically: two consecutive formula interval symbols.

对应上述方法,本申请还提供一种文本识别中的拉泰赫格式公式处理装置,包括:Corresponding to the above method, the present application also provides a Ratek format formula processing device in text recognition, including:

数量判断模块,用于获得文本识别中公式的公式间隔符号数量,并判断所述公式间隔符号数量是否为偶数;Quantity judging module, is used for obtaining the formula interval symbol quantity of formula in text recognition, and judges whether described formula interval symbol quantity is an even number;

头部确定模块,用于如为偶数,根据每个公式碎片的首个公式间隔符号之前的字符类型确定公式碎片的头部的位置;The head determination module is used to determine the position of the head of the formula fragment according to the character type before the first formula interval symbol of each formula fragment if it is an even number;

尾部确定模块,用于根据每个公式碎片的最后的公式间隔符号之后的字符类型确定公式碎片的尾部的位置;Tail determination module, used to determine the position of the tail of the formula fragment according to the character type after the last formula interval symbol of each formula fragment;

符号删除模块,用于删除多余的公式间隔符号,获得完整的拉泰赫格式公式。The symbol removal module is used to remove redundant formula space symbols to obtain complete formulas in Rateh format.

在本申请具体实现中,所述装置还包括:In the specific implementation of the present application, the device also includes:

符号插入模块,用于如为奇数,查找每一公式间隔符号之前未包含在公式碎片中的字符或者公式间隔符号,并在所述字符或者公式间隔符号之后插入一公式间隔符号。The symbol inserting module is used to find a character or a formula spacer that is not included in the formula fragment before each formula spacer if the number is odd, and insert a formula spacer after the character or formula spacer.

在本申请具体实现中,所述头部确定模块包括:In the specific implementation of the present application, the header determination module includes:

第一字符判断单元,用于检测所述每个公式碎片的首个公式间隔符号之前的第一字符的类型;A first character judging unit, configured to detect the type of the first character before the first formula interval symbol of each formula fragment;

第一查找结束单元,用于如果所述第一字符为中文、公式间隔符号、标点符号中任一种,则结束向前查找,确定所述公式间隔符号之后的字符为公式碎片的头部的位置;The first search end unit is used for if the first character is any one of Chinese, formula interval symbols, and punctuation marks, then ends the forward search, and determines that the character after the formula interval symbol is the head of the formula fragment Location;

第一字符交换单元,用于如果所述第一字符为数字、字母或者数学符号,则交换所述公式间隔符号和所述第一字符的位置,并继续向前检测确定公式碎片头部的位置;The first character exchange unit is used to exchange the position of the formula interval symbol and the first character if the first character is a number, letter or mathematical symbol, and continue to detect and determine the position of the formula fragment head. ;

左括号查找单元,用于如果所述第一字符为右括号,则根据向前查找是否获得左括号,确定所述公式碎片的头部的位置。The left parenthesis search unit is configured to determine the position of the head of the formula fragment according to whether the forward search obtains a left parenthesis if the first character is a right parenthesis.

在本申请具体实现中,所述左括号查找单元包括:In the specific implementation of the present application, the left bracket search unit includes:

第一判断子单元,用于如果所述第一字符为右括号,则判断向前查找是否获得左括号;The first judging subunit is used to judge whether the forward search obtains a left bracket if the first character is a right bracket;

第一获得子单元,用于如果查找未获得左括号,则结束向前查找,确定所述公式间隔符号之后的字符为公式碎片的头部的位置;The first obtaining subunit is used to end the forward search if the search does not obtain the left parenthesis, and determine that the character after the formula interval symbol is the position of the head of the formula fragment;

第一未获得子单元,用于如果查找获得左括号,且所述右括号和左括号之间的字符为字母和/或数学符号以及字母和/或数学符号与数字,将所述公式间隔符号插入到所述左括号的前面。The first unobtained subunit is used to divide the formula into an interval symbol if the left parenthesis is obtained from the search, and the characters between the right parenthesis and the left parenthesis are letters and/or mathematical symbols and letters and/or mathematical symbols and numbers Inserted before said opening parenthesis.

在本申请具体实现中,所述尾部确定模块包括:In the specific implementation of the present application, the tail determination module includes:

第二字符判断单元,用于检测所述每个公式碎片的最后的公式间隔符号之后的第二字符的类型;The second character judging unit is used to detect the type of the second character after the last formula interval symbol of each formula fragment;

第二查找结束单元,用于如果所述第二字符为中文、公式间隔符号、标点符号中任一种,则结束向后查找,确定所述公式间隔符号之前的字符为公式碎片的尾部的位置;The second search end unit is used to end the backward search if the second character is any one of Chinese, formula spacer and punctuation, and determine that the character before the formula spacer is the position of the tail of the formula fragment ;

第二字符交换单元,用于如果所述第二字符为字母、数字或者数学符号,则交换所述公式间隔符号和所述第二字符的位置,并继续向后检测确定公式碎片尾部的位置;The second character exchange unit is used to exchange the position of the formula interval symbol and the second character if the second character is a letter, a number or a mathematical symbol, and continue to detect and determine the position of the tail of the formula fragment;

右括号查找单元,用于如果所述第二字符为左括号,则根据向后查找是否获得右括号,确定所述公式碎片的尾部的位置。A right bracket lookup unit, configured to determine the position of the tail of the formula fragment according to whether the backward search obtains a right bracket if the second character is a left bracket.

在本申请具体实现中,所述右括号查找单元包括:In the specific implementation of the present application, the right bracket search unit includes:

第二判断子单元,用于如果所述第二字符为左括号,则判断向后查找是否获得右括号;The second judging subunit is used to judge whether the backward search obtains a right bracket if the second character is a left bracket;

第二获得子单元,用于如果查找未获得右括号,则结束向后查找,确定所述公式间隔符号之后的字符为公式碎片的尾部的位置;The second obtaining subunit is used to end the backward search if the search does not obtain a right parenthesis, and determine that the character after the formula interval symbol is the position of the tail of the formula fragment;

第二未获得子单元,用于如果查找获得右括号,且所述右括号和左括号之间的字符为字母和/或数学符号以及字母和/或数学符号与数字,将所述公式间隔符号插入到所述右括号的后面。The second unobtained subunit is used to divide the formula into an interval symbol if the search obtains a closing bracket, and the characters between the closing bracket and the left bracket are letters and/or mathematical symbols and letters and/or mathematical symbols and numbers Inserted after the closing parenthesis.

在本申请具体实现中,所述多余的公式间隔符号具体为:两个连续的公式间隔符号。In a specific implementation of the present application, the redundant formula interval symbols are specifically: two consecutive formula interval symbols.

本申请实施例判断所述公式间隔符号数量为偶数时,根据每个公式碎片中首个公式间隔符号之前的字符类型确定公式碎片中公式头部的位置,根据每个公式碎片中最后的公式间隔符号之后的字符类型确定公式碎片中公式尾部的位置。进而,删除多余的公式间隔符号,获得完整的拉泰赫格式公式。本申请能够使拉泰赫格式公式碎片自动化合成为拉泰赫格式公式,节约图像识别的人工成本,提高识别效率。When the embodiment of the present application determines that the number of formula interval symbols is an even number, determine the position of the formula head in the formula fragment according to the character type before the first formula interval symbol in each formula fragment, and determine the position of the formula header in each formula fragment according to the last formula interval in each formula fragment The type of character following the symbol determines the position of the tail of the formula in the formula fragment. Furthermore, the redundant formula interval symbols are deleted to obtain a complete Latek format formula. The application can automatically synthesize fragments of the Rateh format formula into Rateh format formulas, save labor costs for image recognition, and improve recognition efficiency.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in this application, and those skilled in the art can also obtain other drawings based on these drawings.

图1是本申请提供的一种文本识别中的拉泰赫格式公式处理方法一实施例流程图;Fig. 1 is a flow chart of an embodiment of a method for processing Latek format formulas in a text recognition provided by the present application;

图2是本申请提供的一种文本识别中的拉泰赫格式公式处理方法另一实施例流程图;FIG. 2 is a flow chart of another embodiment of a method for processing Latek format formulas in text recognition provided by the present application;

图3是本申请提供的一种文本识别中的拉泰赫格式公式处理方法中步骤S2一实施例流程图;Fig. 3 is a flow chart of an embodiment of step S2 in a method for processing Latek format formulas in text recognition provided by the present application;

图4是本申请提供的一种文本识别中的拉泰赫格式公式处理方法中步骤S24一实施例流程图;FIG. 4 is a flow chart of an embodiment of step S24 in a method for processing Latek format formulas in text recognition provided by the present application;

图5是本申请提供的一种文本识别中的拉泰赫格式公式处理方法中步骤S3一实施例流程图;Fig. 5 is a flow chart of an embodiment of step S3 in a method for processing Latek format formulas in text recognition provided by the present application;

图6是本申请提供的一种文本识别中的拉泰赫格式公式处理方法中步骤S34一实施例流程图;FIG. 6 is a flow chart of an embodiment of step S34 in a method for processing Latek format formulas in text recognition provided by the present application;

图7是本申请提供的一种文本识别中的拉泰赫格式公式处理装置一实施例结构图;FIG. 7 is a structural diagram of an embodiment of a Ratek format formula processing device in text recognition provided by the present application;

图8是本申请提供的一种文本识别中的拉泰赫格式公式处理装置另一实施例结构图;Fig. 8 is a structure diagram of another embodiment of a Ratek format formula processing device in text recognition provided by the present application;

图9是本申请提供的一种文本识别中的拉泰赫格式公式处理装置中头部确定模块一实施例结构图;Fig. 9 is a structural diagram of an embodiment of a head determination module in a Ratek formula processing device for text recognition provided by the present application;

图10是本申请提供的一种文本识别中的拉泰赫格式公式处理装置中头部确定模块中左括号查找单元一实施例结构图;Fig. 10 is a structure diagram of an embodiment of the left bracket search unit in the head determination module in the Ratek format formula processing device in a text recognition provided by the present application;

图11是本申请提供的一种文本识别中的拉泰赫格式公式处理装置中尾部确定模块一实施例结构图;Fig. 11 is a structural diagram of an embodiment of a tail determination module in a Latek format formula processing device in a text recognition provided by the present application;

图12是本申请提供的一种文本识别中的拉泰赫格式公式处理装置中尾部确定模块中右括号查找单元一实施例结构图;Fig. 12 is a structural diagram of an embodiment of a right bracket search unit in a tail determination module in a Latek format formula processing device provided by the present application;

图13是本申请提供的文本识别中的拉泰赫格式公式处理方法的电子设备的硬件结构示意图;Fig. 13 is a schematic diagram of the hardware structure of the electronic device of the Latek format formula processing method in the text recognition provided by the present application;

图14是本申请一具体应用场景的流程图。Fig. 14 is a flowchart of a specific application scenario of the present application.

具体实施方式Detailed ways

本申请实施例判断所述公式间隔符号数量为偶数时,根据每个公式碎片中首个公式间隔符号之前的字符类型确定公式碎片中公式头部的位置,根据每个公式碎片中最后的公式间隔符号之后的字符类型确定公式碎片中公式尾部的位置。进而,删除多余的公式间隔符号,获得完整的拉泰赫格式公式。本申请能够使拉泰赫格式公式碎片自动化合成为拉泰赫格式公式,节约图像识别的人工成本,提高识别效率。When the embodiment of the present application determines that the number of formula interval symbols is an even number, determine the position of the formula head in the formula fragment according to the character type before the first formula interval symbol in each formula fragment, and determine the position of the formula header in each formula fragment according to the last formula interval in each formula fragment The type of character following the symbol determines the position of the tail of the formula in the formula fragment. Furthermore, the redundant formula interval symbols are deleted to obtain a complete Latek format formula. The application can automatically synthesize fragments of the Rateh format formula into Rateh format formulas, save labor costs for image recognition, and improve recognition efficiency.

尽管本申请能够具有许多不同形式的实施例,但在附图中显示并且将在本文详细描述的特定实施例,应该理解,这种实施例的公开应该被视为原理的示例,而非意图把本申请限制于显示和描述的特定实施例。在以下的描述中,相同的标号用于描述附图的几个示图中的相同、相似或对应的部分。While the present application is capable of embodiments in many different forms, certain embodiments have been shown in the drawings and will be described in detail herein, it being understood that the disclosure of such embodiments should be considered as an illustration of principles and is not intended to be The application is limited to the specific embodiments shown and described. In the following description, the same reference numerals are used to describe the same, similar or corresponding parts in the several views of the drawings.

如本文所使用,术语“一个”或“一种”被定义为一个(种)或超过一个(种)。如本文所使用,术语“多个”被定义为两个或超过两个。如本文所使用,术语“其他”被定义为至少再一个或更多个。如本文所使用,术语“包含”和/或“具有”被定义为包括(即,开放式语言)。如本文所使用,术语“耦合”被定义为连接,但未必是直接连接,并且未必是以机械方式连接。如本文所使用,术语“程序”或“计算机程序”或类似术语被定义为设计用于在计算机系统上执行的指令序列。“程序”或“计算机程序”可包括子程序、函数、过程、对象方法、对象实现、可执行应用、小应用程序、小服务程序、源代码、目标代码、共享库/动态加载库和/或设计用于在计算机系统上执行的其它指令序列。As used herein, the term "a" or "an" is defined as one or more than one. As used herein, the term "plurality" is defined as two or more than two. As used herein, the term "other" is defined as at least one more or more. As used herein, the terms "comprising" and/or "having" are defined as comprising (ie, open language). As used herein, the term "coupled" is defined as connected, although not necessarily directly, and not necessarily mechanically. As used herein, the term "program" or "computer program" or similar terms is defined as a sequence of instructions designed for execution on a computer system. A "program" or "computer program" may include subroutines, functions, procedures, object methods, object implementations, executable applications, applets, servlets, source code, object code, shared/dynamically loaded libraries, and/or Other sequences of instructions designed for execution on a computer system.

在整个本文件中对“一个实施例”、“某些实施例”、“实施例”或类似术语的提及表示结合实施例描述的特定特征、结构或特性被包括在本发明的至少一个实施例中。因此,在整个本说明书的各种地方的这种词语的出现不必全部表示相同的实施例。另外,所述特定特征、结构或特性可非限制性地在一个或多个实施例中以任何合适的方式组合。Reference throughout this document to "one embodiment," "certain embodiments," "an embodiment," or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one implementation of the invention. example. Thus, the appearances of such words in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments, without limitation.

如本文所使用,术语“或者”应该被解释为是包括性的或者表示任何一种或任何组合。因此,“A、B或者C”表示“下面的任何一种:A;B;C;A和B;A和C;B和C;A,B和C”。仅当元件、功能、步骤或动作的组合以某种方式固有地相互排斥时,将会发生这种定义的例外。As used herein, the term "or" should be interpreted as being inclusive or meaning any one or any combination. Thus, "A, B or C" means "any of the following: A; B; C; A and B; A and C; B and C; A, B and C". An exception to this definition will only occur when a combination of elements, functions, steps or acts is in some way inherently mutually exclusive.

为了使本领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described implementation Examples are only some of the embodiments of the present application, but not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments in this application shall fall within the protection scope of this application.

下面结合本申请附图进一步说明本申请具体实现。The specific implementation of the present application will be further described below in conjunction with the drawings of the present application.

参见图1,本申请一实施例提供一种文本识别中的拉泰赫格式公式处理方法,包括:Referring to Fig. 1, an embodiment of the present application provides a method for processing Latek format formulas in text recognition, including:

S1、获得文本识别中公式的公式间隔符号数量,并判断所述公式间隔符号数量是否为偶数。S1. Obtain the number of formula interval symbols of the formula in the text recognition, and judge whether the number of formula interval symbols is an even number.

对于拉泰赫格式公式,所有的数学公式应该放到公式间隔符号$$之间,但是从网站上爬取的拉泰赫格式公式,通常把一个完整的公式分割为多个公式碎片。For the Rateh format formula, all mathematical formulas should be placed between the formula interval symbols $$, but the Rateh format formula crawled from the website usually divides a complete formula into multiple formula fragments.

例如,爬取的拉泰赫格式公式“题目1:|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$”,包含了公式碎片:$$\frac{1}{2}$$、$$\sqrt{12}$$、$$^{-1}$$。而这个完整的公式应该为$$|-\frac{1}{2}|+\sqrt{12}-2^{-1}$$。For example, the crawled Rateh format formula "Title 1:|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$ ", including formula fragments: $$\frac{1}{2}$$, $$\sqrt{12}$$, $$^{-1}$$. And this complete formula should be $$|-\frac{1}{2}|+\sqrt{12}-2^{-1}$$.

由于爬取的拉泰赫格式公式包含一些公式的公式碎片,所以导致本来是一个公式的题目,如今显示的是三个公式。比如“题目1:|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$”,出现了6个公式间隔符号$$,这6个公式间隔符号$$两两一组,它们中间的部分即为latex格式的公式碎片。计算其中的公式碎片的数量是否为偶数,比如“|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$”的中的公式碎片的数量是3个。Since the crawled Rateh format formula contains some formula fragments, the title that was originally one formula now displays three formulas. For example, "Title 1: |-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$", there are 6 formula interval symbols $$, these 6 formula interval symbols $$ are in pairs, and the part in the middle is the formula fragment in latex format. Calculate whether the number of formula fragments in it is an even number, such as "|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$" The number of formula fragments in is 3.

S2、如为偶数,根据每个公式碎片的首个公式间隔符号之前的字符类型确定公式碎片的头部的位置。S2. If it is an even number, determine the position of the head of the formula fragment according to the character type before the first formula interval symbol of each formula fragment.

如果公式间隔符号的数量为偶数,则可以根据每个公式碎片的首个公式间隔符号之前的字符类型来确定公式碎片的头部位置。If the number of formula spacers is even, the head position of the formula fragment can be determined according to the character type before the first formula space symbol of each formula fragment.

S3、根据每个公式碎片的最后的公式间隔符号之后的字符类型确定公式碎片的尾部的位置。S3. Determine the position of the tail of the formula fragment according to the character type after the last formula interval symbol of each formula fragment.

如果公式间隔符的数量为偶数,则可以根据每个公式碎片的最后的公式间隔符号之后的字符类型来确定公式碎片的尾部的位置。If the number of formula spacers is an even number, the position of the tail of the formula fragment can be determined according to the character type after the last formula space symbol of each formula fragment.

S4、删除多余的公式间隔符号,获得完整的拉泰赫格式公式。S4. Deleting redundant formula interval symbols to obtain a complete formula in Rateh format.

将多个公式碎片中多余的公式间隔符号删除,使两边的公式打通,获得完整的拉泰赫格式公式。Delete the redundant formula space symbols in multiple formula fragments, so that the formulas on both sides can be opened, and a complete formula in Rateh format can be obtained.

因此,本申请能够使拉泰赫格式公式碎片自动化合成为拉泰赫格式公式,节约图像识别的人工成本,提高识别效率。Therefore, the present application can automatically synthesize fragments of the Rateh format formula into Rateh format formulas, save labor costs for image recognition, and improve recognition efficiency.

在本申请另一具体实施例中,参见图2,所述方法还包括:In another specific embodiment of the present application, referring to FIG. 2, the method further includes:

S5、如为奇数,查找每一公式间隔符号之前未包含在公式碎片中的字符或者公式间隔符号,并在所述字符或者公式间隔符号之后插入一公式间隔符号。S5. If it is an odd number, search for a character or a formula spacer that is not included in the formula fragment before each formula spacer, and insert a formula spacer after the character or formula spacer.

如果公式间隔符号的数量为奇数,对每一个公式间隔符号$$,从其前面寻找第一个不在公式中的字符或公式间隔符号$$,并在该字符后面插入公式间隔符号$$。If the number of formula interval symbols is odd, for each formula interval symbol $$, find the first character that is not in the formula or formula interval symbol $$ from before it, and insert the formula interval symbol $$ after the character.

比如如下题目2:For example, the following topic 2:

计算:|-$$\frac{1}{2}$$|+$$(题目2)Calculation: |-$$\frac{1}{2}$$|+$$ (question 2)

对于第一对公式间隔符号$$,其前面第一个不在公式中的字符为冒号“:”,那么经过第一次处理之后题目就变成:For the first pair of formula interval symbols $$, the first character before it that is not in the formula is a colon ":", then after the first processing, the title becomes:

计算:$$|-$$\frac{1}{2}$$|+$$Calculate: $$|-$$\frac{1}{2}$$|+$$

按照同样的方式,整个公式处理完毕之后就变为:In the same way, after the entire formula is processed, it becomes:

计算:$$|-$$$$\frac{1}{2}$$$$|+$$Calculate: $$|-$$$$\frac{1}{2}$$$$|+$$

这样处理之后,题目中的公式间隔符号$$的数目就变成了偶数,则执行步骤S2。After processing in this way, the number of interval symbols $$ in the formula in the title becomes an even number, and then step S2 is executed.

在本申请另一具体实施例中,参见图3,所述步骤S2包括:In another specific embodiment of the present application, referring to FIG. 3, the step S2 includes:

S21、检测所述每个公式碎片的首个公式间隔符号之前的第一字符的类型。S21. Detect the type of the first character before the first formula space symbol of each formula fragment.

S22、如果所述第一字符为中文、公式间隔符号、标点符号中任一种,则结束向前查找,确定所述公式间隔符号之后的字符为公式碎片的头部的位置。S22. If the first character is any one of Chinese, formula spacer, and punctuation mark, then end the forward search, and determine that the character after the formula spacer is the head position of the formula fragment.

具体地,如果所述第一字符为中文、公式间隔符号、标点符号中任一种,则表明所述第一字符并非公式碎片的一部分。例如,“(1)、$$a+b$$”中的首个公式间隔符号$$之前的第一字符为“、”,则确定所述公式间隔符号$$之后的字符“a”为公式碎片的头部的位置。Specifically, if the first character is any one of Chinese, formula interval symbols, and punctuation marks, it indicates that the first character is not a part of the formula fragment. For example, if the first character before the first formula interval symbol $$ in "(1), $$a+b$$" is ",", then it is determined that the character "a" after the formula interval symbol $$ is The position of the head of the formula fragment.

S23、如果所述第一字符为数字、字母或者数学符号,则交换所述公式间隔符号和所述第一字符的位置,并继续向前检测确定公式碎片头部的位置。S23. If the first character is a number, a letter or a mathematical symbol, exchange the position of the formula interval symbol and the first character, and continue to detect and determine the position of the head of the formula fragment.

具体地,如果所述第一字符为数字、字母或者数学符号,则表明所述第一字符为公式碎片的一部分。例如,“6+$$5+9$$”中的首个公式间隔符号$$之前的第一字符为“+”,则交换所述公式间隔符号和所述第一字符的位置获得“6$$+5+9$$”。继续向前检测确定公式碎片头部的位置,交换所述公式间隔符号和所述第一字符的位置获得“$$6+5+9$$”。Specifically, if the first character is a number, letter or mathematical symbol, it indicates that the first character is a part of a formula fragment. For example, if the first character before the first formula space symbol $$ in "6+$$5+9$$" is "+", then exchange the positions of the formula space symbol and the first character to obtain "6$ $+5+9$$". Continue to detect and determine the position of the head of the formula fragment, exchange the positions of the formula interval symbol and the first character to obtain "$$6+5+9$$".

S24、如果所述第一字符为右括号,则根据向前查找是否获得左括号,确定所述公式碎片的头部的位置。S24. If the first character is a right parenthesis, determine the position of the head of the formula fragment according to whether the forward search obtains a left parenthesis.

具体地,参见图4,所述步骤S24包括:Specifically, referring to FIG. 4, the step S24 includes:

S241、如果所述第一字符为右括号,则判断向前查找是否获得左括号。S241. If the first character is a right bracket, judge whether the forward search obtains a left bracket.

S242、如果查找未获得左括号,则结束向前查找,确定所述公式间隔符号之后的字符为公式碎片的头部的位置。S242. If the search does not obtain a left parenthesis, end the forward search, and determine that the character after the formula space symbol is the head position of the formula fragment.

S243、如果查找获得左括号,且所述右括号和左括号之间的字符为字母和/或数学符号以及字母和/或数学符号与数字,将所述公式间隔符号插入到所述左括号的前面。S243. If the search obtains the left parenthesis, and the characters between the right parenthesis and the left parenthesis are letters and/or mathematical symbols and letters and/or mathematical symbols and numbers, insert the formula interval symbol into the left parenthesis Front.

本申请实施例当所述第一字符为右括号时,需要向前查找是否需要将所述公式间隔符号向前移动到所述右括号前面的左括号前面,即需要根据是否能够查找到左括号,确定所述公式碎片的头部的位置。In the embodiment of the present application, when the first character is a right bracket, it is necessary to search forward whether the formula interval symbol needs to be moved forward to the front of the left bracket in front of the right bracket, that is, it needs to be based on whether the left bracket can be found , to determine the position of the head of the formula fragment.

在本申请另一具体实施例中,参见图5,所述步骤S3包括:In another specific embodiment of the present application, referring to FIG. 5, the step S3 includes:

S31、检测所述每个公式碎片的最后的公式间隔符号之后的第二字符的类型。S31. Detect the type of the second character after the last formula space symbol of each formula fragment.

S32、如果所述第二字符为中文、公式间隔符号、标点符号中任一种,则结束向后查找,确定所述公式间隔符号之前的字符为公式碎片的尾部的位置。S32. If the second character is any one of Chinese, formula spacer, and punctuation mark, then end the backward search, and determine that the character before the formula spacer is the position of the tail of the formula fragment.

具体地,如果所述第二字符为中文、公式间隔符号、标点符号中任一种,则表明所述第二字符并非公式碎片的一部分。例如,“$$a+b$$、”中的首个公式间隔符号$$之后的第二字符为“、”,则确定所述公式间隔符号$$之前的字符“b”为公式碎片的尾部的位置。Specifically, if the second character is any one of Chinese, formula interval symbols, and punctuation marks, it indicates that the second character is not a part of the formula fragment. For example, if the second character after the first formula space symbol $$ in "$$a+b$$," is ",", it is determined that the character "b" before the formula space symbol $$ is the formula fragment The position of the tail.

S33、如果所述第二字符为字母、数字或者数学符号,则交换所述公式间隔符号和所述第二字符的位置,并继续向后检测确定公式碎片尾部的位置。S33. If the second character is a letter, a number or a mathematical symbol, exchange the position of the formula interval symbol and the second character, and continue to detect backward to determine the position of the tail of the formula fragment.

具体地,如果所述第二字符为数字、字母或者数学符号,则表明所述第二字符为公式碎片的一部分。例如,“$$5+9$$-2”中的最后的公式间隔符号$$之后的第二字符为“-”,则交换所述公式间隔符号和所述第二字符的位置获得“$$5+9-$$2”。继续向后检测确定公式碎片尾部的位置,交换所述公式间隔符号和所述第二字符的位置获得“$$5+9-2$$”。Specifically, if the second character is a number, letter or mathematical symbol, it indicates that the second character is a part of a formula fragment. For example, if the second character after the last formula space symbol $$ in "$$5+9$$-2" is "-", then exchange the positions of the formula space symbol and the second character to obtain "$$5 +9-$$2". Continue to detect backwards to determine the position of the tail of the formula fragment, exchange the positions of the formula space symbol and the second character to obtain "$$5+9-2$$".

S34、如果所述第二字符为左括号,则根据向后查找是否获得右括号,确定所述公式碎片的尾部的位置。S34. If the second character is a left parenthesis, determine the position of the tail of the formula fragment according to whether the backward search obtains a right parenthesis.

具体地,参见图6,所述步骤S34包括:Specifically, referring to FIG. 6, the step S34 includes:

S341、如果所述第二字符为左括号,则判断向后查找是否获得右括号;S341. If the second character is a left parenthesis, judge whether the backward search obtains a right parenthesis;

S342、如果查找未获得右括号,则结束向后查找,确定所述公式间隔符号之后的字符为公式碎片的尾部的位置。S342. If the search does not obtain a right parenthesis, then end the backward search, and determine that the character after the formula space symbol is the position of the tail of the formula fragment.

S343、如果查找获得右括号,且所述右括号和左括号之间的字符为字母和/或数学符号以及字母和/或数学符号与数字,将所述公式间隔符号插入到所述右括号的后面。S343. If the search obtains a right bracket, and the characters between the right bracket and the left bracket are letters and/or mathematical symbols and letters and/or mathematical symbols and numbers, insert the formula interval symbol into the right bracket Behind.

本申请实施例当所述第二字符为左括号时,需要向后查找是否需要将所述公式间隔符号向后移动到所述左括号后面的右括号后面,即需要根据是否能够查找到右括号,确定所述公式碎片的尾部的位置。In the embodiment of the present application, when the second character is a left parenthesis, it is necessary to find out whether the formula interval symbol needs to be moved backward to the right parenthesis behind the left parenthesis, that is, it needs to be based on whether the right parenthesis can be found , to determine the position of the tail of the formula fragment.

在本申请另一具体实施例中,所述多余的公式间隔符号具体为:两个连续的公式间隔符号。In another specific embodiment of the present application, the redundant formula interval symbols are specifically: two consecutive formula interval symbols.

按照步骤S1至S3处理完成,会出现两个公式间隔符号$$连接在一起的情况,即“$$$$”,这个情况表明这“$$$$”的前面和后面都是真正的公式,所以此时可以直接将“$$$$”删除,使两边的公式打通。After processing according to steps S1 to S3, there will be a situation where two formula space symbols $$ are connected together, that is, "$$$$", which indicates that the front and back of "$$$$" are real formulas , so you can directly delete "$$$$" at this time, so that the formulas on both sides can be opened.

参见图7,对应上述方法,本申请另一实施例提供一种文本识别中的拉泰赫格式公式处理装置,包括:Referring to FIG. 7 , corresponding to the above method, another embodiment of the present application provides a device for processing Latek format formulas in text recognition, including:

数量判断模块71,用于获得文本识别中公式的公式间隔符号数量,并判断所述公式间隔符号数量是否为偶数。The number judging module 71 is configured to obtain the number of formula interval symbols of the formula in the text recognition, and judge whether the number of formula interval symbols is an even number.

头部确定模块72,用于如为偶数,根据每个公式碎片的首个公式间隔符号之前的字符类型确定公式碎片的头部的位置。The head determination module 72 is configured to determine the position of the head of the formula fragment according to the character type before the first formula interval symbol of each formula fragment if it is an even number.

尾部确定模块73,用于根据每个公式碎片的最后的公式间隔符号之后的字符类型确定公式碎片的尾部的位置。The tail determining module 73 is configured to determine the position of the tail of the formula fragment according to the character type after the last formula space symbol of each formula fragment.

符号删除模块74,用于删除多余的公式间隔符号,获得完整的拉泰赫格式公式。The symbol deletion module 74 is used to delete redundant formula interval symbols to obtain a complete formula in Latek format.

对于拉泰赫格式公式,所有的数学公式应该放到公式间隔符号$$之间,但是从网站上爬取的拉泰赫格式公式,通常把一个完整的公式分割为多个公式碎片。For the Rateh format formula, all mathematical formulas should be placed between the formula interval symbols $$, but the Rateh format formula crawled from the website usually divides a complete formula into multiple formula fragments.

例如,爬取的拉泰赫格式公式“题目1:|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$”,包含了公式碎片:$$\frac{1}{2}$$、$$\sqrt{12}$$、$$^{-1}$$。而这个完整的公式应该为$$|-\frac{1}{2}|+\sqrt{12}-2^{-1}$$。For example, the crawled Rateh format formula "Title 1:|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$ ", including formula fragments: $$\frac{1}{2}$$, $$\sqrt{12}$$, $$^{-1}$$. And this complete formula should be $$|-\frac{1}{2}|+\sqrt{12}-2^{-1}$$.

由于爬取的拉泰赫格式公式包含一些公式的公式碎片,所以导致本来是一个公式的题目,如今显示的是三个公式。比如“题目1:|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$”,出现了6个公式间隔符号$$,这6个公式间隔符号$$两两一组,它们中间的部分即为latex格式的公式碎片。计算其中的公式碎片的数量是否为偶数,比如“|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$”的中的公式碎片的数量是3个。Since the crawled Rateh format formula contains some formula fragments, the title that was originally one formula now displays three formulas. For example, "Title 1: |-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$", there are 6 formula interval symbols $$, these 6 formula interval symbols $$ are in pairs, and the part in the middle is the formula fragment in latex format. Calculate whether the number of formula fragments in it is an even number, such as "|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$" The number of formula fragments in is 3.

如果公式间隔符号的数量为偶数,则可以根据每个公式碎片的首个公式间隔符号之前的字符类型来确定公式碎片的头部位置。If the number of formula spacers is even, the head position of the formula fragment can be determined according to the character type before the first formula space symbol of each formula fragment.

如果公式碎片的数量为偶数,则可以根据每个公式碎片的最后的公式间隔符号之后的字符类型来确定公式碎片的尾部的位置。If the number of formula fragments is an even number, the position of the tail of the formula fragment can be determined according to the character type after the last formula space symbol of each formula fragment.

将公式碎片中多余的公式间隔符号删除,使两边的公式打通,获得完整的拉泰赫格式公式。Delete the redundant formula space symbols in the formula fragments, so that the formulas on both sides can be opened up, and the complete formulas in the Rateh format can be obtained.

因此,本申请能够使拉泰赫格式公式碎片自动化合成为拉泰赫格式公式,节约图像识别的人工成本,提高识别效率。Therefore, the present application can automatically synthesize fragments of the Rateh format formula into Rateh format formulas, save labor costs for image recognition, and improve recognition efficiency.

在本申请另一具体实施例中,参见图8,所述装置还包括:In another specific embodiment of the present application, referring to FIG. 8, the device further includes:

符号插入模块75,用于如为奇数,查找每一公式间隔符号之前未包含在公式碎片中的字符或者公式间隔符号,并在所述字符或者公式间隔符号之后插入一公式间隔符号。The symbol insertion module 75 is configured to, if the number is odd, search for a character or a formula space symbol that is not included in the formula fragment before each formula space symbol, and insert a formula space symbol after the character or formula space symbol.

如果公式间隔符号的数量为奇数,对每一个公式间隔符号$$,从其前面寻找第一个不在公式中的字符或公式间隔符号$$,并在该字符后面插入公式间隔符号$$。If the number of formula interval symbols is odd, for each formula interval symbol $$, find the first character that is not in the formula or formula interval symbol $$ from before it, and insert the formula interval symbol $$ after the character.

比如如下题目2:For example, the following topic 2:

计算:|-$$\frac{1}{2}$$|+$$(题目2)Calculation: |-$$\frac{1}{2}$$|+$$ (question 2)

对于第一对公式间隔符号$$,其前面第一个不在公式中的字符为冒号“:”,那么经过第一次处理之后题目就变成:For the first pair of formula interval symbols $$, the first character before it that is not in the formula is a colon ":", then after the first processing, the title becomes:

计算:$$|-$$\frac{1}{2}$$|+$$Calculate: $$|-$$\frac{1}{2}$$|+$$

按照同样的方式,整个公式处理完毕之后就变为:In the same way, after the entire formula is processed, it becomes:

计算:$$|-$$$$\frac{1}{2}$$$$|+$$Calculate: $$|-$$$$\frac{1}{2}$$$$|+$$

这样处理之后,题目中的公式间隔符号$$的数目就变成了偶数,则执行头部确定模块72。After processing in this way, the number of the formula interval symbols $$ in the question becomes an even number, and then the header determination module 72 is executed.

在本申请另一具体实施例中,参见图9,所述头部确定模块72包括:In another specific embodiment of the present application, referring to FIG. 9, the head determination module 72 includes:

第一字符判断单元721,用于检测所述每个公式碎片的首个公式间隔符号之前的第一字符的类型。The first character judging unit 721 is configured to detect the type of the first character before the first formula interval symbol of each formula fragment.

第一查找结束单元722,用于如果所述第一字符为中文、公式间隔符号、标点符号中任一种,则结束向前查找,确定所述公式间隔符号之后的字符为公式碎片的头部的位置。The first search end unit 722 is used to end the forward search if the first character is any one of Chinese, formula space symbol and punctuation mark, and determine that the character after the formula space symbol is the head of the formula fragment s position.

第一字符交换单元723,用于如果所述第一字符为数字、字母或者数学符号,则交换所述公式间隔符号和所述第一字符的位置,并继续向前检测确定公式碎片头部的位置。The first character exchange unit 723 is used to exchange the position of the formula interval symbol and the first character if the first character is a number, letter or mathematical symbol, and continue to detect and determine the position of the formula fragment head Location.

左括号查找单元724,用于如果所述第一字符为右括号,则根据向前查找是否获得左括号,确定所述公式碎片的头部的位置。The left parenthesis search unit 724 is configured to determine the position of the head of the formula fragment according to whether the forward search obtains a left parenthesis if the first character is a right parenthesis.

具体地,如果所述第一字符为中文、公式间隔符号、标点符号中任一种,则表明所述第一字符并非公式碎片的一部分。例如,“(1)、$$a+b$$”中的首个公式间隔符号$$之前的第一字符为“、”,则确定所述公式间隔符号$$之后的字符“a”为公式碎片的头部的位置。Specifically, if the first character is any one of Chinese, formula interval symbols, and punctuation marks, it indicates that the first character is not a part of the formula fragment. For example, if the first character before the first formula interval symbol $$ in "(1), $$a+b$$" is ",", then it is determined that the character "a" after the formula interval symbol $$ is The position of the head of the formula fragment.

具体地,如果所述第一字符为数字、字母或者数学符号,则表明所述第一字符为公式碎片的一部分。例如,“6+$$5+9$$”中的首个公式间隔符号$$之前的第一字符为“+”,则交换所述公式间隔符号和所述第一字符的位置获得“6$$+5+9$$”。继续向前检测确定公式碎片头部的位置,交换所述公式间隔符号和所述第一字符的位置获得“$$6+5+9$$”。Specifically, if the first character is a number, letter or mathematical symbol, it indicates that the first character is a part of a formula fragment. For example, if the first character before the first formula space symbol $$ in "6+$$5+9$$" is "+", then exchange the positions of the formula space symbol and the first character to obtain "6$ $+5+9$$". Continue to detect and determine the position of the head of the formula fragment, exchange the positions of the formula interval symbol and the first character to obtain "$$6+5+9$$".

具体地,参见图10,所述左括号查找单元724包括:Specifically, referring to FIG. 10, the left bracket search unit 724 includes:

第一判断子单元724a,用于如果所述第一字符为右括号,则判断向前查找是否获得左括号。The first judging subunit 724a is configured to judge whether the forward search obtains a left bracket if the first character is a right bracket.

第一获得子单元724b,用于如果查找未获得左括号,则结束向前查找,确定所述公式间隔符号之后的字符为公式碎片的头部的位置。The first obtaining subunit 724b is configured to end the forward search and determine that the character after the space symbol of the formula is the head position of the formula fragment if the search does not obtain the left parenthesis.

第一未获得子单元724c,用于如果查找获得左括号,且所述右括号和左括号之间的字符为字母和/或数学符号以及字母和/或数学符号与数字,将所述公式间隔符号插入到所述左括号的前面。The first unobtained subunit 724c is used to space the formula if the left parenthesis is obtained from the search, and the characters between the right parenthesis and the left parenthesis are letters and/or mathematical symbols and letters and/or mathematical symbols and numbers symbol is inserted in front of the opening parenthesis.

本申请实施例当所述第一字符为右括号时,需要向前查找是否需要将所述公式间隔符号向前移动到所述右括号前面的左括号前面,即需要根据是否能够查找到左括号,确定所述公式碎片的头部的位置。In the embodiment of the present application, when the first character is a right bracket, it is necessary to search forward whether the formula interval symbol needs to be moved forward to the front of the left bracket in front of the right bracket, that is, it needs to be based on whether the left bracket can be found , to determine the position of the head of the formula fragment.

在本申请另一具体实施例中,参见图11,所述尾部确定模块73包括:In another specific embodiment of the present application, referring to FIG. 11 , the tail determination module 73 includes:

第二字符判断单元731,用于检测所述每个公式碎片的最后的公式间隔符号之后的第二字符的类型。The second character judging unit 731 is configured to detect the type of the second character after the last formula space symbol of each formula fragment.

第二查找结束单元732,用于如果所述第二字符为中文、公式间隔符号、标点符号中任一种,则结束向后查找,确定所述公式间隔符号之前的字符为公式碎片的尾部的位置。The second search end unit 732 is used for if the second character is any one of Chinese, formula interval symbols, and punctuation marks, then ends the backward search, and determines that the character before the formula interval symbol is the tail of the formula fragment Location.

第二字符交换单元733,用于如果所述第二字符为字母、数字或者数学符号,则交换所述公式间隔符号和所述第二字符的位置,并继续向后检测确定公式碎片尾部的位置。The second character exchange unit 733 is used to exchange the position of the formula interval symbol and the second character if the second character is a letter, a numeral or a mathematical symbol, and continue to detect backwards to determine the position of the tail of the formula fragment .

右括号查找单元734,用于如果所述第二字符为左括号,则根据向后查找是否获得右括号,确定所述公式碎片的尾部的位置。The right bracket search unit 734 is configured to determine the position of the tail of the formula fragment according to whether the backward search obtains a right bracket if the second character is a left bracket.

具体地,如果所述第二字符为中文、公式间隔符号、标点符号中任一种,则表明所述第二字符并非公式碎片的一部分。例如,“$$a+b$$、”中的首个公式间隔符号$$之后的第二字符为“、”,则确定所述公式间隔符号$$之前的字符“b”为公式碎片的尾部的位置。Specifically, if the second character is any one of Chinese, formula interval symbols, and punctuation marks, it indicates that the second character is not a part of the formula fragment. For example, if the second character after the first formula space symbol $$ in "$$a+b$$," is ",", it is determined that the character "b" before the formula space symbol $$ is the formula fragment The position of the tail.

具体地,如果所述第二字符为数字、字母或者数学符号,则表明所述第二字符为公式碎片的一部分。例如,“$$5+9$$-2”中的最后的公式间隔符号$$之后的第二字符为“-”,则交换所述公式间隔符号和所述第二字符的位置获得“$$5+9-$$2”。继续向后检测确定公式碎片尾部的位置,交换所述公式间隔符号和所述第二字符的位置获得“$$5+9-2$$”。Specifically, if the second character is a number, letter or mathematical symbol, it indicates that the second character is a part of a formula fragment. For example, if the second character after the last formula space symbol $$ in "$$5+9$$-2" is "-", then exchange the positions of the formula space symbol and the second character to obtain "$$5 +9-$$2". Continue to detect backwards to determine the position of the tail of the formula fragment, exchange the positions of the formula space symbol and the second character to obtain "$$5+9-2$$".

具体地,参见图12,所述右括号查找单元734包括:Specifically, referring to FIG. 12, the right bracket search unit 734 includes:

第二判断子单元734a,用于如果所述第二字符为左括号,则判断向后查找是否获得右括号;The second judging subunit 734a is configured to judge whether the backward search obtains a right bracket if the second character is a left bracket;

第二获得子单元734b,用于如果查找未获得右括号,则结束向后查找,确定所述公式间隔符号之后的字符为公式碎片的尾部的位置。The second obtaining subunit 734b is configured to end the backward search and determine that the character after the space symbol of the formula is the tail position of the formula fragment if the search does not obtain a right parenthesis.

第二未获得子单元734c,用于如果查找获得右括号,且所述右括号和左括号之间的字符为字母和/或数学符号以及字母和/或数学符号与数字,将所述公式间隔符号插入到所述右括号的后面。The second unobtained subunit 734c is used to space the formula if the search obtains a right parenthesis, and the characters between the right parenthesis and the left parenthesis are letters and/or mathematical symbols and letters and/or mathematical symbols and numbers symbols are inserted after the closing parenthesis.

本申请实施例当所述第二字符为左括号时,需要向后查找是否需要将所述公式间隔符号向后移动到所述左括号后面的右括号后面,即需要根据是否能够查找到右括号,确定所述公式碎片的尾部的位置。In the embodiment of the present application, when the second character is a left parenthesis, it is necessary to find out whether the formula interval symbol needs to be moved backward to the right parenthesis behind the left parenthesis, that is, it needs to be based on whether the right parenthesis can be found , to determine the position of the tail of the formula fragment.

在本申请另一具体实施例中,所述多余的公式间隔符号具体为:两个连续的公式间隔符号。In another specific embodiment of the present application, the redundant formula interval symbols are specifically: two consecutive formula interval symbols.

按照上述模块处理完成,会出现两个公式间隔符号$$连接在一起的情况,即“$$$$”,这个情况表明这“$$$$”的前面和后面都是真正的公式,所以此时可以直接将“$$$$”删除,使两边的公式打通。After completing the processing according to the above modules, there will be a situation where two formula interval symbols $$ are connected together, that is, "$$$$". This situation indicates that the front and back of "$$$$" are real formulas, so At this point, you can directly delete "$$$$" to open up the formulas on both sides.

图13是本申请文本识别中的拉泰赫格式公式处理方法的电子设备的硬件结构示意图。根据图13所示,该设备包括:FIG. 13 is a schematic diagram of the hardware structure of the electronic device used in the processing method of the Ratek format formula in the text recognition of the present application. According to Figure 13, the device includes:

一个或多个处理器1310以及存储器1320,图13中以一个处理器1310为例。One or more processors 1310 and memory 1320 , one processor 1310 is taken as an example in FIG. 13 .

文本识别中的拉泰赫格式公式处理方法的设备还可以包括:输入装置1330和输出装置1330。The device of the method for processing Latek format formulas in text recognition may further include: an input device 1330 and an output device 1330 .

处理器1310、存储器1320、输入装置1330和输出装置1330可以通过总线或者其他方式连接,图13中以通过总线连接为例。The processor 1310, the memory 1320, the input device 1330, and the output device 1330 may be connected via a bus or in other ways, and connection via a bus is taken as an example in FIG. 13 .

存储器1320作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的文本识别中的拉泰赫格式公式处理方法对应的程序指令/模块(例如,附图13所示的列表设置模块131、海报插入模块132)。处理器1310通过运行存储在存储器1320中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例文本识别中的拉泰赫格式公式处理方法。The memory 1320, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as the Ratek in the text recognition in the embodiment of the present application Program instructions/modules corresponding to the format formula processing method (for example, the list setting module 131 and the poster insertion module 132 shown in FIG. 13 ). The processor 1310 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 1320, that is, realizes the Latek format formula processing in the text recognition of the above-mentioned method embodiments method.

存储器1320可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据文本识别中的拉泰赫格式公式处理装置的使用所创建的数据等。此外,存储器1320可以包括高速随机存取存储器1320,还可以包括非易失性存储器1320,例如至少一个磁盘存储器1320件、闪存器件、或其他非易失性固态存储器1320件。在一些实施例中,存储器1320可选包括相对于处理器1310远程设置的存储器1320,这些远程存储器1320可以通过网络连接至音效模式选择装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 1320 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and at least one application required by a function; created data, etc. In addition, the memory 1320 may include a high-speed random access memory 1320 , and may also include a non-volatile memory 1320 , such as at least one disk storage 1320 , a flash memory device, or other non-volatile solid-state memory 1320 . In some embodiments, the memory 1320 may optionally include memory 1320 remotely located relative to the processor 1310, and these remote memory 1320 may be connected to the sound effect mode selection device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

输入装置1330可接收输入的数字或字符信息,以及产生与文本识别中的拉泰赫格式公式处理装置的用户设置以及功能控制有关的键信号输入。输出装置1330可包括扬声器等设备。The input device 1330 can receive input number or character information, and generate key signal input related to user setting and function control of the Ratek format formula processing device in text recognition. The output device 1330 may include devices such as speakers.

所述一个或者多个模块存储在所述存储器1320中,当被所述一个或者多个处理器1310执行时,执行上述任意方法实施例中的文本识别中的拉泰赫格式公式处理方法。The one or more modules are stored in the memory 1320, and when executed by the one or more processors 1310, execute the Latek format formula processing method in text recognition in any of the above method embodiments.

上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。The above-mentioned products can execute the method provided by the embodiment of the present application, and have corresponding functional modules and beneficial effects for executing the method. For technical details not described in detail in this embodiment, refer to the method provided in the embodiment of this application.

本申请实施例的电子设备以多种形式存在,包括但不限于:The electronic equipment of the embodiment of the present application exists in various forms, including but not limited to:

(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones (such as iPhone), multimedia phones, feature phones, and low-end phones.

(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDA, MID and UMPC equipment, such as iPad.

(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players (such as iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The composition of a server includes a processor, hard disk, memory, system bus, etc. The server is similar to a general-purpose computer architecture, but due to the need to provide high-reliability services, it is important in terms of processing power and stability. , Reliability, security, scalability, manageability and other aspects have high requirements.

(13)其他具有数据交互功能的电子装置。(13) Other electronic devices with data interaction function.

下面通过本申请一具体应用场景来进一步说明本申请实现。The implementation of this application will be further described below through a specific application scenario of this application.

参见图14,所述方法包括:Referring to Figure 14, the method includes:

1401、接收进行文字识别的题目。1401. Receive a question for character recognition.

1402、判断题目中是否包含公式间隔符号$$。1402. Determine whether the question contains the formula interval symbol $$.

1403、如果不包含公式间隔符号$$,无需进行处理。1403. If the formula interval symbol $$ is not included, no processing is required.

1404、如果包含公式间隔符号$$,计算公式间隔符号的数量N。1404. If the formula interval symbol $$ is included, calculate the number N of formula interval symbols.

1405、判断公式间隔符号的数量N是否为偶数。1405. Determine whether the number N of interval symbols in the formula is an even number.

1406、如果不为偶数,从每一个公式间隔符号$$前面寻找第一个不在公式中的字符的位置,在其后面插入一个公式间隔符号$$,并执行步骤1407。1406. If it is not an even number, find the position of the first character that is not in the formula before each formula space symbol $$, insert a formula space symbol $$ after it, and execute step 1407.

1407、如果为偶数,将所述公式间隔符号两两分组,获得公式中的公式碎片。1407. If it is an even number, group the formula interval symbols in pairs to obtain formula fragments in the formula.

1408、判断是否处理完全部公式碎片。1408. Determine whether all formula fragments have been processed.

1409、如果处理完全部公式碎片,检索并删除重复的公式间隔符号$$$$。1409. If all formula fragments are processed, retrieve and delete repeated formula space symbols $$$$.

1410、如果没有处理完全部公式碎片,判断每个公式碎片中的公式间隔符号$$是否处于所在公式碎片的开头。1410. If not all formula fragments have been processed, determine whether the formula interval symbol $$ in each formula fragment is at the beginning of the formula fragment.

1411、如果处于所在公式碎片的开头,则判断公式间隔符号$$前面的字符是否是公式中包含的内容,如果否,则返回步骤1408。1411. If it is at the beginning of the formula fragment, judge whether the character before the formula space symbol $$ is contained in the formula, if not, return to step 1408.

1412、如果公式间隔符号$$前面的字符是公式中包含的内容,则交换公式分隔符$$与前面字符的位置,并返回步骤1411。1412 . If the character before the formula separator $$ is contained in the formula, exchange the positions of the formula separator $$ and the preceding character, and return to step 1411 .

1413、如果不处于所在公式碎片的开头,则判断公式分隔符$$后面的字符是否是公式中包含的内容,如果否,则返回步骤1408。。1413 . If it is not at the beginning of the formula fragment, judge whether the character behind the formula separator $$ is contained in the formula, and if not, return to step 1408 . .

1414、如果公式分隔符$$后面的字符是公式中包含的内容,则交换公式分隔符$$与后面字符的位置,并返回步骤1413。1414. If the character behind the formula separator $$ is contained in the formula, exchange the positions of the formula separator $$ and the character behind, and return to step 1413.

比如,题目1“计算:|-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$;”,经过一系列处理之后依次变为:For example, topic 1 "Calculation: |-$$\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$;", after a series of processing Then it becomes:

计算:$$|-\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$;Calculation: $$|-\frac{1}{2}$$|+$$\sqrt{12}$$-2$$^{-1}$$;

计算:$$|-\frac{1}{2}$$$$|+\sqrt{12}$$-2$$^{-1}$$;Calculate: $$|-\frac{1}{2}$$$$|+\sqrt{12}$$-2$$^{-1}$$;

计算:$$|-\frac{1}{2}$$$$|+\sqrt{12}$$$$-2^{-1}$$;Calculate: $$|-\frac{1}{2}$$$$|+\sqrt{12}$$$$-2^{-1}$$;

计算:$$|-\frac{1}{2}|+\sqrt{12}-2^{-1}$$;Calculation: $$|-\frac{1}{2}|+\sqrt{12}-2^{-1}$$;

比如,题目2“$$\frac{1}{4}$$a$$^{2}$$-9(b-c)$$^{2}$$的一个因式是$$\frac{1}{2}$$a-3b+3c,另一个因式是()”,经过一系列处理之后依次变为:For example, question 2 "$$\frac{1}{4}$$a$$^{2}$$-9(b-c)$$^{2}$$ is a factor of $$\frac{1 }{2}$$a-3b+3c, the other factor is ()", after a series of processing, it becomes:

$$\frac{1}{4}$$a$$^{2}$$-9(b-c)$$^{2}$$的一个因式是$$\frac{1}{2}$$a-3b+3c,另一个因式是()A factor of $$\frac{1}{4}$$a$$^{2}$$-9(b-c)$$^{2}$$ is $$\frac{1}{2}$ $a-3b+3c, another factor is ()

$$\frac{1}{4}$$$$a^{2}$$-9(b-c)$$^{2}$$的一个因式是$$\frac{1}{2}$$a-3b+3c,另一个因式是()A factor of $$\frac{1}{4}$$$$a^{2}$$-9(b-c)$$^{2}$$ is $$\frac{1}{2}$ $a-3b+3c, another factor is ()

$$\frac{1}{4}$$$$a^{2}$$$$-9(b-c)^{2}$$的一个因式是$$\frac{1}{2}$$a-3b+3c,另一个因式是()A factor of $$\frac{1}{4}$$$$a^{2}$$$$-9(b-c)^{2}$$ is $$\frac{1}{2}$ $a-3b+3c, another factor is ()

$$\frac{1}{4}$$$$a^{2}$$$$-9(b-c)^{2}$$的一个因式是$$\frac{1}{2}$$a-3b+3c,另一个因式是()A factor of $$\frac{1}{4}$$$$a^{2}$$$$-9(b-c)^{2}$$ is $$\frac{1}{2}$ $a-3b+3c, another factor is ()

$$\frac{1}{4}$$$$a^{2}$$$$-9(b-c)^{2}$$的一个因式是$$\frac{1}{2}$$a-3b+3c,另一个因式是()A factor of $$\frac{1}{4}$$$$a^{2}$$$$-9(b-c)^{2}$$ is $$\frac{1}{2}$ $a-3b+3c, another factor is ()

$$\frac{1}{4}$$$$a^{2}$$$$-9(b-c)^{2}$$的一个因式是$$\frac{1}{2}a-3b+3c$$,另一个因式是()A factor of $$\frac{1}{4}$$$$a^{2}$$$$-9(b-c)^{2}$$ is $$\frac{1}{2}a -3b+3c$$, another factor is ()

$$\frac{1}{4}a^{2}-9(b-c)^{2}$$的一个因式是$$\frac{1}{2}a-3b+3c$$,另一个因式是()One factor of $$\frac{1}{4}a^{2}-9(b-c)^{2}$$ is $$\frac{1}{2}a-3b+3c$$, and the other A factor is ()

以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, and the modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in One place, or it can be distributed to multiple network modules. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative efforts.

本领域的技术人员应明白,本申请的实施例可提供为方法、装置(设备)、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, devices (devices), or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照实施例的方法、装置(设备)和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products of the embodiments. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。While preferred embodiments of the present application have been described, additional changes and modifications to these embodiments can be made by those skilled in the art once the basic inventive concept is appreciated. Therefore, the appended claims are intended to be construed to cover the preferred embodiment and all changes and modifications which fall within the scope of the application. Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims (14)

1. a kind of La Taihe form formula processing methods in text identification, which is characterized in that including:
The formula space character quantity of formula in text identification is obtained, and judges whether the formula space character quantity is even Number;
It is such as even number, the head of formula fragment is determined according to the character types before the first formula space character of each formula fragment The position in portion;
The position of the tail portion of formula fragment is determined according to the character types after the last formula space character of each formula fragment It puts;
Extra formula space character is deleted, obtains complete La Taihe forms formula.
2. the method as described in claim 1, which is characterized in that the method further includes:
It is such as odd number, the character or formula blank character being not included in before searching each formula space character in formula fragment Number, and a formula space character is inserted into after the character or formula space character.
3. the method as described in claim 1, which is characterized in that it is described as being even number, according to the first public affairs of each formula fragment Character types before formula space character determine that the position on the head of formula fragment includes:
Detect the type of the first character before the first formula space character of each formula fragment;
If first character is Chinese, any in formula space character, punctuation mark, terminate Look-ahead, determine Position of the character for the head of formula fragment after the formula space character;
If first character is number, alphabetical or mathematic sign, the formula space character and described first are exchanged The position of character, and continue to detect the position for determining formula fragment head forward;
If first character is right parenthesis, whether left bracket is obtained according to Look-ahead, determines the formula fragment The position on head.
4. method as claimed in claim 3, which is characterized in that if first character be right parenthesis, according to Before search whether obtain left bracket, determine that the position on the head of the formula fragment includes:
If first character is right parenthesis, judge whether Look-ahead obtains left bracket;
If lookup does not obtain left bracket, terminate Look-ahead, determine that the character after the formula space character is formula The position on the head of fragment;
If search obtain left bracket, and the character between the right parenthesis and left bracket for letter and/or mathematic sign and Letter and/or mathematic sign and number, the formula space character is inserted into before the left bracket.
5. method as described in claim 1, which is characterized in that the last formula space character of each formula fragment of basis Character types later determine that the position of the tail portion of formula fragment includes:
Detect the type of the second character after the last formula space character of each formula fragment;
If second character is Chinese, any in formula space character, punctuation mark, terminate to search backward, determine Position of the character for the tail portion of formula fragment before the formula space character;
If second character is letter, digital or mathematic sign, the formula space character and described second are exchanged The position of character, and continue to detect the position for determining formula fragment tail portion backward;
If second character is left bracket, according to acquisition right parenthesis is searched whether backward, the formula fragment is determined The position of tail portion.
6. method as claimed in claim 5, which is characterized in that if second character be left bracket, according to After search whether obtain right parenthesis, determine that the position of the tail portion of the formula fragment includes:
If second character is left bracket, judge to search whether to obtain right parenthesis backward;
If lookup does not obtain right parenthesis, terminate to search backward, determine that the character after the formula space character is formula The position of the tail portion of fragment;
If search obtain right parenthesis, and the character between the right parenthesis and left bracket for letter and/or mathematic sign and Letter and/or mathematic sign and number, the formula space character is inserted into behind the right parenthesis.
7. method as claimed in claim 6, which is characterized in that the extra formula space character is specially:Two continuous Formula space character.
8. a kind of La Taihe form formula manipulation devices in text identification, which is characterized in that including:
Quantity judgment module for obtaining the formula space character quantity of formula in text identification, and judges the formula interval Whether symbol quantity is even number;
Head determining module, for being such as even number, according to the character type before the first formula space character of each formula fragment Type determines the position on the head of formula fragment;
Tail portion determining module determines public affairs for the character types after the last formula space character according to each formula fragment The position of the tail portion of formula fragment;
Puncture module for deleting extra formula space character, obtains complete La Taihe forms formula.
9. device as claimed in claim 8, which is characterized in that described device further includes:
Symbol is inserted into module, for being such as odd number, the word being not included in before searching each formula space character in formula fragment It accords with either formula space character and a formula space character is inserted into after the character or formula space character.
10. device as claimed in claim 9, which is characterized in that the head determining module includes:
First character judging unit, for detecting the first character before the first formula space character of each formula fragment Type;
First searches end unit, if being Chinese for first character, any in formula space character, punctuation mark Kind, then terminate Look-ahead, determine position of the character for the head of formula fragment after the formula space character;
First character crosspoint if being number for first character, alphabetical or mathematic sign, exchanges the public affairs The position of formula space character and first character, and continue to detect the position for determining formula fragment head forward;
Whether left bracket searching unit if being right parenthesis for first character, left bracket is obtained according to Look-ahead, Determine the position on the head of the formula fragment.
11. device as claimed in claim 10, which is characterized in that the left bracket searching unit includes:
First judgment sub-unit if being right parenthesis for first character, judges whether Look-ahead obtains left bracket;
First obtains subelement, if if not obtaining left bracket for searching, terminates Look-ahead, determines between the formula Position of the character for the head of formula fragment after symbol;
First does not obtain subelement, if obtaining left bracket for searching, and the character between the right parenthesis and left bracket is Letter and/or mathematic sign and letter and/or mathematic sign and number, are inserted into the left side by the formula space character and include Before number.
12. device as claimed in claim 11, which is characterized in that the tail portion determining module includes:
Second character judging unit, for detecting the second word after the last formula space character of each formula fragment The type of symbol;
Second searches end unit, if being Chinese for second character, any in formula space character, punctuation mark Kind, then terminate to search backward, the character before determining the formula space character is the position of the tail portion of formula fragment;
Second character crosspoint if being letter for second character, digital or mathematic sign, exchanges the public affairs The position of formula space character and second character, and continue to detect the position for determining formula fragment tail portion backward;
Right parenthesis searching unit, if being left bracket for second character, basis searches whether to obtain right parenthesis backward, Determine the position of the tail portion of the formula fragment.
13. device as claimed in claim 12, which is characterized in that the right parenthesis searching unit includes:
Second judgment sub-unit if being left bracket for second character, judges to search whether to obtain right parenthesis backward;
Second obtains subelement, if not obtaining right parenthesis for searching, terminates to search backward, determines the formula blank character Position of the character for the tail portion of formula fragment after number;
Second does not obtain subelement, if obtaining right parenthesis for searching, and the character between the right parenthesis and left bracket is Letter and/or mathematic sign and letter and/or mathematic sign and number, are inserted into the right side by the formula space character and include Behind number.
14. device as claimed in claim 13, which is characterized in that the extra formula space character is specially:Two companies Continuous formula space character.
CN201611227736.8A 2016-12-27 2016-12-27 Lateh format formula processing method and device in text recognition Active CN108241847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611227736.8A CN108241847B (en) 2016-12-27 2016-12-27 Lateh format formula processing method and device in text recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611227736.8A CN108241847B (en) 2016-12-27 2016-12-27 Lateh format formula processing method and device in text recognition

Publications (2)

Publication Number Publication Date
CN108241847A true CN108241847A (en) 2018-07-03
CN108241847B CN108241847B (en) 2021-02-26

Family

ID=62702564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611227736.8A Active CN108241847B (en) 2016-12-27 2016-12-27 Lateh format formula processing method and device in text recognition

Country Status (1)

Country Link
CN (1) CN108241847B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507067A (en) * 2019-01-31 2020-08-07 北京易真学思教育科技有限公司 Method for obtaining formula pictures, method and device for transferring formula pictures
CN113139547A (en) * 2020-01-20 2021-07-20 阿里巴巴集团控股有限公司 Text recognition method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5572625A (en) * 1993-10-22 1996-11-05 Cornell Research Foundation, Inc. Method for generating audio renderings of digitized works having highly technical content
CN101149790A (en) * 2007-11-14 2008-03-26 哈尔滨工程大学 Chinese printed formula recognition method
CN101329731A (en) * 2008-06-06 2008-12-24 南开大学 Automatic Recognition Method of Mathematical Formula in Image
CN101388068A (en) * 2007-09-12 2009-03-18 汉王科技股份有限公司 Mathematical formula identifying and coding method
CN102033856A (en) * 2009-09-29 2011-04-27 北大方正集团有限公司 Formula composing method and system thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5572625A (en) * 1993-10-22 1996-11-05 Cornell Research Foundation, Inc. Method for generating audio renderings of digitized works having highly technical content
CN101388068A (en) * 2007-09-12 2009-03-18 汉王科技股份有限公司 Mathematical formula identifying and coding method
CN101149790A (en) * 2007-11-14 2008-03-26 哈尔滨工程大学 Chinese printed formula recognition method
CN101329731A (en) * 2008-06-06 2008-12-24 南开大学 Automatic Recognition Method of Mathematical Formula in Image
CN102033856A (en) * 2009-09-29 2011-04-27 北大方正集团有限公司 Formula composing method and system thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HUIFANG GUO等: "A method of adding an attribute into MathML for formula retrieval", 《IEEE》 *
田学东等: "基于统计特征的数学公式抽取方法的研究", 《计算机工程》 *
陈立辉等: "基于LaTex的Web数学公式提取方法研究", 《计算机科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507067A (en) * 2019-01-31 2020-08-07 北京易真学思教育科技有限公司 Method for obtaining formula pictures, method and device for transferring formula pictures
CN113139547A (en) * 2020-01-20 2021-07-20 阿里巴巴集团控股有限公司 Text recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108241847B (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN106534548B (en) Voice error correction method and device
JP6515624B2 (en) Method of identifying lecture video topics and non-transitory computer readable medium
CN108920666B (en) Semantic understanding-based searching method, system, electronic device and storage medium
CN110597963B (en) Expression question-answering library construction method, expression search device and storage medium
CN110781668B (en) Text information type identification method and device
Shao et al. Assisting in writing wikipedia-like articles from scratch with large language models
CN108417205A (en) Semantic understanding training method and system
CN108345593A (en) Question bank system-based teaching lecture generation method and device
CN104090955A (en) Automatic audio/video label labeling method and system
US10665218B2 (en) Audio data processing method and device
CN103324685B (en) The approach for video retrieval by video clip of Japanese Online Video language material
CN108255841A (en) Method and device for searching questions
CN105302906A (en) Information labeling method and apparatus
CN109471955B (en) Video clip positioning method, computing device and storage medium
WO2018094952A1 (en) Content recommendation method and apparatus
WO2021139242A1 (en) Presentation file generation method, apparatus, and device and storage medium
CN109524008A (en) Voice recognition method, device and equipment
CN111241276A (en) Topic searching method, device, equipment and storage medium
CN105929979A (en) Long-sentence input method and device
CN104951439A (en) Electronic book and integration obtaining method and system for relevant electronic resources thereof
CN108241847A (en) Lateh format formula processing method and device in text recognition
CN110297965B (en) Courseware page display and page set construction method, device, equipment and medium
CN109388806B (en) Chinese word segmentation method based on deep learning and forgetting algorithm
CN108255798A (en) A method and device for inputting Lateh format formulas
CN107729486B (en) Video searching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant