WO2020232866A1 - Scanned text segmentation method and apparatus, computer device and storage medium - Google Patents

Scanned text segmentation method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2020232866A1
WO2020232866A1 PCT/CN2019/102549 CN2019102549W WO2020232866A1 WO 2020232866 A1 WO2020232866 A1 WO 2020232866A1 CN 2019102549 W CN2019102549 W CN 2019102549W WO 2020232866 A1 WO2020232866 A1 WO 2020232866A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
characters
character
line
parameters
Prior art date
Application number
PCT/CN2019/102549
Other languages
French (fr)
Chinese (zh)
Inventor
许剑勇
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020232866A1 publication Critical patent/WO2020232866A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to a method, device, computer equipment and storage medium for scanning text segmentation.
  • the paper text is scanned to obtain a picture containing text content, and the text content in the picture is recognized through intelligent recognition technology to obtain editable text; however, the inventor realized that the traditional intelligent recognition method can only identify the picture If you need to locate or analyze the corresponding paragraph of the text content for further processing, because the above intelligent recognition method cannot determine the beginning and end positions of the characters in the text content, it may be due to inaccurate segmentation of the text content. The subsequent text content processing error.
  • a method, apparatus, computer device, and storage medium for scanning text segments are provided.
  • a method for scanning text segmentation including:
  • the vertex parameters of each line of characters in the text page include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
  • a scanning text segmentation device including:
  • Picture acquisition module used to acquire pictures containing text content
  • a content conversion module configured to perform text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content
  • the vertex parameter acquisition module is used to acquire the vertex parameters of each line of characters in the text page.
  • the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters.
  • the second set of vertex parameters is used for Determine the vertex parameters of the segmentation criterion;
  • a standard parameter obtaining module configured to identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters;
  • a difference calculation module for calculating the difference between the second set of vertex parameters and the standard parameters of each line of characters
  • the segmentation module is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • a computer device including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
  • the vertex parameters of each line of characters in the text page include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • the vertex parameters of each line of characters in the text page include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
  • Fig. 1 is an application scenario diagram of a scanning text segmentation method according to one or more embodiments.
  • Fig. 2 is a schematic flowchart of a scanning text segmentation method according to one or more embodiments.
  • Fig. 3 is a schematic flowchart of a method for obtaining a preset value according to one or more embodiments.
  • Fig. 4 is a block diagram of an apparatus for scanning text segmentation according to one or more embodiments.
  • Figure 5 is a block diagram of a computer device according to one or more embodiments.
  • the scanning text segmentation method provided in this application can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network through the network. After receiving the scan request from the user, the terminal obtains the image of the target document and converts it into the corresponding target text, and then segments the target text.
  • the terminal 102 sends the segmented target text and the problems found in the segmentation process to Server 104.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for scanning text segmentation is provided. Taking the method applied to the terminal 102 in FIG. 1 as an example for description, the method includes the following steps:
  • the picture with text content is the picture obtained by the terminal taking or scanning the target document to be converted through the scanning device.
  • Target documents are documents that users want to scan and convert into editable text, such as legal documents or technical documents.
  • the scanning device is a built-in or external scanning device of the terminal, such as a camera of a mobile phone or a computer, or a scanner connected to the computer.
  • the area to be scanned is the area where the user wants to scan the content through the terminal; when the scanning device is a mobile phone or computer, the area to be scanned is the shooting area of the camera; when the scanning device is an external scanner, the area to be scanned is the scanner Scan area.
  • the user when the user has a scanning requirement, the user can place the document to be scanned and recognized in the area to be scanned in the terminal built-in or docked with the scanning device, and the image of the target document to be scanned and recognized is collected by the scanning device to obtain a picture containing text content.
  • S204 Perform text recognition on the picture to obtain a text page, and the text page contains characters consistent with the arrangement order of the text content.
  • the text content in the picture can be converted into an editable text (or character) form according to the arrangement order in the picture through the built-in or external content recognition device of the terminal to obtain a text page .
  • Content recognition equipment is a device used to convert text content in a picture into an editable text page, which can refer to OCR recognition equipment, etc.; OCR (Optical Character Recognition) equipment refers to checking the characters on the picture and passing the detection The dark and light patterns determine its shape, and then use character recognition methods to translate the shape into computer text.
  • the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine a segmentation criterion.
  • the vertex parameter of each line of characters is a parameter used to characterize the position, height and width of a line of characters in the text page.
  • the vertex parameter of each line of characters can be the distance between the upper, lower, left, and right boundaries of the line of characters from the upper, lower, left, and right boundaries of the display interface, which can be expressed as a line of characters
  • the distance between the leftmost end of the first character and the left border of the display interface left, the distance between the uppermost end of a line of characters and the upper border of the display interface top, the left and right width of this line is width
  • the vertex parameter of each line of characters can also be the position of a line of characters on the left side of the text page after setting the length and width coordinates of the text page. For example, set the lower left corner of the text page as the zero coordinate of the length and width direction. The position of each line of characters in the text page can be expressed as four sets of coordinate parameters corresponding to this zero coordinate.
  • the vertex parameters of each line of characters include the first set of vertex parameters and the second set of vertex parameters.
  • the second set of vertex parameters are the vertex parameters used to determine the segmentation standard, which can be the vertex that represents the width of a line of characters in the above example.
  • Parameters such as the distance left between the leftmost end of the first character of a line of characters and the left border of the display interface and the left and right width of this line is width; or the difference in the width direction of the four left parameters of a line of characters.
  • the vertex parameters of each line of characters in the text page can be adjusted according to the needs of technicians or the recognition method of the terminal, and are not limited to the above examples.
  • the terminal when the terminal device performs text conversion through the content recognition device, the terminal needs to analyze the position in the text page of the characters in each line of the text page.
  • the terminal can directly record the vertex parameters of each line of characters in the text page in the text page through the content recognition device.
  • the vertex parameters may include two sets of vertex parameters for positioning each line of characters in the text page, where the second group is and Vertex parameters related to segmentation.
  • S208 Identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters.
  • the longest line of characters in a text page is a line of the text page as a segmentation standard, and it can be the line with the right end of the last character of the line closest to the right edge of the text page, or the line with the largest character line width.
  • the terminal can compare the width of each line of characters in the text page, and use the line with the largest width value as the longest line of characters in the text page, or compare each line of characters.
  • the line with the smallest distance between the right end of the last character of the line of characters and the right edge of the text page is regarded as the longest line of characters.
  • S210 Calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters.
  • the terminal compares the second set of vertex parameters of each line in the text page with the standard parameters, and calculates the difference between the two to determine whether the line is Need to be segmented.
  • S212 Determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • the preset value is used to judge whether the difference between the second set of vertex parameters of each line and the standard parameter can be used as the value of the branch judgment standard.
  • the specific value is obtained by the technician based on the historical character sample analysis, and its format is consistent with that of each line.
  • the vertex parameters and standard parameters of the second group of characters are the same, and can be set to several pixel values and so on.
  • the target character is the last character at the end of the line, which can be text or punctuation.
  • the paragraph character "/n" is added after the target character of this line, if the second set of vertices of the line. If the difference between the parameter and the standard parameter is less than or equal to the preset value, it is considered to be unsegmented and there is no need to add a segmentation symbol.
  • the terminal judges whether the line is segmented line by line, until the target text is segmented, and the segmented text is obtained for the user to use or machine recognition, for example, so that the machine can recognize the segmentation symbol to locate the paragraph.
  • step S210 the difference between the second set of vertex parameters of each line of characters and the standard parameters is calculated.
  • the vertex parameter corresponding to the start position of the next line of characters can also be obtained continuously.
  • a paragraph break is inserted after the last character of this line. In this step, for the text with a space at the beginning of the next line after the line break, judge whether to segment according to the end of this paragraph and the beginning of the next paragraph, and improve the accuracy of segmentation.
  • the terminal can obtain the text content page by page as the text page and perform the above segmentation step. After segmenting the text in one page, continue to obtain the next page of the target text As a text page, until all pages of the target text are segmented.
  • the terminal can recognize the first page and use the obtained standard parameters as the standard parameters of the entire text, that is, only need to obtain the standard parameters of the first page, and the remaining pages The standard parameters on the first page are still used as the segmentation standard. For multiple pages of the same text, only the standard parameters of the first page full line need to be obtained, which improves the segmentation efficiency of this method.
  • the terminal obtains the text page by recognizing scanned images containing text content, and obtains the vertex parameters of each line of the text page, determines the longest line of characters in the text page, and compares the maximum
  • the second set of vertex parameters of a long line of characters are used as standard parameters, and the second set of vertex parameters of each line of characters are compared with the standard parameters in turn.
  • the terminal considers it This line is the ending line of the paragraph. Add a paragraph break after the target character of this line until the text is segmented.
  • the terminal can accurately divide the recognized text after recognizing the content of the text in the picture. segment.
  • the way of obtaining the preset value in the above scanning text segmentation method may include:
  • S302 Obtain a character sample, and identify the type of character corresponding to the character sample.
  • Character samples are used to analyze the width of non-Chinese characters and Chinese characters in the same document, and can be scanned documents that have been processed before.
  • Character types are types of non-Chinese characters, such as letters or numbers.
  • the terminal calculates the preset value, it needs to calculate based on the historically processed segmented document as a character sample.
  • This character sample contains characters of different character types. After the terminal obtains the character sample, it first recognizes its corresponding character species.
  • S304 Calculate the duty ratio of the character corresponding to each character type to the Chinese character, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character.
  • the terminal calculates the ratio of the width of each character type except Chinese characters in the document to the width of Chinese characters. For example, according to the sample, the terminal can roughly estimate that each number or letter is considered to be half of the width of the Chinese text character; if you need to calculate more accurately, you can calculate in advance each number 0-9 and each letter az and AZ and other non-Chinese characters
  • the width of the character is divided by the width of a Chinese character to generate a relative width hash table of non-Chinese characters, and then the terminal calculates the proportion of the character corresponding to each character type to the Chinese character based on the relative width hash table.
  • the terminal can calculate a preset value according to the different types and the width ratios of Chinese characters in the character sample, such as the existence of Chinese characters, letters, numbers, punctuation marks and other characters in historically scanned documents, and the alignment between full lines
  • the preset value used to adjust the segmentation error caused by the character typesetting problem is obtained, which can be specifically set to 0.1-0.15 times the width of the longest line of characters.
  • the technician studies a large number of character samples, and calculates accurate preset values based on the difference in typesetting of different character types, and accurately segments the recognized text.
  • step S212 after adding a segmentation character after the target character in step S212 to obtain the segmented text, it may further include: sending the segmented text to the server; the receiving server returns according to the segmented text The update instruction; replace the preset value in the terminal according to the preset value in the update instruction.
  • the update instruction is an instruction sent by the server to the terminal to update the preset value of the terminal, and may be an instruction to replace the local preset value of the terminal with a new preset value.
  • the terminal sends the segmented text to the server for further processing or use. If the server finds that the segment recognition of the terminal is incorrect, it can adjust the preset value according to the cause of the error.
  • the server may generate an update instruction, and return the update instruction to the terminal, update the local preset value of the terminal, to segment the scanned document.
  • the server detects the accuracy of the preset value based on the document segmented by the terminal, and if the preset value is not accurate, it updates it to improve the accuracy of the text segmentation processed later by the terminal.
  • step S206 after obtaining the vertex parameters of each line of characters in the text page in step S206, it may further include: partitioning the text according to the vertex parameters of each line of characters in the text page; generating according to the partitioned text New text page; continue to identify the longest line of characters in the text page based on vertex parameters.
  • the terminal when the terminal recognizes that the typesetting difference of the text in the text page is large, that is, the difference between the vertex parameters of each line of characters is large, such as the vertices at the beginning of a line or several consecutive lines and the beginning of the remaining lines
  • the terminal can partition the text page according to the magnitude of the difference, and divide each The area obtained by each partition generates a new text page, and the text in the new text page is segmented in turn.
  • Technicians can set the corresponding zoning standard by studying the historical segmented samples.
  • the terminal can use these rows as one New area, other lines in the text page as another area, etc.
  • the terminal generates a new text page for each area, and the terminal segments the content in the new page according to the above steps.
  • This embodiment is aimed at a situation where texts with different layouts on a page, such as newspapers, posters, etc., are segmented.
  • the text with different layouts on a page can also be accurately segmented by means of partitions.
  • step S212 after adding a segmentation character after the target character to obtain the segmented text in step S212, it may further include: saving the segmented text; deleting the vertex parameter of each line of characters in the text page.
  • the terminal saves the segmented text, and deletes the vertex parameters of each line of characters when automatically clearing the scanned text.
  • a user input or a delete instruction sent by the server can also be obtained at the terminal, and the terminal deletes the vertex parameter of each line of characters in the text page.
  • the vertex list is the position characterization parameter of each line of the text in the text page
  • the amount of data is large.
  • a scanning text segmentation device which includes: a picture acquisition module 100, a content conversion module 200, a vertex parameter acquisition module 300, a standard parameter acquisition module 400, and a difference calculation module 500 and segmentation module 600, of which:
  • the picture obtaining module 100 is used to obtain pictures containing text content.
  • the content conversion module 200 is configured to perform text recognition on the picture to obtain a text page, and the text page contains characters in the same order as the text content.
  • the vertex parameter acquisition module 300 is used to acquire the vertex parameters of each line of characters in the text page.
  • the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are used to determine segmentation Standard vertex parameters.
  • the standard parameter obtaining module 400 is configured to identify the longest line of characters in a text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters.
  • the difference calculation module 500 is used to calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters.
  • the segmentation module 600 is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • the aforementioned scanning text segmentation device may further include:
  • the sample acquisition module is used to acquire character samples and identify the type of characters corresponding to the character samples.
  • the character type analysis module is used to calculate the duty ratio of the character corresponding to each character type to the Chinese character.
  • the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character.
  • the preset value calculation module is used to calculate the preset value according to the duty ratio.
  • the aforementioned scanning text segmentation device may further include:
  • the sending module is used to send the segmented text to the server.
  • the update instruction receiving module is used to receive the update instruction returned by the server according to the segmented text.
  • the preset value update module is used to update the preset value according to the update instruction.
  • the aforementioned scanning text segmentation device may further include:
  • the partition module is used to partition the text according to the vertex parameters of each line of characters in the text page.
  • the page update module is used to generate a new text page based on the partitioned text.
  • the new page segmentation module is used to continue to identify the longest line of characters in the text page according to the vertex parameters.
  • the aforementioned scanning text segmentation device may further include:
  • the save module is used to save the segmented text.
  • the parameter deletion module is used to delete the vertex parameter of each line of characters in the text page.
  • Each module in the aforementioned scanning text segmentation device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 5.
  • the computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a scanning text segmentation method.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device including a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors perform the following steps: Picture; text recognition is performed on the picture to obtain a text page, which contains characters in the same order as the text content; to obtain the vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include the first set of vertex parameters and the first set of vertex parameters Two sets of vertex parameters, the second set of vertex parameters are the vertex parameters used to determine the segmentation standard; the longest line of characters in the text page is recognized according to the vertex parameters, and the second set of vertex parameters of the longest line of characters are obtained as standard parameters; calculation The difference between the second set of vertex parameters of each line of characters and the standard parameters; and determine the target character in the line where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • the way of obtaining the preset value realized when the processor executes the computer-readable instruction includes: obtaining a character sample, and identifying the character type corresponding to the character sample; calculating the width of the character and the Chinese character corresponding to each character type
  • the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; the preset value is calculated according to the duty ratio.
  • the method when the processor executes the computer-readable instruction, after adding a segmentation character after the target character to obtain the segmented text, the method further includes: sending the segmented text to the server; and the receiving server according to the segmentation The update instruction returned by the following text; replace the preset value in the terminal according to the preset value in the update instruction.
  • the method further includes: partitioning the text according to the vertex parameters of each line of characters in the text page; Generate a new text page; continue to identify the longest line of characters in the text page based on the vertex parameters.
  • the processor when the processor executes the computer-readable instruction, after adding a paragraph break after the target character to obtain the segmented text, it further includes: saving the segmented text; deleting each line of characters in the text page The vertex parameters.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors perform the following steps:: Get text The picture of the content; the text recognition of the picture to get the text page, the text page contains the characters in the same order as the text content; the vertex parameters of each line of characters in the text page are obtained, the vertex parameters of each line of characters include the first set of vertex parameters And the second set of vertex parameters, the second set of vertex parameters are used to determine the segmentation standard; according to the vertex parameters, the longest line of characters in the text page is recognized, and the second set of vertex parameters of the longest line of characters are obtained as standard parameters ; Calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters; and determine the target character in the line where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • the way of obtaining the preset value realized when the computer-readable instruction is executed by the processor includes: obtaining a character sample, and identifying the character type corresponding to the character sample; calculating the proportion of characters and Chinese characters corresponding to each character type. Width ratio, the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; the preset value is calculated according to the duty ratio.
  • the method when the computer-readable instruction is executed by the processor, after adding a segmentation character after the target character to obtain the segmented text, the method further includes: sending the segmented text to the server; The update instruction returned by the text after the paragraph; replace the preset value in the terminal according to the preset value in the update instruction.
  • the method further includes: partitioning the text according to the vertex parameters of each line of characters in the text page; The following text generates a new text page; continue to identify the longest line of characters in the text page according to the vertex parameters.
  • adding a paragraph break after the target character to obtain the segmented text further includes: saving the segmented text; deleting each line in the text page The vertex parameter of the character.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

A scanned text segmentation method, comprising: acquiring a picture containing text content; performing text recognition of the picture to obtain a text page, the text page containing characters, the arrangement order of which is consistent with that of the text content; acquiring vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters comprising a first group of vertex parameters and a second group of vertex parameters, and the second group of vertex parameters being vertex parameters used for determining a segmentation standard; recognizing a longest line of characters in the text page according to the vertex parameters, and acquiring the second group of vertex parameters of the longest line of characters as standard parameters; calculating a difference value between the second group of vertex parameters of each line of characters and the standard parameters; determining a target character in the line where the difference value is greater than a preset value, and adding a segmentation symbol after the target character to obtain segmented text.

Description

扫描文本分段方法、装置、计算机设备和存储介质Scanning text segmentation method, device, computer equipment and storage medium
相关申请的交叉引用Cross references to related applications
本申请要求于2019年05月20日提交中国专利局,申请号为201910418522.6,申请名称为“扫描文本分段方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on May 20, 2019. The application number is 201910418522.6 and the application name is "Scanning text segmentation method, device, computer equipment and storage medium". The reference is incorporated in this application.
技术领域Technical field
本申请涉及一种扫描文本分段方法、装置、计算机设备和存储介质。This application relates to a method, device, computer equipment and storage medium for scanning text segmentation.
背景技术Background technique
随着数据技术的发展,越来越多的信息都通过网络处理和交互,因而对于纸质材料转换为电子格式的技术也层出不穷。With the development of data technology, more and more information is processed and interacted through the network, so there are endless technologies for converting paper materials into electronic formats.
传统地,将纸质文本进行扫描得到包含文本内容的图片,通过智能识别技术识别出图片中的文本内容得到可编辑文本;然而,发明人意识到,传统的智能识别方法只能识别出图片中包含的文本内容,若需要对文本内容的对应段落进行定位或分析等进一步处理时,由于上述智能识别方法无法判定文本内容中字符的段落起止位置,可能会由于对文本内容分段不准确,导致后续文本内容处理出错的问题。Traditionally, the paper text is scanned to obtain a picture containing text content, and the text content in the picture is recognized through intelligent recognition technology to obtain editable text; however, the inventor realized that the traditional intelligent recognition method can only identify the picture If you need to locate or analyze the corresponding paragraph of the text content for further processing, because the above intelligent recognition method cannot determine the beginning and end positions of the characters in the text content, it may be due to inaccurate segmentation of the text content. The subsequent text content processing error.
发明内容Summary of the invention
根据本申请公开的各种实施例,提供一种扫描文本分段方法、装置、计算机设备和存储介质。According to various embodiments disclosed in the present application, a method, apparatus, computer device, and storage medium for scanning text segments are provided.
一种扫描文本分段方法,包括:A method for scanning text segmentation, including:
获取含有文本内容的图片;Get pictures with text content;
对所述图片进行文本识别得到文本页面,所述文本页面中包含与所述文本内容排列顺序一致的字符;Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;
获取所述文本页面中的每行字符的顶点参数,每行字符的所述顶点参数 包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
根据所述顶点参数识别所述文本页面中的最长一行字符,获取所述最长一行字符的所述顶点参数作为标准参数;Identifying the longest line of characters in the text page according to the vertex parameters, and obtaining the vertex parameters of the longest line of characters as standard parameters;
计算每行字符的所述顶点参数与所述标准参数之间的差值;及Calculating the difference between the vertex parameter and the standard parameter of each line of characters; and
在所述差值大于预设值的所在行中确定目标字符,并在所述目标字符之后加入分段符得到分段后的文本。Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
一种扫描文本分段装置,包括:A scanning text segmentation device, including:
图片获取模块,用于获取含有文本内容的图片;Picture acquisition module, used to acquire pictures containing text content;
内容转换模块,用于对所述图片进行文本识别得到文本页面,所述文本页面中包含与所述文本内容排列顺序一致的字符;A content conversion module, configured to perform text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;
顶点参数获取模块,用于获取所述文本页面中的每行字符的顶点参数,每行字符的所述顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;The vertex parameter acquisition module is used to acquire the vertex parameters of each line of characters in the text page. The vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters. The second set of vertex parameters is used for Determine the vertex parameters of the segmentation criterion;
标准参数获取模块,用于根据所述顶点参数识别所述文本页面中的最长一行字符,获取所述最长一行字符的第二组顶点参数作为标准参数;A standard parameter obtaining module, configured to identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters;
差值计算模块,用于计算每行字符的所述第二组顶点参数与所述标准参数之间的差值;及A difference calculation module for calculating the difference between the second set of vertex parameters and the standard parameters of each line of characters; and
分段模块,用于在所述差值大于预设值的所在行中确定目标字符,并在所述目标字符之后加入分段符得到分段后的文本。The segmentation module is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device, including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
获取含有文本内容的图片;Get pictures with text content;
对所述图片进行文本识别得到文本页面,所述文本页面中包含与所述文本内容排列顺序一致的字符;Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;
获取所述文本页面中的每行字符的顶点参数,每行字符的所述顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
根据所述顶点参数识别所述文本页面中的最长一行字符,获取所述最长一行字符的第二组顶点参数作为标准参数;Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;
计算每行字符的所述第二组顶点参数与所述标准参数之间的差值;及Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and
在所述差值大于预设值的所在行中确定目标字符,并在所述目标字符之后加入分段符得到分段后的文本。Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取含有文本内容的图片;Get pictures with text content;
对所述图片进行文本识别得到文本页面,所述文本页面中包含与所述文本内容排列顺序一致的字符;Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;
获取所述文本页面中的每行字符的顶点参数,每行字符的所述顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
根据所述顶点参数识别所述文本页面中的最长一行字符,获取所述最长一行字符的第二组顶点参数作为标准参数;Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;
计算每行字符的所述第二组顶点参数与所述标准参数之间的差值;及Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and
在所述差值大于预设值的所在行中确定目标字符,并在所述目标字符之后加入分段符得到分段后的文本。Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1为根据一个或多个实施例中扫描文本分段方法的应用场景图。Fig. 1 is an application scenario diagram of a scanning text segmentation method according to one or more embodiments.
图2为根据一个或多个实施例中扫描文本分段方法的流程示意图。Fig. 2 is a schematic flowchart of a scanning text segmentation method according to one or more embodiments.
图3为根据一个或多个实施例中预设值的获取方式的流程示意图。Fig. 3 is a schematic flowchart of a method for obtaining a preset value according to one or more embodiments.
图4为根据一个或多个实施例中扫描文本分段装置的框图。Fig. 4 is a block diagram of an apparatus for scanning text segmentation according to one or more embodiments.
图5为根据一个或多个实施例中计算机设备的框图。Figure 5 is a block diagram of a computer device according to one or more embodiments.
具体实施方式Detailed ways
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solutions and advantages of the present application clearer, the following further describes the present application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
本申请提供的扫描文本分段方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104通过网络进行通信。终端接收到用户的扫描请求后,获取目标文档的图片并转换为对应的目标文本后,对目标文本进行分段,终端102将分段后的目标文本和在分段过程中发现的问题发送至服务器104。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The scanning text segmentation method provided in this application can be applied to the application environment as shown in FIG. 1. The terminal 102 communicates with the server 104 through the network through the network. After receiving the scan request from the user, the terminal obtains the image of the target document and converts it into the corresponding target text, and then segments the target text. The terminal 102 sends the segmented target text and the problems found in the segmentation process to Server 104. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
在其中一个实施例中,如图2所示,提供了一种扫描文本分段方法,以该方法应用于图1中的终端102为例进行说明,包括以下步骤:In one of the embodiments, as shown in FIG. 2, a method for scanning text segmentation is provided. Taking the method applied to the terminal 102 in FIG. 1 as an example for description, the method includes the following steps:
S202,获取含有文本内容的图片。S202: Obtain a picture containing text content.
含有文本内容的图片是终端通过扫描设备对需要转化的目标文档进行拍摄或者扫描得到的图片。目标文档是用户想要进行扫描并转化为可编辑文本的文档,例如法律文件或技术文档等。扫描设备为终端内置或外接的扫描设备,如手机或电脑的摄像头,或者电脑外接的扫描仪等。待扫描区域则是用户想要通过终端扫描内容的放置区域;当扫描设备为手机或电脑时,则待扫描区域为摄像头的拍摄区域;当扫描设备为外接扫描仪时,待扫描区域为扫描仪的扫描区。The picture with text content is the picture obtained by the terminal taking or scanning the target document to be converted through the scanning device. Target documents are documents that users want to scan and convert into editable text, such as legal documents or technical documents. The scanning device is a built-in or external scanning device of the terminal, such as a camera of a mobile phone or a computer, or a scanner connected to the computer. The area to be scanned is the area where the user wants to scan the content through the terminal; when the scanning device is a mobile phone or computer, the area to be scanned is the shooting area of the camera; when the scanning device is an external scanner, the area to be scanned is the scanner Scan area.
具体地,用户当有扫描需求时,可将需扫描识别的文件放置到终端内置或对接扫描设备的待扫描区域,通过扫描设备采集需扫描识别的目标文档的图片得到含有文本内容的图片。Specifically, when the user has a scanning requirement, the user can place the document to be scanned and recognized in the area to be scanned in the terminal built-in or docked with the scanning device, and the image of the target document to be scanned and recognized is collected by the scanning device to obtain a picture containing text content.
S204,对图片进行文本识别得到文本页面,文本页面中包含与文本内容 排列顺序一致的字符。S204: Perform text recognition on the picture to obtain a text page, and the text page contains characters consistent with the arrangement order of the text content.
具体地,终端采集到含有文本内容的图片后,可以通过终端内置或外接的内容识别设备将图片中的文本内容按照图片中的排列顺序转化为可编辑的文本(或字符)形式,得到文本页面。内容识别设备是用于将图片中的文本内容转化为可编辑的文本页面的设备,可以指OCR识别设备等;OCR(Optical Character Recognition,光学字符识别)设备是指检查图片上的字符,通过检测暗、亮的模式确定其形状,然后用字符识别方法将形状翻译成计算机文本的过程的设备。Specifically, after the terminal collects a picture containing text content, the text content in the picture can be converted into an editable text (or character) form according to the arrangement order in the picture through the built-in or external content recognition device of the terminal to obtain a text page . Content recognition equipment is a device used to convert text content in a picture into an editable text page, which can refer to OCR recognition equipment, etc.; OCR (Optical Character Recognition) equipment refers to checking the characters on the picture and passing the detection The dark and light patterns determine its shape, and then use character recognition methods to translate the shape into computer text.
S206,获取文本页面中的每行字符的顶点参数,每行字符的顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数。S206: Obtain vertex parameters of each line of characters in the text page. The vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine a segmentation criterion.
每行字符的顶点参数是用于表征一行字符在文本页面中的位置和高宽的参数。例如,当文本页面为一个矩形的显示界面时,每行字符的顶点参数可以是此行字符的上下左右四个边界分别距离显示界面的上下左右四个边界的距离,即可以表示为一行字符的第一个字符的最左端与显示界面左边界的距离left、一行字符的最上端与显示界面上边界的距离top、此行的左右宽度为width、此行的高度为height,通过(left,top,width,height)的方式表示每行字符的顶点参数。另外,每行字符的顶点参数也可以是给文本页面设定长宽方向的坐标后,一行字符在文本页面的左边中的位置,例如设置文本页面的左下角为长宽方向的零坐标,则每行字符在文本页面中的位置可以表示为与此零坐标对应的四组坐标参数。对应地,每行字符的顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数,可以是上例中的表征一行字符宽度的顶点参数,如一行字符的第一个字符的最左端与显示界面左边界的距离left以及此行的左右宽度为width;或一行字符的四组左边参数的宽度方向的差值。文本页面中每行字符的顶点参数可以根据技术人员的需求或者终端的识别方式进行调整,并不限于上述的例子。The vertex parameter of each line of characters is a parameter used to characterize the position, height and width of a line of characters in the text page. For example, when the text page is a rectangular display interface, the vertex parameter of each line of characters can be the distance between the upper, lower, left, and right boundaries of the line of characters from the upper, lower, left, and right boundaries of the display interface, which can be expressed as a line of characters The distance between the leftmost end of the first character and the left border of the display interface left, the distance between the uppermost end of a line of characters and the upper border of the display interface top, the left and right width of this line is width, and the height of this line is height, through (left, top , Width, height) means the vertex parameters of each line of characters. In addition, the vertex parameter of each line of characters can also be the position of a line of characters on the left side of the text page after setting the length and width coordinates of the text page. For example, set the lower left corner of the text page as the zero coordinate of the length and width direction. The position of each line of characters in the text page can be expressed as four sets of coordinate parameters corresponding to this zero coordinate. Correspondingly, the vertex parameters of each line of characters include the first set of vertex parameters and the second set of vertex parameters. The second set of vertex parameters are the vertex parameters used to determine the segmentation standard, which can be the vertex that represents the width of a line of characters in the above example. Parameters, such as the distance left between the leftmost end of the first character of a line of characters and the left border of the display interface and the left and right width of this line is width; or the difference in the width direction of the four left parameters of a line of characters. The vertex parameters of each line of characters in the text page can be adjusted according to the needs of technicians or the recognition method of the terminal, and are not limited to the above examples.
具体地,在终端设备通过内容识别设备进行文本转换时,终端需要对文本页面中每行的字符的在文本页面中的位置进行分析。终端可直接通过内容 识别设备记录文本页面中的每行字符在文本页面中的顶点参数,顶点参数可以包括用于对每行字符在文本页面中定位的两组顶点参数,其中第二组为与分段相关的顶点参数。Specifically, when the terminal device performs text conversion through the content recognition device, the terminal needs to analyze the position in the text page of the characters in each line of the text page. The terminal can directly record the vertex parameters of each line of characters in the text page in the text page through the content recognition device. The vertex parameters may include two sets of vertex parameters for positioning each line of characters in the text page, where the second group is and Vertex parameters related to segmentation.
S208,根据顶点参数识别文本页面中的最长一行字符,获取最长一行字符的第二组顶点参数作为标准参数。S208: Identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters.
文本页面中的最长一行字符是文本页面中作为分段标准的一行,可以是一行的最后一个字符的右端在距离文本页面右边界最近的一行,或者字符行宽度值最大的一行等。The longest line of characters in a text page is a line of the text page as a segmentation standard, and it can be the line with the right end of the last character of the line closest to the right edge of the text page, or the line with the largest character line width.
具体地,终端在获取到文本页面中每行字符的顶点参数后,可以通过比较每行字符在文本页面中的宽度,将宽度值最大的一行作为文本页面中的最长一行字符,或者比较每行字符的最后一个字符的右端与文本页面右边界之间的距离最小的一行作为最长一行字符。取最长一行字符的与分段相关的第二组顶点参数,例如,此行最右端与文本页面右边界的相对位置作为之后对页面中文本的分段标准,即标准参数。Specifically, after obtaining the vertex parameters of each line of characters in the text page, the terminal can compare the width of each line of characters in the text page, and use the line with the largest width value as the longest line of characters in the text page, or compare each line of characters. The line with the smallest distance between the right end of the last character of the line of characters and the right edge of the text page is regarded as the longest line of characters. Take the second set of vertex parameters related to segmentation of the longest line of characters, for example, the relative position of the rightmost end of this line and the right border of the text page as the subsequent segmentation standard for the text in the page, that is, the standard parameter.
S210,计算每行字符的第二组顶点参数与标准参数之间的差值。S210: Calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters.
具体地,终端在步骤S208中获取判定是否分段的标准参数后,将文本页面中的每行的第二组顶点参数与标准参数相比较,并计算二者的差值,来判定此行是否需要分段。Specifically, after obtaining the standard parameters for determining whether to segment in step S208, the terminal compares the second set of vertex parameters of each line in the text page with the standard parameters, and calculates the difference between the two to determine whether the line is Need to be segmented.
S212,在差值大于预设值的所在行中确定目标字符,并在目标字符之后加入分段符得到分段后的文本。S212: Determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
预设值是用于判断每行的第二组顶点参数与标准参数之间的差值是否可以作为分行判断标准的数值,其具体值是技术人员根据历史字符样本分析得到,其格式与每行字符第二组顶点参数、标准参数一致,可以设置为若干个像素值等等。The preset value is used to judge whether the difference between the second set of vertex parameters of each line and the standard parameter can be used as the value of the branch judgment standard. The specific value is obtained by the technician based on the historical character sample analysis, and its format is consistent with that of each line. The vertex parameters and standard parameters of the second group of characters are the same, and can be set to several pixel values and so on.
目标字符是本行结束的最后一个字符,可以是文字或者标点。The target character is the last character at the end of the line, which can be text or punctuation.
具体地,如果该行的第二组顶点参数与标准参数之间的差值大于预设值,则在此行的目标字符之后加入分段符“/n”,若该行的第二组顶点参数与标准参数之间的差值小于等于预设值,则认为其未分段,不需加入分段符。终端逐行判断此行是否分段,直至将目标文本分段完成,得到分段后的文本,供 用户使用或者机器识别,例如,使机器可以识别分段符来进行段落定位。Specifically, if the difference between the second set of vertex parameters of the line and the standard parameters is greater than the preset value, the paragraph character "/n" is added after the target character of this line, if the second set of vertices of the line If the difference between the parameter and the standard parameter is less than or equal to the preset value, it is considered to be unsegmented and there is no need to add a segmentation symbol. The terminal judges whether the line is segmented line by line, until the target text is segmented, and the segmented text is obtained for the user to use or machine recognition, for example, so that the machine can recognize the segmentation symbol to locate the paragraph.
另外,上述步骤S210中的计算每行字符的第二组顶点参数与标准参数之间的差值。当差值大于预设值时之后,还可以继续获取下一行字符的开始位置对应的顶点参数。当下一行的开始位置大于最长一行字符的开始位置对应的顶点参数时,则在此行的最后一个字符之后插入分段符。此步骤针对换行后下一行段首留有间隔的文本,根据此段末和下段首来判断是否分段,提高分段的准确性。In addition, in the above step S210, the difference between the second set of vertex parameters of each line of characters and the standard parameters is calculated. When the difference is greater than the preset value, the vertex parameter corresponding to the start position of the next line of characters can also be obtained continuously. When the start position of the next line is greater than the vertex parameter corresponding to the start position of the longest line of characters, a paragraph break is inserted after the last character of this line. In this step, for the text with a space at the beginning of the next line after the line break, judge whether to segment according to the end of this paragraph and the beginning of the next paragraph, and improve the accuracy of segmentation.
可选地,当待分段文本为多页时,终端可逐页获取文本内容作为文本页面执行上述的分段步骤,对一页中的文本分段完成后,继续获取目标文本的下一页作为文本页面,直至对目标文本的所有页分段完成。另外,对于一组扫描得到的多页排版相同的文本,终端可对第一页进行识别,将得到的标准参数作为整个文本的标准参数,即只需获取第一页的标准参数,剩下页仍用第一页的标准参数作为分段标准。对多页排版相同的文本,只需获取第一页满行的标准参数,提高本方法的分段效率。Optionally, when the text to be segmented is multiple pages, the terminal can obtain the text content page by page as the text page and perform the above segmentation step. After segmenting the text in one page, continue to obtain the next page of the target text As a text page, until all pages of the target text are segmented. In addition, for a group of scanned texts with the same layout on multiple pages, the terminal can recognize the first page and use the obtained standard parameters as the standard parameters of the entire text, that is, only need to obtain the standard parameters of the first page, and the remaining pages The standard parameters on the first page are still used as the segmentation standard. For multiple pages of the same text, only the standard parameters of the first page full line need to be obtained, which improves the segmentation efficiency of this method.
上述扫描文本分段方法中,终端通过对扫描得到的含有文本内容的图片进行识别得到文本页面,并获取文本页面每行字符的顶点参数,判断出文本页面中的最长一行字符,并将最长一行字符的第二组顶点参数作为标准参数,依次将每行字符的第二组顶点参数与标准参数相对比,当二者的差值大于分段标准对应的预设值时,则终端认为此行为段落结束行,在此行的目标字符之后加入分段符,直至对文本分段完成;终端根据上述方法,在识别出图片中的文本的内容后,也可以对识别出的文本准确分段。In the above scanning text segmentation method, the terminal obtains the text page by recognizing scanned images containing text content, and obtains the vertex parameters of each line of the text page, determines the longest line of characters in the text page, and compares the maximum The second set of vertex parameters of a long line of characters are used as standard parameters, and the second set of vertex parameters of each line of characters are compared with the standard parameters in turn. When the difference between the two is greater than the preset value corresponding to the segmentation standard, the terminal considers it This line is the ending line of the paragraph. Add a paragraph break after the target character of this line until the text is segmented. According to the above method, the terminal can accurately divide the recognized text after recognizing the content of the text in the picture. segment.
在其中一个实施例中,请参见图3,上述扫描文本分段方法中的预设值的获取方式可以包括:In one of the embodiments, referring to FIG. 3, the way of obtaining the preset value in the above scanning text segmentation method may include:
S302,获取字符样本,并识别字符样本对应的字符种类。S302: Obtain a character sample, and identify the type of character corresponding to the character sample.
字符样本是用于分析同一文档中非汉字字符与汉字字符在行中的占宽的样本,可以为以前处理过的扫描文档。字符种类是非汉字字符的种类,例如字母或者数字等等。Character samples are used to analyze the width of non-Chinese characters and Chinese characters in the same document, and can be scanned documents that have been processed before. Character types are types of non-Chinese characters, such as letters or numbers.
具体地,终端在计算预设值时,需要根据历史处理过的分段文档作为字符样进行计算,此字符样本中包含不同的字符种类的字符,终端获取字符样 本后,先识别其对应的字符种类。Specifically, when the terminal calculates the preset value, it needs to calculate based on the historically processed segmented document as a character sample. This character sample contains characters of different character types. After the terminal obtains the character sample, it first recognizes its corresponding character species.
S304,计算每个字符种类对应的字符与汉字的占宽比,所述占宽比为字符在文档中所占的宽度与汉字所占宽度的比值。S304: Calculate the duty ratio of the character corresponding to each character type to the Chinese character, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character.
具体地,终端计算除汉字外的每个字符种类在文档中所占的宽度与汉字所占宽度的比值。例如,终端根据样本可以粗略估算出每个数字或字母认为占中文本符宽度的一半;若需要计算地更加精确,可以事先算出每个数字0-9和每个字母a-z和A-Z等等非汉字字符所占的宽度除以一个汉字的宽度生成一个非汉字字符的相对宽度哈希表,而后终端根据此相对宽度哈希表中计算每个字符种类对应的字符与汉字的占宽比。Specifically, the terminal calculates the ratio of the width of each character type except Chinese characters in the document to the width of Chinese characters. For example, according to the sample, the terminal can roughly estimate that each number or letter is considered to be half of the width of the Chinese text character; if you need to calculate more accurately, you can calculate in advance each number 0-9 and each letter az and AZ and other non-Chinese characters The width of the character is divided by the width of a Chinese character to generate a relative width hash table of non-Chinese characters, and then the terminal calculates the proportion of the character corresponding to each character type to the Chinese character based on the relative width hash table.
S306,根据占宽比计算预设值。S306: Calculate a preset value according to the duty ratio.
具体地,终端可根据对字符样本中不同种类与汉字的占宽比计算预设值,如历史扫描的文件中的存在汉字、字母、数字、标点符号等字符,其满行之间的对齐情况得到用于调整由于字符排版问题造成的分段误差的预设值,具体可设置为最长一行字符的宽度的0.1~0.15倍。Specifically, the terminal can calculate a preset value according to the different types and the width ratios of Chinese characters in the character sample, such as the existence of Chinese characters, letters, numbers, punctuation marks and other characters in historically scanned documents, and the alignment between full lines The preset value used to adjust the segmentation error caused by the character typesetting problem is obtained, which can be specifically set to 0.1-0.15 times the width of the longest line of characters.
上述实施例中,技术人员通过对大量字符样本进行研究,通过不同字符种类的排版的差异,计算出准确的预设值,对识别出的文本准确分段。In the above-mentioned embodiment, the technician studies a large number of character samples, and calculates accurate preset values based on the difference in typesetting of different character types, and accurately segments the recognized text.
在一些实施例中,上述步骤S212中的在目标字符之后加入分段符得到分段后的文本之后,还可以包括:将分段后的文本发送至服务器;接收服务器根据分段后的文本返回的更新指令;根据更新指令中的预设值替代所述终端中的预设值。In some embodiments, after adding a segmentation character after the target character in step S212 to obtain the segmented text, it may further include: sending the segmented text to the server; the receiving server returns according to the segmented text The update instruction; replace the preset value in the terminal according to the preset value in the update instruction.
更新指令是是服务器向终端发送的、用于更新终端的预设值的指令,可以是用新的预设值代替终端本地的预设值的指令。The update instruction is an instruction sent by the server to the terminal to update the preset value of the terminal, and may be an instruction to replace the local preset value of the terminal with a new preset value.
具体地,终端将分段完成的文本发送给服务器进行进一步处理或使用。服务器若发现终端的分段识别有误,则可针对其出错原因对预设值进行调整。服务器可以生成更新指令,并将更新指令返回至终端,对终端本地的预设值进行更新,来对扫描的文档进行分段。Specifically, the terminal sends the segmented text to the server for further processing or use. If the server finds that the segment recognition of the terminal is incorrect, it can adjust the preset value according to the cause of the error. The server may generate an update instruction, and return the update instruction to the terminal, update the local preset value of the terminal, to segment the scanned document.
上述实施例中,服务器根据终端分段完成的文档检测预设值的准确性,若预设值不准确,则对其进行更新,提高终端对以后处理的文本分段的准确性。In the foregoing embodiment, the server detects the accuracy of the preset value based on the document segmented by the terminal, and if the preset value is not accurate, it updates it to improve the accuracy of the text segmentation processed later by the terminal.
在其中一个实施例中,上述步骤S206中的获取文本页面中的每行字符的顶点参数之后,还可以包括:根据文本页面中的每行字符的顶点参数对文本分区;根据分区后的文本生成新的文本页面;继续根据顶点参数识别文本页面中的最长一行字符。In one of the embodiments, after obtaining the vertex parameters of each line of characters in the text page in step S206, it may further include: partitioning the text according to the vertex parameters of each line of characters in the text page; generating according to the partitioned text New text page; continue to identify the longest line of characters in the text page based on vertex parameters.
具体地,当终端识别到文本页面中的文本的排版差异较大,即每行字符的顶点参数之间的差值较大,如一行或连续几行的行首和其余行的行首的顶点参数之间的差值较大,或一行或连续几行的行末和其余行的行末的顶点参数之间的差值较大时,终端可根据差值的大小将文本页面进行分区,并将每个分区得到的区域生成新的文本页面,依次对新的文本页面中的文本进行分段。技术人员可以通过对历史分段样本进行研究设置对应的分区标准。例如,将一行或者连续几行的顶点参数中的其中一组或两组与其他行对应的顶点参数的差值较大,且差值超出预设像素值时,终端可以将这几行作为一个新的区域,文本页面中的其他行作为另一个区域等。终端把每个区域都生成新的文本页面终端再对新的页面中的内容按照上述步骤进行分段。此实施例是针对一页中有不同排版的文本,例如报纸、海报等,进行分段的情况。上述实施例中,对于一页中有不同排版的文本也可以通过分区的方式对其准确分段。Specifically, when the terminal recognizes that the typesetting difference of the text in the text page is large, that is, the difference between the vertex parameters of each line of characters is large, such as the vertices at the beginning of a line or several consecutive lines and the beginning of the remaining lines When the difference between the parameters is large, or the difference between the vertex parameters at the end of a row or several consecutive rows and the end of the remaining rows is large, the terminal can partition the text page according to the magnitude of the difference, and divide each The area obtained by each partition generates a new text page, and the text in the new text page is segmented in turn. Technicians can set the corresponding zoning standard by studying the historical segmented samples. For example, if one or two sets of vertex parameters in one row or consecutive rows have a large difference in vertex parameters corresponding to other rows, and the difference exceeds the preset pixel value, the terminal can use these rows as one New area, other lines in the text page as another area, etc. The terminal generates a new text page for each area, and the terminal segments the content in the new page according to the above steps. This embodiment is aimed at a situation where texts with different layouts on a page, such as newspapers, posters, etc., are segmented. In the above embodiment, the text with different layouts on a page can also be accurately segmented by means of partitions.
在一个实施例中,上述步骤S212中的在目标字符之后加入分段符得到分段后的文本之后,还可以包括:保存分段后的文本;删除文本页面中的每行字符的顶点参数。In one embodiment, after adding a segmentation character after the target character to obtain the segmented text in step S212, it may further include: saving the segmented text; deleting the vertex parameter of each line of characters in the text page.
具体地,在对文本页面中的内容分段结束后,终端保存分段完成的文本,删除自动清空扫描文本识别时的每行字符的顶点参数。可选地,也可以在终端获取用户输入或者服务器发送的删除指令,终端才删除文本页面中每行字符的顶点参数。Specifically, after the segmentation of the content in the text page is finished, the terminal saves the segmented text, and deletes the vertex parameters of each line of characters when automatically clearing the scanned text. Optionally, a user input or a delete instruction sent by the server can also be obtained at the terminal, and the terminal deletes the vertex parameter of each line of characters in the text page.
上述实施例中,由于顶点列表为文本的每行在文本页面中的位置表征参数,数据量较大,在终端完成对文本页面内的文本的分段操作后,应删除此文本页面分段过程中的获取的顶点参数,提高终端的运行速率。In the above embodiment, since the vertex list is the position characterization parameter of each line of the text in the text page, the amount of data is large. After the terminal completes the segmentation operation of the text in the text page, the text page segmentation process should be deleted The vertex parameters obtained in the, improve the operating speed of the terminal.
应该理解的是,虽然图2-3的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其 它的顺序执行。而且,图2-3中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the flowchart of FIGS. 2-3 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in Figure 2-3 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The execution order of is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
在一个实施例中,如图4所示,提供了一种扫描文本分段装置,包括:图片获取模块100、内容转换模块200、顶点参数获取模块300、标准参数获取模块400、差值计算模块500和分段模块600,其中:In one embodiment, as shown in FIG. 4, a scanning text segmentation device is provided, which includes: a picture acquisition module 100, a content conversion module 200, a vertex parameter acquisition module 300, a standard parameter acquisition module 400, and a difference calculation module 500 and segmentation module 600, of which:
图片获取模块100,用于获取含有文本内容的图片。The picture obtaining module 100 is used to obtain pictures containing text content.
内容转换模块200,用于对图片进行文本识别得到文本页面,文本页面中包含与文本内容排列顺序一致的字符。The content conversion module 200 is configured to perform text recognition on the picture to obtain a text page, and the text page contains characters in the same order as the text content.
顶点参数获取模块300,用于获取文本页面中的每行字符的顶点参数,每行字符的顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数。The vertex parameter acquisition module 300 is used to acquire the vertex parameters of each line of characters in the text page. The vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are used to determine segmentation Standard vertex parameters.
标准参数获取模块400,用于根据顶点参数识别文本页面中的最长一行字符,获取最长一行字符的第二组顶点参数作为标准参数。The standard parameter obtaining module 400 is configured to identify the longest line of characters in a text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters.
差值计算模块500,用于计算每行字符的第二组顶点参数与标准参数之间的差值。The difference calculation module 500 is used to calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters.
分段模块600,用于在差值大于预设值的所在行中确定目标字符,并在目标字符之后加入分段符得到分段后的文本。The segmentation module 600 is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
在一些实施例中,上述扫描文本分段装置还可以包括:In some embodiments, the aforementioned scanning text segmentation device may further include:
样本获取模块,用于获取字符样本,并识别字符样本对应的字符种类。The sample acquisition module is used to acquire character samples and identify the type of characters corresponding to the character samples.
字符种类分析模块,用于计算每个字符种类对应的字符与汉字的占宽比,所述占宽比为字符在文档中所占的宽度与汉字所占宽度的比值。The character type analysis module is used to calculate the duty ratio of the character corresponding to each character type to the Chinese character. The duty ratio is the ratio of the width of the character in the document to the width of the Chinese character.
预设值计算模块,用于根据占宽比计算预设值。The preset value calculation module is used to calculate the preset value according to the duty ratio.
在一些实施例中,上述扫描文本分段装置还可以包括:In some embodiments, the aforementioned scanning text segmentation device may further include:
发送模块,用于将分段后的文本发送至服务器。The sending module is used to send the segmented text to the server.
更新指令接收模块,用于接收服务器根据分段后的文本返回的更新指令。The update instruction receiving module is used to receive the update instruction returned by the server according to the segmented text.
预设值更新模块,用于根据更新指令更新预设值。The preset value update module is used to update the preset value according to the update instruction.
在一些实施例中,上述扫描文本分段装置还可以包括:In some embodiments, the aforementioned scanning text segmentation device may further include:
分区模块,用于根据文本页面中的每行字符的顶点参数对文本分区。The partition module is used to partition the text according to the vertex parameters of each line of characters in the text page.
页面更新模块,用于根据分区后的文本生成新的文本页面。The page update module is used to generate a new text page based on the partitioned text.
新页面分段模块,用于继续根据顶点参数识别文本页面中的最长一行字符。The new page segmentation module is used to continue to identify the longest line of characters in the text page according to the vertex parameters.
在一些实施例中,上述扫描文本分段装置还可以包括:In some embodiments, the aforementioned scanning text segmentation device may further include:
保存模块,用于保存分段后的文本。The save module is used to save the segmented text.
参数删除模块,用于删除文本页面中的每行字符的顶点参数。The parameter deletion module is used to delete the vertex parameter of each line of characters in the text page.
关于扫描文本分段装置的具体限定可以参见上文中对于扫描文本分段方法的限定,在此不再赘述。上述扫描文本分段装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the scanning text segmentation device, please refer to the above definition of the scanning text segmentation method, which will not be repeated here. Each module in the aforementioned scanning text segmentation device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一些实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种扫描文本分段方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In some embodiments, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 5. The computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer readable instructions. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a scanning text segmentation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计 算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行以下步骤:获取含有文本内容的图片;对图片进行文本识别得到文本页面,文本页面中包含与文本内容排列顺序一致的字符;获取文本页面中的每行字符的顶点参数,每行字符的顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;根据顶点参数识别文本页面中的最长一行字符,获取最长一行字符的第二组顶点参数作为标准参数;计算每行字符的第二组顶点参数与标准参数之间的差值;及在差值大于预设值的所在行中确定目标字符,并在目标字符之后加入分段符得到分段后的文本。A computer device including a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors perform the following steps: Picture; text recognition is performed on the picture to obtain a text page, which contains characters in the same order as the text content; to obtain the vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include the first set of vertex parameters and the first set of vertex parameters Two sets of vertex parameters, the second set of vertex parameters are the vertex parameters used to determine the segmentation standard; the longest line of characters in the text page is recognized according to the vertex parameters, and the second set of vertex parameters of the longest line of characters are obtained as standard parameters; calculation The difference between the second set of vertex parameters of each line of characters and the standard parameters; and determine the target character in the line where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
在一些实施例中,处理器执行计算机可读指令时实现的预设值的获取方式包括:获取字符样本,并识别字符样本对应的字符种类;计算每个字符种类对应的字符与汉字的占宽比,占宽比为字符在文档中所占的宽度与汉字所占宽度的比值;根据占宽比计算预设值。In some embodiments, the way of obtaining the preset value realized when the processor executes the computer-readable instruction includes: obtaining a character sample, and identifying the character type corresponding to the character sample; calculating the width of the character and the Chinese character corresponding to each character type The duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; the preset value is calculated according to the duty ratio.
在一些实施例中,处理器执行计算机可读指令时实现的在目标字符之后加入分段符得到分段后的文本之后,还包括:将分段后的文本发送至服务器;接收服务器根据分段后的文本返回的更新指令;根据更新指令中的预设值替代所述终端中的预设值。In some embodiments, when the processor executes the computer-readable instruction, after adding a segmentation character after the target character to obtain the segmented text, the method further includes: sending the segmented text to the server; and the receiving server according to the segmentation The update instruction returned by the following text; replace the preset value in the terminal according to the preset value in the update instruction.
在一些实施例中,处理器执行计算机可读指令时实现的获取文本页面中的每行字符的顶点参数之后,还包括:根据文本页面中的每行字符的顶点参数对文本分区;根据分区后的文本生成新的文本页面;继续根据顶点参数识别文本页面中的最长一行字符。In some embodiments, after the processor executes the computer-readable instructions to obtain the vertex parameters of each line of characters in the text page, the method further includes: partitioning the text according to the vertex parameters of each line of characters in the text page; Generate a new text page; continue to identify the longest line of characters in the text page based on the vertex parameters.
在一些实施例中,处理器执行计算机可读指令时实现的在目标字符之后加入分段符得到分段后的文本之后,还包括:保存分段后的文本;删除文本页面中的每行字符的顶点参数。In some embodiments, when the processor executes the computer-readable instruction, after adding a paragraph break after the target character to obtain the segmented text, it further includes: saving the segmented text; deleting each line of characters in the text page The vertex parameters.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤::获取含有文本内容的图片;对图片进行文本识别得到文本页面,文本页面中包含与文本内容排列顺序一致的字符;获取文本页面中的每行字符的顶点参数,每行字符的顶点参数包括第一组顶点参数和第二组顶点参数,第 二组顶点参数为用于判定分段标准的顶点参数;根据顶点参数识别文本页面中的最长一行字符,获取最长一行字符的第二组顶点参数作为标准参数;计算每行字符的第二组顶点参数与标准参数之间的差值;及在差值大于预设值的所在行中确定目标字符,并在目标字符之后加入分段符得到分段后的文本。One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors perform the following steps:: Get text The picture of the content; the text recognition of the picture to get the text page, the text page contains the characters in the same order as the text content; the vertex parameters of each line of characters in the text page are obtained, the vertex parameters of each line of characters include the first set of vertex parameters And the second set of vertex parameters, the second set of vertex parameters are used to determine the segmentation standard; according to the vertex parameters, the longest line of characters in the text page is recognized, and the second set of vertex parameters of the longest line of characters are obtained as standard parameters ; Calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters; and determine the target character in the line where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
在一些实施例中,计算机可读指令被处理器执行时实现的预设值的获取方式包括:获取字符样本,并识别字符样本对应的字符种类;计算每个字符种类对应的字符与汉字的占宽比,占宽比为字符在文档中所占的宽度与汉字所占宽度的比值;根据占宽比计算预设值。In some embodiments, the way of obtaining the preset value realized when the computer-readable instruction is executed by the processor includes: obtaining a character sample, and identifying the character type corresponding to the character sample; calculating the proportion of characters and Chinese characters corresponding to each character type. Width ratio, the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; the preset value is calculated according to the duty ratio.
在一些实施例中,计算机可读指令被处理器执行时实现的在目标字符之后加入分段符得到分段后的文本之后,还包括:将分段后的文本发送至服务器;接收服务器根据分段后的文本返回的更新指令;根据更新指令中的预设值替代所述终端中的预设值。In some embodiments, when the computer-readable instruction is executed by the processor, after adding a segmentation character after the target character to obtain the segmented text, the method further includes: sending the segmented text to the server; The update instruction returned by the text after the paragraph; replace the preset value in the terminal according to the preset value in the update instruction.
在一些实施例中,计算机可读指令被处理器执行时实现的获取文本页面中的每行字符的顶点参数之后,还包括:根据文本页面中的每行字符的顶点参数对文本分区;根据分区后的文本生成新的文本页面;继续根据顶点参数识别文本页面中的最长一行字符。In some embodiments, after obtaining the vertex parameters of each line of characters in the text page when the computer-readable instructions are executed by the processor, the method further includes: partitioning the text according to the vertex parameters of each line of characters in the text page; The following text generates a new text page; continue to identify the longest line of characters in the text page according to the vertex parameters.
在一些实施例中,计算机可读指令被处理器执行时实现的在目标字符之后加入分段符得到分段后的文本之后,还包括:保存分段后的文本;删除文本页面中的每行字符的顶点参数。In some embodiments, when the computer-readable instruction is executed by the processor, adding a paragraph break after the target character to obtain the segmented text, further includes: saving the segmented text; deleting each line in the text page The vertex parameter of the character.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM (DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, they should It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种扫描文本分段方法,包括:A method for scanning text segmentation, including:
    获取含有文本内容的图片;Get pictures with text content;
    对所述图片进行文本识别得到文本页面,所述文本页面中包含与所述文本内容排列顺序一致的字符;Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;
    获取所述文本页面中的每行字符的顶点参数,每行字符的所述顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
    根据所述顶点参数识别所述文本页面中的最长一行字符,获取所述最长一行字符的第二组顶点参数作为标准参数;Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;
    计算每行字符的所述第二组顶点参数与所述标准参数之间的差值;及Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and
    在所述差值大于预设值的所在行中确定目标字符,并在所述目标字符之后加入分段符得到分段后的文本。Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
  2. 根据权利要求1所述的方法,其特征在于,所述预设值的获取方式包括:The method according to claim 1, wherein the method for obtaining the preset value comprises:
    获取字符样本,并识别所述字符样本对应的字符种类;Acquiring a character sample, and identifying the type of character corresponding to the character sample;
    计算每个所述字符种类对应的字符与汉字的占宽比,所述占宽比为字符在文档中所占的宽度与汉字所占宽度的比值;及Calculate the duty ratio of the character and the Chinese character corresponding to each of the character types, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; and
    根据所述占宽比计算所述预设值。The preset value is calculated according to the duty ratio.
  3. 根据权利要求2所述的方法,其特征在于,所述在所述目标字符之后加入分段符得到分段后的文本之后,所述方法还包括:3. The method according to claim 2, wherein after adding a segmentation character after the target character to obtain the segmented text, the method further comprises:
    将所述分段后的文本发送至服务器;Sending the segmented text to the server;
    接收所述服务器根据所述分段后的文本返回的更新指令;及Receiving an update instruction returned by the server according to the segmented text; and
    根据所述更新指令中的预设值替代所述终端中的所述预设值。Replacing the preset value in the terminal according to the preset value in the update instruction.
  4. 根据权利要求1所述的方法,其特征在于,所述获取所述文本页面中的每行字符的顶点参数之后,还包括:The method according to claim 1, wherein after said obtaining the vertex parameters of each line of characters in the text page, the method further comprises:
    根据所述文本页面中的每行字符的顶点参数对所述文本分区;Partition the text according to the vertex parameters of each line of characters in the text page;
    根据分区后的所述文本生成新的文本页面;及Generate a new text page based on the text after the partition; and
    继续所述根据顶点参数识别文本页面中的最长一行字符。Continue to identify the longest line of characters in the text page according to the vertex parameters.
  5. 根据权利要求1至4任意一项所述的方法,其特征在于,所述在所述目标字符之后加入分段符得到分段后的文本之后,所述方法还包括:The method according to any one of claims 1 to 4, characterized in that, after adding a segment character after the target character to obtain a segmented text, the method further comprises:
    保存所述分段后的文本;及Save the segmented text; and
    删除所述文本页面中的每行字符的顶点参数。Delete the vertex parameter of each line of characters in the text page.
  6. 一种扫描文本分段装置,包括:A scanning text segmentation device, including:
    图片获取模块,用于获取含有文本内容的图片;Picture acquisition module, used to acquire pictures containing text content;
    内容转换模块,用于对所述图片进行文本识别得到文本页面,所述文本页面中包含与所述文本内容排列顺序一致的字符;A content conversion module, configured to perform text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;
    顶点参数获取模块,用于获取所述文本页面中的每行字符的顶点参数,每行字符的所述顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;The vertex parameter acquisition module is used to acquire the vertex parameters of each line of characters in the text page. The vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters. The second set of vertex parameters is used for Determine the vertex parameters of the segmentation criterion;
    标准参数获取模块,用于根据所述顶点参数识别所述文本页面中的最长一行字符,获取所述最长一行字符的第二组顶点参数作为标准参数;A standard parameter obtaining module, configured to identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters;
    差值计算模块,用于计算所述每行字符的所述第二组与所述标准参数之间的差值;A difference calculation module, configured to calculate the difference between the second group of characters in each line and the standard parameter;
    分段模块,用于在所述差值大于预设值的所在行中确定目标字符,并在所述目标字符之后加入分段符得到分段后的文本。The segmentation module is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括:The device according to claim 6, wherein the device further comprises:
    样本获取模块,用于获取字符样本,并识别所述字符样本对应的字符种类;The sample acquisition module is used to acquire a character sample and identify the character type corresponding to the character sample;
    字符种类分析模块,用于计算每个所述字符种类对应的字符与汉字的占宽比,所述占宽比为字符在文档中所占的宽度与汉字所占宽度的比值;The character type analysis module is used to calculate the ratio of the character to the Chinese character corresponding to each of the character types, where the ratio is the ratio of the width of the character in the document to the width of the Chinese character;
    预设值计算模块,用于根据所述占宽比计算所述预设值。The preset value calculation module is configured to calculate the preset value according to the duty ratio.
  8. 根据权利要求7所述的装置,其特征在于,所述装置还包括:The device according to claim 7, wherein the device further comprises:
    发送模块,用于将所述分段后的文本发送至服务器;A sending module, used to send the segmented text to the server;
    更新指令接收模块,用于接收所述服务器根据所述分段后的文本返回的更新指令;An update instruction receiving module, configured to receive an update instruction returned by the server according to the segmented text;
    预设值更新模块,用于根据所述更新指令中的预设值替代所述终端中的所述预设值。The preset value update module is configured to replace the preset value in the terminal according to the preset value in the update instruction.
  9. 根据权利要求6所述的装置,其特征在于,所述装置还包括:The device according to claim 6, wherein the device further comprises:
    分区模块,用于根据所述文本页面中的每行字符的顶点参数对所述文本分区;A partition module, configured to partition the text according to the vertex parameters of each line of characters in the text page;
    页面更新模块,用于根据分区后的所述文本生成新的文本页面;及The page update module is used to generate a new text page according to the partitioned text; and
    新页面分段模块,用于继续所述根据顶点参数识别文本页面中的最长一行字符。The new page segmentation module is used to continue the identification of the longest line of characters in the text page according to the vertex parameter.
  10. 根据权利要求6~9中任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 6-9, wherein the device further comprises:
    保存模块,用于保存所述分段后的文本;及A saving module for saving the segmented text; and
    参数删除模块,用于删除所述文本页面中的每行字符的顶点参数。The parameter deletion module is used to delete the vertex parameter of each line of characters in the text page.
  11. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:
    获取含有文本内容的图片;Get pictures with text content;
    对所述图片进行文本识别得到文本页面,所述文本页面中包含与所述文本内容排列顺序一致的字符;Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;
    获取所述文本页面中的每行字符的顶点参数,每行字符的所述顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
    根据所述顶点参数识别所述文本页面中的最长一行字符,获取所述最长一行字符的第二组顶点参数作为标准参数;Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;
    计算每行字符的所述第二组顶点参数与所述标准参数之间的差值;及Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and
    在所述差值大于预设值的所在行中确定目标字符,并在所述目标字符之后加入分段符得到分段后的文本。Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
  12. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤,所述预设值的获取方式包括:The computer device according to claim 11, wherein the processor further executes the following steps when executing the computer-readable instruction, and the way of obtaining the preset value comprises:
    获取字符样本,并识别所述字符样本对应的字符种类;Acquiring a character sample, and identifying the type of character corresponding to the character sample;
    计算每个所述字符种类对应的字符与汉字的占宽比,所述占宽比为字符在文档中所占的宽度与汉字所占宽度的比值;及Calculate the duty ratio of the character and the Chinese character corresponding to each of the character types, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; and
    根据所述占宽比计算所述预设值。The preset value is calculated according to the duty ratio.
  13. 根据权利要求12所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时在所述目标字符之后加入分段符得到分段后的文本之后还执行以下步骤:The computer device according to claim 12, wherein when the processor executes the computer-readable instruction, after adding a segmentation character after the target character to obtain the segmented text, the processor further executes the following steps:
    将所述分段后的文本发送至服务器;Sending the segmented text to the server;
    接收所述服务器根据所述分段后的文本返回的更新指令;及Receiving an update instruction returned by the server according to the segmented text; and
    根据所述更新指令中的预设值替代所述终端中的所述预设值。Replacing the preset value in the terminal according to the preset value in the update instruction.
  14. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时在获取所述文本页面中的每行字符的顶点参数之后还执行以下步骤:The computer device according to claim 11, wherein the processor further executes the following steps after acquiring the vertex parameters of each line of characters in the text page when executing the computer-readable instructions:
    根据所述文本页面中的每行字符的顶点参数对所述文本分区;Partition the text according to the vertex parameters of each line of characters in the text page;
    根据分区后的所述文本生成新的文本页面;及Generate a new text page based on the text after the partition; and
    继续所述根据顶点参数识别文本页面中的最长一行字符。Continue to identify the longest line of characters in the text page according to the vertex parameters.
  15. 根据权利要求11~14中任一项所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时在所述目标字符之后加入分段符得到分段后的文本之后还执行以下步骤:The computer device according to any one of claims 11 to 14, wherein when the processor executes the computer-readable instruction, a segmentation character is added after the target character to obtain the segmented text after the Perform the following steps:
    保存所述分段后的文本;及Save the segmented text; and
    删除所述文本页面中的每行字符的顶点参数。Delete the vertex parameter of each line of characters in the text page.
  16. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
    获取含有文本内容的图片;Get pictures with text content;
    对所述图片进行文本识别得到文本页面,所述文本页面中包含与所述文本内容排列顺序一致的字符;Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;
    获取所述文本页面中的每行字符的顶点参数,每行字符的所述顶点参数包括第一组顶点参数和第二组顶点参数,第二组顶点参数为用于判定分段标准的顶点参数;Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
    根据所述顶点参数识别所述文本页面中的最长一行字符,获取所述最长一行字符的第二组顶点参数作为标准参数;Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;
    计算每行字符的所述第二组顶点参数与所述标准参数之间的差值;及Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and
    在所述差值大于预设值的所在行中确定目标字符,并在所述目标字符之后加入分段符得到分段后的文本。Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
  17. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤,所述预设值的获取方式包括:The storage medium according to claim 16, wherein the following steps are further executed when the computer-readable instruction is executed by the processor, and the way of obtaining the preset value comprises:
    获取字符样本,并识别所述字符样本对应的字符种类;Acquiring a character sample, and identifying the type of character corresponding to the character sample;
    计算每个所述字符种类对应的字符与汉字的占宽比,所述占宽比为字符在文档中所占的宽度与汉字所占宽度的比值;及Calculate the duty ratio of the character and the Chinese character corresponding to each of the character types, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; and
    根据所述占宽比计算所述预设值。The preset value is calculated according to the duty ratio.
  18. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时在所述目标字符之后加入分段符得到分段后的文本之后还执行以下步骤:18. The storage medium according to claim 17, wherein when the computer-readable instructions are executed by the processor, after adding a segmentation character after the target character to obtain the segmented text, the following steps are further performed:
    将所述分段后的文本发送至服务器;Sending the segmented text to the server;
    接收所述服务器根据所述分段后的文本返回的更新指令;及Receiving an update instruction returned by the server according to the segmented text; and
    根据所述更新指令中的预设值替代所述终端中的所述预设值。Replacing the preset value in the terminal according to the preset value in the update instruction.
  19. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时在获取所述文本页面中的每行字符的顶点参数之后还执行以下步骤:The storage medium according to claim 16, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed after obtaining the vertex parameters of each line of characters in the text page:
    根据所述文本页面中的每行字符的顶点参数对所述文本分区;Partition the text according to the vertex parameters of each line of characters in the text page;
    根据分区后的所述文本生成新的文本页面;及Generate a new text page based on the text after the partition; and
    继续所述根据顶点参数识别文本页面中的最长一行字符。Continue to identify the longest line of characters in the text page according to the vertex parameters.
  20. 根据权利要求16~19中任一项所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时在所述目标字符之后加入分段符得到分段后的文本之后还执行以下步骤:The storage medium according to any one of claims 16 to 19, wherein when the computer-readable instruction is executed by the processor, a segmentation character is added after the target character to obtain a segmented text Also perform the following steps:
    保存所述分段后的文本;及Save the segmented text; and
    删除所述文本页面中的每行字符的顶点参数。Delete the vertex parameter of each line of characters in the text page.
PCT/CN2019/102549 2019-05-20 2019-08-26 Scanned text segmentation method and apparatus, computer device and storage medium WO2020232866A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910418522.6A CN110245570B (en) 2019-05-20 2019-05-20 Scanned text segmentation method and device, computer equipment and storage medium
CN201910418522.6 2019-05-20

Publications (1)

Publication Number Publication Date
WO2020232866A1 true WO2020232866A1 (en) 2020-11-26

Family

ID=67884469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102549 WO2020232866A1 (en) 2019-05-20 2019-08-26 Scanned text segmentation method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN110245570B (en)
WO (1) WO2020232866A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191348A (en) * 2021-05-31 2021-07-30 山东新一代信息产业技术研究院有限公司 Template-based text structured extraction method and tool

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177532B (en) * 2021-05-27 2024-04-05 中国平安人寿保险股份有限公司 Method, device, equipment and medium for identifying paragraph boundary of characters in image
CN114444439B (en) * 2022-04-08 2022-08-26 深圳市壹箭教育科技有限公司 Test question set file generation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970213B1 (en) * 2007-05-21 2011-06-28 A9.Com, Inc. Method and system for improving the recognition of text in an image
CN107545223A (en) * 2016-06-29 2018-01-05 腾讯科技(深圳)有限公司 Image-recognizing method and electronic equipment
CN108734089A (en) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 Identify method, apparatus, equipment and the storage medium of table content in picture file
CN109697414A (en) * 2018-12-13 2019-04-30 北京金山数字娱乐科技有限公司 A kind of text positioning method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8565474B2 (en) * 2010-03-10 2013-10-22 Microsoft Corporation Paragraph recognition in an optical character recognition (OCR) process
CN105487396A (en) * 2015-12-29 2016-04-13 宇龙计算机通信科技(深圳)有限公司 Method and device of controlling smart home
CN106326854B (en) * 2016-08-19 2019-09-06 掌阅科技股份有限公司 A kind of format document paragraph recognition methods

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970213B1 (en) * 2007-05-21 2011-06-28 A9.Com, Inc. Method and system for improving the recognition of text in an image
CN107545223A (en) * 2016-06-29 2018-01-05 腾讯科技(深圳)有限公司 Image-recognizing method and electronic equipment
CN108734089A (en) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 Identify method, apparatus, equipment and the storage medium of table content in picture file
CN109697414A (en) * 2018-12-13 2019-04-30 北京金山数字娱乐科技有限公司 A kind of text positioning method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191348A (en) * 2021-05-31 2021-07-30 山东新一代信息产业技术研究院有限公司 Template-based text structured extraction method and tool

Also Published As

Publication number Publication date
CN110245570A (en) 2019-09-17
CN110245570B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
US11631050B2 (en) Syncing physical and electronic document
WO2021017260A1 (en) Multi-language text recognition method and apparatus, computer device, and storage medium
CN111476227B (en) Target field identification method and device based on OCR and storage medium
WO2019201035A1 (en) Method and device for identifying object node in image, terminal and computer readable storage medium
US9710704B2 (en) Method and apparatus for finding differences in documents
WO2018233055A1 (en) Method and apparatus for entering policy information, computer device and storage medium
RU2651144C2 (en) Data input from images of the documents with fixed structure
WO2017202232A1 (en) Business card content identification method, electronic device and storage medium
WO2020232866A1 (en) Scanned text segmentation method and apparatus, computer device and storage medium
US8693790B2 (en) Form template definition method and form template definition apparatus
WO2021012382A1 (en) Method and apparatus for configuring chat robot, computer device and storage medium
US8515176B1 (en) Identification of text-block frames
CN110136198B (en) Image processing method, apparatus, device and storage medium thereof
WO2021017272A1 (en) Pathology image annotation method and device, computer apparatus, and storage medium
US10679089B2 (en) Systems and methods for optical character recognition
WO2014086287A1 (en) Text image automatic dividing method and device, method for automatically dividing handwriting entries
WO2022166833A1 (en) Image processing method and apparatus, and electronic device and storage medium
WO2018233171A1 (en) Method and apparatus for entering document information, computer device and storage medium
US9734132B1 (en) Alignment and reflow of displayed character images
WO2022206534A1 (en) Method and apparatus for text content recognition, computer device, and storage medium
WO2022166707A1 (en) Image processing method and apparatus, electronic device, and storage medium
CN111223155B (en) Image data processing method, device, computer equipment and storage medium
CN108334800B (en) Stamp image processing device and method and electronic equipment
US9110926B1 (en) Skew detection for vertical text
CN116860747A (en) Training sample generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929509

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929509

Country of ref document: EP

Kind code of ref document: A1