WO2020232866A1 - Procédé et appareil de segmentation de texte scanné, dispositif informatique et support de stockage - Google Patents

Procédé et appareil de segmentation de texte scanné, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2020232866A1
WO2020232866A1 PCT/CN2019/102549 CN2019102549W WO2020232866A1 WO 2020232866 A1 WO2020232866 A1 WO 2020232866A1 CN 2019102549 W CN2019102549 W CN 2019102549W WO 2020232866 A1 WO2020232866 A1 WO 2020232866A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
characters
character
line
parameters
Prior art date
Application number
PCT/CN2019/102549
Other languages
English (en)
Chinese (zh)
Inventor
许剑勇
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020232866A1 publication Critical patent/WO2020232866A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to a method, device, computer equipment and storage medium for scanning text segmentation.
  • the paper text is scanned to obtain a picture containing text content, and the text content in the picture is recognized through intelligent recognition technology to obtain editable text; however, the inventor realized that the traditional intelligent recognition method can only identify the picture If you need to locate or analyze the corresponding paragraph of the text content for further processing, because the above intelligent recognition method cannot determine the beginning and end positions of the characters in the text content, it may be due to inaccurate segmentation of the text content. The subsequent text content processing error.
  • a method, apparatus, computer device, and storage medium for scanning text segments are provided.
  • a method for scanning text segmentation including:
  • the vertex parameters of each line of characters in the text page include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
  • a scanning text segmentation device including:
  • Picture acquisition module used to acquire pictures containing text content
  • a content conversion module configured to perform text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content
  • the vertex parameter acquisition module is used to acquire the vertex parameters of each line of characters in the text page.
  • the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters.
  • the second set of vertex parameters is used for Determine the vertex parameters of the segmentation criterion;
  • a standard parameter obtaining module configured to identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters;
  • a difference calculation module for calculating the difference between the second set of vertex parameters and the standard parameters of each line of characters
  • the segmentation module is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • a computer device including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
  • the vertex parameters of each line of characters in the text page include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • the vertex parameters of each line of characters in the text page include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ;
  • Fig. 1 is an application scenario diagram of a scanning text segmentation method according to one or more embodiments.
  • Fig. 2 is a schematic flowchart of a scanning text segmentation method according to one or more embodiments.
  • Fig. 3 is a schematic flowchart of a method for obtaining a preset value according to one or more embodiments.
  • Fig. 4 is a block diagram of an apparatus for scanning text segmentation according to one or more embodiments.
  • Figure 5 is a block diagram of a computer device according to one or more embodiments.
  • the scanning text segmentation method provided in this application can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network through the network. After receiving the scan request from the user, the terminal obtains the image of the target document and converts it into the corresponding target text, and then segments the target text.
  • the terminal 102 sends the segmented target text and the problems found in the segmentation process to Server 104.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for scanning text segmentation is provided. Taking the method applied to the terminal 102 in FIG. 1 as an example for description, the method includes the following steps:
  • the picture with text content is the picture obtained by the terminal taking or scanning the target document to be converted through the scanning device.
  • Target documents are documents that users want to scan and convert into editable text, such as legal documents or technical documents.
  • the scanning device is a built-in or external scanning device of the terminal, such as a camera of a mobile phone or a computer, or a scanner connected to the computer.
  • the area to be scanned is the area where the user wants to scan the content through the terminal; when the scanning device is a mobile phone or computer, the area to be scanned is the shooting area of the camera; when the scanning device is an external scanner, the area to be scanned is the scanner Scan area.
  • the user when the user has a scanning requirement, the user can place the document to be scanned and recognized in the area to be scanned in the terminal built-in or docked with the scanning device, and the image of the target document to be scanned and recognized is collected by the scanning device to obtain a picture containing text content.
  • S204 Perform text recognition on the picture to obtain a text page, and the text page contains characters consistent with the arrangement order of the text content.
  • the text content in the picture can be converted into an editable text (or character) form according to the arrangement order in the picture through the built-in or external content recognition device of the terminal to obtain a text page .
  • Content recognition equipment is a device used to convert text content in a picture into an editable text page, which can refer to OCR recognition equipment, etc.; OCR (Optical Character Recognition) equipment refers to checking the characters on the picture and passing the detection The dark and light patterns determine its shape, and then use character recognition methods to translate the shape into computer text.
  • the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine a segmentation criterion.
  • the vertex parameter of each line of characters is a parameter used to characterize the position, height and width of a line of characters in the text page.
  • the vertex parameter of each line of characters can be the distance between the upper, lower, left, and right boundaries of the line of characters from the upper, lower, left, and right boundaries of the display interface, which can be expressed as a line of characters
  • the distance between the leftmost end of the first character and the left border of the display interface left, the distance between the uppermost end of a line of characters and the upper border of the display interface top, the left and right width of this line is width
  • the vertex parameter of each line of characters can also be the position of a line of characters on the left side of the text page after setting the length and width coordinates of the text page. For example, set the lower left corner of the text page as the zero coordinate of the length and width direction. The position of each line of characters in the text page can be expressed as four sets of coordinate parameters corresponding to this zero coordinate.
  • the vertex parameters of each line of characters include the first set of vertex parameters and the second set of vertex parameters.
  • the second set of vertex parameters are the vertex parameters used to determine the segmentation standard, which can be the vertex that represents the width of a line of characters in the above example.
  • Parameters such as the distance left between the leftmost end of the first character of a line of characters and the left border of the display interface and the left and right width of this line is width; or the difference in the width direction of the four left parameters of a line of characters.
  • the vertex parameters of each line of characters in the text page can be adjusted according to the needs of technicians or the recognition method of the terminal, and are not limited to the above examples.
  • the terminal when the terminal device performs text conversion through the content recognition device, the terminal needs to analyze the position in the text page of the characters in each line of the text page.
  • the terminal can directly record the vertex parameters of each line of characters in the text page in the text page through the content recognition device.
  • the vertex parameters may include two sets of vertex parameters for positioning each line of characters in the text page, where the second group is and Vertex parameters related to segmentation.
  • S208 Identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters.
  • the longest line of characters in a text page is a line of the text page as a segmentation standard, and it can be the line with the right end of the last character of the line closest to the right edge of the text page, or the line with the largest character line width.
  • the terminal can compare the width of each line of characters in the text page, and use the line with the largest width value as the longest line of characters in the text page, or compare each line of characters.
  • the line with the smallest distance between the right end of the last character of the line of characters and the right edge of the text page is regarded as the longest line of characters.
  • S210 Calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters.
  • the terminal compares the second set of vertex parameters of each line in the text page with the standard parameters, and calculates the difference between the two to determine whether the line is Need to be segmented.
  • S212 Determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • the preset value is used to judge whether the difference between the second set of vertex parameters of each line and the standard parameter can be used as the value of the branch judgment standard.
  • the specific value is obtained by the technician based on the historical character sample analysis, and its format is consistent with that of each line.
  • the vertex parameters and standard parameters of the second group of characters are the same, and can be set to several pixel values and so on.
  • the target character is the last character at the end of the line, which can be text or punctuation.
  • the paragraph character "/n" is added after the target character of this line, if the second set of vertices of the line. If the difference between the parameter and the standard parameter is less than or equal to the preset value, it is considered to be unsegmented and there is no need to add a segmentation symbol.
  • the terminal judges whether the line is segmented line by line, until the target text is segmented, and the segmented text is obtained for the user to use or machine recognition, for example, so that the machine can recognize the segmentation symbol to locate the paragraph.
  • step S210 the difference between the second set of vertex parameters of each line of characters and the standard parameters is calculated.
  • the vertex parameter corresponding to the start position of the next line of characters can also be obtained continuously.
  • a paragraph break is inserted after the last character of this line. In this step, for the text with a space at the beginning of the next line after the line break, judge whether to segment according to the end of this paragraph and the beginning of the next paragraph, and improve the accuracy of segmentation.
  • the terminal can obtain the text content page by page as the text page and perform the above segmentation step. After segmenting the text in one page, continue to obtain the next page of the target text As a text page, until all pages of the target text are segmented.
  • the terminal can recognize the first page and use the obtained standard parameters as the standard parameters of the entire text, that is, only need to obtain the standard parameters of the first page, and the remaining pages The standard parameters on the first page are still used as the segmentation standard. For multiple pages of the same text, only the standard parameters of the first page full line need to be obtained, which improves the segmentation efficiency of this method.
  • the terminal obtains the text page by recognizing scanned images containing text content, and obtains the vertex parameters of each line of the text page, determines the longest line of characters in the text page, and compares the maximum
  • the second set of vertex parameters of a long line of characters are used as standard parameters, and the second set of vertex parameters of each line of characters are compared with the standard parameters in turn.
  • the terminal considers it This line is the ending line of the paragraph. Add a paragraph break after the target character of this line until the text is segmented.
  • the terminal can accurately divide the recognized text after recognizing the content of the text in the picture. segment.
  • the way of obtaining the preset value in the above scanning text segmentation method may include:
  • S302 Obtain a character sample, and identify the type of character corresponding to the character sample.
  • Character samples are used to analyze the width of non-Chinese characters and Chinese characters in the same document, and can be scanned documents that have been processed before.
  • Character types are types of non-Chinese characters, such as letters or numbers.
  • the terminal calculates the preset value, it needs to calculate based on the historically processed segmented document as a character sample.
  • This character sample contains characters of different character types. After the terminal obtains the character sample, it first recognizes its corresponding character species.
  • S304 Calculate the duty ratio of the character corresponding to each character type to the Chinese character, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character.
  • the terminal calculates the ratio of the width of each character type except Chinese characters in the document to the width of Chinese characters. For example, according to the sample, the terminal can roughly estimate that each number or letter is considered to be half of the width of the Chinese text character; if you need to calculate more accurately, you can calculate in advance each number 0-9 and each letter az and AZ and other non-Chinese characters
  • the width of the character is divided by the width of a Chinese character to generate a relative width hash table of non-Chinese characters, and then the terminal calculates the proportion of the character corresponding to each character type to the Chinese character based on the relative width hash table.
  • the terminal can calculate a preset value according to the different types and the width ratios of Chinese characters in the character sample, such as the existence of Chinese characters, letters, numbers, punctuation marks and other characters in historically scanned documents, and the alignment between full lines
  • the preset value used to adjust the segmentation error caused by the character typesetting problem is obtained, which can be specifically set to 0.1-0.15 times the width of the longest line of characters.
  • the technician studies a large number of character samples, and calculates accurate preset values based on the difference in typesetting of different character types, and accurately segments the recognized text.
  • step S212 after adding a segmentation character after the target character in step S212 to obtain the segmented text, it may further include: sending the segmented text to the server; the receiving server returns according to the segmented text The update instruction; replace the preset value in the terminal according to the preset value in the update instruction.
  • the update instruction is an instruction sent by the server to the terminal to update the preset value of the terminal, and may be an instruction to replace the local preset value of the terminal with a new preset value.
  • the terminal sends the segmented text to the server for further processing or use. If the server finds that the segment recognition of the terminal is incorrect, it can adjust the preset value according to the cause of the error.
  • the server may generate an update instruction, and return the update instruction to the terminal, update the local preset value of the terminal, to segment the scanned document.
  • the server detects the accuracy of the preset value based on the document segmented by the terminal, and if the preset value is not accurate, it updates it to improve the accuracy of the text segmentation processed later by the terminal.
  • step S206 after obtaining the vertex parameters of each line of characters in the text page in step S206, it may further include: partitioning the text according to the vertex parameters of each line of characters in the text page; generating according to the partitioned text New text page; continue to identify the longest line of characters in the text page based on vertex parameters.
  • the terminal when the terminal recognizes that the typesetting difference of the text in the text page is large, that is, the difference between the vertex parameters of each line of characters is large, such as the vertices at the beginning of a line or several consecutive lines and the beginning of the remaining lines
  • the terminal can partition the text page according to the magnitude of the difference, and divide each The area obtained by each partition generates a new text page, and the text in the new text page is segmented in turn.
  • Technicians can set the corresponding zoning standard by studying the historical segmented samples.
  • the terminal can use these rows as one New area, other lines in the text page as another area, etc.
  • the terminal generates a new text page for each area, and the terminal segments the content in the new page according to the above steps.
  • This embodiment is aimed at a situation where texts with different layouts on a page, such as newspapers, posters, etc., are segmented.
  • the text with different layouts on a page can also be accurately segmented by means of partitions.
  • step S212 after adding a segmentation character after the target character to obtain the segmented text in step S212, it may further include: saving the segmented text; deleting the vertex parameter of each line of characters in the text page.
  • the terminal saves the segmented text, and deletes the vertex parameters of each line of characters when automatically clearing the scanned text.
  • a user input or a delete instruction sent by the server can also be obtained at the terminal, and the terminal deletes the vertex parameter of each line of characters in the text page.
  • the vertex list is the position characterization parameter of each line of the text in the text page
  • the amount of data is large.
  • a scanning text segmentation device which includes: a picture acquisition module 100, a content conversion module 200, a vertex parameter acquisition module 300, a standard parameter acquisition module 400, and a difference calculation module 500 and segmentation module 600, of which:
  • the picture obtaining module 100 is used to obtain pictures containing text content.
  • the content conversion module 200 is configured to perform text recognition on the picture to obtain a text page, and the text page contains characters in the same order as the text content.
  • the vertex parameter acquisition module 300 is used to acquire the vertex parameters of each line of characters in the text page.
  • the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are used to determine segmentation Standard vertex parameters.
  • the standard parameter obtaining module 400 is configured to identify the longest line of characters in a text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters.
  • the difference calculation module 500 is used to calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters.
  • the segmentation module 600 is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • the aforementioned scanning text segmentation device may further include:
  • the sample acquisition module is used to acquire character samples and identify the type of characters corresponding to the character samples.
  • the character type analysis module is used to calculate the duty ratio of the character corresponding to each character type to the Chinese character.
  • the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character.
  • the preset value calculation module is used to calculate the preset value according to the duty ratio.
  • the aforementioned scanning text segmentation device may further include:
  • the sending module is used to send the segmented text to the server.
  • the update instruction receiving module is used to receive the update instruction returned by the server according to the segmented text.
  • the preset value update module is used to update the preset value according to the update instruction.
  • the aforementioned scanning text segmentation device may further include:
  • the partition module is used to partition the text according to the vertex parameters of each line of characters in the text page.
  • the page update module is used to generate a new text page based on the partitioned text.
  • the new page segmentation module is used to continue to identify the longest line of characters in the text page according to the vertex parameters.
  • the aforementioned scanning text segmentation device may further include:
  • the save module is used to save the segmented text.
  • the parameter deletion module is used to delete the vertex parameter of each line of characters in the text page.
  • Each module in the aforementioned scanning text segmentation device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 5.
  • the computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a scanning text segmentation method.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device including a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors perform the following steps: Picture; text recognition is performed on the picture to obtain a text page, which contains characters in the same order as the text content; to obtain the vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include the first set of vertex parameters and the first set of vertex parameters Two sets of vertex parameters, the second set of vertex parameters are the vertex parameters used to determine the segmentation standard; the longest line of characters in the text page is recognized according to the vertex parameters, and the second set of vertex parameters of the longest line of characters are obtained as standard parameters; calculation The difference between the second set of vertex parameters of each line of characters and the standard parameters; and determine the target character in the line where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • the way of obtaining the preset value realized when the processor executes the computer-readable instruction includes: obtaining a character sample, and identifying the character type corresponding to the character sample; calculating the width of the character and the Chinese character corresponding to each character type
  • the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; the preset value is calculated according to the duty ratio.
  • the method when the processor executes the computer-readable instruction, after adding a segmentation character after the target character to obtain the segmented text, the method further includes: sending the segmented text to the server; and the receiving server according to the segmentation The update instruction returned by the following text; replace the preset value in the terminal according to the preset value in the update instruction.
  • the method further includes: partitioning the text according to the vertex parameters of each line of characters in the text page; Generate a new text page; continue to identify the longest line of characters in the text page based on the vertex parameters.
  • the processor when the processor executes the computer-readable instruction, after adding a paragraph break after the target character to obtain the segmented text, it further includes: saving the segmented text; deleting each line of characters in the text page The vertex parameters.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors perform the following steps:: Get text The picture of the content; the text recognition of the picture to get the text page, the text page contains the characters in the same order as the text content; the vertex parameters of each line of characters in the text page are obtained, the vertex parameters of each line of characters include the first set of vertex parameters And the second set of vertex parameters, the second set of vertex parameters are used to determine the segmentation standard; according to the vertex parameters, the longest line of characters in the text page is recognized, and the second set of vertex parameters of the longest line of characters are obtained as standard parameters ; Calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters; and determine the target character in the line where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
  • the way of obtaining the preset value realized when the computer-readable instruction is executed by the processor includes: obtaining a character sample, and identifying the character type corresponding to the character sample; calculating the proportion of characters and Chinese characters corresponding to each character type. Width ratio, the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; the preset value is calculated according to the duty ratio.
  • the method when the computer-readable instruction is executed by the processor, after adding a segmentation character after the target character to obtain the segmented text, the method further includes: sending the segmented text to the server; The update instruction returned by the text after the paragraph; replace the preset value in the terminal according to the preset value in the update instruction.
  • the method further includes: partitioning the text according to the vertex parameters of each line of characters in the text page; The following text generates a new text page; continue to identify the longest line of characters in the text page according to the vertex parameters.
  • adding a paragraph break after the target character to obtain the segmented text further includes: saving the segmented text; deleting each line in the text page The vertex parameter of the character.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

L'invention concerne un procédé de segmentation de texte scanné qui consiste à : acquérir une image contenant un contenu de texte ; effectuer une reconnaissance de texte de l'image pour obtenir une page de texte, la page de texte contenant des caractères, dont l'ordre d'agencement correspond à celui du contenu de texte ; acquérir des paramètres de sommet de chaque ligne de caractères dans la page de texte, les paramètres de sommet de chaque ligne de caractères comprenant un premier groupe de paramètres de sommet et un second groupe de paramètres de sommet, et le second groupe de paramètres de sommet étant des paramètres de sommet utilisés pour déterminer un standard de segmentation ; reconnaître une ligne de caractères la plus longue dans la page de texte en fonction des paramètres de sommet, et acquérir le second groupe de paramètres de sommet de la ligne de caractères la plus longue en tant que paramètres standard ; calculer une valeur de différence entre le second groupe de paramètres de sommet de chaque ligne de caractères et les paramètres standard ; déterminer un caractère cible dans la ligne où la valeur de différence est supérieure à une valeur prédéfinie, et ajouter un symbole de segmentation après le caractère cible pour obtenir un texte segmenté.
PCT/CN2019/102549 2019-05-20 2019-08-26 Procédé et appareil de segmentation de texte scanné, dispositif informatique et support de stockage WO2020232866A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910418522.6A CN110245570B (zh) 2019-05-20 2019-05-20 扫描文本分段方法、装置、计算机设备和存储介质
CN201910418522.6 2019-05-20

Publications (1)

Publication Number Publication Date
WO2020232866A1 true WO2020232866A1 (fr) 2020-11-26

Family

ID=67884469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102549 WO2020232866A1 (fr) 2019-05-20 2019-08-26 Procédé et appareil de segmentation de texte scanné, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN110245570B (fr)
WO (1) WO2020232866A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191348A (zh) * 2021-05-31 2021-07-30 山东新一代信息产业技术研究院有限公司 一种基于模板的文本结构化提取方法及工具

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177532B (zh) * 2021-05-27 2024-04-05 中国平安人寿保险股份有限公司 图像中文字的段落边界的识别方法、装置、设备及介质
CN114444439B (zh) * 2022-04-08 2022-08-26 深圳市壹箭教育科技有限公司 试题集文件生成方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970213B1 (en) * 2007-05-21 2011-06-28 A9.Com, Inc. Method and system for improving the recognition of text in an image
CN107545223A (zh) * 2016-06-29 2018-01-05 腾讯科技(深圳)有限公司 图像识别方法及电子设备
CN108734089A (zh) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 识别图片文件中表格内容的方法、装置、设备及存储介质
CN109697414A (zh) * 2018-12-13 2019-04-30 北京金山数字娱乐科技有限公司 一种文本定位方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8565474B2 (en) * 2010-03-10 2013-10-22 Microsoft Corporation Paragraph recognition in an optical character recognition (OCR) process
CN105487396A (zh) * 2015-12-29 2016-04-13 宇龙计算机通信科技(深圳)有限公司 智能家居的控制方法和智能家居的控制装置
CN106326854B (zh) * 2016-08-19 2019-09-06 掌阅科技股份有限公司 一种版式文档段落识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970213B1 (en) * 2007-05-21 2011-06-28 A9.Com, Inc. Method and system for improving the recognition of text in an image
CN107545223A (zh) * 2016-06-29 2018-01-05 腾讯科技(深圳)有限公司 图像识别方法及电子设备
CN108734089A (zh) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 识别图片文件中表格内容的方法、装置、设备及存储介质
CN109697414A (zh) * 2018-12-13 2019-04-30 北京金山数字娱乐科技有限公司 一种文本定位方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191348A (zh) * 2021-05-31 2021-07-30 山东新一代信息产业技术研究院有限公司 一种基于模板的文本结构化提取方法及工具

Also Published As

Publication number Publication date
CN110245570B (zh) 2023-04-18
CN110245570A (zh) 2019-09-17

Similar Documents

Publication Publication Date Title
US11631050B2 (en) Syncing physical and electronic document
WO2021017260A1 (fr) Procédé et appareil de reconnaissance de texte multilingue, dispositif informatique, et support d'informations
US10339378B2 (en) Method and apparatus for finding differences in documents
CN111476227B (zh) 基于ocr的目标字段识别方法、装置及存储介质
WO2019201035A1 (fr) Procédé et dispositif d'identification d'un nœud d'objet dans une image, terminal et support de données lisible par ordinateur
WO2018233055A1 (fr) Procédé et appareil d'entrée d'informations de police, dispositif informatique et support d'informations
RU2651144C2 (ru) Ввод данных с изображений документов с фиксированной структурой
WO2021012382A1 (fr) Procédé et appareil de configuration d'agent conversationnel, dispositif informatique et support de stockage
WO2017202232A1 (fr) Procédé d'identification de contenu de carte de visite, dispositif électronique et support de stockage
WO2020232866A1 (fr) Procédé et appareil de segmentation de texte scanné, dispositif informatique et support de stockage
US8693790B2 (en) Form template definition method and form template definition apparatus
US8515176B1 (en) Identification of text-block frames
CN110136198B (zh) 图像处理方法及其装置、设备和存储介质
WO2021017272A1 (fr) Procédé et dispositif d'annotation d'une image de pathologie, appareil informatique et support d'informations
US10679089B2 (en) Systems and methods for optical character recognition
CN109685870B (zh) 信息标注方法及装置、标注设备及存储介质
WO2014086287A1 (fr) Procédé et dispositif de division automatique d'image de texte, procédé de division automatique d'entrées manuscrites
WO2018233171A1 (fr) Procédé et appareil de saisie d'informations de documents, dispositif informatique et support de stockage
WO2022166833A1 (fr) Procédé et appareil de traitement d'image, ainsi que dispositif électronique et support de stockage
US9734132B1 (en) Alignment and reflow of displayed character images
WO2022206534A1 (fr) Procédé et appareil de reconnaissance de contenu de texte, dispositif informatique et support de stockage
WO2022166707A1 (fr) Procédé et appareil de traitement d'image, dispositif électronique et support de stockage
CN111223155B (zh) 图像数据处理方法、装置、计算机设备和存储介质
US9110926B1 (en) Skew detection for vertical text
CN116860747A (zh) 训练样本的生成方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929509

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929509

Country of ref document: EP

Kind code of ref document: A1