WO2020232866A1

WO2020232866A1 - Scanned text segmentation method and apparatus, computer device and storage medium

Info

Publication number: WO2020232866A1
Application number: PCT/CN2019/102549
Authority: WO
Inventors: 许剑勇
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-05-20
Filing date: 2019-08-26
Publication date: 2020-11-26
Also published as: CN110245570A; CN110245570B

Abstract

A scanned text segmentation method, comprising: acquiring a picture containing text content; performing text recognition of the picture to obtain a text page, the text page containing characters, the arrangement order of which is consistent with that of the text content; acquiring vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters comprising a first group of vertex parameters and a second group of vertex parameters, and the second group of vertex parameters being vertex parameters used for determining a segmentation standard; recognizing a longest line of characters in the text page according to the vertex parameters, and acquiring the second group of vertex parameters of the longest line of characters as standard parameters; calculating a difference value between the second group of vertex parameters of each line of characters and the standard parameters; determining a target character in the line where the difference value is greater than a preset value, and adding a segmentation symbol after the target character to obtain segmented text.

Description

Scanning text segmentation method, device, computer equipment and storage medium

Cross references to related applications

This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on May 20, 2019. The application number is 201910418522.6 and the application name is "Scanning text segmentation method, device, computer equipment and storage medium". The reference is incorporated in this application.

Technical field

This application relates to a method, device, computer equipment and storage medium for scanning text segmentation.

Background technique

With the development of data technology, more and more information is processed and interacted through the network, so there are endless technologies for converting paper materials into electronic formats.

Traditionally, the paper text is scanned to obtain a picture containing text content, and the text content in the picture is recognized through intelligent recognition technology to obtain editable text; however, the inventor realized that the traditional intelligent recognition method can only identify the picture If you need to locate or analyze the corresponding paragraph of the text content for further processing, because the above intelligent recognition method cannot determine the beginning and end positions of the characters in the text content, it may be due to inaccurate segmentation of the text content. The subsequent text content processing error.

Summary of the invention

According to various embodiments disclosed in the present application, a method, apparatus, computer device, and storage medium for scanning text segments are provided.

A method for scanning text segmentation, including:

Get pictures with text content;

Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;

Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ；

Identifying the longest line of characters in the text page according to the vertex parameters, and obtaining the vertex parameters of the longest line of characters as standard parameters;

Calculating the difference between the vertex parameter and the standard parameter of each line of characters; and

Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.

A scanning text segmentation device, including:

Picture acquisition module, used to acquire pictures containing text content;

A content conversion module, configured to perform text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;

The vertex parameter acquisition module is used to acquire the vertex parameters of each line of characters in the text page. The vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters. The second set of vertex parameters is used for Determine the vertex parameters of the segmentation criterion;

A standard parameter obtaining module, configured to identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters;

A difference calculation module for calculating the difference between the second set of vertex parameters and the standard parameters of each line of characters; and

The segmentation module is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.

A computer device, including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:

Get pictures with text content;

Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;

Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and

One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Get pictures with text content;

The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Fig. 1 is an application scenario diagram of a scanning text segmentation method according to one or more embodiments.

Fig. 2 is a schematic flowchart of a scanning text segmentation method according to one or more embodiments.

Fig. 3 is a schematic flowchart of a method for obtaining a preset value according to one or more embodiments.

Fig. 4 is a block diagram of an apparatus for scanning text segmentation according to one or more embodiments.

Figure 5 is a block diagram of a computer device according to one or more embodiments.

Detailed ways

In order to make the technical solutions and advantages of the present application clearer, the following further describes the present application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.

The scanning text segmentation method provided in this application can be applied to the application environment as shown in FIG. 1. The terminal 102 communicates with the server 104 through the network through the network. After receiving the scan request from the user, the terminal obtains the image of the target document and converts it into the corresponding target text, and then segments the target text. The terminal 102 sends the segmented target text and the problems found in the segmentation process to Server 104. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented by an independent server or a server cluster composed of multiple servers.

In one of the embodiments, as shown in FIG. 2, a method for scanning text segmentation is provided. Taking the method applied to the terminal 102 in FIG. 1 as an example for description, the method includes the following steps:

S202: Obtain a picture containing text content.

The picture with text content is the picture obtained by the terminal taking or scanning the target document to be converted through the scanning device. Target documents are documents that users want to scan and convert into editable text, such as legal documents or technical documents. The scanning device is a built-in or external scanning device of the terminal, such as a camera of a mobile phone or a computer, or a scanner connected to the computer. The area to be scanned is the area where the user wants to scan the content through the terminal; when the scanning device is a mobile phone or computer, the area to be scanned is the shooting area of the camera; when the scanning device is an external scanner, the area to be scanned is the scanner Scan area.

Specifically, when the user has a scanning requirement, the user can place the document to be scanned and recognized in the area to be scanned in the terminal built-in or docked with the scanning device, and the image of the target document to be scanned and recognized is collected by the scanning device to obtain a picture containing text content.

S204: Perform text recognition on the picture to obtain a text page, and the text page contains characters consistent with the arrangement order of the text content.

Specifically, after the terminal collects a picture containing text content, the text content in the picture can be converted into an editable text (or character) form according to the arrangement order in the picture through the built-in or external content recognition device of the terminal to obtain a text page . Content recognition equipment is a device used to convert text content in a picture into an editable text page, which can refer to OCR recognition equipment, etc.; OCR (Optical Character Recognition) equipment refers to checking the characters on the picture and passing the detection The dark and light patterns determine its shape, and then use character recognition methods to translate the shape into computer text.

S206: Obtain vertex parameters of each line of characters in the text page. The vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine a segmentation criterion.

The vertex parameter of each line of characters is a parameter used to characterize the position, height and width of a line of characters in the text page. For example, when the text page is a rectangular display interface, the vertex parameter of each line of characters can be the distance between the upper, lower, left, and right boundaries of the line of characters from the upper, lower, left, and right boundaries of the display interface, which can be expressed as a line of characters The distance between the leftmost end of the first character and the left border of the display interface left, the distance between the uppermost end of a line of characters and the upper border of the display interface top, the left and right width of this line is width, and the height of this line is height, through (left, top , Width, height) means the vertex parameters of each line of characters. In addition, the vertex parameter of each line of characters can also be the position of a line of characters on the left side of the text page after setting the length and width coordinates of the text page. For example, set the lower left corner of the text page as the zero coordinate of the length and width direction. The position of each line of characters in the text page can be expressed as four sets of coordinate parameters corresponding to this zero coordinate. Correspondingly, the vertex parameters of each line of characters include the first set of vertex parameters and the second set of vertex parameters. The second set of vertex parameters are the vertex parameters used to determine the segmentation standard, which can be the vertex that represents the width of a line of characters in the above example. Parameters, such as the distance left between the leftmost end of the first character of a line of characters and the left border of the display interface and the left and right width of this line is width; or the difference in the width direction of the four left parameters of a line of characters. The vertex parameters of each line of characters in the text page can be adjusted according to the needs of technicians or the recognition method of the terminal, and are not limited to the above examples.

Specifically, when the terminal device performs text conversion through the content recognition device, the terminal needs to analyze the position in the text page of the characters in each line of the text page. The terminal can directly record the vertex parameters of each line of characters in the text page in the text page through the content recognition device. The vertex parameters may include two sets of vertex parameters for positioning each line of characters in the text page, where the second group is and Vertex parameters related to segmentation.

S208: Identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters.

The longest line of characters in a text page is a line of the text page as a segmentation standard, and it can be the line with the right end of the last character of the line closest to the right edge of the text page, or the line with the largest character line width.

Specifically, after obtaining the vertex parameters of each line of characters in the text page, the terminal can compare the width of each line of characters in the text page, and use the line with the largest width value as the longest line of characters in the text page, or compare each line of characters. The line with the smallest distance between the right end of the last character of the line of characters and the right edge of the text page is regarded as the longest line of characters. Take the second set of vertex parameters related to segmentation of the longest line of characters, for example, the relative position of the rightmost end of this line and the right border of the text page as the subsequent segmentation standard for the text in the page, that is, the standard parameter.

S210: Calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters.

Specifically, after obtaining the standard parameters for determining whether to segment in step S208, the terminal compares the second set of vertex parameters of each line in the text page with the standard parameters, and calculates the difference between the two to determine whether the line is Need to be segmented.

S212: Determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.

The preset value is used to judge whether the difference between the second set of vertex parameters of each line and the standard parameter can be used as the value of the branch judgment standard. The specific value is obtained by the technician based on the historical character sample analysis, and its format is consistent with that of each line. The vertex parameters and standard parameters of the second group of characters are the same, and can be set to several pixel values and so on.

The target character is the last character at the end of the line, which can be text or punctuation.

Specifically, if the difference between the second set of vertex parameters of the line and the standard parameters is greater than the preset value, the paragraph character "/n" is added after the target character of this line, if the second set of vertices of the line If the difference between the parameter and the standard parameter is less than or equal to the preset value, it is considered to be unsegmented and there is no need to add a segmentation symbol. The terminal judges whether the line is segmented line by line, until the target text is segmented, and the segmented text is obtained for the user to use or machine recognition, for example, so that the machine can recognize the segmentation symbol to locate the paragraph.

In addition, in the above step S210, the difference between the second set of vertex parameters of each line of characters and the standard parameters is calculated. When the difference is greater than the preset value, the vertex parameter corresponding to the start position of the next line of characters can also be obtained continuously. When the start position of the next line is greater than the vertex parameter corresponding to the start position of the longest line of characters, a paragraph break is inserted after the last character of this line. In this step, for the text with a space at the beginning of the next line after the line break, judge whether to segment according to the end of this paragraph and the beginning of the next paragraph, and improve the accuracy of segmentation.

Optionally, when the text to be segmented is multiple pages, the terminal can obtain the text content page by page as the text page and perform the above segmentation step. After segmenting the text in one page, continue to obtain the next page of the target text As a text page, until all pages of the target text are segmented. In addition, for a group of scanned texts with the same layout on multiple pages, the terminal can recognize the first page and use the obtained standard parameters as the standard parameters of the entire text, that is, only need to obtain the standard parameters of the first page, and the remaining pages The standard parameters on the first page are still used as the segmentation standard. For multiple pages of the same text, only the standard parameters of the first page full line need to be obtained, which improves the segmentation efficiency of this method.

In the above scanning text segmentation method, the terminal obtains the text page by recognizing scanned images containing text content, and obtains the vertex parameters of each line of the text page, determines the longest line of characters in the text page, and compares the maximum The second set of vertex parameters of a long line of characters are used as standard parameters, and the second set of vertex parameters of each line of characters are compared with the standard parameters in turn. When the difference between the two is greater than the preset value corresponding to the segmentation standard, the terminal considers it This line is the ending line of the paragraph. Add a paragraph break after the target character of this line until the text is segmented. According to the above method, the terminal can accurately divide the recognized text after recognizing the content of the text in the picture. segment.

In one of the embodiments, referring to FIG. 3, the way of obtaining the preset value in the above scanning text segmentation method may include:

S302: Obtain a character sample, and identify the type of character corresponding to the character sample.

Character samples are used to analyze the width of non-Chinese characters and Chinese characters in the same document, and can be scanned documents that have been processed before. Character types are types of non-Chinese characters, such as letters or numbers.

Specifically, when the terminal calculates the preset value, it needs to calculate based on the historically processed segmented document as a character sample. This character sample contains characters of different character types. After the terminal obtains the character sample, it first recognizes its corresponding character species.

S304: Calculate the duty ratio of the character corresponding to each character type to the Chinese character, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character.

Specifically, the terminal calculates the ratio of the width of each character type except Chinese characters in the document to the width of Chinese characters. For example, according to the sample, the terminal can roughly estimate that each number or letter is considered to be half of the width of the Chinese text character; if you need to calculate more accurately, you can calculate in advance each number 0-9 and each letter az and AZ and other non-Chinese characters The width of the character is divided by the width of a Chinese character to generate a relative width hash table of non-Chinese characters, and then the terminal calculates the proportion of the character corresponding to each character type to the Chinese character based on the relative width hash table.

S306: Calculate a preset value according to the duty ratio.

Specifically, the terminal can calculate a preset value according to the different types and the width ratios of Chinese characters in the character sample, such as the existence of Chinese characters, letters, numbers, punctuation marks and other characters in historically scanned documents, and the alignment between full lines The preset value used to adjust the segmentation error caused by the character typesetting problem is obtained, which can be specifically set to 0.1-0.15 times the width of the longest line of characters.

In the above-mentioned embodiment, the technician studies a large number of character samples, and calculates accurate preset values based on the difference in typesetting of different character types, and accurately segments the recognized text.

In some embodiments, after adding a segmentation character after the target character in step S212 to obtain the segmented text, it may further include: sending the segmented text to the server; the receiving server returns according to the segmented text The update instruction; replace the preset value in the terminal according to the preset value in the update instruction.

The update instruction is an instruction sent by the server to the terminal to update the preset value of the terminal, and may be an instruction to replace the local preset value of the terminal with a new preset value.

Specifically, the terminal sends the segmented text to the server for further processing or use. If the server finds that the segment recognition of the terminal is incorrect, it can adjust the preset value according to the cause of the error. The server may generate an update instruction, and return the update instruction to the terminal, update the local preset value of the terminal, to segment the scanned document.

In the foregoing embodiment, the server detects the accuracy of the preset value based on the document segmented by the terminal, and if the preset value is not accurate, it updates it to improve the accuracy of the text segmentation processed later by the terminal.

In one of the embodiments, after obtaining the vertex parameters of each line of characters in the text page in step S206, it may further include: partitioning the text according to the vertex parameters of each line of characters in the text page; generating according to the partitioned text New text page; continue to identify the longest line of characters in the text page based on vertex parameters.

Specifically, when the terminal recognizes that the typesetting difference of the text in the text page is large, that is, the difference between the vertex parameters of each line of characters is large, such as the vertices at the beginning of a line or several consecutive lines and the beginning of the remaining lines When the difference between the parameters is large, or the difference between the vertex parameters at the end of a row or several consecutive rows and the end of the remaining rows is large, the terminal can partition the text page according to the magnitude of the difference, and divide each The area obtained by each partition generates a new text page, and the text in the new text page is segmented in turn. Technicians can set the corresponding zoning standard by studying the historical segmented samples. For example, if one or two sets of vertex parameters in one row or consecutive rows have a large difference in vertex parameters corresponding to other rows, and the difference exceeds the preset pixel value, the terminal can use these rows as one New area, other lines in the text page as another area, etc. The terminal generates a new text page for each area, and the terminal segments the content in the new page according to the above steps. This embodiment is aimed at a situation where texts with different layouts on a page, such as newspapers, posters, etc., are segmented. In the above embodiment, the text with different layouts on a page can also be accurately segmented by means of partitions.

In one embodiment, after adding a segmentation character after the target character to obtain the segmented text in step S212, it may further include: saving the segmented text; deleting the vertex parameter of each line of characters in the text page.

Specifically, after the segmentation of the content in the text page is finished, the terminal saves the segmented text, and deletes the vertex parameters of each line of characters when automatically clearing the scanned text. Optionally, a user input or a delete instruction sent by the server can also be obtained at the terminal, and the terminal deletes the vertex parameter of each line of characters in the text page.

In the above embodiment, since the vertex list is the position characterization parameter of each line of the text in the text page, the amount of data is large. After the terminal completes the segmentation operation of the text in the text page, the text page segmentation process should be deleted The vertex parameters obtained in the, improve the operating speed of the terminal.

It should be understood that, although the various steps in the flowchart of FIGS. 2-3 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in Figure 2-3 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The execution order of is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 4, a scanning text segmentation device is provided, which includes: a picture acquisition module 100, a content conversion module 200, a vertex parameter acquisition module 300, a standard parameter acquisition module 400, and a difference calculation module 500 and segmentation module 600, of which:

The picture obtaining module 100 is used to obtain pictures containing text content.

The content conversion module 200 is configured to perform text recognition on the picture to obtain a text page, and the text page contains characters in the same order as the text content.

The vertex parameter acquisition module 300 is used to acquire the vertex parameters of each line of characters in the text page. The vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are used to determine segmentation Standard vertex parameters.

The standard parameter obtaining module 400 is configured to identify the longest line of characters in a text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters.

The difference calculation module 500 is used to calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters.

The segmentation module 600 is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.

In some embodiments, the aforementioned scanning text segmentation device may further include:

The sample acquisition module is used to acquire character samples and identify the type of characters corresponding to the character samples.

The character type analysis module is used to calculate the duty ratio of the character corresponding to each character type to the Chinese character. The duty ratio is the ratio of the width of the character in the document to the width of the Chinese character.

The preset value calculation module is used to calculate the preset value according to the duty ratio.

The sending module is used to send the segmented text to the server.

The update instruction receiving module is used to receive the update instruction returned by the server according to the segmented text.

The preset value update module is used to update the preset value according to the update instruction.

The partition module is used to partition the text according to the vertex parameters of each line of characters in the text page.

The page update module is used to generate a new text page based on the partitioned text.

The new page segmentation module is used to continue to identify the longest line of characters in the text page according to the vertex parameters.

The save module is used to save the segmented text.

The parameter deletion module is used to delete the vertex parameter of each line of characters in the text page.

For the specific definition of the scanning text segmentation device, please refer to the above definition of the scanning text segmentation method, which will not be repeated here. Each module in the aforementioned scanning text segmentation device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In some embodiments, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 5. The computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer readable instructions. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a scanning text segmentation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

A computer device including a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors perform the following steps: Picture; text recognition is performed on the picture to obtain a text page, which contains characters in the same order as the text content; to obtain the vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include the first set of vertex parameters and the first set of vertex parameters Two sets of vertex parameters, the second set of vertex parameters are the vertex parameters used to determine the segmentation standard; the longest line of characters in the text page is recognized according to the vertex parameters, and the second set of vertex parameters of the longest line of characters are obtained as standard parameters; calculation The difference between the second set of vertex parameters of each line of characters and the standard parameters; and determine the target character in the line where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.

In some embodiments, the way of obtaining the preset value realized when the processor executes the computer-readable instruction includes: obtaining a character sample, and identifying the character type corresponding to the character sample; calculating the width of the character and the Chinese character corresponding to each character type The duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; the preset value is calculated according to the duty ratio.

In some embodiments, when the processor executes the computer-readable instruction, after adding a segmentation character after the target character to obtain the segmented text, the method further includes: sending the segmented text to the server; and the receiving server according to the segmentation The update instruction returned by the following text; replace the preset value in the terminal according to the preset value in the update instruction.

In some embodiments, after the processor executes the computer-readable instructions to obtain the vertex parameters of each line of characters in the text page, the method further includes: partitioning the text according to the vertex parameters of each line of characters in the text page; Generate a new text page; continue to identify the longest line of characters in the text page based on the vertex parameters.

In some embodiments, when the processor executes the computer-readable instruction, after adding a paragraph break after the target character to obtain the segmented text, it further includes: saving the segmented text; deleting each line of characters in the text page The vertex parameters.

One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors perform the following steps:: Get text The picture of the content; the text recognition of the picture to get the text page, the text page contains the characters in the same order as the text content; the vertex parameters of each line of characters in the text page are obtained, the vertex parameters of each line of characters include the first set of vertex parameters And the second set of vertex parameters, the second set of vertex parameters are used to determine the segmentation standard; according to the vertex parameters, the longest line of characters in the text page is recognized, and the second set of vertex parameters of the longest line of characters are obtained as standard parameters ; Calculate the difference between the second set of vertex parameters of each line of characters and the standard parameters; and determine the target character in the line where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.

In some embodiments, the way of obtaining the preset value realized when the computer-readable instruction is executed by the processor includes: obtaining a character sample, and identifying the character type corresponding to the character sample; calculating the proportion of characters and Chinese characters corresponding to each character type. Width ratio, the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; the preset value is calculated according to the duty ratio.

In some embodiments, when the computer-readable instruction is executed by the processor, after adding a segmentation character after the target character to obtain the segmented text, the method further includes: sending the segmented text to the server; The update instruction returned by the text after the paragraph; replace the preset value in the terminal according to the preset value in the update instruction.

In some embodiments, after obtaining the vertex parameters of each line of characters in the text page when the computer-readable instructions are executed by the processor, the method further includes: partitioning the text according to the vertex parameters of each line of characters in the text page; The following text generates a new text page; continue to identify the longest line of characters in the text page according to the vertex parameters.

In some embodiments, when the computer-readable instruction is executed by the processor, adding a paragraph break after the target character to obtain the segmented text, further includes: saving the segmented text; deleting each line in the text page The vertex parameter of the character.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, they should It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A method for scanning text segmentation, including:

Get pictures with text content;

Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;

Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ；

Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;

Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and

Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
The method according to claim 1, wherein the method for obtaining the preset value comprises:

Acquiring a character sample, and identifying the type of character corresponding to the character sample;

Calculate the duty ratio of the character and the Chinese character corresponding to each of the character types, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; and

The preset value is calculated according to the duty ratio.
3. The method according to claim 2, wherein after adding a segmentation character after the target character to obtain the segmented text, the method further comprises:

Sending the segmented text to the server;

Receiving an update instruction returned by the server according to the segmented text; and

Replacing the preset value in the terminal according to the preset value in the update instruction.
The method according to claim 1, wherein after said obtaining the vertex parameters of each line of characters in the text page, the method further comprises:

Partition the text according to the vertex parameters of each line of characters in the text page;

Generate a new text page based on the text after the partition; and

Continue to identify the longest line of characters in the text page according to the vertex parameters.
The method according to any one of claims 1 to 4, characterized in that, after adding a segment character after the target character to obtain a segmented text, the method further comprises:

Save the segmented text; and

Delete the vertex parameter of each line of characters in the text page.
A scanning text segmentation device, including:

Picture acquisition module, used to acquire pictures containing text content;

A content conversion module, configured to perform text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;

The vertex parameter acquisition module is used to acquire the vertex parameters of each line of characters in the text page. The vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters. The second set of vertex parameters is used for Determine the vertex parameters of the segmentation criterion;

A standard parameter obtaining module, configured to identify the longest line of characters in the text page according to the vertex parameters, and obtain the second set of vertex parameters of the longest line of characters as standard parameters;

A difference calculation module, configured to calculate the difference between the second group of characters in each line and the standard parameter;

The segmentation module is used to determine the target character in the row where the difference is greater than the preset value, and add a segmentation mark after the target character to obtain the segmented text.
The device according to claim 6, wherein the device further comprises:

The sample acquisition module is used to acquire a character sample and identify the character type corresponding to the character sample;

The character type analysis module is used to calculate the ratio of the character to the Chinese character corresponding to each of the character types, where the ratio is the ratio of the width of the character in the document to the width of the Chinese character;

The preset value calculation module is configured to calculate the preset value according to the duty ratio.
The device according to claim 7, wherein the device further comprises:

A sending module, used to send the segmented text to the server;

An update instruction receiving module, configured to receive an update instruction returned by the server according to the segmented text;

The preset value update module is configured to replace the preset value in the terminal according to the preset value in the update instruction.
The device according to claim 6, wherein the device further comprises:

A partition module, configured to partition the text according to the vertex parameters of each line of characters in the text page;

The page update module is used to generate a new text page according to the partitioned text; and

The new page segmentation module is used to continue the identification of the longest line of characters in the text page according to the vertex parameter.
The device according to any one of claims 6-9, wherein the device further comprises:

A saving module for saving the segmented text; and

The parameter deletion module is used to delete the vertex parameter of each line of characters in the text page.
A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:

Get pictures with text content;

Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;

Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ；

Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;

Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and

Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
The computer device according to claim 11, wherein the processor further executes the following steps when executing the computer-readable instruction, and the way of obtaining the preset value comprises:

Acquiring a character sample, and identifying the type of character corresponding to the character sample;

Calculate the duty ratio of the character and the Chinese character corresponding to each of the character types, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; and

The preset value is calculated according to the duty ratio.
The computer device according to claim 12, wherein when the processor executes the computer-readable instruction, after adding a segmentation character after the target character to obtain the segmented text, the processor further executes the following steps:

Sending the segmented text to the server;

Receiving an update instruction returned by the server according to the segmented text; and

Replacing the preset value in the terminal according to the preset value in the update instruction.
The computer device according to claim 11, wherein the processor further executes the following steps after acquiring the vertex parameters of each line of characters in the text page when executing the computer-readable instructions:

Partition the text according to the vertex parameters of each line of characters in the text page;

Generate a new text page based on the text after the partition; and

Continue to identify the longest line of characters in the text page according to the vertex parameters.
The computer device according to any one of claims 11 to 14, wherein when the processor executes the computer-readable instruction, a segmentation character is added after the target character to obtain the segmented text after the Perform the following steps:

Save the segmented text; and

Delete the vertex parameter of each line of characters in the text page.
One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Get pictures with text content;

Performing text recognition on the picture to obtain a text page, the text page containing characters consistent with the arrangement order of the text content;

Acquire vertex parameters of each line of characters in the text page, the vertex parameters of each line of characters include a first set of vertex parameters and a second set of vertex parameters, and the second set of vertex parameters are vertex parameters used to determine the segmentation criterion ；

Identifying the longest line of characters in the text page according to the vertex parameters, and acquiring the second set of vertex parameters of the longest line of characters as standard parameters;

Calculating the difference between the second set of vertex parameters and the standard parameters for each line of characters; and

Determine the target character in the row where the difference is greater than the preset value, and add a segmentation character after the target character to obtain the segmented text.
The storage medium according to claim 16, wherein the following steps are further executed when the computer-readable instruction is executed by the processor, and the way of obtaining the preset value comprises:

Acquiring a character sample, and identifying the type of character corresponding to the character sample;

Calculate the duty ratio of the character and the Chinese character corresponding to each of the character types, where the duty ratio is the ratio of the width of the character in the document to the width of the Chinese character; and

The preset value is calculated according to the duty ratio.
18. The storage medium according to claim 17, wherein when the computer-readable instructions are executed by the processor, after adding a segmentation character after the target character to obtain the segmented text, the following steps are further performed:

Sending the segmented text to the server;

Receiving an update instruction returned by the server according to the segmented text; and

Replacing the preset value in the terminal according to the preset value in the update instruction.
The storage medium according to claim 16, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed after obtaining the vertex parameters of each line of characters in the text page:

Partition the text according to the vertex parameters of each line of characters in the text page;

Generate a new text page based on the text after the partition; and

Continue to identify the longest line of characters in the text page according to the vertex parameters.
The storage medium according to any one of claims 16 to 19, wherein when the computer-readable instruction is executed by the processor, a segmentation character is added after the target character to obtain a segmented text Also perform the following steps:

Save the segmented text; and

Delete the vertex parameter of each line of characters in the text page.