WO2020140698A1

WO2020140698A1 - Table data acquisition method and apparatus, and server

Info

Publication number: WO2020140698A1
Application number: PCT/CN2019/124101
Authority: WO
Inventors: 张林江
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2019-01-04
Filing date: 2019-12-09
Publication date: 2020-07-09
Also published as: CN110008809B; CN110008809A

Abstract

A table data acquisition method and apparatus, and a server. The method comprises: obtaining image data of text to be processed; extracting a combined graph from the image data, the combined graph being a graph containing morphological vertical lines and morphological horizontal lines crossing each other; dividing the combined graph into a plurality of rectangular units; performing optical character recognition on the rectangular units respectively, and determining text information of the rectangular units; and according to the position coordinates of the rectangular units, combining the rectangular units containing the text information to obtain table data. By first obtaining graphic features such as morphological vertical lines and morphological horizontal lines in image data and obtaining a combined graph according to the graphic features, then dividing the combined graph into a plurality of rectangular units for optical character recognition to obtain text information of the rectangular units, and carrying out combination reduction according to the position coordinates to obtain table data, the technical problems of big errors and inaccuracy in table data extraction in an existing method are solved.

Description

Method, device and server for acquiring form data

Technical field

This specification belongs to the field of Internet technology, and particularly relates to a method, device and server for acquiring table data.

Background technique

In life and work, such a type of text data (for example, contract documents) often contains not only individual text characters (for example, simple text symbols), but also table data (for example, a statistical list of prices ), and this type of table data also has high information value in certain scenarios, including information content that people pay more attention to.

Generally, the data acquisition method is usually to directly perform optical character recognition on image data such as scanned pictures containing text data to recognize and extract text information in the image data to obtain electronic file data of the corresponding text.

Based on the data acquisition method, it has a relatively good effect when recognizing and extracting individual text characters in image data. However, the table data in the text data is different from the above-mentioned individual text characters. In addition to containing the text information carried by the text characters, it also has certain graphic features, for example, including dividers and dividers. Compared with the individual text characters, the structure of the table data is more complicated and it is more difficult to recognize. As a result, when the existing data acquisition method is used to identify the table data in the image data, errors are likely to occur. For example, the dividers in the table are mistakenly recognized as numbers. Or, the text characters in the N rows and M columns of the table are misaligned and so on. Therefore, there is an urgent need for a method that can accurately identify and completely recover the table data in the image data.

Summary of the invention

The purpose of this specification is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images The content of the table in the data.

The method, device and server for acquiring form data provided in this specification are implemented as follows:

A method for acquiring form data, comprising: acquiring image data of text to be processed; extracting a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines ; Divide the combined image into multiple rectangular units, wherein the multiple rectangular units each carry position coordinates; perform optical character recognition on the multiple rectangular units, and determine whether the multiple rectangular units contain Text information; according to the position coordinates of the rectangular unit, the rectangular unit containing the text information is combined to obtain the table data.

An apparatus for acquiring form data includes: an acquiring module for acquiring image data of text to be processed; an extracting module for extracting a combined image from the image data, wherein the combined image is a form that includes a cross Learning vertical and morphological horizontal lines; a segmentation module for dividing the combined graph into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; an identification module is used for the A plurality of rectangular units respectively perform optical character recognition to determine the text information contained in each of the plurality of rectangular units; a combination module is used to combine the rectangular units containing text information according to the position coordinates of the rectangular units to obtain table data.

A server includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the image data of the text to be processed is obtained; the combined image is extracted from the image data, wherein The combination graph is a graph including vertical morphological lines and horizontal morphological lines; the combination graph is divided into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; Each rectangular unit performs optical character recognition to determine the text information contained in the multiple rectangular units; according to the position coordinates of the rectangular units, the rectangular units containing the text information are combined to obtain table data.

A computer-readable storage medium on which computer instructions are stored, and when the instructions are executed, the image data of the text to be processed is obtained; the combined image is extracted from the image data, wherein the combined image contains a cross Morphological vertical lines and morphological horizontal lines; dividing the combined image into multiple rectangular units, wherein the multiple rectangular units each carry position coordinates; and performing optical characters on the multiple rectangular units Identify and determine the text information contained in each of the plurality of rectangular units; according to the position coordinates of the rectangular units, combine the rectangular units containing the text information to obtain table data.

The method, device and server for acquiring table data provided in this specification, because the combined image is obtained by first obtaining and extracting from the graphic features of the morphological vertical line and the morphological horizontal line in the image data; then the combined image is divided into multiple Each rectangular unit is divided into optical characters to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined according to the position coordinates of the rectangular unit to restore the complete table data. Therefore, the technical problem of large error and inaccuracy in extracting table data existing in the existing method is solved, and the content of the table in the image data can be identified efficiently and accurately, and the table content in the image data is completely restored.

BRIEF DESCRIPTION

In order to more clearly explain the technical solutions in the embodiments of the present specification, the drawings required in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the implementations described in the specification For example, for those of ordinary skill in the art, without paying any creative labor, other drawings can be obtained based on these drawings.

FIG. 1 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;

2 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;

3 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;

4 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;

5 is a schematic diagram of an embodiment of a flow of a method for acquiring table data provided by an embodiment of this specification;

6 is a schematic diagram of an embodiment of a structure of a server provided by an embodiment of this specification;

7 is a schematic diagram of an embodiment of a structure of an apparatus for acquiring table data provided by an embodiment of this specification.

detailed description

In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be described clearly and completely in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only a part of the embodiments of this specification, but not all the embodiments. Based on the embodiments in this specification, all other embodiments obtained by persons of ordinary skill in the art without creative work shall fall within the protection scope of this specification.

It is considered that most existing data acquisition methods are designed for the recognition of individual text characters in image data containing text to be processed. Therefore, it has better accuracy in recognizing and extracting text information represented by text characters in image data. However, some types of text data, such as contract text, will also contain some form content. This type of table content is relatively more complicated than the individual text character structure. Usually, in addition to containing text characters, it also has certain graphic features, for example, it also contains some graphic morphological structures. This makes the identification, extraction and reconstruction of such table data more complicated and difficult. When directly identifying and extracting such table data in graphic data through existing data acquisition methods, it is easy to confuse text characters and graphic features, and it is impossible to accurately distinguish and process the text characters and graphic features among them, resulting in errors. For example, a graphic structure such as a separator bar in the table data is mistakenly recognized as a text character, or a misalignment occurs in the recognition and extraction of text information at different positions in the table data. That is, when the table data in the image data is processed by the existing acquisition method, the effect is often not ideal, and there is a technical problem of large error and inaccuracy in extracting the table data.

In view of the root cause of the above problems, this specification specifically analyzes the different characteristics of the two different attribute objects of text characters and graphic structures that the table data has at the same time. By first obtaining the morphological vertical line and morphological horizontal in the image data Use image structure features such as lines to find a combined image that may form table data from the image data; then divide the combined image into multiple rectangular units, and perform optical character recognition on each rectangular unit separately to obtain the text information of the rectangular unit; Furthermore, according to the position coordinates of the rectangular unit, the rectangular unit containing the text information is combined to restore and reconstruct the complete table data of the image, thereby solving the technical problem of large error and inaccuracy in the existing method of extracting table data. It can efficiently and accurately identify and completely restore the table content in the image data.

The embodiments of the present specification provide an acquisition method of a table data method. The acquisition method of the table data may be specifically applied to an image data processing system including multiple servers. For example, the legal contract processing system for scanning pictures.

Among them, the above system may specifically include a server for identifying and acquiring form data in text data from image data. When the server is specifically implemented, it can extract the combined image from the acquired image data of the text to be processed by detecting the morphological vertical lines and morphological horizontal lines in the image data; then divide the combined image according to the coordinates Into multiple rectangular units, and perform optical character recognition on each of the multiple rectangular units to identify and determine the text information contained in each rectangular unit; then, according to the coordinates of the rectangular unit, combine and splice the above contained text The rectangular unit of information to get the complete table data.

In this embodiment, the server can be understood as a service server that is applied to the business system side and can implement functions such as data transmission and data processing. Specifically, the server may be an electronic device with data calculation, storage, and network interaction functions; or a software program that runs on the electronic device and provides support for data processing, storage, and network interaction. In this embodiment, the number of the servers is not specifically limited. The server may specifically be one server, or several servers, or a server cluster formed by several servers.

In an example of a scenario, as shown in FIG. 1, the form data acquisition method provided in the embodiment of the present specification can be used to process the image data containing the contract received by the legal platform to extract the form data in the contract.

In this scenario example, the legal platform can distribute the image data containing the contract to be entered by the user to the server on the platform that is used to obtain the form data.

Among them, the above-mentioned legal platform can be specifically used to identify and extract text information in user-uploaded image data containing contracts (such as scanned pictures or photos containing contracts) to convert contract contents into electronic file data. Stored in the database of the legal affairs platform, it is convenient for users to access and manage.

After receiving the image data containing the contract, the server may refer to FIG. 2 to pre-process the image to reduce error interference and improve the accuracy of subsequent identification and acquisition of table data.

Specifically, the server may be specifically configured with OpenCV (that is, Open source Computer Vision Library, source code computer vision library). Among them, the above OpenCV can be understood as an API function library about the source code of computer vision. The function code contained in the library has been optimized, and the efficiency of calling and calculating is relatively high. During specific implementation, the server can call the corresponding function code through the above OpenCV to efficiently perform data processing on the image data.

Specifically, the server can first convert the image data to obtain the corresponding grayscale image, and then perform Gaussian smoothing on the grayscale image to filter out the more obvious noise information in the grayscale image and improve the accuracy of the image data, thereby completing Preprocessing of image data. Of course, it should be noted that, in the above preprocessing process, the image data is converted into a grayscale image only as an example for schematic description. During specific implementation, according to specific scenes and accuracy requirements, the image data may also be converted into a binary map first, and then subsequent table data acquisition may be performed based on the binary map. This specification is not limited.

After completing the preprocessing of the image data containing the contract, the server can first scan and retrieve the graphic structural features (such as structural elements, etc.) in the image data based on morphology, so as to find the difference from the image data first. Text characters, with certain graphic features, may form a table of graphics: combination chart.

In the specific implementation, a specific frame image in the image data is taken as an example, for example, the fifth page image in the image data including the contract is taken as an example. The server can scan and search the morphological vertical line and the morphological horizontal line in the frame image.

The above-mentioned morphological vertical lines and morphological horizontal lines can be understood as a structural element related to graphics that is different from text characters. You can refer to Figure 3. The morphological vertical line may specifically be an image unit or a structural element that contains a straight line segment along the vertical direction in the image. The above-mentioned morphological horizontal line may specifically be an image unit or a structural element that contains a straight line segment along the horizontal direction in the image.

Specifically, the server can search for the structural elements in the image by calling the getStructuringElement function, and find all the morphological vertical lines and morphological horizontal lines from it. Of course, it should be noted that the above-listed method of obtaining the morphological vertical line and the morphological horizontal line from the image by calling the getStructuringElement function is only a schematic illustration. During specific implementation, according to the specific situation, the morphological vertical line and the morphological horizontal line in the image may also be obtained in other suitable ways. This specification is not limited.

Consider that in the tabular data, each morphological horizontal line mostly intersects one or more of the morphological vertical lines. Therefore, after obtaining the morphological vertical line and the morphological horizontal line in the frame image, the server can further search for the graph containing the structure of the intersecting morphological vertical line and the morphological horizontal line as possible form data Combining graphs to avoid subsequent processing of graphic structures that obviously do not have the graphic features of table data and improve processing efficiency.

In this scenario example, in order to avoid the misalignment of the identified and extracted morphological horizontal lines and morphological vertical lines, the morphological horizontal lines and morphological vertical lines can be directly extracted on the original image, and the extracted morphology Horizontal lines and morphological vertical lines cover the extraction position.

After obtaining the above-mentioned combination chart with more obvious data characteristics of the data table and possibly forming the table data, the combination chart can be further inspected, by checking whether the combination chart meets the preset table format requirements, to be more accurate To determine whether the combination chart is a data table.

Wherein, the above-mentioned preset table format requirements can be specifically understood as a rule set for describing graphic features of data tables different from other graphic structures.

For example, considering that the data table is different from other graphics, each grid graphic (or rectangular frame, see Figure 3) is designed to fill in specific characters, that is, each grid graphic in the data table The minimum area should be able to accommodate at least the next complete character. Therefore, the following rules for graphic area characteristics may be set: the minimum area of the grid pattern in the data table should be greater than a preset area threshold. Also considering that based on people's usual typesetting habits, the table data will be set to the center position when editing the table data. Therefore, you can also set the following rules for graphic position features: the absolute value of the difference between the distance between the left border of the data table and the left border of the image and the distance between the right border of the data table and the right border of the image is less than the Set the distance threshold. Also considering the purpose of using table data, usually in order to compare and compare at least two or more data into a table, so as to more clearly show the differences between different data. Therefore, the following rules for the quantity characteristics of graphics may also be set: the number of grid graphics in the data table is greater than or equal to a preset quantity threshold (for example, 2), and so on.

Of course, it should be noted that the specific rules included in the preset table format requirements listed above are only for better explaining the implementation of this specification. During specific implementation, according to specific application scenarios and processing requirements, other types or content rules may also be introduced as the above-mentioned preset table format requirements. This specification is not limited.

In this scenario example, in order to determine whether the extracted combination map meets the preset table format requirements, in specific implementation, it can first retrieve the point where the horizontal and vertical morphological lines in the combination map are at the same image position as Intersection point, and then determine the position coordinates of each intersection point in the combined image in the frame image.

The above-mentioned intersection point can be specifically understood as the pixel point at the position where the morphological vertical line and the morphological horizontal line intersect in the combined image in the frame image. See Figure 3 for details.

Specifically, the server can search for and obtain the coordinates of the intersection point in the combined image in the image by calling the opencv bitwise_and function. Of course, it should be noted that the enumeration of the coordinates of the intersection point through the opencv bitwise_and function listed above is only a schematic illustration. During specific implementation, the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.

At the same time, the server may further search for the graphic structure elements of the above combination diagram, and find a graphic element having a rectangular (or square) structure (ie, a grid in the corresponding table) as a rectangular frame in the combination diagram. You can refer to Figure 3.

Specifically, the server may search for and obtain the rectangular frame in the combination graph by calling the findContours function. Of course, it should be noted that the above-mentioned enumeration of the rectangular frame in the combination diagram by the findContours function is only a schematic illustration. During specific implementation, the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.

Further, the server may determine the endpoint coordinates at the four endpoints of each rectangular frame in the combination graph through position comparison based on the determined intersection coordinate and the rectangular frame in the combination graph. Furthermore, according to the coordinates of the endpoints of the rectangular frame in the combination diagram, it can be determined whether the combination diagram meets the preset table format requirements.

For example, the server may calculate the length and width of the rectangular frame according to the coordinates of the endpoints of the rectangular frame, and then calculate the area of the rectangular frame based on the length and width. Then compare the area of the rectangular frame with the preset area threshold. If the area of each rectangular frame in the combination diagram is greater than the preset area threshold, it can be determined that the combination diagram meets the preset table format requirements.

For another example, the server can also compare the value of the abscissa of the end point coordinates of each rectangular frame in the combination diagram, find the end point with the smallest value of the abscissa as the endpoint on the left border of the combination diagram, and determine the abscissa of the endpoint as the left The abscissa of the border, and then calculate the distance between the left border of the combined image and the left border of the image based on the abscissa of the left border, and record it as d1. Similarly, the service finds the endpoint with the largest abscissa value as the endpoint on the right border of the combination chart by comparing the values of the abscissa of the endpoint, and determines the abscissa of the endpoint as the abscissa of the right border. The abscissa of the side boundary calculates the distance between the right boundary of the combined drawing and the right boundary of the drawing, and is denoted as d2. Further, the server may calculate the absolute value of the difference between d1 and d2, and compare the absolute value of the above difference with a preset distance threshold. If the absolute value of the above-mentioned difference is less than or equal to the preset distance threshold, it can be determined that the entire combination picture is located at the center of the image, that is, the preset table format requirements are met.

Of course, it should be noted that the above-listed methods for judging whether the combination chart meets the preset table format requirements are only for better explaining the implementation of this specification. During specific implementation, according to specific conditions and accuracy requirements, the above two judgment methods may be combined, or other suitable judgment methods may be introduced to judge whether the combined picture meets the preset table format requirements. This specification is not limited.

After determining that the combination diagram conforms to the preset table format, the server may determine that the currently extracted combination diagram is indeed a data table in the image. Subsequent text information can be extracted from the combined image.

Considering that the above combination diagram usually contains a plurality of lattice figures or rectangular frames, directly identifying and extracting the text information in the combination diagram is prone to problems such as misalignment. Therefore, the server may first divide the above combined image into a plurality of rectangular units. Among them, each rectangular unit corresponds to a rectangular frame in the combination diagram one by one; however, it is different from the single graphical structure element of the rectangular frame. Each rectangular unit contains text characters or blank state information. Furthermore, separate optical character recognition can be performed on each rectangular unit to accurately identify the text characters in the rectangular unit and determine the text information contained in each rectangular unit.

Specifically, the server may first determine the contour line enclosing the rectangular frame as the dividing line according to the endpoint coordinates of the rectangular frame, and then may cut along the contour line to divide the rectangular unit corresponding to the rectangular frame from the combined diagram. For example, see Figure 4. The coordinates of the four endpoints of a rectangular frame in the combination diagram are A (15, 60), B (15, 40), C (30, 40), and D (30, 60). During specific implementation, the server can start from the endpoint A, keep the abscissa 15 unchanged, and find the endpoint with a different ordinate, namely endpoint B, and then connect endpoint A to endpoint B according to a preset division rule. Then, the server starts from the endpoint B, keeps the ordinate 40 unchanged, and finds the endpoint with different abscissas, that is, the endpoint C, and then connects the endpoint B to the endpoint C according to the preset division rule. Then, the server starts from the endpoint C, keeps the abscissa 30 unchanged according to the preset division rule, and finds the endpoint with a different ordinate, namely the endpoint D, and then connects the endpoint C to the endpoint D. Finally, the server starts from the endpoint D and keeps the ordinate 60 unchanged according to the preset division rule, and finds the endpoint with different abscissas, that is, endpoint A, and then connects the endpoint D to the endpoint A. In this way, a closed connecting line can be obtained: A to B to C to D to A, which is the outline of the rectangular frame. Further, the server may use the outline as a dividing line, and divide the rectangular frame containing the text information in the combined image along the outline to obtain the corresponding rectangular unit.

According to the above method, each rectangular unit in the combined graph can be divided. Of course, it should be noted that the above-mentioned manner of dividing the rectangular unit is just to better explain the embodiments of the present specification. During specific implementation, other suitable methods may also be used to divide a plurality of rectangular units from the combined diagram according to specific circumstances. This specification is not limited.

It should be noted that, in the process of dividing the combined image, the server also generates position coordinates corresponding to the rectangular unit according to the coordinates of the end points of the rectangular frame.

Wherein, the above position coordinates can be understood as a kind of parameter data used to indicate the position of the rectangular unit in the image of the combined image or describe the positional relationship between the rectangular unit in the image of the combined image and other adjacent rectangular units.

Specifically, the server may calculate the coordinates of the center point of the rectangular frame as the position coordinates of the corresponding rectangular unit according to the endpoint coordinates of the four endpoints of the rectangular frame. The server may also calculate the coordinates of the center points of each rectangular frame first, and then according to the preset arrangement order, for example, from the top to bottom and from left to right, according to the coordinates of the center points of each rectangular frame, determine The row number and column number of each rectangular unit are used as the position coordinates of the corresponding rectangular unit. For example, according to the coordinates of the center point of the rectangular frame, it is determined that the rectangular frame A is located in the first row and second column of the combined diagram, that is, the corresponding row number is 1 and the column number is 2, so "1-2" can be used as The position coordinates of the rectangular unit corresponding to the rectangular frame A. Of course, it should be noted that the above-listed methods for determining the position coordinates of the rectangular unit are only schematic illustrations. During specific implementation, according to the specific situation, other suitable methods may also be used to determine the position coordinates of the rectangular unit. This specification is not limited.

After dividing the combined image to obtain multiple corresponding rectangular units, the server can perform optical character recognition (ie, OCR, Optical, Character, Recognition) on each of the multiple rectangular units to determine the text characters in each rectangular unit, and then Determine the text information contained in each rectangular unit. If no text characters are recognized in the rectangular unit, the text information contained in the rectangular unit is left blank. In this way, multiple rectangular units containing corresponding text information can be obtained.

Further, the server may combine and combine the rectangular units containing the text information obtained above according to the position coordinates of each rectangular unit. For example, the rectangular unit containing text information can be set at the position of the first row and the second column according to the position coordinates "1-2" of the rectangular unit. According to the above manner, a plurality of rectangular units containing text information are sequentially set to corresponding positions, so that a complete data table can be restored. Of course, it should be noted that the above-mentioned combination mode is only a schematic illustration. During specific implementation, other combination methods can also be used to perform combination splicing according to other types of position coordinates. This specification is not limited.

According to the above method, the server can separately detect the form data of each image in the image data containing the contract to be processed, and then obtain the form data when it is determined that the form data exists, so as to extract the complete image data Form data, and feed back the extracted form data to the legal platform, so as to organize and generate the electronic file data for the contract for storage.

In another scenario example, in order to make the table lines in the obtained table data clearer, and to improve the accuracy of subsequent optical character recognition to extract text information, during specific implementation, the server obtains the After the morphological vertical line and the morphological horizontal line, further feature enhancement processing can be performed on the obtained morphological vertical line and the morphological horizontal line to make the obtained morphological vertical line and morphological horizontal line clearer.

Wherein, the above feature strengthening treatment may specifically be a morphological treatment, and may specifically include corrosion treatment and/or expansion treatment. During specific implementation, based on the morphological processing, the data value of the pixel in the middle of the area can be reset (reset to 0 or 1) by sliding the area of the convolution kernel into the frame image. Specifically, corrosion treatment may be performed first, followed by expansion treatment.

Specifically, the above-mentioned corrosion processing can be understood as an AND operation. Specifically, by corroding the pixels close to the foreground according to the size of the convolution kernel (that is, resetting the value of the corresponding pixel to 0), the foreground object becomes Small, which can reduce the white area around the morphological vertical line or morphological horizontal line to achieve the effect of removing white noise; at the same time, it can also break the structural elements adjacent or even connected to the above morphological vertical line or morphological horizontal line open.

After the corrosion treatment is performed, since the corrosion will relatively reduce the structural elements of the image, the morphological vertical line or the morphological horizontal line after the corrosion processing may be continuously expanded.

The above expansion process can be understood as an OR operation. In contrast to the corrosion process, the eroded image can be enlarged and restored through expansion to obtain relatively clear morphological vertical lines and morphological horizontal lines of constant size. .

As can be seen from the above scenario examples, the method for obtaining the table data provided in this specification is due to obtaining and extracting the combined image according to the graphic features of the morphological vertical line and the morphological horizontal line in the image data; then the combined image is divided into multiple The rectangular units are divided into optical characters for each rectangular unit to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving In order to achieve the technical problems of large error and inaccuracy in the extraction of table data in the existing methods, it is possible to efficiently and accurately identify and completely restore the table content in the image data.

Referring to FIG. 5, an embodiment of the present specification also provides a method for acquiring table data, where the method is specifically applied to the server side. During specific implementation, the method may include the following:

S51: Acquire image data of the text to be processed.

In this embodiment, the above-mentioned to-be-processed text may specifically be a to-be-processed contract text, a to-be-processed constitution text, or a to-be-processed specification text. Correspondingly, the image data of the text to be processed may be a scanned image containing the text content, a photo containing the text content, or a video containing the text content. The specific content and form of the image data of the text to be processed above are not limited in this specification.

S53: Extract a combination graph from the image data, wherein the combination graph is a graph including vertical morphological lines and horizontal morphological lines.

In this embodiment, the above morphological vertical line and morphological horizontal line can be specifically understood as a structural element related to graphics that is different from text characters. The morphological vertical line may specifically be an image unit or a structural element that contains a straight line segment along the vertical direction in the image. The above-mentioned morphological horizontal line may specifically be an image unit or a structural element that contains a straight line segment along the horizontal direction in the image.

In this embodiment, the above-mentioned combined graph can be specifically understood as the image data having graphic features similar to the table data, for example, a combined graph including graphic structural elements of crossing morphological vertical lines and morphological horizontal lines.

In this embodiment, the above-mentioned extraction of the combined image from the image data, during specific implementation, may include the following: search and obtain the morphological vertical line and the morphological horizontal line in the image data; connect the morphology The vertical line and the morphological horizontal line obtain the combined diagram.

In this embodiment, the above search and obtain the morphological vertical line and the morphological horizontal line in the image data, in specific implementation, may include the following content: by calling the getStructuringElement function in OpenCV to search for the structural element in the image , Find the morphological vertical line and morphological horizontal line in the image data. Of course, it should be noted that the above-listed method of obtaining the morphological vertical line and the morphological horizontal line from the image by calling the getStructuringElement function is only a schematic illustration. During specific implementation, according to the specific situation, the morphological vertical line and the morphological horizontal line in the image may also be obtained in other suitable ways. This specification is not limited.

In this embodiment, the morphological vertical line and the morphological horizontal line obtained in the above manner also carry position information in the image data, and then the corresponding information can be connected according to the position information of the morphological vertical line and the morphological horizontal line The morphology vertical line and the morphology horizontal line to get the combined picture.

S55: Divide the combined image into a plurality of rectangular units, where the plurality of rectangular units respectively carry position coordinates.

In this embodiment, the above rectangular unit can be specifically understood as an image unit that corresponds one-to-one with a rectangular frame in the combination diagram, but distinguishes the rectangular frame and contains text information (such as text characters filled or blank) .

In this embodiment, the above-mentioned rectangular frame can be specifically understood as a rectangular or square-shaped graphic element composed of two morphological vertical lines and two morphological horizontal lines, which simply contain only graphic features. Among them, each rectangular frame can be regarded as a grid in the table.

In this embodiment, the combination diagram is divided into a plurality of rectangular units. In specific implementation, the following contents may be included: obtaining the coordinates of the intersection point in the combination diagram; searching and obtaining the rectangular frame in the combination diagram; according to The coordinate of the intersection point in the combined graph determines the coordinates of the end points of the rectangular frame; and according to the coordinate of the endpoints of the rectangular frame, the combined graph is divided into a plurality of rectangular units.

In this embodiment, the above-mentioned intersection point can be specifically understood as the pixel point at the position where the vertical morphological line and the horizontal morphological line in the combination figure intersect.

In this embodiment, during specific implementation, the coordinates of the intersection point in the combined graph in the image can be searched and obtained by calling the opencv bitwise_and function in OpenCV. Of course, it should be noted that the enumeration of the coordinates of the intersection point through the opencv bitwise_and function listed above is only a schematic illustration. During specific implementation, the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.

In this embodiment, during specific implementation, the rectangular frame in the combined graph can be searched and obtained by calling the findContours function in OpenCV. Of course, it should be noted that the above-mentioned enumeration of the rectangular frame in the combination diagram by the findContours function is only a schematic illustration. During specific implementation, the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.

In this embodiment, the above OpenCV (Open source Computer Vision Library, source code computer vision library) can be specifically understood as an API function library about the source code of computer vision, the function code contained in the library has been optimized The efficiency of processing, calling and calculating is relatively high. During specific implementation, the server can call the corresponding function code through the above OpenCV to efficiently perform data processing on the image data.

In this embodiment, according to the coordinates of the endpoints of the rectangular frame, the combination diagram is divided into a plurality of rectangular units. In specific implementation, the following may be included: according to the coordinates of the intersection point in the combination diagram, determine the The coordinates of the end points of the rectangular frame; the dividing line is determined according to the coordinates of the end points of the rectangular frame; and the combined image is divided into a plurality of rectangular units according to the dividing lines.

In this embodiment, the endpoint coordinates of the rectangular frame are determined according to the coordinates of the intersection point in the combination diagram, and in specific implementation, the following content may be included: the coordinates of the intersection point in the combination diagram and the rectangular frame are performed Position comparison to determine the four endpoints of each rectangular frame from the intersection, and then determine the coordinates of the endpoints of each rectangular frame.

In this embodiment, the above-mentioned determination of the dividing line according to the coordinates of the end points of the rectangular frame may include the following content: according to the coordinates of the four end points of each rectangular frame, the outline line surrounding the rectangular frame is determined as the corresponding dividing line. Furthermore, subsequent division can be performed along the above division line, and each rectangular unit can be obtained from the combination diagram.

In this embodiment, while dividing the combined image to obtain multiple rectangular units, the method further includes the following content: generating position coordinates of the rectangular units according to the coordinates of the end points of the rectangular frame.

In this embodiment, the position coordinates of the above rectangular unit can be specifically understood as a type used to indicate the position of the rectangular unit in the image of the combined image or describe the position of the rectangular unit and other adjacent rectangular units in the image of the combined image Parameter data of the relationship.

In this embodiment, during specific implementation, the coordinates of the center point of the rectangular frame may be calculated as the position coordinates of the corresponding rectangular unit according to the endpoint coordinates of the four end points of the rectangular frame. You can also calculate the coordinates of the center point of each rectangular frame first, and then follow the preset arrangement order, for example, from top to bottom and from left to right, according to the coordinates of the center point of each rectangular frame, arrange in order For each rectangular unit, determine the row number and column number of each sorted rectangular unit as the position coordinates of the corresponding rectangular unit. Of course, it should be noted that the above-listed methods for determining the position coordinates of the rectangular unit are only schematic illustrations. During specific implementation, according to the specific situation, other suitable methods may also be used to determine the position coordinates of the rectangular unit. This specification is not limited.

S57: Perform optical character recognition on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units.

In this embodiment, during specific implementation, each rectangular unit of the plurality of rectangular units may be subjected to separate optical character recognition to separately identify text characters in each rectangular unit, and then determine the location of each rectangular unit. Contains text information.

In this embodiment, during specific implementation, when text characters are not recognized from the rectangular unit, the text information contained in the rectangular unit may be left blank.

S59: According to the position coordinates of the rectangular unit, combine the rectangular units containing text information to obtain table data.

In this embodiment, in specific implementation, the rectangular units containing text information adjacent to the position coordinates may be stitched according to the position coordinates of each rectangular unit, and the rectangular units containing text information may be placed in the corresponding At the location of the data, so as to obtain the complete table data.

In this embodiment, the combined graph is obtained by acquiring and extracting the graphic features such as the morphological vertical line and the morphological horizontal line in the image data; then the combined graph is divided into a plurality of rectangular units, and each rectangular unit is Optical character recognition to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving the problem of extracting table data existing in the existing methods The technical problem of large error and inaccuracy can be recognized efficiently and accurately, and the content of the table in the image data can be completely restored.

In an embodiment, in order to reduce noise interference and improve the accuracy of acquiring table data, after acquiring the image data of the text to be processed, the method may further include the following when the method is specifically implemented: performing the image data of the text to be processed Preprocessing, wherein the preprocessing includes: converting the image data into a grayscale image; and/or performing Gaussian smoothing on the image data to filter out noise interference. Of course, it should be noted that the above-mentioned pre-processing methods are just to better explain the embodiments of this specification. During specific implementation, other suitable processing methods may be used for pre-processing according to the specific situation and accuracy requirements. This specification is not limited.

In one embodiment, the above-mentioned extraction of the combined image from the image data, in specific implementation, may include the following content: search and obtain morphological vertical lines and morphological horizontal lines in the image data; connect the morphology The vertical line and the morphological horizontal line obtain the combined diagram.

In one embodiment, the above search and obtain the morphological vertical line and the morphological horizontal line in the image data, in specific implementation, may include the following content: search and obtain the morphological vertical line in the image data through the getStructuringElement function Lines and morphological horizontal lines.

In one embodiment, in order to make the acquired morphological vertical lines and morphological horizontal lines clear, and to reduce the impact of errors on subsequent text information recognition, the morphological vertical lines and morphological horizontal lines in the image data are searched and acquired After the line is implemented, the method may further include the following contents: performing feature enhancement processing on the obtained morphological vertical line and morphological horizontal line respectively, wherein the feature enhancement processing includes at least one of the following: corrosion treatment And expansion treatment.

In this embodiment, during the specific implementation, the morphological vertical line and the morphological horizontal line may be etched first, and then the morphological vertical line and the morphological horizontal line after the etching process may be expanded.

In this embodiment, the white noise generated by the foreground of the morphological vertical line and the morphological horizontal line can be eliminated through the etching process, making the morphological vertical line and the morphological horizontal line clearer, but the morphological vertical line and the The graphical elements of the morphological horizontal lines are reduced. Therefore, after corroding the morphological vertical line and the morphological horizontal line, the morphological vertical line and the morphological horizontal line with a constant size can be recovered by the expansion treatment to be more clear.

In one embodiment, it is considered that the above-mentioned combination chart is only that the graphic features are similar to the table data, but it may not be table data. For example, the large text character "Tian" also has graphic features similar to table data. Therefore, the extracted combination chart can be tested to determine whether the combination chart meets the preset table format requirements, so as to more accurately determine whether the combination chart is real table data, and then can only be determined as table data. The combination graph performs data processing, thereby reducing waste of resources and improving processing efficiency.

In one embodiment, after the combined image is extracted from the image data, when the method is specifically implemented, the method may further include: acquiring coordinates of the intersection point in the combined image, where the intersection point is the combination Pixels at the position where the morphological vertical line and the morphological horizontal line intersect in the figure; search and obtain the rectangular frame in the combined map; determine the endpoint coordinates of the rectangular frame according to the coordinates of the intersection point in the combined map; The endpoint coordinates of the rectangular frame determine whether the combined image meets the preset table format requirements.

In this embodiment, during specific implementation, the coordinates of the intersection point in the combined graph in the image can be searched and obtained by calling the opencv bitwise_and function. Of course, it should be noted that the enumeration of the coordinates of the intersection point through the opencv bitwise_and function listed above is only a schematic illustration. During specific implementation, the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.

In this embodiment, during specific implementation, the rectangular frame in the combined graph can be searched and obtained by calling the findContours function. Of course, it should be noted that the above-mentioned enumeration of the rectangular frame in the combination diagram by the findContours function is only a schematic illustration. During specific implementation, the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.

In this embodiment, the above-mentioned preset table format requirement can be specifically understood as a rule set for describing the graphic features of the data table different from other graphic structures.

During specific implementation, the specific rules included in the above-mentioned preset table format requirements can be flexibly set according to specific conditions. For example, considering that the data table is different from other graphics, each grid graphic (or rectangular frame) is designed to fill in specific characters, that is, the minimum area of each grid graphic in the data table should be at least tolerable The next complete character. Therefore, the following rules for graphic area characteristics may be set: the minimum area of the grid pattern in the data table should be greater than a preset area threshold. Also considering that based on people's usual typesetting habits, the table data will be set to the center position when editing the table data. Therefore, you can also set the following rules for graphic position features: the absolute value of the difference between the distance between the left border of the data table and the left border of the image and the distance between the right border of the data table and the right border of the image is less than the Set the distance threshold. Also considering the purpose of using table data, usually in order to compare and compare at least two or more data into a table, so as to more clearly show the differences between different data. Therefore, the following rules for the quantity characteristics of graphics may also be set: the number of grid graphics in the data table is greater than or equal to a preset quantity threshold (for example, 2), and so on.

In one embodiment, the above determines whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame. In specific implementation, it may include the following: according to the endpoint coordinates of the rectangular frame, calculate The area of the rectangular frame; detecting whether the area of the rectangular frame is greater than a preset area threshold. If the area of the rectangular frame is greater than a preset area threshold, it is determined that the combined map meets the preset table format requirements.

In one embodiment, the foregoing determines whether the combination map meets the preset table format requirements according to the endpoint coordinates of the rectangular frame. In specific implementation, the following may also be included: According to the endpoint coordinates of the rectangular frame in the combination map, respectively Determine the abscissa of the left border and the right border of the combined map; calculate the distance between the left border of the combined map and the left border of the image data based on the left border of the combined map. A distance; calculate the distance between the right border of the combination map and the right border of the image data according to the abscissa of the right border of the combination map, and record it as the second distance; calculate the distance difference between the first distance and the second distance Compare the absolute value of the difference with a preset distance threshold to detect whether the absolute value of the distance difference is less than the preset distance threshold. If the absolute value of the distance difference is less than a preset distance threshold, it is determined that the combination map meets the preset table format requirements.

In an embodiment, the above-mentioned dividing the combined image into a plurality of rectangular units, in specific implementation, may include the following: determining the dividing line according to the coordinates of the end points of the rectangular frame; dividing the combined image into the following according to the dividing line A plurality of rectangular units, and generating position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.

In one embodiment, the image data of the text to be processed may specifically include: a scanned image or a photograph containing a contract to be processed. Of course, it should be noted that the image data of the to-be-processed text listed above is only for better explaining the embodiments of the present specification. During specific implementation, according to specific application scenarios and processing requirements, the image data of the text to be processed may also include image data of other types and contents, for example, a video screenshot containing a manual to be processed, etc. This specification is not limited.

As can be seen from the above, the method for obtaining the table data provided by the embodiment of the present specification is that the combined picture is obtained by obtaining and extracting the graphic features such as the morphological vertical line and the morphological horizontal line in the image data; then the combined map is divided into multiple The rectangular units are divided into optical characters for each rectangular unit to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving It solves the technical problems of large error and inaccuracy in the extraction of table data in the existing methods, so that it can be efficiently and accurately identified, and the content of the table in the image data can be completely restored; after the extraction of the combined image, according to the combined image The included intersections, rectangular frames and other graphic factors detect whether the extracted combined image is tabular data in the text, thereby avoiding mistakenly identifying non-tabular data as tables, reducing errors, and improving the accuracy of obtaining tabular data.

An embodiment of this specification also provides a server including a processor and a memory for storing processor-executable instructions. When the processor is specifically implemented, the following steps may be performed according to the instructions: acquiring image data of text to be processed; Extracting a combination diagram from the image data, wherein the combination diagram is a graph including vertical morphological and morphological horizontal lines; the combination diagram is divided into a plurality of rectangular units, wherein the plurality of rectangles The units carry position coordinates; perform optical character recognition on the multiple rectangular units to determine the text information contained in the multiple rectangular units; according to the position coordinates of the rectangular units, combine the rectangular units containing text information to obtain Tabular data.

In order to complete the above instructions more accurately, as shown in FIG. 6, this specification also provides another specific server, where the server includes a network communication port 601, a processor 602, and a memory 603. The cables are connected so that each structure can perform specific data interactions.

Among them, the network communication port 601 may be specifically used to input image data of text to be processed;

The processor 602 may be specifically used to extract a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines; the combined image is divided into A plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; optical character recognition is performed on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units; according to the position of the rectangular unit Coordinates, combined with rectangular cells containing text information, get table data.

The memory 603 may specifically be used to store image data of text to be processed input via the network communication port 601 and store corresponding instruction programs based on the processor 602.

In this embodiment, the network communication port 601 may be a virtual port that is bound to different communication protocols so that different data can be sent or received. For example, the network communication port may be port 80 responsible for web data communication, port 21 responsible for FTP data communication, or port 25 responsible for mail data communication. In addition, the network communication port may also be a physical communication interface or a communication chip. For example, it can be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it can also be a Wifi chip; it can also be a Bluetooth chip.

In this embodiment, the processor 602 can be implemented in any suitable manner. For example, the processor may adopt, for example, a microprocessor or a processor and a computer-readable medium storing a computer-readable program code (such as software or firmware) executable by the (micro)processor, a logic gate, a switch, an application specific integrated circuit ( Application Specific (Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller, etc. This manual is not limited.

In this embodiment, the memory 603 may include multiple levels. In a digital system, as long as it can store binary data, it can be a memory. In an integrated circuit, a circuit with a storage function without a physical form is also called a memory. , Such as RAM, FIFO, etc.; in the system, the storage device with physical form is also called memory, such as memory stick, TF card, etc.

The embodiments of the present specification also provide a computer storage medium based on the above-mentioned table data acquisition method, where the computer storage medium stores computer program instructions, which are implemented when the computer program instructions are executed: acquiring image data of text to be processed ; Extract a combination graph from the image data, wherein the combination graph is a graph that includes crossed morphological vertical lines and morphological horizontal lines; divide the combination map into a plurality of rectangular units, wherein, the A plurality of rectangular units respectively carry position coordinates; perform optical character recognition on the plurality of rectangular units respectively to determine the text information contained in the plurality of rectangular units; according to the position coordinates of the rectangular units, combine rectangles containing text information Unit, get the table data.

In this embodiment, the storage medium includes, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), cache (Cache), hard disk (Hard Disk Drive, HDD) Or memory card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface configured to perform network connection communication according to the standard specified by the communication protocol.

In this embodiment, the functions and effects specifically implemented by the program instructions stored in the computer storage medium can be explained in comparison with other embodiments, and will not be repeated here.

Referring to FIG. 7, at the software level, the embodiment of the present specification also provides an apparatus for acquiring table data. The apparatus may specifically include the following structural modules:

The obtaining module 71 can be specifically used to obtain image data of text to be processed;

The extracting module 72 may be specifically used for extracting a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines;

The segmentation module 73 may be specifically used to segment the combined image into multiple rectangular units, where the multiple rectangular units each carry position coordinates;

The recognition module 74 may be specifically configured to perform optical character recognition on the plurality of rectangular units respectively and determine the text information contained in the plurality of rectangular units respectively;

The combining module 75 can be specifically used to combine rectangular units containing text information according to the position coordinates of the rectangular units to obtain table data.

In one embodiment, the extraction module 72 may specifically include the following structural units:

The first search unit may specifically be used to search for and obtain morphological vertical lines and morphological horizontal lines in the image data;

The connecting unit may specifically be used to connect the morphological vertical line and the morphological horizontal line to obtain the combined graph.

In one embodiment, the apparatus may further specifically include a detection module, configured to detect whether the combination graph meets a preset table format requirement. Wherein, the detection module may specifically include the following structural units:

The obtaining unit may be specifically configured to obtain the coordinates of the intersection point in the combined graph, where the intersection point may specifically be a pixel point at a position where the morphological vertical line and the morphological horizontal line intersect in the combined map;

The second search unit may specifically be used to search for and obtain a rectangular frame in the combination diagram;

The first determining unit may specifically be used to determine the coordinates of the end point of the rectangular frame according to the coordinates of the intersection in the combined graph;

The second determining unit may be specifically configured to determine whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame.

In one embodiment, the second determining unit may be specifically configured to calculate the area of the rectangular frame according to the coordinates of the endpoints of the rectangular frame; and detect whether the area of the rectangular frame is greater than a preset area threshold.

In one embodiment, the segmentation module 73 may specifically include the following structural units:

The third determining unit can be specifically used to determine the dividing line according to the coordinates of the end points of the rectangular frame;

The dividing unit may specifically be used to divide the combined image into a plurality of rectangular units according to the dividing line, and generate position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.

In one embodiment, the apparatus may further specifically include a preprocessing module for preprocessing the image data of the text to be processed, wherein the preprocessing may specifically include: converting the image data to gray Degree image; and/or, perform Gaussian smoothing on the image data, etc.

It should be noted that the units, devices, or modules explained in the foregoing embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. For the convenience of description, when describing the above device, the functions are divided into various modules and described separately. Of course, when implementing this specification, the functions of each module may be implemented in one or more software and/or hardware, or the modules that implement the same function may be implemented by a combination of multiple submodules or subunits. The device embodiments described above are only schematic. For example, the division of the unit is only a division of logical functions. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or integrated To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical, or other forms.

It can be seen from the above that the table data acquisition device provided by the embodiment of the present specification is obtained by the extraction module and extracted according to the morphological vertical lines and morphological horizontal lines in the image data to obtain the combined picture; The module divides the combined image into multiple rectangular units, and performs optical character recognition on each rectangular unit type to obtain the text information contained in each rectangular unit, and then uses the combination module to divide the rectangle containing the text information according to the position coordinates of the rectangular unit. Units are combined and restored to obtain complete table data, thereby solving the technical problem of large error and inaccuracy in the existing method of extracting table data, so as to achieve efficient and accurate identification, and completely restore the table content in the image data; After extracting the combination chart, the combination module detects whether the extracted combination chart is tabular data in the text according to the intersection points, rectangular frames and other graphical factors contained in the combo chart, so as to avoid mistakenly identifying non-table data as Tables reduce errors and improve the accuracy of obtaining table data.

Although this specification provides method operation steps as described in the embodiments or flowcharts, more or less operation steps may be included based on conventional or non-inventive means. The sequence of steps listed in the embodiments is only one way among the sequence of execution of many steps, and does not represent a unique sequence of execution. When the actual device or client product is executed, it can be executed sequentially or in parallel according to the method shown in the embodiments or the drawings (for example, a parallel processor or multi-threaded processing environment, or even a distributed data processing environment). The terms "include", "include", or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, product, or device that includes a series of elements includes not only those elements, but also others that are not explicitly listed Elements, or include elements inherent to such processes, methods, products, or equipment. Without more restrictions, it does not exclude that there are other identical or equivalent elements in the process, method, product or equipment including the elements. The first and second words are used to indicate names, but do not indicate any particular order.

Those skilled in the art also know that, in addition to implementing the controller in the form of pure computer-readable program code, the method can be logically programmed to enable the controller to use logic gates, switches, special integrated circuits, programmable logic controllers and embedded To achieve the same function in the form of a microcontroller, etc. Therefore, such a controller can be regarded as a hardware component, and the device for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module of the implementation method and a structure within a hardware component.

This specification can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types. This specification can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules may be located in local and remote computer storage media including storage devices.

It can be known from the description of the above embodiments that those skilled in the art can clearly understand that this specification can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of this specification can be embodied in the form of software products in essence or part that contributes to the existing technology, and the computer software products can be stored in a storage medium, such as ROM/RAM, magnetic disk , CD, etc., including several instructions to enable a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in the various embodiments of this specification or some parts of the embodiments.

The embodiments in this specification are described in a progressive manner. The same or similar parts between the embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. This manual can be used in many general or special computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable electronic devices, network PCs, small computers, mainframe computers, including the above Distributed computing environment for any system or device, etc.

Although the description has been described through the embodiments, those of ordinary skill in the art know that there are many variations and changes in the description without departing from the spirit of the description, and it is hoped that the appended claims include these variations and changes without departing from the spirit of the description.

Claims

A method for obtaining tabular data includes:

Obtain the image data of the text to be processed;

Extract a combination graph from the image data, wherein the combination graph is a graph including vertical morphological lines and horizontal morphological lines;

Dividing the combined image into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates;

Performing optical character recognition on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units;

According to the position coordinates of the rectangular unit, the rectangular unit containing the text information is combined to obtain the table data.
The method according to claim 1, extracting the combined image from the image data, comprising:

Search and obtain morphological vertical lines and morphological horizontal lines in the image data;

Connect the morphological vertical line and the morphological horizontal line to obtain the combined graph.
The method of claim 1, after extracting the combined image from the image data, the method further comprises:

Obtaining the coordinates of the intersection point in the combination map, where the intersection point is the pixel point at the position where the morphological vertical line and the morphological horizontal line intersect in the combination map;

Search for and obtain a rectangular frame in the combination diagram;

Determine the coordinates of the end points of the rectangular frame according to the coordinates of the intersection points in the combined graph;

According to the coordinates of the end points of the rectangular frame, it is determined whether the combined image meets the preset table format requirements.
The method according to claim 3, according to the coordinates of the endpoints of the rectangular frame, determining whether the combined graph meets the preset table format requirements, including:

Calculate the area of the rectangular frame according to the coordinates of the endpoints of the rectangular frame;

Detecting whether the area of the rectangular frame is greater than a preset area threshold.
The method according to claim 3, dividing the combined graph into a plurality of rectangular units, including:

Determine the dividing line according to the coordinates of the end points of the rectangular frame;

The combined image is divided into a plurality of rectangular units according to the dividing line, and the position coordinates of the rectangular unit corresponding to the rectangular frame are generated according to the coordinates of the endpoints of the rectangular frame.
The method according to claim 1, after acquiring the image data of the text to be processed, the method further comprises:

Preprocessing the image data of the text to be processed, wherein the preprocessing includes: converting the image data into a grayscale image; and/or, performing Gaussian smoothing on the image data.
The method according to claim 1, wherein the image data of the text to be processed includes: a scanned image or a photograph containing a contract to be processed.
An apparatus for acquiring form data includes:

Acquisition module for acquiring image data of text to be processed;

An extraction module, configured to extract a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines;

A segmentation module, configured to segment the combined image into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates;

A recognition module, configured to perform optical character recognition on the plurality of rectangular units, and determine the text information contained in the plurality of rectangular units;

The combining module is used to combine rectangular cells containing text information according to the position coordinates of the rectangular cells to obtain table data.
The apparatus according to claim 8, the extraction module comprising:

A first search unit, used to search for and obtain morphological vertical lines and morphological horizontal lines in the image data;

The connecting unit is used to connect the vertical morphological line and the horizontal morphological line to obtain the combined graph.
The device according to claim 8, further comprising a detection module, the detection module comprising:

An obtaining unit, configured to obtain the coordinates of the intersection point in the combined map, wherein the intersection point is the pixel point at the position where the morphological vertical line and the morphological horizontal line intersect in the combined map;

A second search unit, used to search for and obtain the rectangular frame in the combined graph;

A first determining unit, configured to determine the coordinates of the end point of the rectangular frame according to the coordinates of the intersection in the combined graph;

The second determining unit is configured to determine whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame.
According to the apparatus of claim 10, the second determining unit is specifically configured to calculate the area of the rectangular frame based on the coordinates of the end points of the rectangular frame; and detect whether the area of the rectangular frame is greater than a preset area threshold.
The apparatus of claim 10, the segmentation module comprises:

The third determining unit is used to determine the dividing line according to the coordinates of the end points of the rectangular frame;

The dividing unit is configured to divide the combined image into a plurality of rectangular units according to the dividing line, and generate position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.
The device according to claim 8, further comprising a preprocessing module for preprocessing the image data of the text to be processed, wherein the preprocessing includes: converting the image data to grayscale Image; and/or, performing Gaussian smoothing on the image data.
The apparatus according to claim 8, the image data of the text to be processed includes: a scanned image or a photograph containing a contract to be processed.
A server includes a processor and a memory for storing processor-executable instructions, and when the processor executes the instructions, the steps of the method according to any one of claims 1 to 7 are implemented.
A computer-readable storage medium having computer instructions stored thereon, when the instructions are executed, the steps of the method according to any one of claims 1 to 7 are realized.