WO2020140698A1 - Table data acquisition method and apparatus, and server - Google Patents

Table data acquisition method and apparatus, and server Download PDF

Info

Publication number
WO2020140698A1
WO2020140698A1 PCT/CN2019/124101 CN2019124101W WO2020140698A1 WO 2020140698 A1 WO2020140698 A1 WO 2020140698A1 CN 2019124101 W CN2019124101 W CN 2019124101W WO 2020140698 A1 WO2020140698 A1 WO 2020140698A1
Authority
WO
WIPO (PCT)
Prior art keywords
morphological
rectangular
coordinates
image data
image
Prior art date
Application number
PCT/CN2019/124101
Other languages
French (fr)
Chinese (zh)
Inventor
张林江
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020140698A1 publication Critical patent/WO2020140698A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Definitions

  • This specification belongs to the field of Internet technology, and particularly relates to a method, device and server for acquiring table data.
  • such a type of text data for example, contract documents
  • the data acquisition method is usually to directly perform optical character recognition on image data such as scanned pictures containing text data to recognize and extract text information in the image data to obtain electronic file data of the corresponding text.
  • the table data in the text data is different from the above-mentioned individual text characters.
  • it also has certain graphic features, for example, including dividers and dividers.
  • the structure of the table data is more complicated and it is more difficult to recognize.
  • the existing data acquisition method is used to identify the table data in the image data, errors are likely to occur.
  • the dividers in the table are mistakenly recognized as numbers.
  • the text characters in the N rows and M columns of the table are misaligned and so on. Therefore, there is an urgent need for a method that can accurately identify and completely recover the table data in the image data.
  • the purpose of this specification is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images
  • the content of the table in the data is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images
  • the content of the table in the data is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images
  • the content of the table in the data is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images
  • the content of the table in the data is
  • a method for acquiring form data comprising: acquiring image data of text to be processed; extracting a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines ; Divide the combined image into multiple rectangular units, wherein the multiple rectangular units each carry position coordinates; perform optical character recognition on the multiple rectangular units, and determine whether the multiple rectangular units contain Text information; according to the position coordinates of the rectangular unit, the rectangular unit containing the text information is combined to obtain the table data.
  • An apparatus for acquiring form data includes: an acquiring module for acquiring image data of text to be processed; an extracting module for extracting a combined image from the image data, wherein the combined image is a form that includes a cross Learning vertical and morphological horizontal lines; a segmentation module for dividing the combined graph into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; an identification module is used for the A plurality of rectangular units respectively perform optical character recognition to determine the text information contained in each of the plurality of rectangular units; a combination module is used to combine the rectangular units containing text information according to the position coordinates of the rectangular units to obtain table data.
  • a server includes a processor and a memory for storing processor-executable instructions.
  • the processor executes the instructions, the image data of the text to be processed is obtained; the combined image is extracted from the image data, wherein
  • the combination graph is a graph including vertical morphological lines and horizontal morphological lines; the combination graph is divided into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; Each rectangular unit performs optical character recognition to determine the text information contained in the multiple rectangular units; according to the position coordinates of the rectangular units, the rectangular units containing the text information are combined to obtain table data.
  • the method, device and server for acquiring table data provided in this specification, because the combined image is obtained by first obtaining and extracting from the graphic features of the morphological vertical line and the morphological horizontal line in the image data; then the combined image is divided into multiple Each rectangular unit is divided into optical characters to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined according to the position coordinates of the rectangular unit to restore the complete table data. Therefore, the technical problem of large error and inaccuracy in extracting table data existing in the existing method is solved, and the content of the table in the image data can be identified efficiently and accurately, and the table content in the image data is completely restored.
  • FIG. 1 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
  • FIG. 2 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
  • FIG. 3 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
  • FIG. 4 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
  • FIG. 5 is a schematic diagram of an embodiment of a flow of a method for acquiring table data provided by an embodiment of this specification
  • FIG. 6 is a schematic diagram of an embodiment of a structure of a server provided by an embodiment of this specification.
  • FIG. 7 is a schematic diagram of an embodiment of a structure of an apparatus for acquiring table data provided by an embodiment of this specification.
  • a graphic structure such as a separator bar in the table data is mistakenly recognized as a text character, or a misalignment occurs in the recognition and extraction of text information at different positions in the table data. That is, when the table data in the image data is processed by the existing acquisition method, the effect is often not ideal, and there is a technical problem of large error and inaccuracy in extracting the table data.
  • this specification specifically analyzes the different characteristics of the two different attribute objects of text characters and graphic structures that the table data has at the same time.
  • image structure features such as lines to find a combined image that may form table data from the image data; then divide the combined image into multiple rectangular units, and perform optical character recognition on each rectangular unit separately to obtain the text information of the rectangular unit;
  • the rectangular unit containing the text information is combined to restore and reconstruct the complete table data of the image, thereby solving the technical problem of large error and inaccuracy in the existing method of extracting table data. It can efficiently and accurately identify and completely restore the table content in the image data.
  • the embodiments of the present specification provide an acquisition method of a table data method.
  • the acquisition method of the table data may be specifically applied to an image data processing system including multiple servers.
  • the legal contract processing system for scanning pictures For example, the legal contract processing system for scanning pictures.
  • the above system may specifically include a server for identifying and acquiring form data in text data from image data.
  • the server can extract the combined image from the acquired image data of the text to be processed by detecting the morphological vertical lines and morphological horizontal lines in the image data; then divide the combined image according to the coordinates Into multiple rectangular units, and perform optical character recognition on each of the multiple rectangular units to identify and determine the text information contained in each rectangular unit; then, according to the coordinates of the rectangular unit, combine and splice the above contained text The rectangular unit of information to get the complete table data.
  • the server can be understood as a service server that is applied to the business system side and can implement functions such as data transmission and data processing.
  • the server may be an electronic device with data calculation, storage, and network interaction functions; or a software program that runs on the electronic device and provides support for data processing, storage, and network interaction.
  • the number of the servers is not specifically limited.
  • the server may specifically be one server, or several servers, or a server cluster formed by several servers.
  • the form data acquisition method provided in the embodiment of the present specification can be used to process the image data containing the contract received by the legal platform to extract the form data in the contract.
  • the legal platform can distribute the image data containing the contract to be entered by the user to the server on the platform that is used to obtain the form data.
  • the above-mentioned legal platform can be specifically used to identify and extract text information in user-uploaded image data containing contracts (such as scanned pictures or photos containing contracts) to convert contract contents into electronic file data.
  • contracts such as scanned pictures or photos containing contracts
  • the above-mentioned legal platform can be specifically used to identify and extract text information in user-uploaded image data containing contracts (such as scanned pictures or photos containing contracts) to convert contract contents into electronic file data.
  • contracts such as scanned pictures or photos containing contracts
  • the server may refer to FIG. 2 to pre-process the image to reduce error interference and improve the accuracy of subsequent identification and acquisition of table data.
  • the server may be specifically configured with OpenCV (that is, Open source Computer Vision Library, source code computer vision library).
  • OpenCV Open source Computer Vision Library, source code computer vision library
  • the above OpenCV can be understood as an API function library about the source code of computer vision.
  • the function code contained in the library has been optimized, and the efficiency of calling and calculating is relatively high.
  • the server can call the corresponding function code through the above OpenCV to efficiently perform data processing on the image data.
  • the server can first convert the image data to obtain the corresponding grayscale image, and then perform Gaussian smoothing on the grayscale image to filter out the more obvious noise information in the grayscale image and improve the accuracy of the image data, thereby completing Preprocessing of image data.
  • the image data is converted into a grayscale image only as an example for schematic description.
  • the image data may also be converted into a binary map first, and then subsequent table data acquisition may be performed based on the binary map. This specification is not limited.
  • the server can first scan and retrieve the graphic structural features (such as structural elements, etc.) in the image data based on morphology, so as to find the difference from the image data first.
  • Text characters, with certain graphic features, may form a table of graphics: combination chart.
  • a specific frame image in the image data is taken as an example, for example, the fifth page image in the image data including the contract is taken as an example.
  • the server can scan and search the morphological vertical line and the morphological horizontal line in the frame image.
  • the above-mentioned morphological vertical lines and morphological horizontal lines can be understood as a structural element related to graphics that is different from text characters. You can refer to Figure 3.
  • the morphological vertical line may specifically be an image unit or a structural element that contains a straight line segment along the vertical direction in the image.
  • the above-mentioned morphological horizontal line may specifically be an image unit or a structural element that contains a straight line segment along the horizontal direction in the image.
  • the server can search for the structural elements in the image by calling the getStructuringElement function, and find all the morphological vertical lines and morphological horizontal lines from it.
  • the above-listed method of obtaining the morphological vertical line and the morphological horizontal line from the image by calling the getStructuringElement function is only a schematic illustration.
  • the morphological vertical line and the morphological horizontal line in the image may also be obtained in other suitable ways. This specification is not limited.
  • each morphological horizontal line mostly intersects one or more of the morphological vertical lines. Therefore, after obtaining the morphological vertical line and the morphological horizontal line in the frame image, the server can further search for the graph containing the structure of the intersecting morphological vertical line and the morphological horizontal line as possible form data Combining graphs to avoid subsequent processing of graphic structures that obviously do not have the graphic features of table data and improve processing efficiency.
  • the morphological horizontal lines and morphological vertical lines can be directly extracted on the original image, and the extracted morphology Horizontal lines and morphological vertical lines cover the extraction position.
  • the combination chart After obtaining the above-mentioned combination chart with more obvious data characteristics of the data table and possibly forming the table data, the combination chart can be further inspected, by checking whether the combination chart meets the preset table format requirements, to be more accurate To determine whether the combination chart is a data table.
  • the above-mentioned preset table format requirements can be specifically understood as a rule set for describing graphic features of data tables different from other graphic structures.
  • each grid graphic (or rectangular frame, see Figure 3) is designed to fill in specific characters, that is, each grid graphic in the data table
  • the minimum area should be able to accommodate at least the next complete character. Therefore, the following rules for graphic area characteristics may be set: the minimum area of the grid pattern in the data table should be greater than a preset area threshold. Also considering that based on people's usual typesetting habits, the table data will be set to the center position when editing the table data. Therefore, you can also set the following rules for graphic position features: the absolute value of the difference between the distance between the left border of the data table and the left border of the image and the distance between the right border of the data table and the right border of the image is less than the Set the distance threshold.
  • the following rules for the quantity characteristics of graphics may also be set: the number of grid graphics in the data table is greater than or equal to a preset quantity threshold (for example, 2), and so on.
  • the extracted combination map in order to determine whether the extracted combination map meets the preset table format requirements, in specific implementation, it can first retrieve the point where the horizontal and vertical morphological lines in the combination map are at the same image position as Intersection point, and then determine the position coordinates of each intersection point in the combined image in the frame image.
  • intersection point can be specifically understood as the pixel point at the position where the morphological vertical line and the morphological horizontal line intersect in the combined image in the frame image. See Figure 3 for details.
  • the server can search for and obtain the coordinates of the intersection point in the combined image in the image by calling the opencv bitwise_and function.
  • the opencv bitwise_and function listed above is only a schematic illustration.
  • the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.
  • the server may further search for the graphic structure elements of the above combination diagram, and find a graphic element having a rectangular (or square) structure (ie, a grid in the corresponding table) as a rectangular frame in the combination diagram.
  • a graphic element having a rectangular (or square) structure ie, a grid in the corresponding table
  • the server may search for and obtain the rectangular frame in the combination graph by calling the findContours function.
  • the findContours function is only a schematic illustration.
  • the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
  • the server may determine the endpoint coordinates at the four endpoints of each rectangular frame in the combination graph through position comparison based on the determined intersection coordinate and the rectangular frame in the combination graph. Furthermore, according to the coordinates of the endpoints of the rectangular frame in the combination diagram, it can be determined whether the combination diagram meets the preset table format requirements.
  • the server may calculate the length and width of the rectangular frame according to the coordinates of the endpoints of the rectangular frame, and then calculate the area of the rectangular frame based on the length and width. Then compare the area of the rectangular frame with the preset area threshold. If the area of each rectangular frame in the combination diagram is greater than the preset area threshold, it can be determined that the combination diagram meets the preset table format requirements.
  • the server can also compare the value of the abscissa of the end point coordinates of each rectangular frame in the combination diagram, find the end point with the smallest value of the abscissa as the endpoint on the left border of the combination diagram, and determine the abscissa of the endpoint as the left The abscissa of the border, and then calculate the distance between the left border of the combined image and the left border of the image based on the abscissa of the left border, and record it as d1.
  • the service finds the endpoint with the largest abscissa value as the endpoint on the right border of the combination chart by comparing the values of the abscissa of the endpoint, and determines the abscissa of the endpoint as the abscissa of the right border.
  • the abscissa of the side boundary calculates the distance between the right boundary of the combined drawing and the right boundary of the drawing, and is denoted as d2.
  • the server may calculate the absolute value of the difference between d1 and d2, and compare the absolute value of the above difference with a preset distance threshold. If the absolute value of the above-mentioned difference is less than or equal to the preset distance threshold, it can be determined that the entire combination picture is located at the center of the image, that is, the preset table format requirements are met.
  • the server may determine that the currently extracted combination diagram is indeed a data table in the image. Subsequent text information can be extracted from the combined image.
  • the server may first divide the above combined image into a plurality of rectangular units.
  • each rectangular unit corresponds to a rectangular frame in the combination diagram one by one; however, it is different from the single graphical structure element of the rectangular frame.
  • Each rectangular unit contains text characters or blank state information.
  • separate optical character recognition can be performed on each rectangular unit to accurately identify the text characters in the rectangular unit and determine the text information contained in each rectangular unit.
  • the server may first determine the contour line enclosing the rectangular frame as the dividing line according to the endpoint coordinates of the rectangular frame, and then may cut along the contour line to divide the rectangular unit corresponding to the rectangular frame from the combined diagram. For example, see Figure 4.
  • the coordinates of the four endpoints of a rectangular frame in the combination diagram are A (15, 60), B (15, 40), C (30, 40), and D (30, 60).
  • the server can start from the endpoint A, keep the abscissa 15 unchanged, and find the endpoint with a different ordinate, namely endpoint B, and then connect endpoint A to endpoint B according to a preset division rule.
  • the server starts from the endpoint B, keeps the ordinate 40 unchanged, and finds the endpoint with different abscissas, that is, the endpoint C, and then connects the endpoint B to the endpoint C according to the preset division rule.
  • the server starts from the endpoint C, keeps the abscissa 30 unchanged according to the preset division rule, and finds the endpoint with a different ordinate, namely the endpoint D, and then connects the endpoint C to the endpoint D.
  • the server starts from the endpoint D and keeps the ordinate 60 unchanged according to the preset division rule, and finds the endpoint with different abscissas, that is, endpoint A, and then connects the endpoint D to the endpoint A.
  • a closed connecting line can be obtained: A to B to C to D to A, which is the outline of the rectangular frame.
  • the server may use the outline as a dividing line, and divide the rectangular frame containing the text information in the combined image along the outline to obtain the corresponding rectangular unit.
  • each rectangular unit in the combined graph can be divided.
  • the above-mentioned manner of dividing the rectangular unit is just to better explain the embodiments of the present specification.
  • other suitable methods may also be used to divide a plurality of rectangular units from the combined diagram according to specific circumstances. This specification is not limited.
  • the server in the process of dividing the combined image, also generates position coordinates corresponding to the rectangular unit according to the coordinates of the end points of the rectangular frame.
  • the above position coordinates can be understood as a kind of parameter data used to indicate the position of the rectangular unit in the image of the combined image or describe the positional relationship between the rectangular unit in the image of the combined image and other adjacent rectangular units.
  • the server may calculate the coordinates of the center point of the rectangular frame as the position coordinates of the corresponding rectangular unit according to the endpoint coordinates of the four endpoints of the rectangular frame.
  • the server may also calculate the coordinates of the center points of each rectangular frame first, and then according to the preset arrangement order, for example, from the top to bottom and from left to right, according to the coordinates of the center points of each rectangular frame, determine The row number and column number of each rectangular unit are used as the position coordinates of the corresponding rectangular unit.
  • the rectangular frame A is located in the first row and second column of the combined diagram, that is, the corresponding row number is 1 and the column number is 2, so "1-2" can be used as The position coordinates of the rectangular unit corresponding to the rectangular frame A.
  • the above-listed methods for determining the position coordinates of the rectangular unit are only schematic illustrations. During specific implementation, according to the specific situation, other suitable methods may also be used to determine the position coordinates of the rectangular unit. This specification is not limited.
  • the server can perform optical character recognition (ie, OCR, Optical, Character, Recognition) on each of the multiple rectangular units to determine the text characters in each rectangular unit, and then Determine the text information contained in each rectangular unit. If no text characters are recognized in the rectangular unit, the text information contained in the rectangular unit is left blank. In this way, multiple rectangular units containing corresponding text information can be obtained.
  • optical character recognition ie, OCR, Optical, Character, Recognition
  • the server may combine and combine the rectangular units containing the text information obtained above according to the position coordinates of each rectangular unit.
  • the rectangular unit containing text information can be set at the position of the first row and the second column according to the position coordinates "1-2" of the rectangular unit.
  • a plurality of rectangular units containing text information are sequentially set to corresponding positions, so that a complete data table can be restored.
  • the above-mentioned combination mode is only a schematic illustration. During specific implementation, other combination methods can also be used to perform combination splicing according to other types of position coordinates. This specification is not limited.
  • the server can separately detect the form data of each image in the image data containing the contract to be processed, and then obtain the form data when it is determined that the form data exists, so as to extract the complete image data Form data, and feed back the extracted form data to the legal platform, so as to organize and generate the electronic file data for the contract for storage.
  • the server obtains the After the morphological vertical line and the morphological horizontal line, further feature enhancement processing can be performed on the obtained morphological vertical line and the morphological horizontal line to make the obtained morphological vertical line and morphological horizontal line clearer.
  • the above feature strengthening treatment may specifically be a morphological treatment, and may specifically include corrosion treatment and/or expansion treatment.
  • the data value of the pixel in the middle of the area can be reset (reset to 0 or 1) by sliding the area of the convolution kernel into the frame image.
  • corrosion treatment may be performed first, followed by expansion treatment.
  • the above-mentioned corrosion processing can be understood as an AND operation. Specifically, by corroding the pixels close to the foreground according to the size of the convolution kernel (that is, resetting the value of the corresponding pixel to 0), the foreground object becomes Small, which can reduce the white area around the morphological vertical line or morphological horizontal line to achieve the effect of removing white noise; at the same time, it can also break the structural elements adjacent or even connected to the above morphological vertical line or morphological horizontal line open.
  • the morphological vertical line or the morphological horizontal line after the corrosion processing may be continuously expanded.
  • the above expansion process can be understood as an OR operation.
  • the eroded image can be enlarged and restored through expansion to obtain relatively clear morphological vertical lines and morphological horizontal lines of constant size. .
  • the method for obtaining the table data provided in this specification is due to obtaining and extracting the combined image according to the graphic features of the morphological vertical line and the morphological horizontal line in the image data; then the combined image is divided into multiple
  • the rectangular units are divided into optical characters for each rectangular unit to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving
  • an embodiment of the present specification also provides a method for acquiring table data, where the method is specifically applied to the server side.
  • the method may include the following:
  • the above-mentioned to-be-processed text may specifically be a to-be-processed contract text, a to-be-processed constitution text, or a to-be-processed specification text.
  • the image data of the text to be processed may be a scanned image containing the text content, a photo containing the text content, or a video containing the text content.
  • the specific content and form of the image data of the text to be processed above are not limited in this specification.
  • S53 Extract a combination graph from the image data, wherein the combination graph is a graph including vertical morphological lines and horizontal morphological lines.
  • the above morphological vertical line and morphological horizontal line can be specifically understood as a structural element related to graphics that is different from text characters.
  • the morphological vertical line may specifically be an image unit or a structural element that contains a straight line segment along the vertical direction in the image.
  • the above-mentioned morphological horizontal line may specifically be an image unit or a structural element that contains a straight line segment along the horizontal direction in the image.
  • the above-mentioned combined graph can be specifically understood as the image data having graphic features similar to the table data, for example, a combined graph including graphic structural elements of crossing morphological vertical lines and morphological horizontal lines.
  • the above-mentioned extraction of the combined image from the image data may include the following: search and obtain the morphological vertical line and the morphological horizontal line in the image data; connect the morphology The vertical line and the morphological horizontal line obtain the combined diagram.
  • the above search and obtain the morphological vertical line and the morphological horizontal line in the image data may include the following content: by calling the getStructuringElement function in OpenCV to search for the structural element in the image , Find the morphological vertical line and morphological horizontal line in the image data.
  • the above-listed method of obtaining the morphological vertical line and the morphological horizontal line from the image by calling the getStructuringElement function is only a schematic illustration.
  • the morphological vertical line and the morphological horizontal line in the image may also be obtained in other suitable ways. This specification is not limited.
  • the morphological vertical line and the morphological horizontal line obtained in the above manner also carry position information in the image data, and then the corresponding information can be connected according to the position information of the morphological vertical line and the morphological horizontal line The morphology vertical line and the morphology horizontal line to get the combined picture.
  • S55 Divide the combined image into a plurality of rectangular units, where the plurality of rectangular units respectively carry position coordinates.
  • the above rectangular unit can be specifically understood as an image unit that corresponds one-to-one with a rectangular frame in the combination diagram, but distinguishes the rectangular frame and contains text information (such as text characters filled or blank) .
  • each rectangular frame can be specifically understood as a rectangular or square-shaped graphic element composed of two morphological vertical lines and two morphological horizontal lines, which simply contain only graphic features.
  • each rectangular frame can be regarded as a grid in the table.
  • the combination diagram is divided into a plurality of rectangular units.
  • the following contents may be included: obtaining the coordinates of the intersection point in the combination diagram; searching and obtaining the rectangular frame in the combination diagram; according to The coordinate of the intersection point in the combined graph determines the coordinates of the end points of the rectangular frame; and according to the coordinate of the endpoints of the rectangular frame, the combined graph is divided into a plurality of rectangular units.
  • intersection point can be specifically understood as the pixel point at the position where the vertical morphological line and the horizontal morphological line in the combination figure intersect.
  • the coordinates of the intersection point in the combined graph in the image can be searched and obtained by calling the opencv bitwise_and function in OpenCV.
  • OpenCV opencv bitwise_and function
  • the rectangular frame in the combined graph can be searched and obtained by calling the findContours function in OpenCV.
  • the findContours function in OpenCV.
  • the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
  • the above OpenCV Open source Computer Vision Library, source code computer vision library
  • the server can call the corresponding function code through the above OpenCV to efficiently perform data processing on the image data.
  • the combination diagram is divided into a plurality of rectangular units.
  • the following may be included: according to the coordinates of the intersection point in the combination diagram, determine the The coordinates of the end points of the rectangular frame; the dividing line is determined according to the coordinates of the end points of the rectangular frame; and the combined image is divided into a plurality of rectangular units according to the dividing lines.
  • the endpoint coordinates of the rectangular frame are determined according to the coordinates of the intersection point in the combination diagram, and in specific implementation, the following content may be included: the coordinates of the intersection point in the combination diagram and the rectangular frame are performed Position comparison to determine the four endpoints of each rectangular frame from the intersection, and then determine the coordinates of the endpoints of each rectangular frame.
  • the above-mentioned determination of the dividing line according to the coordinates of the end points of the rectangular frame may include the following content: according to the coordinates of the four end points of each rectangular frame, the outline line surrounding the rectangular frame is determined as the corresponding dividing line. Furthermore, subsequent division can be performed along the above division line, and each rectangular unit can be obtained from the combination diagram.
  • the method further includes the following content: generating position coordinates of the rectangular units according to the coordinates of the end points of the rectangular frame.
  • the position coordinates of the above rectangular unit can be specifically understood as a type used to indicate the position of the rectangular unit in the image of the combined image or describe the position of the rectangular unit and other adjacent rectangular units in the image of the combined image Parameter data of the relationship.
  • the coordinates of the center point of the rectangular frame may be calculated as the position coordinates of the corresponding rectangular unit according to the endpoint coordinates of the four end points of the rectangular frame. You can also calculate the coordinates of the center point of each rectangular frame first, and then follow the preset arrangement order, for example, from top to bottom and from left to right, according to the coordinates of the center point of each rectangular frame, arrange in order For each rectangular unit, determine the row number and column number of each sorted rectangular unit as the position coordinates of the corresponding rectangular unit.
  • the above-listed methods for determining the position coordinates of the rectangular unit are only schematic illustrations. During specific implementation, according to the specific situation, other suitable methods may also be used to determine the position coordinates of the rectangular unit. This specification is not limited.
  • S57 Perform optical character recognition on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units.
  • each rectangular unit of the plurality of rectangular units may be subjected to separate optical character recognition to separately identify text characters in each rectangular unit, and then determine the location of each rectangular unit. Contains text information.
  • the text information contained in the rectangular unit may be left blank.
  • the rectangular units containing text information adjacent to the position coordinates may be stitched according to the position coordinates of each rectangular unit, and the rectangular units containing text information may be placed in the corresponding At the location of the data, so as to obtain the complete table data.
  • the combined graph is obtained by acquiring and extracting the graphic features such as the morphological vertical line and the morphological horizontal line in the image data; then the combined graph is divided into a plurality of rectangular units, and each rectangular unit is Optical character recognition to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving the problem of extracting table data existing in the existing methods The technical problem of large error and inaccuracy can be recognized efficiently and accurately, and the content of the table in the image data can be completely restored.
  • the method may further include the following when the method is specifically implemented: performing the image data of the text to be processed Preprocessing, wherein the preprocessing includes: converting the image data into a grayscale image; and/or performing Gaussian smoothing on the image data to filter out noise interference.
  • the preprocessing includes: converting the image data into a grayscale image; and/or performing Gaussian smoothing on the image data to filter out noise interference.
  • the above-mentioned extraction of the combined image from the image data may include the following content: search and obtain morphological vertical lines and morphological horizontal lines in the image data; connect the morphology The vertical line and the morphological horizontal line obtain the combined diagram.
  • the above search and obtain the morphological vertical line and the morphological horizontal line in the image data may include the following content: search and obtain the morphological vertical line in the image data through the getStructuringElement function Lines and morphological horizontal lines.
  • the method may further include the following contents: performing feature enhancement processing on the obtained morphological vertical line and morphological horizontal line respectively, wherein the feature enhancement processing includes at least one of the following: corrosion treatment And expansion treatment.
  • the morphological vertical line and the morphological horizontal line may be etched first, and then the morphological vertical line and the morphological horizontal line after the etching process may be expanded.
  • the white noise generated by the foreground of the morphological vertical line and the morphological horizontal line can be eliminated through the etching process, making the morphological vertical line and the morphological horizontal line clearer, but the morphological vertical line and the The graphical elements of the morphological horizontal lines are reduced. Therefore, after corroding the morphological vertical line and the morphological horizontal line, the morphological vertical line and the morphological horizontal line with a constant size can be recovered by the expansion treatment to be more clear.
  • the above-mentioned combination chart is only that the graphic features are similar to the table data, but it may not be table data.
  • the large text character "Tian” also has graphic features similar to table data. Therefore, the extracted combination chart can be tested to determine whether the combination chart meets the preset table format requirements, so as to more accurately determine whether the combination chart is real table data, and then can only be determined as table data.
  • the combination graph performs data processing, thereby reducing waste of resources and improving processing efficiency.
  • the method may further include: acquiring coordinates of the intersection point in the combined image, where the intersection point is the combination Pixels at the position where the morphological vertical line and the morphological horizontal line intersect in the figure; search and obtain the rectangular frame in the combined map; determine the endpoint coordinates of the rectangular frame according to the coordinates of the intersection point in the combined map; The endpoint coordinates of the rectangular frame determine whether the combined image meets the preset table format requirements.
  • the coordinates of the intersection point in the combined graph in the image can be searched and obtained by calling the opencv bitwise_and function.
  • the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.
  • the rectangular frame in the combined graph can be searched and obtained by calling the findContours function.
  • the findContours function is only a schematic illustration.
  • the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
  • the above-mentioned preset table format requirement can be specifically understood as a rule set for describing the graphic features of the data table different from other graphic structures.
  • the specific rules included in the above-mentioned preset table format requirements can be flexibly set according to specific conditions. For example, considering that the data table is different from other graphics, each grid graphic (or rectangular frame) is designed to fill in specific characters, that is, the minimum area of each grid graphic in the data table should be at least tolerable The next complete character. Therefore, the following rules for graphic area characteristics may be set: the minimum area of the grid pattern in the data table should be greater than a preset area threshold. Also considering that based on people's usual typesetting habits, the table data will be set to the center position when editing the table data.
  • the absolute value of the difference between the distance between the left border of the data table and the left border of the image and the distance between the right border of the data table and the right border of the image is less than the Set the distance threshold.
  • the following rules for the quantity characteristics of graphics may also be set: the number of grid graphics in the data table is greater than or equal to a preset quantity threshold (for example, 2), and so on.
  • the above determines whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame. In specific implementation, it may include the following: according to the endpoint coordinates of the rectangular frame, calculate The area of the rectangular frame; detecting whether the area of the rectangular frame is greater than a preset area threshold. If the area of the rectangular frame is greater than a preset area threshold, it is determined that the combined map meets the preset table format requirements.
  • the foregoing determines whether the combination map meets the preset table format requirements according to the endpoint coordinates of the rectangular frame.
  • the following may also be included: According to the endpoint coordinates of the rectangular frame in the combination map, respectively Determine the abscissa of the left border and the right border of the combined map; calculate the distance between the left border of the combined map and the left border of the image data based on the left border of the combined map.
  • a distance calculate the distance between the right border of the combination map and the right border of the image data according to the abscissa of the right border of the combination map, and record it as the second distance; calculate the distance difference between the first distance and the second distance Compare the absolute value of the difference with a preset distance threshold to detect whether the absolute value of the distance difference is less than the preset distance threshold. If the absolute value of the distance difference is less than a preset distance threshold, it is determined that the combination map meets the preset table format requirements.
  • the above-mentioned dividing the combined image into a plurality of rectangular units may include the following: determining the dividing line according to the coordinates of the end points of the rectangular frame; dividing the combined image into the following according to the dividing line A plurality of rectangular units, and generating position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.
  • the image data of the text to be processed may specifically include: a scanned image or a photograph containing a contract to be processed.
  • the image data of the to-be-processed text listed above is only for better explaining the embodiments of the present specification.
  • the image data of the text to be processed may also include image data of other types and contents, for example, a video screenshot containing a manual to be processed, etc. This specification is not limited.
  • the method for obtaining the table data is that the combined picture is obtained by obtaining and extracting the graphic features such as the morphological vertical line and the morphological horizontal line in the image data; then the combined map is divided into multiple The rectangular units are divided into optical characters for each rectangular unit to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving It solves the technical problems of large error and inaccuracy in the extraction of table data in the existing methods, so that it can be efficiently and accurately identified, and the content of the table in the image data can be completely restored; after the extraction of the combined image, according to the combined image
  • the included intersections, rectangular frames and other graphic factors detect whether the extracted combined image is tabular data in the text, thereby avoiding mistakenly identifying non-tabular data as tables, reducing errors, and improving the accuracy of obtaining tabular data.
  • An embodiment of this specification also provides a server including a processor and a memory for storing processor-executable instructions.
  • the following steps may be performed according to the instructions: acquiring image data of text to be processed; Extracting a combination diagram from the image data, wherein the combination diagram is a graph including vertical morphological and morphological horizontal lines; the combination diagram is divided into a plurality of rectangular units, wherein the plurality of rectangles The units carry position coordinates; perform optical character recognition on the multiple rectangular units to determine the text information contained in the multiple rectangular units; according to the position coordinates of the rectangular units, combine the rectangular units containing text information to obtain Tabular data.
  • this specification also provides another specific server, where the server includes a network communication port 601, a processor 602, and a memory 603.
  • the cables are connected so that each structure can perform specific data interactions.
  • the network communication port 601 may be specifically used to input image data of text to be processed
  • the processor 602 may be specifically used to extract a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines; the combined image is divided into A plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; optical character recognition is performed on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units; according to the position of the rectangular unit Coordinates, combined with rectangular cells containing text information, get table data.
  • the memory 603 may specifically be used to store image data of text to be processed input via the network communication port 601 and store corresponding instruction programs based on the processor 602.
  • the network communication port 601 may be a virtual port that is bound to different communication protocols so that different data can be sent or received.
  • the network communication port may be port 80 responsible for web data communication, port 21 responsible for FTP data communication, or port 25 responsible for mail data communication.
  • the network communication port may also be a physical communication interface or a communication chip.
  • it can be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it can also be a Wifi chip; it can also be a Bluetooth chip.
  • the processor 602 can be implemented in any suitable manner.
  • the processor may adopt, for example, a microprocessor or a processor and a computer-readable medium storing a computer-readable program code (such as software or firmware) executable by the (micro)processor, a logic gate, a switch, an application specific integrated circuit ( Application Specific (Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller, etc.
  • a computer-readable program code such as software or firmware
  • the memory 603 may include multiple levels. In a digital system, as long as it can store binary data, it can be a memory. In an integrated circuit, a circuit with a storage function without a physical form is also called a memory. , Such as RAM, FIFO, etc.; in the system, the storage device with physical form is also called memory, such as memory stick, TF card, etc.
  • the embodiments of the present specification also provide a computer storage medium based on the above-mentioned table data acquisition method, where the computer storage medium stores computer program instructions, which are implemented when the computer program instructions are executed: acquiring image data of text to be processed ; Extract a combination graph from the image data, wherein the combination graph is a graph that includes crossed morphological vertical lines and morphological horizontal lines; divide the combination map into a plurality of rectangular units, wherein, the A plurality of rectangular units respectively carry position coordinates; perform optical character recognition on the plurality of rectangular units respectively to determine the text information contained in the plurality of rectangular units; according to the position coordinates of the rectangular units, combine rectangles containing text information Unit, get the table data.
  • the storage medium includes, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), cache (Cache), hard disk (Hard Disk Drive, HDD) Or memory card (Memory Card).
  • RAM Random Access Memory
  • ROM read-only memory
  • cache cache
  • HDD Hard Disk Drive
  • Memory Card Memory Card
  • the memory may be used to store computer program instructions.
  • the network communication unit may be an interface configured to perform network connection communication according to the standard specified by the communication protocol.
  • the embodiment of the present specification also provides an apparatus for acquiring table data.
  • the apparatus may specifically include the following structural modules:
  • the obtaining module 71 can be specifically used to obtain image data of text to be processed
  • the extracting module 72 may be specifically used for extracting a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines;
  • the segmentation module 73 may be specifically used to segment the combined image into multiple rectangular units, where the multiple rectangular units each carry position coordinates;
  • the recognition module 74 may be specifically configured to perform optical character recognition on the plurality of rectangular units respectively and determine the text information contained in the plurality of rectangular units respectively;
  • the combining module 75 can be specifically used to combine rectangular units containing text information according to the position coordinates of the rectangular units to obtain table data.
  • the extraction module 72 may specifically include the following structural units:
  • the first search unit may specifically be used to search for and obtain morphological vertical lines and morphological horizontal lines in the image data
  • the connecting unit may specifically be used to connect the morphological vertical line and the morphological horizontal line to obtain the combined graph.
  • the apparatus may further specifically include a detection module, configured to detect whether the combination graph meets a preset table format requirement.
  • the detection module may specifically include the following structural units:
  • the obtaining unit may be specifically configured to obtain the coordinates of the intersection point in the combined graph, where the intersection point may specifically be a pixel point at a position where the morphological vertical line and the morphological horizontal line intersect in the combined map;
  • the second search unit may specifically be used to search for and obtain a rectangular frame in the combination diagram
  • the first determining unit may specifically be used to determine the coordinates of the end point of the rectangular frame according to the coordinates of the intersection in the combined graph;
  • the second determining unit may be specifically configured to determine whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame.
  • the second determining unit may be specifically configured to calculate the area of the rectangular frame according to the coordinates of the endpoints of the rectangular frame; and detect whether the area of the rectangular frame is greater than a preset area threshold.
  • the segmentation module 73 may specifically include the following structural units:
  • the third determining unit can be specifically used to determine the dividing line according to the coordinates of the end points of the rectangular frame
  • the dividing unit may specifically be used to divide the combined image into a plurality of rectangular units according to the dividing line, and generate position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.
  • the apparatus may further specifically include a preprocessing module for preprocessing the image data of the text to be processed, wherein the preprocessing may specifically include: converting the image data to gray Degree image; and/or, perform Gaussian smoothing on the image data, etc.
  • the image data of the text to be processed may specifically include: a scanned image or a photograph containing a contract to be processed.
  • the image data of the to-be-processed text listed above is only for better explaining the embodiments of the present specification.
  • the image data of the text to be processed may also include image data of other types and contents, for example, a video screenshot containing a manual to be processed, etc. This specification is not limited.
  • the units, devices, or modules explained in the foregoing embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions.
  • the functions are divided into various modules and described separately.
  • the functions of each module may be implemented in one or more software and/or hardware, or the modules that implement the same function may be implemented by a combination of multiple submodules or subunits.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a division of logical functions.
  • there may be another division manner for example, multiple units or components may be combined or integrated To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical, or other forms.
  • the table data acquisition device provided by the embodiment of the present specification is obtained by the extraction module and extracted according to the morphological vertical lines and morphological horizontal lines in the image data to obtain the combined picture;
  • the module divides the combined image into multiple rectangular units, and performs optical character recognition on each rectangular unit type to obtain the text information contained in each rectangular unit, and then uses the combination module to divide the rectangle containing the text information according to the position coordinates of the rectangular unit.
  • Units are combined and restored to obtain complete table data, thereby solving the technical problem of large error and inaccuracy in the existing method of extracting table data, so as to achieve efficient and accurate identification, and completely restore the table content in the image data;
  • the combination module detects whether the extracted combination chart is tabular data in the text according to the intersection points, rectangular frames and other graphical factors contained in the combo chart, so as to avoid mistakenly identifying non-table data as Tables reduce errors and improve the accuracy of obtaining table data.
  • the method can be logically programmed to enable the controller to use logic gates, switches, special integrated circuits, programmable logic controllers and embedded To achieve the same function in the form of a microcontroller, etc. Therefore, such a controller can be regarded as a hardware component, and the device for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module of the implementation method and a structure within a hardware component.
  • program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types.
  • This specification can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network.
  • program modules may be located in local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A table data acquisition method and apparatus, and a server. The method comprises: obtaining image data of text to be processed; extracting a combined graph from the image data, the combined graph being a graph containing morphological vertical lines and morphological horizontal lines crossing each other; dividing the combined graph into a plurality of rectangular units; performing optical character recognition on the rectangular units respectively, and determining text information of the rectangular units; and according to the position coordinates of the rectangular units, combining the rectangular units containing the text information to obtain table data. By first obtaining graphic features such as morphological vertical lines and morphological horizontal lines in image data and obtaining a combined graph according to the graphic features, then dividing the combined graph into a plurality of rectangular units for optical character recognition to obtain text information of the rectangular units, and carrying out combination reduction according to the position coordinates to obtain table data, the technical problems of big errors and inaccuracy in table data extraction in an existing method are solved.

Description

表格数据的获取方法、装置和服务器Method, device and server for acquiring form data 技术领域Technical field
本说明书属于互联网技术领域,尤其涉及一种表格数据的获取方法、装置和服务器。This specification belongs to the field of Internet technology, and particularly relates to a method, device and server for acquiring table data.
背景技术Background technique
在生活、工作中常常会涉及到这样一类文本数据(例如,合同文件)除了包含有单独的文本字符(例如单纯的文字符号)外,还会包含有表格数据(例如,关于价格的统计列表),且这类表格数据在某些场景中还具有较高的信息价值,包含有人们较为关注的信息内容。In life and work, such a type of text data (for example, contract documents) often contains not only individual text characters (for example, simple text symbols), but also table data (for example, a statistical list of prices ), and this type of table data also has high information value in certain scenarios, including information content that people pay more attention to.
通常,数据获取方法往往是直接对包含有文本数据的扫描图片等图像数据进行光学字符识别,以识别并提取出图像数据中的文本信息,得到对应文本的电子档数据。Generally, the data acquisition method is usually to directly perform optical character recognition on image data such as scanned pictures containing text data to recognize and extract text information in the image data to obtain electronic file data of the corresponding text.
基于数据获取方法,在对图像数据中单独的文本字符进行识别提取时,具有相对较好的效果。但是,文本数据中的表格数据区别于上述单独的文本字符,除了包含有文本字符所携带的文本信息外,还具有一定的图形特征,例如,包含有分隔线、分隔框等。相对于单独的文本字符,表格数据的结构更为复杂,识别起来更为困难。导致通过现有的数据获取方法在识别图像数据中的表格数据时,很容易出现误差。例如,会将表格中的分隔栏错误识别成了数字。或者,对表格中N行M列中的文本字符的识别出现错位等等。因此,亟需一种能够精确识别,并完整恢复得到图像数据中的表格数据的方法。Based on the data acquisition method, it has a relatively good effect when recognizing and extracting individual text characters in image data. However, the table data in the text data is different from the above-mentioned individual text characters. In addition to containing the text information carried by the text characters, it also has certain graphic features, for example, including dividers and dividers. Compared with the individual text characters, the structure of the table data is more complicated and it is more difficult to recognize. As a result, when the existing data acquisition method is used to identify the table data in the image data, errors are likely to occur. For example, the dividers in the table are mistakenly recognized as numbers. Or, the text characters in the N rows and M columns of the table are misaligned and so on. Therefore, there is an urgent need for a method that can accurately identify and completely recover the table data in the image data.
发明内容Summary of the invention
本说明书目的在于提供一种表格数据的获取方法、装置和服务器,以解决现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。The purpose of this specification is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images The content of the table in the data.
本说明书提供的一种表格数据的获取方法、装置和服务器是这样实现的:The method, device and server for acquiring form data provided in this specification are implemented as follows:
一种表格数据的获取方法,包括:获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所 述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。A method for acquiring form data, comprising: acquiring image data of text to be processed; extracting a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines ; Divide the combined image into multiple rectangular units, wherein the multiple rectangular units each carry position coordinates; perform optical character recognition on the multiple rectangular units, and determine whether the multiple rectangular units contain Text information; according to the position coordinates of the rectangular unit, the rectangular unit containing the text information is combined to obtain the table data.
一种表格数据的获取装置,包括:获取模块,用于获取待处理文本的图像数据;提取模块,用于从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;分割模块,用于将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;识别模块,用于对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;组合模块,用于根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。An apparatus for acquiring form data includes: an acquiring module for acquiring image data of text to be processed; an extracting module for extracting a combined image from the image data, wherein the combined image is a form that includes a cross Learning vertical and morphological horizontal lines; a segmentation module for dividing the combined graph into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; an identification module is used for the A plurality of rectangular units respectively perform optical character recognition to determine the text information contained in each of the plurality of rectangular units; a combination module is used to combine the rectangular units containing text information according to the position coordinates of the rectangular units to obtain table data.
一种服务器,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。A server includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the image data of the text to be processed is obtained; the combined image is extracted from the image data, wherein The combination graph is a graph including vertical morphological lines and horizontal morphological lines; the combination graph is divided into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; Each rectangular unit performs optical character recognition to determine the text information contained in the multiple rectangular units; according to the position coordinates of the rectangular units, the rectangular units containing the text information are combined to obtain table data.
一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。A computer-readable storage medium on which computer instructions are stored, and when the instructions are executed, the image data of the text to be processed is obtained; the combined image is extracted from the image data, wherein the combined image contains a cross Morphological vertical lines and morphological horizontal lines; dividing the combined image into multiple rectangular units, wherein the multiple rectangular units each carry position coordinates; and performing optical characters on the multiple rectangular units Identify and determine the text information contained in each of the plurality of rectangular units; according to the position coordinates of the rectangular units, combine the rectangular units containing the text information to obtain table data.
本说明书提供的一种表格数据的获取方法、装置和服务器,由于先通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再将组合图分割成多个矩形单元分,对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而根据矩形单元的位置坐标将包含有文本信息的矩形单元进行组合,还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。The method, device and server for acquiring table data provided in this specification, because the combined image is obtained by first obtaining and extracting from the graphic features of the morphological vertical line and the morphological horizontal line in the image data; then the combined image is divided into multiple Each rectangular unit is divided into optical characters to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined according to the position coordinates of the rectangular unit to restore the complete table data. Therefore, the technical problem of large error and inaccuracy in extracting table data existing in the existing method is solved, and the content of the table in the image data can be identified efficiently and accurately, and the table content in the image data is completely restored.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本说明书实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一 些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the technical solutions in the embodiments of the present specification, the drawings required in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the implementations described in the specification For example, for those of ordinary skill in the art, without paying any creative labor, other drawings can be obtained based on these drawings.
图1是在一个场景示例中,应用本说明书实施例提供的表格数据的获取方法的一种实施例的示意图;FIG. 1 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
图2是在一个场景示例中,应用本说明书实施例提供的表格数据的获取方法的一种实施例的示意图;2 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
图3是在一个场景示例中,应用本说明书实施例提供的表格数据的获取方法的一种实施例的示意图;3 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
图4是在一个场景示例中,应用本说明书实施例提供的表格数据的获取方法的一种实施例的示意图;4 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
图5是本说明书实施例提供的表格数据的获取方法的流程的一种实施例的示意图;5 is a schematic diagram of an embodiment of a flow of a method for acquiring table data provided by an embodiment of this specification;
图6是本说明书实施例提供的服务器的结构的一种实施例的示意图;6 is a schematic diagram of an embodiment of a structure of a server provided by an embodiment of this specification;
图7是本说明书实施例提供的表格数据的获取装置的结构的一种实施例的示意图。7 is a schematic diagram of an embodiment of a structure of an apparatus for acquiring table data provided by an embodiment of this specification.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be described clearly and completely in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only a part of the embodiments of this specification, but not all the embodiments. Based on the embodiments in this specification, all other embodiments obtained by persons of ordinary skill in the art without creative work shall fall within the protection scope of this specification.
考虑到现有的数据获取方法大多是针对包含有待处理文本的图像数据中的单独的文本字符的识别设计的。因此,在识别、提取图像数据中的文本字符所表征的文本信息时具有较好的准确度。但是,有些类型的文本数据,例如合同文本,还会包含有一些表格内容。这类表格内容相对与单独的文本字符结构更为复杂,通常除了包含有文本字符外,还具有一定的图形特征,例如还会同时包含有一些图形形态学的结构。导致对这类表格数据的识别、提取以及重建更加复杂、困难。通过现有的数据获取方法对图形数据中的这类表格数据直接进行识别、提取时,容易将文本字符和图形特征混淆,无法精准地区分、处理其中的文本字符和图形特征,导致容易出现误差,例如,将表格数据中的分隔栏等图形结构错误地识别成了文本字符,或者对表格数据中不同位置的文本信息的 识别提取出现错位等。即,通过现有的获取方法处理图像数据中的表格数据时效果往往不够理想,存在提取表格数据误差大、不准确的技术问题。It is considered that most existing data acquisition methods are designed for the recognition of individual text characters in image data containing text to be processed. Therefore, it has better accuracy in recognizing and extracting text information represented by text characters in image data. However, some types of text data, such as contract text, will also contain some form content. This type of table content is relatively more complicated than the individual text character structure. Usually, in addition to containing text characters, it also has certain graphic features, for example, it also contains some graphic morphological structures. This makes the identification, extraction and reconstruction of such table data more complicated and difficult. When directly identifying and extracting such table data in graphic data through existing data acquisition methods, it is easy to confuse text characters and graphic features, and it is impossible to accurately distinguish and process the text characters and graphic features among them, resulting in errors. For example, a graphic structure such as a separator bar in the table data is mistakenly recognized as a text character, or a misalignment occurs in the recognition and extraction of text information at different positions in the table data. That is, when the table data in the image data is processed by the existing acquisition method, the effect is often not ideal, and there is a technical problem of large error and inaccuracy in extracting the table data.
针对产生上述问题的根本原因,本说明书具体分析了表格数据所同时具备的文本字符与图形结构两种不同属性对象识别时的不同特点,通过先获取图像数据中的形态学竖线和形态学横线等图像结构特征,从图像数据中找到可能形成表格数据的组合图;再将上述组合图分割成多个矩形单元,对各个矩形单元分别单独进行光学字符识别,以得到矩形单元的文本信息;进而根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,以恢复、重建图像的完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。In view of the root cause of the above problems, this specification specifically analyzes the different characteristics of the two different attribute objects of text characters and graphic structures that the table data has at the same time. By first obtaining the morphological vertical line and morphological horizontal in the image data Use image structure features such as lines to find a combined image that may form table data from the image data; then divide the combined image into multiple rectangular units, and perform optical character recognition on each rectangular unit separately to obtain the text information of the rectangular unit; Furthermore, according to the position coordinates of the rectangular unit, the rectangular unit containing the text information is combined to restore and reconstruct the complete table data of the image, thereby solving the technical problem of large error and inaccuracy in the existing method of extracting table data. It can efficiently and accurately identify and completely restore the table content in the image data.
本说明书实施方式提供一种表格数据方法的获取方法,所述表格数据的获取方法具体可以应用于包含有多个服务器的图像数据处理系统中。例如,法务合同扫描图片的处理系统。The embodiments of the present specification provide an acquisition method of a table data method. The acquisition method of the table data may be specifically applied to an image data processing system including multiple servers. For example, the legal contract processing system for scanning pictures.
其中,上述系统具体可以包括有一个用于负责从图像数据中识别、获取文本数据内的表格数据的服务器。该服务器具体实施时,可以通过检测图像数据中的形态学竖线、形态学横线等图形结构特征,从所获取的待处理文本的图像数据中提取出组合图;再将组合图根据坐标分割成多个矩形单元,对多个矩形单元中的各个矩形单元分别进行光学字符识别,以识别、确定出各个矩形单元所包含的文本信息;进而根据矩形单元的坐标,组合、拼接上述包含有文本信息的矩形单元,从而得到完整的表格数据。Among them, the above system may specifically include a server for identifying and acquiring form data in text data from image data. When the server is specifically implemented, it can extract the combined image from the acquired image data of the text to be processed by detecting the morphological vertical lines and morphological horizontal lines in the image data; then divide the combined image according to the coordinates Into multiple rectangular units, and perform optical character recognition on each of the multiple rectangular units to identify and determine the text information contained in each rectangular unit; then, according to the coordinates of the rectangular unit, combine and splice the above contained text The rectangular unit of information to get the complete table data.
在本实施方式中,所述服务器可以理解为是一种应用于业务系统一侧的,能够实现数据传输、数据处理等功能的业务服务器。具体的,所述服务器可以为一个具有数据运算、存储功能以及网络交互功能的电子设备;也可以为运行于该电子设备中,为数据处理、存储和网络交互提供支持的软件程序。在本实施方式中,并不具体限定所述服务器的数量。所述服务器具体可以为一个服务器,也可以为几个服务器,或者,由若干服务器形成的服务器集群。In this embodiment, the server can be understood as a service server that is applied to the business system side and can implement functions such as data transmission and data processing. Specifically, the server may be an electronic device with data calculation, storage, and network interaction functions; or a software program that runs on the electronic device and provides support for data processing, storage, and network interaction. In this embodiment, the number of the servers is not specifically limited. The server may specifically be one server, or several servers, or a server cluster formed by several servers.
在一个场景示例中,可以参阅图1所示,可以应用本说明书实施例提供的表格数据的获取方法对法务平台所接收到的包含有合同的图像数据进行处理,以提取合同中的表格数据。In an example of a scenario, as shown in FIG. 1, the form data acquisition method provided in the embodiment of the present specification can be used to process the image data containing the contract received by the legal platform to extract the form data in the contract.
在本场景示例中,法务平台可以将用户输入的包含有待处理合同的图像数据分配 给平台中用于获取表格数据的服务器中。In this scenario example, the legal platform can distribute the image data containing the contract to be entered by the user to the server on the platform that is used to obtain the form data.
其中,上述法务平台具体可以用于将用户上传输入的包含有合同的图像数据(例如包含有合同的扫描图片或者照片)中的文本信息进行识别、提取,以将合同内容转化为电子档数据,保存于法务平台的数据库中,方便用户的调取、管理。Among them, the above-mentioned legal platform can be specifically used to identify and extract text information in user-uploaded image data containing contracts (such as scanned pictures or photos containing contracts) to convert contract contents into electronic file data. Stored in the database of the legal affairs platform, it is convenient for users to access and manage.
服务器在接收到包含有合同的图像数据后,可以参阅图2所示先对图像进行预处理,以减少误差干扰,提高后续识别、获取表格数据的精度。After receiving the image data containing the contract, the server may refer to FIG. 2 to pre-process the image to reduce error interference and improve the accuracy of subsequent identification and acquisition of table data.
具体的,上述服务器具体可以配置有OpenCV(即Open source Computer Vision Library,源代码计算机视觉库)。其中,上述OpenCV具体可以理解为一种关于计算机视觉的源代码的API函数库,该库中所包含的函数代码都经过了优化处理,调用、计算的效率相对较高。具体实施时,服务器可以通过上述OpenCV调用相应的函数代码,高效地对图像数据进行数据处理。Specifically, the server may be specifically configured with OpenCV (that is, Open source Computer Vision Library, source code computer vision library). Among them, the above OpenCV can be understood as an API function library about the source code of computer vision. The function code contained in the library has been optimized, and the efficiency of calling and calculating is relatively high. During specific implementation, the server can call the corresponding function code through the above OpenCV to efficiently perform data processing on the image data.
具体的,服务器可以先将图像数据进行灰度转换得到对应的灰度图像,再对灰度图像进行高斯平滑,以过滤掉灰度图像中比较明显的噪声信息,提高图像数据的精度,从而完成对图像数据的预处理。当然,需要说明的是,上述预处理过程中仅以将图像数据转换为灰度图像为例进行示意性说明。具体实施时,根据具体场景和精度要求,也可以将图像数据先转换为二值图,再基于二值图进行后续的表格数据的获取。对此,本说明书不作限定。Specifically, the server can first convert the image data to obtain the corresponding grayscale image, and then perform Gaussian smoothing on the grayscale image to filter out the more obvious noise information in the grayscale image and improve the accuracy of the image data, thereby completing Preprocessing of image data. Of course, it should be noted that, in the above preprocessing process, the image data is converted into a grayscale image only as an example for schematic description. During specific implementation, according to specific scenes and accuracy requirements, the image data may also be converted into a binary map first, and then subsequent table data acquisition may be performed based on the binary map. This specification is not limited.
在完成对包含有合同的图像数据的预处理后,服务器可以先基于形态学,对图像数据中的图形结构特征(例如结构元素等)进行扫描检索,以便先从图像数据中找到区别于单独的文本字符的,具有一定图形特征的,可能形成表格的图形:组合图。After completing the preprocessing of the image data containing the contract, the server can first scan and retrieve the graphic structural features (such as structural elements, etc.) in the image data based on morphology, so as to find the difference from the image data first. Text characters, with certain graphic features, may form a table of graphics: combination chart.
具体实施时,以图像数据中具体的某一帧图像为例,例如,以包含有合同的图像数据中的第五页图像为例。服务器可以扫描、搜索该帧图像中的形态学竖线和形态学横线。In the specific implementation, a specific frame image in the image data is taken as an example, for example, the fifth page image in the image data including the contract is taken as an example. The server can scan and search the morphological vertical line and the morphological horizontal line in the frame image.
上述形态学竖线、形态学横线具体可以理解为一种区别于文本字符的,与图形相关的结构元素。可以参阅图3所示。上述形态学竖线具体可以是图像中包含有沿垂直方向的直线段的图像单元或者结构元素。上述形态学横线具体可以是图像中包含有沿水平方向的直线段的图像单元或者结构元素。The above-mentioned morphological vertical lines and morphological horizontal lines can be understood as a structural element related to graphics that is different from text characters. You can refer to Figure 3. The morphological vertical line may specifically be an image unit or a structural element that contains a straight line segment along the vertical direction in the image. The above-mentioned morphological horizontal line may specifically be an image unit or a structural element that contains a straight line segment along the horizontal direction in the image.
具体的,服务器可以通过调用getStructuringElement函数对图像中的结构元素进行搜索,从中找到所有的形态学竖线和形态学横线。当然,需要说明的是上述所列举的通 过调用getStructuringElement函数从图像中获取形态学竖线和形态学横线的方式只是一种示意性说明。具体实施时,根据具体情况,也可以通过其他合适的方式获取图像中的形态学竖线和形态学横线。对此,本说明书不作限定。Specifically, the server can search for the structural elements in the image by calling the getStructuringElement function, and find all the morphological vertical lines and morphological horizontal lines from it. Of course, it should be noted that the above-listed method of obtaining the morphological vertical line and the morphological horizontal line from the image by calling the getStructuringElement function is only a schematic illustration. During specific implementation, according to the specific situation, the morphological vertical line and the morphological horizontal line in the image may also be obtained in other suitable ways. This specification is not limited.
考虑到在表格数据中每一个形态学横线大多是与形态学竖线中的一个或多个相交。因此,服务器在获取得到该帧图像中的形态学竖线和形态学横线后,可以进一步搜索出包含有相交的形态学竖线和形态学横线的结构的图形作为可能形成的表格数据的组合图,以避免对明显不具备表格数据的图形特征的图形结构进行后续处理,提高了处理效率。Consider that in the tabular data, each morphological horizontal line mostly intersects one or more of the morphological vertical lines. Therefore, after obtaining the morphological vertical line and the morphological horizontal line in the frame image, the server can further search for the graph containing the structure of the intersecting morphological vertical line and the morphological horizontal line as possible form data Combining graphs to avoid subsequent processing of graphic structures that obviously do not have the graphic features of table data and improve processing efficiency.
在本场景示例中,为了避免所识别提取的形态学横线和形态学竖线发生错位,可以在原图像上直接进行形态学横线和形态学竖线的提取,并将所提取得到的形态学横线和形态学竖线覆盖在提取位置处。In this scenario example, in order to avoid the misalignment of the identified and extracted morphological horizontal lines and morphological vertical lines, the morphological horizontal lines and morphological vertical lines can be directly extracted on the original image, and the extracted morphology Horizontal lines and morphological vertical lines cover the extraction position.
在获取得到了上述具备较为明显的数据表格的图形特征、可能形成表格数据的组合图后,可以对该组合图进行进一步检测,通过检测该组合图是否满足预设的表格格式要求,以更加精确地判断该组合图是否为数据表格。After obtaining the above-mentioned combination chart with more obvious data characteristics of the data table and possibly forming the table data, the combination chart can be further inspected, by checking whether the combination chart meets the preset table format requirements, to be more accurate To determine whether the combination chart is a data table.
其中,上述预设的表格格式要求具体可以理解为一种用于描述数据表格区别于其他图形结构的图形特征的规则集。Wherein, the above-mentioned preset table format requirements can be specifically understood as a rule set for describing graphic features of data tables different from other graphic structures.
例如,考虑到数据表格不同于其他的图形,其中每一个格子图形(或称矩形框,可以参阅图3所示)都是用于填充具体的字符设计的,即数据表格中每一个格子图形的最小面积应当至少能够容得下一个完整的字符。因此,可以设置有如下的针对图形面积特征的规则:数据表格中的格子图形的最小面积应当大于预设的面积阈值。又考虑到基于人们通常的排版习惯,在编辑表格数据时会将表格数据设置为居中的位置。因此,还可以设置有如下针对图形位置特征的规则:数据表格的左侧边界与图像的左侧边界的距离同数据表格右侧边界与图像的右侧边界的距离的差值的绝对值小于预设的距离阈值。还考虑到在使用表格数据的目的,通常为了将至少两个或者更多个数据列成表格进行对比、比较,以便更加清晰地展示不同数据之间的差异。因此,还可以设置有如下针对图形的数量特征的规则:数据表格中的格子图形的数量大于等于预设的数量阈值(例如,2个)等。For example, considering that the data table is different from other graphics, each grid graphic (or rectangular frame, see Figure 3) is designed to fill in specific characters, that is, each grid graphic in the data table The minimum area should be able to accommodate at least the next complete character. Therefore, the following rules for graphic area characteristics may be set: the minimum area of the grid pattern in the data table should be greater than a preset area threshold. Also considering that based on people's usual typesetting habits, the table data will be set to the center position when editing the table data. Therefore, you can also set the following rules for graphic position features: the absolute value of the difference between the distance between the left border of the data table and the left border of the image and the distance between the right border of the data table and the right border of the image is less than the Set the distance threshold. Also considering the purpose of using table data, usually in order to compare and compare at least two or more data into a table, so as to more clearly show the differences between different data. Therefore, the following rules for the quantity characteristics of graphics may also be set: the number of grid graphics in the data table is greater than or equal to a preset quantity threshold (for example, 2), and so on.
当然,需要说明的是,上述所列举的预设的表格格式要求所包含的具体规则只是为了更好地说明本说明书实施方式。具体实施时,根据具体的应用场景和处理要求,还 可以引入其他类型或内容的规则作为上述预设的表格格式要求。对此,本说明书不作限定。Of course, it should be noted that the specific rules included in the preset table format requirements listed above are only for better explaining the implementation of this specification. During specific implementation, according to specific application scenarios and processing requirements, other types or content rules may also be introduced as the above-mentioned preset table format requirements. This specification is not limited.
在本场景示例中,服务器为了确定所提取的组合图是否满足预设的表格格式要求,具体实施时,可以先检索组合图中形态学横线与形态学竖线在图像位置相同的点,作为交点,进而确定所述组合图中的各个交点在该帧图像中的位置坐标。In this scenario example, in order to determine whether the extracted combination map meets the preset table format requirements, in specific implementation, it can first retrieve the point where the horizontal and vertical morphological lines in the combination map are at the same image position as Intersection point, and then determine the position coordinates of each intersection point in the combined image in the frame image.
其中,上述交点具体可以理解为在该帧图像中,组合图中形态学竖线和形态学横线相交位置处的像素点。具体可以参阅图3所示。The above-mentioned intersection point can be specifically understood as the pixel point at the position where the morphological vertical line and the morphological horizontal line intersect in the combined image in the frame image. See Figure 3 for details.
具体的,服务器可以通过调用opencv bitwise_and函数搜索并获取图像中所述组合图中的交点坐标。当然,需要说明的是,上述所列举的通过opencv bitwise_and函数获取交点坐标只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的交点的坐标。对此,本说明书不作限定。Specifically, the server can search for and obtain the coordinates of the intersection point in the combined image in the image by calling the opencv bitwise_and function. Of course, it should be noted that the enumeration of the coordinates of the intersection point through the opencv bitwise_and function listed above is only a schematic illustration. During specific implementation, the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.
同时,服务器还可以对上述组合图进行进一步的图形结构元素的搜索,寻找到具有矩形(或者方形)结构(即对应表格中的一个格子)的图形元素作为所述组合图中的矩形框。可以参阅图3所示。At the same time, the server may further search for the graphic structure elements of the above combination diagram, and find a graphic element having a rectangular (or square) structure (ie, a grid in the corresponding table) as a rectangular frame in the combination diagram. You can refer to Figure 3.
具体的,服务器可以通过调用findContours函数搜索并获取所述组合图中的矩形框。当然,需要说明的是,上述所列举的通过findContours函数获取组合图中的矩形框只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的矩形框。对此,本说明书不作限定。Specifically, the server may search for and obtain the rectangular frame in the combination graph by calling the findContours function. Of course, it should be noted that the above-mentioned enumeration of the rectangular frame in the combination diagram by the findContours function is only a schematic illustration. During specific implementation, the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
进一步,服务器可以根据所确定的上述交点坐标以及组合图中的矩形框,通过位置比较,分别确定组合图中的各个矩形框的四个端点处的端点坐标。进而可以根据组合图中矩形框的端点坐标,判断组合图是否满足预设的表格格式要求。Further, the server may determine the endpoint coordinates at the four endpoints of each rectangular frame in the combination graph through position comparison based on the determined intersection coordinate and the rectangular frame in the combination graph. Furthermore, according to the coordinates of the endpoints of the rectangular frame in the combination diagram, it can be determined whether the combination diagram meets the preset table format requirements.
例如,服务器可以根据矩形框的端点坐标,计算出该矩形框的长度和宽度,进而根据长度和宽度计算出矩形框的面积。再将矩形框的面积与预设的面积阈值进行比较。如果组合图中各个矩形框的面积都大于预设的面积阈值,则可以判断组合图满足预设的表格格式要求。For example, the server may calculate the length and width of the rectangular frame according to the coordinates of the endpoints of the rectangular frame, and then calculate the area of the rectangular frame based on the length and width. Then compare the area of the rectangular frame with the preset area threshold. If the area of each rectangular frame in the combination diagram is greater than the preset area threshold, it can be determined that the combination diagram meets the preset table format requirements.
又例如,服务器还可以比较组合图中各个矩形框的端点坐标的横坐标的数值,找到横坐标数值最小的端点作为组合图左侧边界上的端点,并将该端点的横坐标确定为左侧边界的横坐标,再根据上述左侧边界的横坐标计算组合图左侧边界与图像的左侧边界的距离,记为d1。类似的,服务通过比较端点的横坐标的数值,找到横坐标数值最大的 端点作为组合图右侧边界上的端点,并将该端点的横坐标确定为右侧边界的横坐标,再根据上述右侧边界的横坐标计算组合图右侧边界与图形的右侧边界的距离,记为d2。进一步,服务器可以计算d1与d2的差值的绝对值,并将上述差值的绝对值与预设的距离阈值进行比较。如果上述差值的绝对值小于等于预设的距离阈值,则可以判断上述组合图的整体位于图像居中的位置,即满足预设的表格格式要求等。For another example, the server can also compare the value of the abscissa of the end point coordinates of each rectangular frame in the combination diagram, find the end point with the smallest value of the abscissa as the endpoint on the left border of the combination diagram, and determine the abscissa of the endpoint as the left The abscissa of the border, and then calculate the distance between the left border of the combined image and the left border of the image based on the abscissa of the left border, and record it as d1. Similarly, the service finds the endpoint with the largest abscissa value as the endpoint on the right border of the combination chart by comparing the values of the abscissa of the endpoint, and determines the abscissa of the endpoint as the abscissa of the right border. The abscissa of the side boundary calculates the distance between the right boundary of the combined drawing and the right boundary of the drawing, and is denoted as d2. Further, the server may calculate the absolute value of the difference between d1 and d2, and compare the absolute value of the above difference with a preset distance threshold. If the absolute value of the above-mentioned difference is less than or equal to the preset distance threshold, it can be determined that the entire combination picture is located at the center of the image, that is, the preset table format requirements are met.
当然,需要说明的是,上述所列举的判断组合图是否满足预设的表格格式要求的方式只是为了更好地说明本说明书实施方式。具体实施时,根据具体情况和精度要求,可以将上述两种判断方式组合,也可以引入其他合适的判断方式来判断组合图是否符合预设的表格格式要求。对此,本说明书不作限定。Of course, it should be noted that the above-listed methods for judging whether the combination chart meets the preset table format requirements are only for better explaining the implementation of this specification. During specific implementation, according to specific conditions and accuracy requirements, the above two judgment methods may be combined, or other suitable judgment methods may be introduced to judge whether the combined picture meets the preset table format requirements. This specification is not limited.
在确定组合图符合预设的表格格式后,服务器可以确定当前提取的组合图确实是图像中数据表格。可以对该组合图进行后续的文本信息的提取。After determining that the combination diagram conforms to the preset table format, the server may determine that the currently extracted combination diagram is indeed a data table in the image. Subsequent text information can be extracted from the combined image.
考虑到上述组合图通常会包含有多个格子图形或者矩形框,直接对组合图中的文本信息进行识别提取容易出现错位等问题。因此,服务器可以先将上述组合图分割为多个矩形单元。其中,每个矩形单元分别与组合图中的一个矩形框一一对应;但又不同于矩形框这种单独的图形结构元素,每一个矩形单元内部包含有文本字符或者空白状态信息。进而可以对每个矩形单元分别进行单独的光学字符识别,以准确地识别出矩形单元中的文本字符,确定出各个矩形单元所包含的文本信息。Considering that the above combination diagram usually contains a plurality of lattice figures or rectangular frames, directly identifying and extracting the text information in the combination diagram is prone to problems such as misalignment. Therefore, the server may first divide the above combined image into a plurality of rectangular units. Among them, each rectangular unit corresponds to a rectangular frame in the combination diagram one by one; however, it is different from the single graphical structure element of the rectangular frame. Each rectangular unit contains text characters or blank state information. Furthermore, separate optical character recognition can be performed on each rectangular unit to accurately identify the text characters in the rectangular unit and determine the text information contained in each rectangular unit.
具体的,服务器可以先根据矩形框的端点坐标确定出围成矩形框的轮廓线作为分割线,进而可以沿着轮廓线进行切割,从组合图中分割对应该矩形框的矩形单元。例如,参阅图4所示。对于组合图中某一个矩形框的四个端点坐标分别为A(15,60)、B(15,40)、C(30,40)和D(30,60)。具体实施时,服务器可以从端点A出发,按照预设的划分规则,保持横坐标15不变,寻找到纵坐标不同的端点,即端点B,进而将端点A与端点B相连。然后,服务器再从端点B出发,按照预设的划分规则,保持纵坐标40不变,寻找到横坐标不同的端点,即端点C,进而将端点B与端点C相连。接着,服务器再从端点C出发,按照预设的划分规则,保持横坐标30不变,寻找到纵坐标不同的端点,即端点D,进而将端点C与端点D相连。最后,服务器再从端点D出发,按照预设的划分规则,保持纵坐标60不变,寻找到横坐标不同的端点,即端点A,进而将端点D与端点A相连。这样可以得到一段封闭的连接线:A到B到C到D到A,即该矩形框轮廓线。进一步,服务器可以以上述轮廓线作为分割线,沿着上述轮廓线将组合图中包含有文本信息的矩形框分割出来,得到对应的矩形单元。Specifically, the server may first determine the contour line enclosing the rectangular frame as the dividing line according to the endpoint coordinates of the rectangular frame, and then may cut along the contour line to divide the rectangular unit corresponding to the rectangular frame from the combined diagram. For example, see Figure 4. The coordinates of the four endpoints of a rectangular frame in the combination diagram are A (15, 60), B (15, 40), C (30, 40), and D (30, 60). During specific implementation, the server can start from the endpoint A, keep the abscissa 15 unchanged, and find the endpoint with a different ordinate, namely endpoint B, and then connect endpoint A to endpoint B according to a preset division rule. Then, the server starts from the endpoint B, keeps the ordinate 40 unchanged, and finds the endpoint with different abscissas, that is, the endpoint C, and then connects the endpoint B to the endpoint C according to the preset division rule. Then, the server starts from the endpoint C, keeps the abscissa 30 unchanged according to the preset division rule, and finds the endpoint with a different ordinate, namely the endpoint D, and then connects the endpoint C to the endpoint D. Finally, the server starts from the endpoint D and keeps the ordinate 60 unchanged according to the preset division rule, and finds the endpoint with different abscissas, that is, endpoint A, and then connects the endpoint D to the endpoint A. In this way, a closed connecting line can be obtained: A to B to C to D to A, which is the outline of the rectangular frame. Further, the server may use the outline as a dividing line, and divide the rectangular frame containing the text information in the combined image along the outline to obtain the corresponding rectangular unit.
按照上述方式可以分割出组合图中的各个矩形单元。当然,需要说明的是,上述所列举的分割矩形单元的方式只是为了更好地说明本说明书实施方式。具体实施时,根据具体情况也可以采用其他合适的方式从所述组合图中分割出多个矩形单元。对此,本说明书不作限定。According to the above method, each rectangular unit in the combined graph can be divided. Of course, it should be noted that the above-mentioned manner of dividing the rectangular unit is just to better explain the embodiments of the present specification. During specific implementation, other suitable methods may also be used to divide a plurality of rectangular units from the combined diagram according to specific circumstances. This specification is not limited.
需要说明的是,在分割组合图的过程中,服务器还会根据矩形框的端点坐标生成矩形单元对应的位置坐标。It should be noted that, in the process of dividing the combined image, the server also generates position coordinates corresponding to the rectangular unit according to the coordinates of the end points of the rectangular frame.
其中,上述位置坐标具体可以理解为一种用于指示矩形单元在组合图的图像中的位置或者描述组合图的图像中矩形单元与其他相邻的矩形单元的位置关系的参数数据。Wherein, the above position coordinates can be understood as a kind of parameter data used to indicate the position of the rectangular unit in the image of the combined image or describe the positional relationship between the rectangular unit in the image of the combined image and other adjacent rectangular units.
具体的,服务器可以根据矩形框的四个端点的端点坐标,计算该矩形框中心点的坐标作为对应的矩形单元的位置坐标。也可以服务器先分别计算出各个矩形框的中心点的坐标,再按照预设的排列顺序,例如,按照从上到下从左到右的顺序,根据各个矩形框的中心点的坐标,确定出各个矩形单元的行编号和列编号,作为对应的矩形单元的位置坐标。例如,根据矩形框的中心点的坐标,确定矩形框A位于为组合图中的第一行第二列,即对应的行编号为1,列编号为2,因此可以将“1-2”作为矩形框A所对应的矩形单元的位置坐标。当然,需要说明的是上述所列举的确定矩形单元的位置坐标的方式只是一种示意性说明。具体实施时,根据具体情况,还可以采用其他合适的方式确定矩形单元的位置坐标。对此,本说明书不作限定。Specifically, the server may calculate the coordinates of the center point of the rectangular frame as the position coordinates of the corresponding rectangular unit according to the endpoint coordinates of the four endpoints of the rectangular frame. The server may also calculate the coordinates of the center points of each rectangular frame first, and then according to the preset arrangement order, for example, from the top to bottom and from left to right, according to the coordinates of the center points of each rectangular frame, determine The row number and column number of each rectangular unit are used as the position coordinates of the corresponding rectangular unit. For example, according to the coordinates of the center point of the rectangular frame, it is determined that the rectangular frame A is located in the first row and second column of the combined diagram, that is, the corresponding row number is 1 and the column number is 2, so "1-2" can be used as The position coordinates of the rectangular unit corresponding to the rectangular frame A. Of course, it should be noted that the above-listed methods for determining the position coordinates of the rectangular unit are only schematic illustrations. During specific implementation, according to the specific situation, other suitable methods may also be used to determine the position coordinates of the rectangular unit. This specification is not limited.
在分割组合图得到多个对应的矩形单元后,服务器可以对多个矩形单元中的各个矩形单元分别进行光学字符识别(即OCR,Optical Character Recognition)识别确定出各个矩形单元中的文本字符,进而确定出各个矩形单元所包含的文本信息。如果矩形单元中没有识别到文本字符,则将该矩形单元所包含的文本信息置空。这样就可以得到多个分别包含有对应的文本信息的矩形单元。After dividing the combined image to obtain multiple corresponding rectangular units, the server can perform optical character recognition (ie, OCR, Optical, Character, Recognition) on each of the multiple rectangular units to determine the text characters in each rectangular unit, and then Determine the text information contained in each rectangular unit. If no text characters are recognized in the rectangular unit, the text information contained in the rectangular unit is left blank. In this way, multiple rectangular units containing corresponding text information can be obtained.
进一步,服务器可以根据各个矩形单元的位置坐标,将上述得到的包含有文本信息的矩形单元进行组合拼接。例如,可以根据矩形单元的位置坐标“1-2”,将包含有文本信息的矩形单元设置在第一行第二列的位置处。按照上述方式,依次将多个包含有文本信息的矩形单元设置到对应的位置处,从而可以还原得到完整的数据表格。当然,需要说明的是,上述所列举的组合方式只是一种示意性说明。具体实施时,也可以根据其他类型的位置坐标,采用其他的组合方式进行组合拼接。对此,本说明书不作限定。Further, the server may combine and combine the rectangular units containing the text information obtained above according to the position coordinates of each rectangular unit. For example, the rectangular unit containing text information can be set at the position of the first row and the second column according to the position coordinates "1-2" of the rectangular unit. According to the above manner, a plurality of rectangular units containing text information are sequentially set to corresponding positions, so that a complete data table can be restored. Of course, it should be noted that the above-mentioned combination mode is only a schematic illustration. During specific implementation, other combination methods can also be used to perform combination splicing according to other types of position coordinates. This specification is not limited.
按照上述方式,服务器可以分别对包含有待处理合同的图像数据中的每张图像分 别进行表格数据的检测,在确定存在表格数据的情况下再进行表格数据的获取,从而提取得到图像数据中完整的表格数据,并将提取到的表格数据反馈给法务平台,以便整理生成针对该合同的电子档数据进行保存。According to the above method, the server can separately detect the form data of each image in the image data containing the contract to be processed, and then obtain the form data when it is determined that the form data exists, so as to extract the complete image data Form data, and feed back the extracted form data to the legal platform, so as to organize and generate the electronic file data for the contract for storage.
在另一个场景示例中,为了使得所获取的表格数据中表格线条更加的清晰,以提高后续进行光学字符识别提取文本信息的精度,具体实施时,服务器在通过扫描、搜索得到该帧图像中的形态学竖线和形态学横线后,进一步还可以对所得到的形态学竖线和形态学横线分别进行特征强化处理,使得所得到的形态学竖线、形态学横线更加清晰。In another scenario example, in order to make the table lines in the obtained table data clearer, and to improve the accuracy of subsequent optical character recognition to extract text information, during specific implementation, the server obtains the After the morphological vertical line and the morphological horizontal line, further feature enhancement processing can be performed on the obtained morphological vertical line and the morphological horizontal line to make the obtained morphological vertical line and morphological horizontal line clearer.
其中,上述特征强化处理具体可以是一种形态学处理,具体可以包括腐蚀处理和/或膨胀处理。具体实施时,基于形态处理,可以通过将卷积核的区域滑动至该帧图像中,以对区域中间的像素点的数据值进行重置(重置为0或1)。具体的,可以先进行腐蚀处理,再进行膨胀处理。Wherein, the above feature strengthening treatment may specifically be a morphological treatment, and may specifically include corrosion treatment and/or expansion treatment. During specific implementation, based on the morphological processing, the data value of the pixel in the middle of the area can be reset (reset to 0 or 1) by sliding the area of the convolution kernel into the frame image. Specifically, corrosion treatment may be performed first, followed by expansion treatment.
具体的,上述腐蚀处理,可以理解为一种做与运算,具体通过根据卷积核的大小,将靠近前景的像素点腐蚀(即将对应像素点的数值重置变为0),使得前景物体变小,进而可以使得形态学竖线或形态学横线周围的白色区域减少,达到去除白噪声的效果;同时还可以将与上述形态学竖线或形态学横线相邻甚至相连的结构元素断开。Specifically, the above-mentioned corrosion processing can be understood as an AND operation. Specifically, by corroding the pixels close to the foreground according to the size of the convolution kernel (that is, resetting the value of the corresponding pixel to 0), the foreground object becomes Small, which can reduce the white area around the morphological vertical line or morphological horizontal line to achieve the effect of removing white noise; at the same time, it can also break the structural elements adjacent or even connected to the above morphological vertical line or morphological horizontal line open.
在进行完腐蚀处理后,由于腐蚀会使得图像的结构元素相对发生缩小,因此,可以继续对腐蚀处理后的形态学竖线或形态学横线进行膨胀处理。After the corrosion treatment is performed, since the corrosion will relatively reduce the structural elements of the image, the morphological vertical line or the morphological horizontal line after the corrosion processing may be continuously expanded.
上述膨胀处理,可以理解为一种做或运算,与腐蚀处理相反,通过膨胀可以对腐蚀后的图像进行放大复原,从而得到相对较清晰的、大小不变的形态学竖线和形态学横线。The above expansion process can be understood as an OR operation. In contrast to the corrosion process, the eroded image can be enlarged and restored through expansion to obtain relatively clear morphological vertical lines and morphological horizontal lines of constant size. .
由上述场景示例可见,本说明书提供的表格数据的获取方法,由于通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再将组合图分割成多个矩形单元分,对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而根据矩形单元的位置坐标将包含有文本信息的矩形单元进行组合还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。As can be seen from the above scenario examples, the method for obtaining the table data provided in this specification is due to obtaining and extracting the combined image according to the graphic features of the morphological vertical line and the morphological horizontal line in the image data; then the combined image is divided into multiple The rectangular units are divided into optical characters for each rectangular unit to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving In order to achieve the technical problems of large error and inaccuracy in the extraction of table data in the existing methods, it is possible to efficiently and accurately identify and completely restore the table content in the image data.
参阅图5所示,本说明书实施例还提供了一种表格数据的获取方法,其中,该方法具体应用于服务器一侧。具体实施时,该方法可以包括以下内容:Referring to FIG. 5, an embodiment of the present specification also provides a method for acquiring table data, where the method is specifically applied to the server side. During specific implementation, the method may include the following:
S51:获取待处理文本的图像数据。S51: Acquire image data of the text to be processed.
在本实施例中,上述待处理文本具体可以是待处理的合同文本,也可以是待处理的章程文本,还可以是待处理的说明书文本等。相应的,上述待处理文本的图像数据可以是包含有上述文本内容的扫描图片,也可以是包含有上述文本内容的照片,还可以是包含有上述文本内容的视频等等。对于上述待处理文本的图像数据的具体内容和形式,本说明书不作限定。In this embodiment, the above-mentioned to-be-processed text may specifically be a to-be-processed contract text, a to-be-processed constitution text, or a to-be-processed specification text. Correspondingly, the image data of the text to be processed may be a scanned image containing the text content, a photo containing the text content, or a video containing the text content. The specific content and form of the image data of the text to be processed above are not limited in this specification.
S53:从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形。S53: Extract a combination graph from the image data, wherein the combination graph is a graph including vertical morphological lines and horizontal morphological lines.
在本实施例中,上述形态学竖线、形态学横线具体可以理解为一种区别于文本字符的,与图形相关的结构元素。上述形态学竖线具体可以是图像中包含有沿垂直方向的直线段的图像单元或者结构元素。上述形态学横线具体可以是图像中包含有沿水平方向的直线段的图像单元或者结构元素。In this embodiment, the above morphological vertical line and morphological horizontal line can be specifically understood as a structural element related to graphics that is different from text characters. The morphological vertical line may specifically be an image unit or a structural element that contains a straight line segment along the vertical direction in the image. The above-mentioned morphological horizontal line may specifically be an image unit or a structural element that contains a straight line segment along the horizontal direction in the image.
在本实施例中,上述组合图具体可以理解为图像数据中具有与表格数据类似的图形特征的,例如也包含有交叉的形态学竖线和形态学横线的图形结构元素的组合图形。In this embodiment, the above-mentioned combined graph can be specifically understood as the image data having graphic features similar to the table data, for example, a combined graph including graphic structural elements of crossing morphological vertical lines and morphological horizontal lines.
在本实施例中,上述从所述图像数据中提取组合图,具体实施时,可以包括以下内容:搜索并获取所述图像数据中的形态学竖线和形态学横线;连接所述形态学竖线和所述形态学横线,得到所述组合图。In this embodiment, the above-mentioned extraction of the combined image from the image data, during specific implementation, may include the following: search and obtain the morphological vertical line and the morphological horizontal line in the image data; connect the morphology The vertical line and the morphological horizontal line obtain the combined diagram.
在本实施例中,上述搜索并获取所述图像数据中的形态学竖线和形态学横线,具体实施时,可以包括以下内容:通过调用OpenCV中的getStructuringElement函数对图像中的结构元素进行搜索,从中找到图像数据中的形态学竖线和形态学横线。当然,需要说明的是上述所列举的通过调用getStructuringElement函数从图像中获取形态学竖线和形态学横线的方式只是一种示意性说明。具体实施时,根据具体情况,也可以通过其他合适的方式获取图像中的形态学竖线和形态学横线。对此,本说明书不作限定。In this embodiment, the above search and obtain the morphological vertical line and the morphological horizontal line in the image data, in specific implementation, may include the following content: by calling the getStructuringElement function in OpenCV to search for the structural element in the image , Find the morphological vertical line and morphological horizontal line in the image data. Of course, it should be noted that the above-listed method of obtaining the morphological vertical line and the morphological horizontal line from the image by calling the getStructuringElement function is only a schematic illustration. During specific implementation, according to the specific situation, the morphological vertical line and the morphological horizontal line in the image may also be obtained in other suitable ways. This specification is not limited.
在本实施例中,通过上述方式获取得到的形态学竖线和形态学横线还携带有在图像数据中的位置信息,进而可以根据形态学竖线和形态学横线的位置信息,连接对应的形态学竖线和形态学横线,得到所述组合图。In this embodiment, the morphological vertical line and the morphological horizontal line obtained in the above manner also carry position information in the image data, and then the corresponding information can be connected according to the position information of the morphological vertical line and the morphological horizontal line The morphology vertical line and the morphology horizontal line to get the combined picture.
S55:将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标。S55: Divide the combined image into a plurality of rectangular units, where the plurality of rectangular units respectively carry position coordinates.
在本实施例中,上述矩形单元具体可以理解为一种与组合图中的一个矩形框一一对应,但又区别矩形框,包含有文本信息(例如填充有文本字符或者置空)的图像单元。In this embodiment, the above rectangular unit can be specifically understood as an image unit that corresponds one-to-one with a rectangular frame in the combination diagram, but distinguishes the rectangular frame and contains text information (such as text characters filled or blank) .
在本实施例中,上述矩形框具体可以理解为由两段形态学竖线和两段形态学横线组成的,单纯只包含图形特征的,矩形或方形形状的图形元素。其中,每一个矩形框可以认为是表格中的一个格子。In this embodiment, the above-mentioned rectangular frame can be specifically understood as a rectangular or square-shaped graphic element composed of two morphological vertical lines and two morphological horizontal lines, which simply contain only graphic features. Among them, each rectangular frame can be regarded as a grid in the table.
在本实施例中,将所述组合图分割成多个矩形单元,具体实施时,可以包括以下内容:获取所述组合图中的交点坐标;搜索并获取所述组合图中的矩形框;根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;根据所述矩形框的端点坐标,将所述组合图分割成多个矩形单元。In this embodiment, the combination diagram is divided into a plurality of rectangular units. In specific implementation, the following contents may be included: obtaining the coordinates of the intersection point in the combination diagram; searching and obtaining the rectangular frame in the combination diagram; according to The coordinate of the intersection point in the combined graph determines the coordinates of the end points of the rectangular frame; and according to the coordinate of the endpoints of the rectangular frame, the combined graph is divided into a plurality of rectangular units.
在本实施例中,上述交点具体可以理解为组合图中形态学竖线和形态学横线相交位置处的像素点。In this embodiment, the above-mentioned intersection point can be specifically understood as the pixel point at the position where the vertical morphological line and the horizontal morphological line in the combination figure intersect.
在本实施例中,具体实施时,可以通过调用OpenCV中的opencv bitwise_and函数搜索并获取图像中所述组合图中的交点坐标。当然,需要说明的是,上述所列举的通过opencv bitwise_and函数获取交点坐标只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的交点的坐标。对此,本说明书不作限定。In this embodiment, during specific implementation, the coordinates of the intersection point in the combined graph in the image can be searched and obtained by calling the opencv bitwise_and function in OpenCV. Of course, it should be noted that the enumeration of the coordinates of the intersection point through the opencv bitwise_and function listed above is only a schematic illustration. During specific implementation, the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.
在本实施例中,具体实施时,可以通过调用OpenCV中的findContours函数搜索并获取所述组合图中的矩形框。当然,需要说明的是,上述所列举的通过findContours函数获取组合图中的矩形框只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的矩形框。对此,本说明书不作限定。In this embodiment, during specific implementation, the rectangular frame in the combined graph can be searched and obtained by calling the findContours function in OpenCV. Of course, it should be noted that the above-mentioned enumeration of the rectangular frame in the combination diagram by the findContours function is only a schematic illustration. During specific implementation, the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
在本实施例中,上述OpenCV(Open source Computer Vision Library,源代码计算机视觉库)具体可以理解为一种关于计算机视觉的源代码的API函数库,该库中所包含的函数代码都经过了优化处理,调用、计算的效率相对较高。具体实施时,服务器可以通过上述OpenCV调用相应的函数代码,高效地对图像数据进行数据处理。In this embodiment, the above OpenCV (Open source Computer Vision Library, source code computer vision library) can be specifically understood as an API function library about the source code of computer vision, the function code contained in the library has been optimized The efficiency of processing, calling and calculating is relatively high. During specific implementation, the server can call the corresponding function code through the above OpenCV to efficiently perform data processing on the image data.
在本实施例中,上述根据所述矩形框的端点坐标,将所述组合图分割成多个矩形单元,具体实施时,可以包括以下内容:根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;根据矩形框的端点坐标确定分割线;根据所述分割线将所述组合图分割成多个矩形单元。In this embodiment, according to the coordinates of the endpoints of the rectangular frame, the combination diagram is divided into a plurality of rectangular units. In specific implementation, the following may be included: according to the coordinates of the intersection point in the combination diagram, determine the The coordinates of the end points of the rectangular frame; the dividing line is determined according to the coordinates of the end points of the rectangular frame; and the combined image is divided into a plurality of rectangular units according to the dividing lines.
在本实施例中,上述根据所述组合图中的交点坐标,确定所述矩形框的端点坐标,具体实施时,可以包括以下内容:将所述组合图中的交点坐标与所述矩形框进行位置比较,以从交点中确定出各个矩形框的4个端点,进而确定出各个矩形框的端点坐标。In this embodiment, the endpoint coordinates of the rectangular frame are determined according to the coordinates of the intersection point in the combination diagram, and in specific implementation, the following content may be included: the coordinates of the intersection point in the combination diagram and the rectangular frame are performed Position comparison to determine the four endpoints of each rectangular frame from the intersection, and then determine the coordinates of the endpoints of each rectangular frame.
在本实施例中,上述根据矩形框的端点坐标确定分割线,具体实施时,可以包括以下内容:根据各个矩形框的4个端点坐标确定出围成矩形框的轮廓线作为对应的分割线。进而后续可以沿着上述分割线进行分割,从组合图中分割得到各个矩形单元。In this embodiment, the above-mentioned determination of the dividing line according to the coordinates of the end points of the rectangular frame may include the following content: according to the coordinates of the four end points of each rectangular frame, the outline line surrounding the rectangular frame is determined as the corresponding dividing line. Furthermore, subsequent division can be performed along the above division line, and each rectangular unit can be obtained from the combination diagram.
在本实施例中,在分割所述组合图得到多个矩形单元的同时,所述方法还包括有以下内容:根据所述矩形框的端点坐标,生成矩形单元的位置坐标。In this embodiment, while dividing the combined image to obtain multiple rectangular units, the method further includes the following content: generating position coordinates of the rectangular units according to the coordinates of the end points of the rectangular frame.
在本实施例中,上述矩形单元的位置坐标,具体可以理解为一种用于指示矩形单元在组合图的图像中的位置或者描述组合图的图像中矩形单元与其他相邻的矩形单元的位置关系的参数数据。In this embodiment, the position coordinates of the above rectangular unit can be specifically understood as a type used to indicate the position of the rectangular unit in the image of the combined image or describe the position of the rectangular unit and other adjacent rectangular units in the image of the combined image Parameter data of the relationship.
在本实施例中,具体实施时,可以根据矩形框的四个端点的端点坐标,计算该矩形框中心点的坐标作为对应的矩形单元的位置坐标。也可以先分别计算出各个矩形框的中心点的坐标,再按照预设的排列顺序,例如,按照从上到下从左到右的顺序,根据各个矩形框的中心点的坐标,按顺序排列各个矩形单元,并确定出排序后的各个矩形单元的行编号和列编号,作为对应的矩形单元的位置坐标等。当然,需要说明的是上述所列举的确定矩形单元的位置坐标的方式只是一种示意性说明。具体实施时,根据具体情况,还可以采用其他合适的方式确定矩形单元的位置坐标。对此,本说明书不作限定。In this embodiment, during specific implementation, the coordinates of the center point of the rectangular frame may be calculated as the position coordinates of the corresponding rectangular unit according to the endpoint coordinates of the four end points of the rectangular frame. You can also calculate the coordinates of the center point of each rectangular frame first, and then follow the preset arrangement order, for example, from top to bottom and from left to right, according to the coordinates of the center point of each rectangular frame, arrange in order For each rectangular unit, determine the row number and column number of each sorted rectangular unit as the position coordinates of the corresponding rectangular unit. Of course, it should be noted that the above-listed methods for determining the position coordinates of the rectangular unit are only schematic illustrations. During specific implementation, according to the specific situation, other suitable methods may also be used to determine the position coordinates of the rectangular unit. This specification is not limited.
S57:对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息。S57: Perform optical character recognition on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units.
在本实施例中,具体实施时,可以对所述多个矩形单元中的各个矩形单元分别进行单独的光学字符识别,以分别识别出各个矩形单元中的文本字符,进而确定出各个矩形单元所包含的文本信息。In this embodiment, during specific implementation, each rectangular unit of the plurality of rectangular units may be subjected to separate optical character recognition to separately identify text characters in each rectangular unit, and then determine the location of each rectangular unit. Contains text information.
在本实施例中,具体实施时,在从矩形单元中没有识别得到文本字符时,可以将该矩形单元所包含的文本信息置空。In this embodiment, during specific implementation, when text characters are not recognized from the rectangular unit, the text information contained in the rectangular unit may be left blank.
S59:根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。S59: According to the position coordinates of the rectangular unit, combine the rectangular units containing text information to obtain table data.
在本实施例中,具体实施时,可以根据各个矩形单元的位置坐标,将位置坐标相邻的包含有文本信息的矩形单元进行拼接,并按照位置坐标将包含有文本信息的矩形单元放置于对应的位置处,从而组合得到了完整的表格数据。In this embodiment, in specific implementation, the rectangular units containing text information adjacent to the position coordinates may be stitched according to the position coordinates of each rectangular unit, and the rectangular units containing text information may be placed in the corresponding At the location of the data, so as to obtain the complete table data.
在本实施例中,由于通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再将组合图分割成多个矩形单元分,对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而根据矩形单元的位置坐标将包 含有文本信息的矩形单元进行组合还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。In this embodiment, the combined graph is obtained by acquiring and extracting the graphic features such as the morphological vertical line and the morphological horizontal line in the image data; then the combined graph is divided into a plurality of rectangular units, and each rectangular unit is Optical character recognition to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving the problem of extracting table data existing in the existing methods The technical problem of large error and inaccuracy can be recognized efficiently and accurately, and the content of the table in the image data can be completely restored.
在一个实施例中,为了减少噪声干扰,提高表格数据的获取精度,在获取待处理文本的图像数据后,所述方法具体实施时还可以包括以下内容:对所述待处理文本的图像数据进行预处理,其中,所述预处理包括:将所述图像数据转换为灰度图像;和/或,对所述图像数据进行高斯平滑处理,以过滤掉噪声干扰。当然,需要说明的是,上述所列举的预处理方式只是为了更好地说明本说明书实施方式。具体实施时,根据具体情况和精度要求还可以采用其他合适的处理方式进行预处理。对此,本说明书不作限定。In an embodiment, in order to reduce noise interference and improve the accuracy of acquiring table data, after acquiring the image data of the text to be processed, the method may further include the following when the method is specifically implemented: performing the image data of the text to be processed Preprocessing, wherein the preprocessing includes: converting the image data into a grayscale image; and/or performing Gaussian smoothing on the image data to filter out noise interference. Of course, it should be noted that the above-mentioned pre-processing methods are just to better explain the embodiments of this specification. During specific implementation, other suitable processing methods may be used for pre-processing according to the specific situation and accuracy requirements. This specification is not limited.
在一个实施例中,上述从所述图像数据中提取组合图,具体实施时,可以包括以下内容:搜索并获取所述图像数据中的形态学竖线和形态学横线;连接所述形态学竖线和所述形态学横线,得到所述组合图。In one embodiment, the above-mentioned extraction of the combined image from the image data, in specific implementation, may include the following content: search and obtain morphological vertical lines and morphological horizontal lines in the image data; connect the morphology The vertical line and the morphological horizontal line obtain the combined diagram.
在一个实施例中,上述搜索并获取所述图像数据中的形态学竖线和形态学横线,具体实施时,可以包括以下内容:通过getStructuringElement函数搜索并获取所述图像数据中的形态学竖线和形态学横线。In one embodiment, the above search and obtain the morphological vertical line and the morphological horizontal line in the image data, in specific implementation, may include the following content: search and obtain the morphological vertical line in the image data through the getStructuringElement function Lines and morphological horizontal lines.
在一个实施例中,为了使得所获取的形态学竖线和形态学横线清晰,减少对后续文本信息识别的误差影响,在搜索并获取所述图像数据中的形态学竖线和形态学横线后,所述方法具体实施时还可以包括以下内容:对所述获取的形态学竖线和形态学横线分别进行特征强化处理,其中,所述特征强化处理包括以下至少之一:腐蚀处理和膨胀处理。In one embodiment, in order to make the acquired morphological vertical lines and morphological horizontal lines clear, and to reduce the impact of errors on subsequent text information recognition, the morphological vertical lines and morphological horizontal lines in the image data are searched and acquired After the line is implemented, the method may further include the following contents: performing feature enhancement processing on the obtained morphological vertical line and morphological horizontal line respectively, wherein the feature enhancement processing includes at least one of the following: corrosion treatment And expansion treatment.
在本实施例中,具体实施时,可以先对形态学竖线和形态学横线进行腐蚀处理,再对腐蚀处理后的形态学竖线和形态学横线进行膨胀处理。In this embodiment, during the specific implementation, the morphological vertical line and the morphological horizontal line may be etched first, and then the morphological vertical line and the morphological horizontal line after the etching process may be expanded.
在本实施例中,通过腐蚀处理可以消除形态学竖线和形态学横线的前景所产生的白噪声,使得形态学竖线和形态学横线更加清晰,但也会将形态学竖线和形态学横线的图形元素进行缩小。因此,在对形态学竖线和形态学横线进行腐蚀处理后,还可以通过膨胀处理恢复得到更加清晰,但大小不变的形态学竖线和形态学横线。In this embodiment, the white noise generated by the foreground of the morphological vertical line and the morphological horizontal line can be eliminated through the etching process, making the morphological vertical line and the morphological horizontal line clearer, but the morphological vertical line and the The graphical elements of the morphological horizontal lines are reduced. Therefore, after corroding the morphological vertical line and the morphological horizontal line, the morphological vertical line and the morphological horizontal line with a constant size can be recovered by the expansion treatment to be more clear.
在一个实施例中,考虑到上述组合图只是图形特征与表格数据近似,但也有可能不是表格数据。例如,尺寸较大的文本字符“田”也具有与表格数据近似的图形特征。因此,可以进行所提取的组合图进行检测,以确定组合图是否满足预设的表格格式要求,以更加精确地判断出组合图是否为真正的表格数据,进而后续可以仅对确定为表格数据 的组合图进行数据处理,从而减少了资源的浪费,提高了处理效率。In one embodiment, it is considered that the above-mentioned combination chart is only that the graphic features are similar to the table data, but it may not be table data. For example, the large text character "Tian" also has graphic features similar to table data. Therefore, the extracted combination chart can be tested to determine whether the combination chart meets the preset table format requirements, so as to more accurately determine whether the combination chart is real table data, and then can only be determined as table data. The combination graph performs data processing, thereby reducing waste of resources and improving processing efficiency.
在一个实施例中,在从所述图像数据中提取组合图后,所述方法具体实施时,还可以包括以下内容:获取所述组合图中的交点坐标,其中,所述交点为所述组合图中形态学竖线和形态学横线相交位置处的像素点;搜索并获取所述组合图中的矩形框;根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求。In one embodiment, after the combined image is extracted from the image data, when the method is specifically implemented, the method may further include: acquiring coordinates of the intersection point in the combined image, where the intersection point is the combination Pixels at the position where the morphological vertical line and the morphological horizontal line intersect in the figure; search and obtain the rectangular frame in the combined map; determine the endpoint coordinates of the rectangular frame according to the coordinates of the intersection point in the combined map; The endpoint coordinates of the rectangular frame determine whether the combined image meets the preset table format requirements.
在本实施例中,具体实施时,可以通过调用opencv bitwise_and函数搜索并获取图像中所述组合图中的交点坐标。当然,需要说明的是,上述所列举的通过opencv bitwise_and函数获取交点坐标只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的交点的坐标。对此,本说明书不作限定。In this embodiment, during specific implementation, the coordinates of the intersection point in the combined graph in the image can be searched and obtained by calling the opencv bitwise_and function. Of course, it should be noted that the enumeration of the coordinates of the intersection point through the opencv bitwise_and function listed above is only a schematic illustration. During specific implementation, the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.
在本实施例中,具体实施时,可以通过调用findContours函数搜索并获取所述组合图中的矩形框。当然,需要说明的是,上述所列举的通过findContours函数获取组合图中的矩形框只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的矩形框。对此,本说明书不作限定。In this embodiment, during specific implementation, the rectangular frame in the combined graph can be searched and obtained by calling the findContours function. Of course, it should be noted that the above-mentioned enumeration of the rectangular frame in the combination diagram by the findContours function is only a schematic illustration. During specific implementation, the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
在本实施例中,上述预设的表格格式要求具体可以理解为一种用于描述数据表格区别于其他图形结构的图形特征的规则集。In this embodiment, the above-mentioned preset table format requirement can be specifically understood as a rule set for describing the graphic features of the data table different from other graphic structures.
具体实施时,可以根据具体情况,灵活设置上述预设的表格格式要求所包含的具体规则。例如,考虑到数据表格不同于其他的图形,其中每一个格子图形(或称矩形框)都是用于填充具体的字符设计的,即数据表格中每一个格子图形的最小面积应当至少能够容得下一个完整的字符。因此,可以设置有如下的针对图形面积特征的规则:数据表格中的格子图形的最小面积应当大于预设的面积阈值。又考虑到基于人们通常的排版习惯,在编辑表格数据时会将表格数据设置为居中的位置。因此,还可以设置有如下针对图形位置特征的规则:数据表格的左侧边界与图像的左侧边界的距离同数据表格右侧边界与图像的右侧边界的距离的差值的绝对值小于预设的距离阈值。还考虑到在使用表格数据的目的,通常为了将至少两个或者更多个数据列成表格进行对比、比较,以便更加清晰地展示不同数据之间的差异。因此,还可以设置有如下针对图形的数量特征的规则:数据表格中的格子图形的数量大于等于预设的数量阈值(例如,2个)等。During specific implementation, the specific rules included in the above-mentioned preset table format requirements can be flexibly set according to specific conditions. For example, considering that the data table is different from other graphics, each grid graphic (or rectangular frame) is designed to fill in specific characters, that is, the minimum area of each grid graphic in the data table should be at least tolerable The next complete character. Therefore, the following rules for graphic area characteristics may be set: the minimum area of the grid pattern in the data table should be greater than a preset area threshold. Also considering that based on people's usual typesetting habits, the table data will be set to the center position when editing the table data. Therefore, you can also set the following rules for graphic position features: the absolute value of the difference between the distance between the left border of the data table and the left border of the image and the distance between the right border of the data table and the right border of the image is less than the Set the distance threshold. Also considering the purpose of using table data, usually in order to compare and compare at least two or more data into a table, so as to more clearly show the differences between different data. Therefore, the following rules for the quantity characteristics of graphics may also be set: the number of grid graphics in the data table is greater than or equal to a preset quantity threshold (for example, 2), and so on.
当然,需要说明的是,上述所列举的预设的表格格式要求所包含的具体规则只是为了更好地说明本说明书实施方式。具体实施时,根据具体的应用场景和处理要求,还 可以引入其他类型或内容的规则作为上述预设的表格格式要求。对此,本说明书不作限定。Of course, it should be noted that the specific rules included in the preset table format requirements listed above are only for better explaining the implementation of this specification. During specific implementation, according to specific application scenarios and processing requirements, other types or content rules may also be introduced as the above-mentioned preset table format requirements. This specification is not limited.
在一个实施例中,上述根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求,具体实施时,可以包括以下内容:根据所述矩形框的端点坐标,计算所述矩形框的面积;检测所述矩形框的面积是否大于预设的面积阈值。如果所述矩形框的面积大于预设的面积阈值,判断所述组合图满足预设的表格格式要求。In one embodiment, the above determines whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame. In specific implementation, it may include the following: according to the endpoint coordinates of the rectangular frame, calculate The area of the rectangular frame; detecting whether the area of the rectangular frame is greater than a preset area threshold. If the area of the rectangular frame is greater than a preset area threshold, it is determined that the combined map meets the preset table format requirements.
在一个实施例中,上述根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求,具体实施时,也可以包括以下内容:根据组合图中矩形框的端点坐标分别确定组合图中左侧边界的横坐标与右侧边界的横坐标;根据所述组合图中左侧边界的横坐标计算组合图的左侧边界与图像数据的左侧边界的距离,记为第一距离;根据所述组合图中右侧边界的横坐标计算组合图的右侧边界与图像数据的右侧边界的距离,记为第二距离;计算第一距离与第二距离的距离差值的绝对值,将所述差值的绝对值与预设的距离阈值进行比较,检测所述距离差值的绝对值是否小于预设的距离阈值。如果所述距离差值的绝对值小于预设的距离阈值,判断组合图满足预设的表格格式要求。In one embodiment, the foregoing determines whether the combination map meets the preset table format requirements according to the endpoint coordinates of the rectangular frame. In specific implementation, the following may also be included: According to the endpoint coordinates of the rectangular frame in the combination map, respectively Determine the abscissa of the left border and the right border of the combined map; calculate the distance between the left border of the combined map and the left border of the image data based on the left border of the combined map. A distance; calculate the distance between the right border of the combination map and the right border of the image data according to the abscissa of the right border of the combination map, and record it as the second distance; calculate the distance difference between the first distance and the second distance Compare the absolute value of the difference with a preset distance threshold to detect whether the absolute value of the distance difference is less than the preset distance threshold. If the absolute value of the distance difference is less than a preset distance threshold, it is determined that the combination map meets the preset table format requirements.
当然,需要说明的是,上述所列举的判断组合图是否满足预设的表格格式要求的方式只是为了更好地说明本说明书实施方式。具体实施时,根据具体情况和精度要求,可以将上述两种判断方式组合,也可以引入其他合适的判断方式来判断组合图是否符合预设的表格格式要求。对此,本说明书不作限定。Of course, it should be noted that the above-listed methods for judging whether the combination chart meets the preset table format requirements are only for better explaining the implementation of this specification. During specific implementation, according to specific conditions and accuracy requirements, the above two judgment methods may be combined, or other suitable judgment methods may be introduced to judge whether the combined picture meets the preset table format requirements. This specification is not limited.
在一个实施例中,上述将所述组合图分割成多个矩形单元,具体实施时,可以包括以下内容:根据矩形框的端点坐标确定分割线;根据所述分割线将所述组合图分割成多个矩形单元,并根据所述矩形框的端点坐标生成与所述矩形框对应的矩形单元的位置坐标。In an embodiment, the above-mentioned dividing the combined image into a plurality of rectangular units, in specific implementation, may include the following: determining the dividing line according to the coordinates of the end points of the rectangular frame; dividing the combined image into the following according to the dividing line A plurality of rectangular units, and generating position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.
在一个实施例中,所述待处理文本的图像数据具体可以包括:包含待处理合同的扫描图像或照片等。当然,需要说明的是,上述所列举的待处理文本的图像数据只是为了更好地说明本说明书实施方式。具体实施时,根据具体的应用场景和处理要求,上述待处理文本的图像数据还可以包括其他类型、内容的图像数据,例如,包含有待处理说明书的视频截图等等。对此,本说明书不作限定。In one embodiment, the image data of the text to be processed may specifically include: a scanned image or a photograph containing a contract to be processed. Of course, it should be noted that the image data of the to-be-processed text listed above is only for better explaining the embodiments of the present specification. During specific implementation, according to specific application scenarios and processing requirements, the image data of the text to be processed may also include image data of other types and contents, for example, a video screenshot containing a manual to be processed, etc. This specification is not limited.
由上可见,本说明书实施例提供的表格数据的获取方法,由于通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再将组合图分割成多 个矩形单元分,对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而根据矩形单元的位置坐标将包含有文本信息的矩形单元进行组合还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容;还通过在提取得到组合图后,根据组合图所包含的交点、矩形框等图形因素,检测所提取的组合图是否是文本中的表格数据,从而避免将非表格数据错误识别成了表格,减少了误差,提高了获取表格数据的精度。As can be seen from the above, the method for obtaining the table data provided by the embodiment of the present specification is that the combined picture is obtained by obtaining and extracting the graphic features such as the morphological vertical line and the morphological horizontal line in the image data; then the combined map is divided into multiple The rectangular units are divided into optical characters for each rectangular unit to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving It solves the technical problems of large error and inaccuracy in the extraction of table data in the existing methods, so that it can be efficiently and accurately identified, and the content of the table in the image data can be completely restored; after the extraction of the combined image, according to the combined image The included intersections, rectangular frames and other graphic factors detect whether the extracted combined image is tabular data in the text, thereby avoiding mistakenly identifying non-tabular data as tables, reducing errors, and improving the accuracy of obtaining tabular data.
本说明书实施例还提供了一种服务器,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器具体实施时可以根据指令执行以下步骤:获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。An embodiment of this specification also provides a server including a processor and a memory for storing processor-executable instructions. When the processor is specifically implemented, the following steps may be performed according to the instructions: acquiring image data of text to be processed; Extracting a combination diagram from the image data, wherein the combination diagram is a graph including vertical morphological and morphological horizontal lines; the combination diagram is divided into a plurality of rectangular units, wherein the plurality of rectangles The units carry position coordinates; perform optical character recognition on the multiple rectangular units to determine the text information contained in the multiple rectangular units; according to the position coordinates of the rectangular units, combine the rectangular units containing text information to obtain Tabular data.
为了能够更加准确地完成上述指令,参阅图6所示,本说明书还提供了另一种具体的服务器,其中,所述服务器包括网络通信端口601、处理器602以及存储器603,上述结构通过内部线缆相连,以便各个结构可以进行具体的数据交互。In order to complete the above instructions more accurately, as shown in FIG. 6, this specification also provides another specific server, where the server includes a network communication port 601, a processor 602, and a memory 603. The cables are connected so that each structure can perform specific data interactions.
其中,所述网络通信端口601,具体可以用于输入待处理文本的图像数据;Among them, the network communication port 601 may be specifically used to input image data of text to be processed;
所述处理器602,具体可以用于从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。The processor 602 may be specifically used to extract a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines; the combined image is divided into A plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; optical character recognition is performed on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units; according to the position of the rectangular unit Coordinates, combined with rectangular cells containing text information, get table data.
所述存储器603,具体可以用于存储经网络通信端口601输入的待处理文本的图像数据,以及存储处理器602所基于的相应的指令程序。The memory 603 may specifically be used to store image data of text to be processed input via the network communication port 601 and store corresponding instruction programs based on the processor 602.
在本实施方式中,所述网络通信端口601可以是与不同的通信协议进行绑定,从而可以发送或接收不同数据的虚拟端口。例如,所述网络通信端口可以是负责进行web数据通信的80号端口,也可以是负责进行FTP数据通信的21号端口,还可以是负责进行邮件数据通信的25号端口。此外,所述网络通信端口还可以是实体的通信接口或者 通信芯片。例如,其可以为无线移动网络通信芯片,如GSM、CDMA等;其还可以为Wifi芯片;其还可以为蓝牙芯片。In this embodiment, the network communication port 601 may be a virtual port that is bound to different communication protocols so that different data can be sent or received. For example, the network communication port may be port 80 responsible for web data communication, port 21 responsible for FTP data communication, or port 25 responsible for mail data communication. In addition, the network communication port may also be a physical communication interface or a communication chip. For example, it can be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it can also be a Wifi chip; it can also be a Bluetooth chip.
在本实施方式中,所述处理器602可以按任何适当的方式实现。例如,处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。本说明书并不作限定。In this embodiment, the processor 602 can be implemented in any suitable manner. For example, the processor may adopt, for example, a microprocessor or a processor and a computer-readable medium storing a computer-readable program code (such as software or firmware) executable by the (micro)processor, a logic gate, a switch, an application specific integrated circuit ( Application Specific (Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller, etc. This manual is not limited.
在本实施方式中,所述存储器603可以包括多个层次,在数字系统中,只要能保存二进制数据的都可以是存储器;在集成电路中,一个没有实物形式的具有存储功能的电路也叫存储器,如RAM、FIFO等;在系统中,具有实物形式的存储设备也叫存储器,如内存条、TF卡等。In this embodiment, the memory 603 may include multiple levels. In a digital system, as long as it can store binary data, it can be a memory. In an integrated circuit, a circuit with a storage function without a physical form is also called a memory. , Such as RAM, FIFO, etc.; in the system, the storage device with physical form is also called memory, such as memory stick, TF card, etc.
本说明书实施例还提供了一种基于上述表格数据的获取方法的计算机存储介质,所述计算机存储介质存储有计算机程序指令,在所述计算机程序指令被执行时实现:获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。The embodiments of the present specification also provide a computer storage medium based on the above-mentioned table data acquisition method, where the computer storage medium stores computer program instructions, which are implemented when the computer program instructions are executed: acquiring image data of text to be processed ; Extract a combination graph from the image data, wherein the combination graph is a graph that includes crossed morphological vertical lines and morphological horizontal lines; divide the combination map into a plurality of rectangular units, wherein, the A plurality of rectangular units respectively carry position coordinates; perform optical character recognition on the plurality of rectangular units respectively to determine the text information contained in the plurality of rectangular units; according to the position coordinates of the rectangular units, combine rectangles containing text information Unit, get the table data.
在本实施方式中,上述存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、缓存(Cache)、硬盘(Hard Disk Drive,HDD)或者存储卡(Memory Card)。所述存储器可以用于存储计算机程序指令。网络通信单元可以是依照通信协议规定的标准设置的,用于进行网络连接通信的接口。In this embodiment, the storage medium includes, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), cache (Cache), hard disk (Hard Disk Drive, HDD) Or memory card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface configured to perform network connection communication according to the standard specified by the communication protocol.
在本实施方式中,该计算机存储介质存储的程序指令具体实现的功能和效果,可以与其它实施方式对照解释,在此不再赘述。In this embodiment, the functions and effects specifically implemented by the program instructions stored in the computer storage medium can be explained in comparison with other embodiments, and will not be repeated here.
参阅图7所示,在软件层面上,本说明书实施例还提供了一种表格数据的获取装置,该装置具体可以包括以下的结构模块:Referring to FIG. 7, at the software level, the embodiment of the present specification also provides an apparatus for acquiring table data. The apparatus may specifically include the following structural modules:
获取模块71,具体可以用于获取待处理文本的图像数据;The obtaining module 71 can be specifically used to obtain image data of text to be processed;
提取模块72,具体可以用于从所述图像数据中提取组合图,其中,所述组合图为 包含有交叉的形态学竖线和形态学横线的图形;The extracting module 72 may be specifically used for extracting a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines;
分割模块73,具体可以用于将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;The segmentation module 73 may be specifically used to segment the combined image into multiple rectangular units, where the multiple rectangular units each carry position coordinates;
识别模块74,具体可以用于对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;The recognition module 74 may be specifically configured to perform optical character recognition on the plurality of rectangular units respectively and determine the text information contained in the plurality of rectangular units respectively;
组合模块75,具体可以用于根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。The combining module 75 can be specifically used to combine rectangular units containing text information according to the position coordinates of the rectangular units to obtain table data.
在一个实施例中,所述提取模块72具体可以包括以下结构单元:In one embodiment, the extraction module 72 may specifically include the following structural units:
第一搜索单元,具体可以用于搜索并获取所述图像数据中的形态学竖线和形态学横线;The first search unit may specifically be used to search for and obtain morphological vertical lines and morphological horizontal lines in the image data;
连接单元,具体可以用于连接所述形态学竖线和所述形态学横线,得到所述组合图。The connecting unit may specifically be used to connect the morphological vertical line and the morphological horizontal line to obtain the combined graph.
在一个实施例中,所述装置具体还可以包括检测模块,用于检测所述组合图是否满足预设的表格格式要求。其中,所述检测模块具体可以包括以下结构单元:In one embodiment, the apparatus may further specifically include a detection module, configured to detect whether the combination graph meets a preset table format requirement. Wherein, the detection module may specifically include the following structural units:
获取单元,具体可以用于获取所述组合图中的交点坐标,其中,所述交点具体可以为所述组合图中形态学竖线和形态学横线相交位置处的像素点;The obtaining unit may be specifically configured to obtain the coordinates of the intersection point in the combined graph, where the intersection point may specifically be a pixel point at a position where the morphological vertical line and the morphological horizontal line intersect in the combined map;
第二搜索单元,具体可以用于搜索并获取所述组合图中的矩形框;The second search unit may specifically be used to search for and obtain a rectangular frame in the combination diagram;
第一确定单元,具体可以用于根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;The first determining unit may specifically be used to determine the coordinates of the end point of the rectangular frame according to the coordinates of the intersection in the combined graph;
第二确定单元,具体可以用于根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求。The second determining unit may be specifically configured to determine whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame.
在一个实施例中,所述第二确定单元具体可以用于根据所述矩形框的端点坐标,计算所述矩形框的面积;检测所述矩形框的面积是否大于预设的面积阈值。In one embodiment, the second determining unit may be specifically configured to calculate the area of the rectangular frame according to the coordinates of the endpoints of the rectangular frame; and detect whether the area of the rectangular frame is greater than a preset area threshold.
在一个实施例中,所述分割模块73具体可以包括以下结构单元:In one embodiment, the segmentation module 73 may specifically include the following structural units:
第三确定单元,具体可以用于根据矩形框的端点坐标确定分割线;The third determining unit can be specifically used to determine the dividing line according to the coordinates of the end points of the rectangular frame;
分割单元,具体可以用于根据所述分割线将所述组合图分割成多个矩形单元,并根据所述矩形框的端点坐标生成与所述矩形框对应的矩形单元的位置坐标。The dividing unit may specifically be used to divide the combined image into a plurality of rectangular units according to the dividing line, and generate position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.
在一个实施例中,所述装置还具体可以包括预处理模块,用于对所述待处理文本的图像数据进行预处理,其中,所述预处理具体可以包括:将所述图像数据转换为灰度图像;和/或,对所述图像数据进行高斯平滑处理等等。In one embodiment, the apparatus may further specifically include a preprocessing module for preprocessing the image data of the text to be processed, wherein the preprocessing may specifically include: converting the image data to gray Degree image; and/or, perform Gaussian smoothing on the image data, etc.
在一个实施例中,所述待处理文本的图像数据具体可以包括:包含待处理合同的扫描图像或照片等。当然,需要说明的是,上述所列举的待处理文本的图像数据只是为了更好地说明本说明书实施方式。具体实施时,根据具体的应用场景和处理要求,上述待处理文本的图像数据还可以包括其他类型、内容的图像数据,例如,包含有待处理说明书的视频截图等等。对此,本说明书不作限定。In one embodiment, the image data of the text to be processed may specifically include: a scanned image or a photograph containing a contract to be processed. Of course, it should be noted that the image data of the to-be-processed text listed above is only for better explaining the embodiments of the present specification. During specific implementation, according to specific application scenarios and processing requirements, the image data of the text to be processed may also include image data of other types and contents, for example, a video screenshot containing a manual to be processed, etc. This specification is not limited.
需要说明的是,上述实施例阐明的单元、装置或模块等,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。为了描述的方便,描述以上装置时以功能分为各种模块分别描述。当然,在实施本说明书时可以把各模块的功能在同一个或多个软件和/或硬件中实现,也可以将实现同一功能的模块由多个子模块或子单元的组合实现等。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。It should be noted that the units, devices, or modules explained in the foregoing embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. For the convenience of description, when describing the above device, the functions are divided into various modules and described separately. Of course, when implementing this specification, the functions of each module may be implemented in one or more software and/or hardware, or the modules that implement the same function may be implemented by a combination of multiple submodules or subunits. The device embodiments described above are only schematic. For example, the division of the unit is only a division of logical functions. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or integrated To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical, or other forms.
由上可见,本说明书实施例提供的表格数据的获取装置,由于通过提取模块获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再通过分割模块和识别模块将组合图分割成多个矩形单元分,并对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而通过组合模块根据矩形单元的位置坐标将包含有文本信息的矩形单元进行组合还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容;还通过在提取得到组合图后,通过组合模块根据组合图所包含的交点、矩形框等图形因素,检测所提取的组合图是否是文本中的表格数据,从而避免将非表格数据错误识别成了表格,减少了误差,提高了获取表格数据的精度。It can be seen from the above that the table data acquisition device provided by the embodiment of the present specification is obtained by the extraction module and extracted according to the morphological vertical lines and morphological horizontal lines in the image data to obtain the combined picture; The module divides the combined image into multiple rectangular units, and performs optical character recognition on each rectangular unit type to obtain the text information contained in each rectangular unit, and then uses the combination module to divide the rectangle containing the text information according to the position coordinates of the rectangular unit. Units are combined and restored to obtain complete table data, thereby solving the technical problem of large error and inaccuracy in the existing method of extracting table data, so as to achieve efficient and accurate identification, and completely restore the table content in the image data; After extracting the combination chart, the combination module detects whether the extracted combination chart is tabular data in the text according to the intersection points, rectangular frames and other graphical factors contained in the combo chart, so as to avoid mistakenly identifying non-table data as Tables reduce errors and improve the accuracy of obtaining table data.
虽然本说明书提供了如实施例或流程图所述的方法操作步骤,但基于常规或者无创造性的手段可以包括更多或者更少的操作步骤。实施例中列举的步骤顺序仅仅为众多步骤执行顺序中的一种方式,不代表唯一的执行顺序。在实际中的装置或客户端产品执行时,可以按照实施例或者附图所示的方法顺序执行或者并行执行(例如并行处理器或 者多线程处理的环境,甚至为分布式数据处理环境)。术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、产品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、产品或者设备所固有的要素。在没有更多限制的情况下,并不排除在包括所述要素的过程、方法、产品或者设备中还存在另外的相同或等同要素。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。Although this specification provides method operation steps as described in the embodiments or flowcharts, more or less operation steps may be included based on conventional or non-inventive means. The sequence of steps listed in the embodiments is only one way among the sequence of execution of many steps, and does not represent a unique sequence of execution. When the actual device or client product is executed, it can be executed sequentially or in parallel according to the method shown in the embodiments or the drawings (for example, a parallel processor or multi-threaded processing environment, or even a distributed data processing environment). The terms "include", "include", or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, product, or device that includes a series of elements includes not only those elements, but also others that are not explicitly listed Elements, or include elements inherent to such processes, methods, products, or equipment. Without more restrictions, it does not exclude that there are other identical or equivalent elements in the process, method, product or equipment including the elements. The first and second words are used to indicate names, but do not indicate any particular order.
本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内部包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。Those skilled in the art also know that, in addition to implementing the controller in the form of pure computer-readable program code, the method can be logically programmed to enable the controller to use logic gates, switches, special integrated circuits, programmable logic controllers and embedded To achieve the same function in the form of a microcontroller, etc. Therefore, such a controller can be regarded as a hardware component, and the device for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module of the implementation method and a structure within a hardware component.
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构、类等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。This specification can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types. This specification can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules may be located in local and remote computer storage media including storage devices.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,移动终端,服务器,或者网络设备等)执行本说明书各个实施例或者实施例的某些部分所述的方法。It can be known from the description of the above embodiments that those skilled in the art can clearly understand that this specification can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of this specification can be embodied in the form of software products in essence or part that contributes to the existing technology, and the computer software products can be stored in a storage medium, such as ROM/RAM, magnetic disk , CD, etc., including several instructions to enable a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in the various embodiments of this specification or some parts of the embodiments.
本说明书中的各个实施例采用递进的方式描述,各个实施例之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。本说明书可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。The embodiments in this specification are described in a progressive manner. The same or similar parts between the embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. This manual can be used in many general or special computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable electronic devices, network PCs, small computers, mainframe computers, including the above Distributed computing environment for any system or device, etc.
虽然通过实施例描绘了本说明书,本领域普通技术人员知道,本说明书有许多变 形和变化而不脱离本说明书的精神,希望所附的权利要求包括这些变形和变化而不脱离本说明书的精神。Although the description has been described through the embodiments, those of ordinary skill in the art know that there are many variations and changes in the description without departing from the spirit of the description, and it is hoped that the appended claims include these variations and changes without departing from the spirit of the description.

Claims (16)

  1. 一种表格数据的获取方法,包括:A method for obtaining tabular data includes:
    获取待处理文本的图像数据;Obtain the image data of the text to be processed;
    从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;Extract a combination graph from the image data, wherein the combination graph is a graph including vertical morphological lines and horizontal morphological lines;
    将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;Dividing the combined image into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates;
    对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;Performing optical character recognition on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units;
    根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。According to the position coordinates of the rectangular unit, the rectangular unit containing the text information is combined to obtain the table data.
  2. 根据权利要求1所述的方法,从所述图像数据中提取组合图,包括:The method according to claim 1, extracting the combined image from the image data, comprising:
    搜索并获取所述图像数据中的形态学竖线和形态学横线;Search and obtain morphological vertical lines and morphological horizontal lines in the image data;
    连接所述形态学竖线和所述形态学横线,得到所述组合图。Connect the morphological vertical line and the morphological horizontal line to obtain the combined graph.
  3. 根据权利要求1所述的方法,在从所述图像数据中提取组合图后,所述方法还包括:The method of claim 1, after extracting the combined image from the image data, the method further comprises:
    获取所述组合图中的交点坐标,其中,所述交点为所述组合图中形态学竖线和形态学横线相交位置处的像素点;Obtaining the coordinates of the intersection point in the combination map, where the intersection point is the pixel point at the position where the morphological vertical line and the morphological horizontal line intersect in the combination map;
    搜索并获取所述组合图中的矩形框;Search for and obtain a rectangular frame in the combination diagram;
    根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;Determine the coordinates of the end points of the rectangular frame according to the coordinates of the intersection points in the combined graph;
    根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求。According to the coordinates of the end points of the rectangular frame, it is determined whether the combined image meets the preset table format requirements.
  4. 根据权利要求3所述的方法,根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求,包括:The method according to claim 3, according to the coordinates of the endpoints of the rectangular frame, determining whether the combined graph meets the preset table format requirements, including:
    根据所述矩形框的端点坐标,计算所述矩形框的面积;Calculate the area of the rectangular frame according to the coordinates of the endpoints of the rectangular frame;
    检测所述矩形框的面积是否大于预设的面积阈值。Detecting whether the area of the rectangular frame is greater than a preset area threshold.
  5. 根据权利要求3所述的方法,将所述组合图分割成多个矩形单元,包括:The method according to claim 3, dividing the combined graph into a plurality of rectangular units, including:
    根据矩形框的端点坐标确定分割线;Determine the dividing line according to the coordinates of the end points of the rectangular frame;
    根据所述分割线将所述组合图分割成多个矩形单元,并根据所述矩形框的端点坐标生成与所述矩形框对应的矩形单元的位置坐标。The combined image is divided into a plurality of rectangular units according to the dividing line, and the position coordinates of the rectangular unit corresponding to the rectangular frame are generated according to the coordinates of the endpoints of the rectangular frame.
  6. 根据权利要求1所述的方法,在获取待处理文本的图像数据后,所述方法还包括:The method according to claim 1, after acquiring the image data of the text to be processed, the method further comprises:
    对所述待处理文本的图像数据进行预处理,其中,所述预处理包括:将所述图像数据转换为灰度图像;和/或,对所述图像数据进行高斯平滑处理。Preprocessing the image data of the text to be processed, wherein the preprocessing includes: converting the image data into a grayscale image; and/or, performing Gaussian smoothing on the image data.
  7. 根据权利要求1所述的方法,所述待处理文本的图像数据包括:包含待处理合同 的扫描图像或照片。The method according to claim 1, wherein the image data of the text to be processed includes: a scanned image or a photograph containing a contract to be processed.
  8. 一种表格数据的获取装置,包括:An apparatus for acquiring form data includes:
    获取模块,用于获取待处理文本的图像数据;Acquisition module for acquiring image data of text to be processed;
    提取模块,用于从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;An extraction module, configured to extract a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines;
    分割模块,用于将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;A segmentation module, configured to segment the combined image into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates;
    识别模块,用于对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;A recognition module, configured to perform optical character recognition on the plurality of rectangular units, and determine the text information contained in the plurality of rectangular units;
    组合模块,用于根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。The combining module is used to combine rectangular cells containing text information according to the position coordinates of the rectangular cells to obtain table data.
  9. 根据权利要求8所述的装置,所述提取模块包括:The apparatus according to claim 8, the extraction module comprising:
    第一搜索单元,用于搜索并获取所述图像数据中的形态学竖线和形态学横线;A first search unit, used to search for and obtain morphological vertical lines and morphological horizontal lines in the image data;
    连接单元,用于连接所述形态学竖线和所述形态学横线,得到所述组合图。The connecting unit is used to connect the vertical morphological line and the horizontal morphological line to obtain the combined graph.
  10. 根据权利要求8所述的装置,所述装置还包括检测模块,所述检测模块包括:The device according to claim 8, further comprising a detection module, the detection module comprising:
    获取单元,用于获取所述组合图中的交点坐标,其中,所述交点为所述组合图中形态学竖线和形态学横线相交位置处的像素点;An obtaining unit, configured to obtain the coordinates of the intersection point in the combined map, wherein the intersection point is the pixel point at the position where the morphological vertical line and the morphological horizontal line intersect in the combined map;
    第二搜索单元,用于搜索并获取所述组合图中的矩形框;A second search unit, used to search for and obtain the rectangular frame in the combined graph;
    第一确定单元,用于根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;A first determining unit, configured to determine the coordinates of the end point of the rectangular frame according to the coordinates of the intersection in the combined graph;
    第二确定单元,用于根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求。The second determining unit is configured to determine whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame.
  11. 根据权利要求10所述的装置,所述第二确定单元具体用于根据所述矩形框的端点坐标,计算所述矩形框的面积;检测所述矩形框的面积是否大于预设的面积阈值。According to the apparatus of claim 10, the second determining unit is specifically configured to calculate the area of the rectangular frame based on the coordinates of the end points of the rectangular frame; and detect whether the area of the rectangular frame is greater than a preset area threshold.
  12. 根据权利要求10所述的装置,所述分割模块包括:The apparatus of claim 10, the segmentation module comprises:
    第三确定单元,用于根据矩形框的端点坐标确定分割线;The third determining unit is used to determine the dividing line according to the coordinates of the end points of the rectangular frame;
    分割单元,用于根据所述分割线将所述组合图分割成多个矩形单元,并根据所述矩形框的端点坐标生成与所述矩形框对应的矩形单元的位置坐标。The dividing unit is configured to divide the combined image into a plurality of rectangular units according to the dividing line, and generate position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.
  13. 根据权利要求8所述的装置,所述装置还包括预处理模块,用于对所述待处理文本的图像数据进行预处理,其中,所述预处理包括:将所述图像数据转换为灰度图像;和/或,对所述图像数据进行高斯平滑处理。The device according to claim 8, further comprising a preprocessing module for preprocessing the image data of the text to be processed, wherein the preprocessing includes: converting the image data to grayscale Image; and/or, performing Gaussian smoothing on the image data.
  14. 根据权利要求8所述的装置,所述待处理文本的图像数据包括:包含待处理合同 的扫描图像或照片。The apparatus according to claim 8, the image data of the text to be processed includes: a scanned image or a photograph containing a contract to be processed.
  15. 一种服务器,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现权利要求1至7中任一项所述方法的步骤。A server includes a processor and a memory for storing processor-executable instructions, and when the processor executes the instructions, the steps of the method according to any one of claims 1 to 7 are implemented.
  16. 一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现权利要求1至7中任一项所述方法的步骤。A computer-readable storage medium having computer instructions stored thereon, when the instructions are executed, the steps of the method according to any one of claims 1 to 7 are realized.
PCT/CN2019/124101 2019-01-04 2019-12-09 Table data acquisition method and apparatus, and server WO2020140698A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910006706.1A CN110008809B (en) 2019-01-04 2019-01-04 Method and device for acquiring form data and server
CN201910006706.1 2019-01-04

Publications (1)

Publication Number Publication Date
WO2020140698A1 true WO2020140698A1 (en) 2020-07-09

Family

ID=67165348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124101 WO2020140698A1 (en) 2019-01-04 2019-12-09 Table data acquisition method and apparatus, and server

Country Status (2)

Country Link
CN (1) CN110008809B (en)
WO (1) WO2020140698A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881883A (en) * 2020-08-10 2020-11-03 晶璞(上海)人工智能科技有限公司 Form document extraction method based on convolution feature extraction and morphological processing
CN112364834A (en) * 2020-12-07 2021-02-12 上海叠念信息科技有限公司 Form identification restoration method based on deep learning and image processing
CN112712014A (en) * 2020-12-29 2021-04-27 平安健康保险股份有限公司 Table picture structure analysis method, system, equipment and readable storage medium
CN114926852A (en) * 2022-03-17 2022-08-19 支付宝(杭州)信息技术有限公司 Table recognition reconstruction method, device, equipment, medium and program product

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008809B (en) * 2019-01-04 2020-08-25 阿里巴巴集团控股有限公司 Method and device for acquiring form data and server
CN110675384B (en) * 2019-09-24 2022-06-07 广东博智林机器人有限公司 Image processing method and device
CN111126409B (en) * 2019-12-26 2023-08-18 南京巨鲨显示科技有限公司 Medical image area identification method and system
CN111160234B (en) * 2019-12-27 2020-12-08 掌阅科技股份有限公司 Table recognition method, electronic device and computer storage medium
CN111027521B (en) * 2019-12-30 2023-12-29 上海智臻智能网络科技股份有限公司 Text processing method and system, data processing device and storage medium
CN111325110B (en) * 2020-01-22 2024-04-05 平安科技(深圳)有限公司 OCR-based table format recovery method, device and storage medium
CN113343740B (en) * 2020-03-02 2022-05-06 阿里巴巴集团控股有限公司 Table detection method, device, equipment and storage medium
CN111460774B (en) * 2020-04-02 2023-06-30 北京易优联科技有限公司 Method and device for restoring data in curve, storage medium and electronic equipment
CN111640130A (en) * 2020-05-29 2020-09-08 深圳壹账通智能科技有限公司 Table reduction method and device
CN111757182B (en) * 2020-07-08 2022-05-31 深圳创维-Rgb电子有限公司 Image splash screen detection method, device, computer device and readable storage medium
CN111985506A (en) * 2020-08-21 2020-11-24 广东电网有限责任公司清远供电局 Chart information extraction method and device and storage medium
CN112200117B (en) * 2020-10-22 2023-10-13 长城计算机软件与系统有限公司 Form identification method and device
CN112733855B (en) * 2020-12-30 2024-04-09 科大讯飞股份有限公司 Table structuring method, table recovering device and device with storage function
CN112861736B (en) * 2021-02-10 2022-08-09 上海大学 Document table content identification and information extraction method based on image processing
CN113569677A (en) * 2021-07-16 2021-10-29 国网天津市电力公司 Paper test report generation method based on scanning piece

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130016381A1 (en) * 2011-07-12 2013-01-17 Fuji Xerox Co., Ltd. Image processing apparatus, non-transitory computer readable medium storing program and image processing method
CN104462044A (en) * 2014-12-16 2015-03-25 上海合合信息科技发展有限公司 Recognizing and editing method and device of tabular images
CN105426856A (en) * 2015-11-25 2016-03-23 成都数联铭品科技有限公司 Image table character identification method
CN108132916A (en) * 2017-11-30 2018-06-08 厦门市美亚柏科信息股份有限公司 Parse method, the storage medium of PDF list datas
CN110008809A (en) * 2019-01-04 2019-07-12 阿里巴巴集团控股有限公司 Acquisition methods, device and the server of list data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6996295B2 (en) * 2002-01-10 2006-02-07 Siemens Corporate Research, Inc. Automatic document reading system for technical drawings
CN107622230B (en) * 2017-08-30 2019-12-06 中国科学院软件研究所 PDF table data analysis method based on region identification and segmentation
CN107943857A (en) * 2017-11-07 2018-04-20 中船黄埔文冲船舶有限公司 Automatic method, apparatus, terminal device and the storage medium for reading AutoCAD forms
CN109086714B (en) * 2018-07-31 2020-12-04 国科赛思(北京)科技有限公司 Form recognition method, recognition system and computer device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130016381A1 (en) * 2011-07-12 2013-01-17 Fuji Xerox Co., Ltd. Image processing apparatus, non-transitory computer readable medium storing program and image processing method
CN104462044A (en) * 2014-12-16 2015-03-25 上海合合信息科技发展有限公司 Recognizing and editing method and device of tabular images
CN105426856A (en) * 2015-11-25 2016-03-23 成都数联铭品科技有限公司 Image table character identification method
CN108132916A (en) * 2017-11-30 2018-06-08 厦门市美亚柏科信息股份有限公司 Parse method, the storage medium of PDF list datas
CN110008809A (en) * 2019-01-04 2019-07-12 阿里巴巴集团控股有限公司 Acquisition methods, device and the server of list data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881883A (en) * 2020-08-10 2020-11-03 晶璞(上海)人工智能科技有限公司 Form document extraction method based on convolution feature extraction and morphological processing
CN112364834A (en) * 2020-12-07 2021-02-12 上海叠念信息科技有限公司 Form identification restoration method based on deep learning and image processing
CN112712014A (en) * 2020-12-29 2021-04-27 平安健康保险股份有限公司 Table picture structure analysis method, system, equipment and readable storage medium
CN112712014B (en) * 2020-12-29 2024-04-30 平安健康保险股份有限公司 Method, system, device and readable storage medium for parsing table picture structure
CN114926852A (en) * 2022-03-17 2022-08-19 支付宝(杭州)信息技术有限公司 Table recognition reconstruction method, device, equipment, medium and program product

Also Published As

Publication number Publication date
CN110008809B (en) 2020-08-25
CN110008809A (en) 2019-07-12

Similar Documents

Publication Publication Date Title
WO2020140698A1 (en) Table data acquisition method and apparatus, and server
US20210256253A1 (en) Method and apparatus of image-to-document conversion based on ocr, device, and readable storage medium
WO2019119966A1 (en) Text image processing method, device, equipment, and storage medium
CN109753953B (en) Method and device for positioning text in image, electronic equipment and storage medium
CN109410215A (en) Image processing method, device, electronic equipment and computer-readable medium
CN110942074B (en) Character segmentation recognition method and device, electronic equipment and storage medium
CN105469027A (en) Horizontal and vertical line detection and removal for document images
CN109948521B (en) Image deviation rectifying method and device, equipment and storage medium
US20180082456A1 (en) Image viewpoint transformation apparatus and method
CN112651953B (en) Picture similarity calculation method and device, computer equipment and storage medium
US20190266431A1 (en) Method, apparatus, and computer-readable medium for processing an image with horizontal and vertical text
CN113642584A (en) Character recognition method, device, equipment, storage medium and intelligent dictionary pen
CN114359932B (en) Text detection method, text recognition method and device
CN115719356A (en) Image processing method, apparatus, device and medium
CN116844177A (en) Table identification method, apparatus, device and storage medium
CN113887375A (en) Text recognition method, device, equipment and storage medium
CN112507938A (en) Geometric feature calculation method, geometric feature recognition method and geometric feature recognition device for text primitives
CN115620321B (en) Table identification method and device, electronic equipment and storage medium
CN108304840B (en) Image data processing method and device
CN114120305B (en) Training method of text classification model, and text content recognition method and device
US11570331B2 (en) Image processing apparatus, image processing method, and storage medium
JP2012003358A (en) Background determination device, method, and program
CN115019321A (en) Text recognition method, text model training method, text recognition device, text model training equipment and storage medium
CN114511862A (en) Form identification method and device and electronic equipment
CN114140805A (en) Image processing method, image processing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19907609

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19907609

Country of ref document: EP

Kind code of ref document: A1