WO2022247823A1 - 图像检测方法、设备和存储介质 - Google Patents

图像检测方法、设备和存储介质 Download PDF

Info

Publication number
WO2022247823A1
WO2022247823A1 PCT/CN2022/094684 CN2022094684W WO2022247823A1 WO 2022247823 A1 WO2022247823 A1 WO 2022247823A1 CN 2022094684 W CN2022094684 W CN 2022094684W WO 2022247823 A1 WO2022247823 A1 WO 2022247823A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
detection
detection frames
cell
identification points
Prior art date
Application number
PCT/CN2022/094684
Other languages
English (en)
French (fr)
Inventor
龙如蛟
杨志博
王永攀
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2022247823A1 publication Critical patent/WO2022247823A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0641Shopping interfaces
    • G06Q30/0643Graphical representation of items or shoppers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means

Definitions

  • the present invention relates to the technical field of image processing, in particular to an image detection method, device and storage medium.
  • OCR Optical Character Recognition
  • the form is detected in the image containing the form, and the structure of the form is recognized, so that the form in the form of the image is converted into an editable excel form file, so as to facilitate the storage and editing of the information contained in the form image.
  • the premise is that the identification of the table structure information can be accurately completed in the picture.
  • Embodiments of the present invention provide an image detection method, device, and storage medium, so as to accurately analyze structured information of an object on an image.
  • an embodiment of the present invention provides an image detection method, the method comprising:
  • the association relationship of the multiple objects is determined.
  • an image detection device which includes:
  • An acquisition module configured to acquire an image to be detected, wherein the image contains a plurality of objects
  • a detection module configured to identify a plurality of identification points corresponding to the plurality of objects in the image, and determine a plurality of detection frames corresponding to the plurality of objects in the image according to the plurality of identification points , according to the correspondence between the plurality of detection frames and the plurality of identification points, and the distance between the identification points corresponding to different detection frames, determine the association relationship of the plurality of objects.
  • an embodiment of the present invention provides an image detection method, the method comprising:
  • the association relationship of the multiple objects is determined.
  • an embodiment of the present invention provides an image detection method, the method comprising:
  • An editable table file is generated according to the row and column information.
  • an embodiment of the present invention provides an electronic device, including: a memory and a processor; where executable code is stored in the memory, and when the executable code is executed by the processor, the processor can at least implement the following The image detection method described in the first aspect or the fourth aspect.
  • the embodiment of the present invention provides a non-transitory machine-readable storage medium, on which executable code is stored, when the executable code is executed by the processor of the electronic device , so that the processor can at least implement the image detection method as described in the first aspect or the fourth aspect.
  • the structured information can be reflected as whether or not between different objects
  • association relationship such as location adjacency relationship, information collocation relationship, and so on.
  • a plurality of identification points corresponding to the plurality of objects (such as the center point, boundary point, etc.) Corresponding multiple detection boxes.
  • the multiple detection frames are used to roughly represent the corresponding positions of the multiple objects in the image.
  • the association relationship of the plurality of objects is determined according to the correspondence between the plurality of detection frames and the plurality of identification points, and the distance between the identification points corresponding to different detection frames.
  • one or more identification points can be defined according to actual needs.
  • the corresponding position areas of different objects in the image can be regressed through the identification points.
  • Detection frame that is, different detection frames are used to characterize each object included in the image; on the other hand, based on the learning of the distance between the identification points of different objects, the identification points corresponding to the detection frames of different objects can also be combined The distance between them determines the association relationship of multiple objects.
  • the classification of the identification points, the learning of the distance between the identification points of different objects, and the regression of the detection frame based on the identification points all reflect the extraction of rich semantic information from the image. Based on these rich semantic information It can be guaranteed that the parsing of the structured information of the object in the image can be completed accurately.
  • FIG. 1 is a flowchart of an image detection method provided by an embodiment of the present invention
  • FIG. 2 is a schematic diagram of the composition and structure of an image detection model provided by an embodiment of the present invention.
  • FIG. 3 is a flowchart of another image detection method provided by an embodiment of the present invention.
  • Fig. 4a is a schematic diagram of a form image detection principle provided by an embodiment of the present invention.
  • Fig. 4b is a schematic diagram of a form image detection scene provided by an embodiment of the present invention.
  • Fig. 5a is a schematic diagram of a splicing result of detection frames in a form image provided by an embodiment of the present invention
  • Fig. 5b is a schematic diagram showing detection frames and detection frame splicing results provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of the composition and structure of another image detection model provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a table structure identification process provided by an embodiment of the present invention.
  • FIG. 8 is a flow chart of another image detection method provided by an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a character detection process provided by an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of the application of an image detection method provided by an embodiment of the present invention.
  • FIG. 11 is a flowchart of another image detection method provided by an embodiment of the present invention.
  • FIG. 12 is a flowchart of another image detection method provided by an embodiment of the present invention.
  • FIG. 13 is a flowchart of another image detection method provided by an embodiment of the present invention.
  • FIG. 14 is a flowchart of another image detection method provided by an embodiment of the present invention.
  • Fig. 15 is a schematic structural diagram of an image detection device provided by an embodiment of the present invention.
  • FIG. 16 is a schematic structural diagram of electronic equipment corresponding to the image detection device provided by the embodiment shown in FIG. 15 .
  • the image detection method provided by the embodiment of the present invention can be executed by an electronic device, and the electronic device can be a terminal device such as a PC, a notebook computer, a smart phone, or a server.
  • the server may be a physical server including an independent host, or may also be a virtual server, or may also be a cloud server or server cluster.
  • the main purpose of the image detection method provided by the embodiment of the present invention is to perform object detection on an image to be detected, that is, to detect the positions of multiple objects contained in the image and the relationship between the multiple objects.
  • the position of the object in the image to be detected can be represented by the position of the detection frame surrounding the object.
  • the images to be detected will be different, and the multiple objects that need to be detected on the images will also be different.
  • the image to be detected refers to the image containing the table area, and multiple objects refer to the multiple cells contained in the table area.
  • the purpose of image detection is to detect these multiple cells Corresponding positions in the image, and determining the positional relationship between the plurality of cells.
  • the image to be detected refers to an image containing text content
  • multiple objects refer to multiple texts contained in the image.
  • the purpose of image detection is to detect these multiple texts Corresponding positions in the image, and determining the character adjacency relationship among the multiple characters.
  • the image to be detected may be an image containing key-value (key-value) pair information, and multiple objects refer to all keys and all values contained in the image.
  • the image detection The purpose of this method is to detect the respective corresponding positions of these keys and values in the image, and to determine the affiliation (or corresponding relationship, matching relationship) between these multiple keys and values.
  • the image to be detected may be an image taken by the user himself, and the image quality is difficult to guarantee.
  • the tables in the image taken by the user may show visual characteristics such as rotation, reflection, coverage, and wrinkles. , which poses a greater challenge to image detection tasks.
  • the image detection scheme provided by the embodiment of the present invention even if the image to be detected has some visual defects, the position of multiple objects in the image and the relationship between each other can be accurately detected.
  • Fig. 1 is a flow chart of an image detection method provided by an embodiment of the present invention. As shown in Fig. 1, the method includes the following steps:
  • the association relationship of the multiple objects According to the correspondence between the multiple detection frames and the multiple identification points, and the distance between the identification points corresponding to different detection frames, determine the association relationship of the multiple objects.
  • the target objects to be detected can be defined in advance for the images to be detected corresponding to each application scenario. what is it.
  • the target object to be detected is each cell contained in the table area.
  • the target object to be detected is each text contained in the image.
  • an image detection model obtained through pre-training may be used to complete the position location processing of multiple objects.
  • the composition structure of the image detection model is illustrated with reference to FIG. 2 .
  • the image detection model may include a backbone network model and multiple branch network models.
  • the backbone network model is used to realize the feature extraction of the input image to be detected, and obtain a feature map of a certain scale.
  • the backbone network model can be implemented as a neural network model such as a convolutional network model composed of multiple convolutional layers, a residual network model, or the like.
  • the feature maps output by the backbone network model are input to multiple branch network models respectively.
  • Multiple branch network models can be called respectively in terms of their respective functional levels: marker classification model, detection frame regression model, and splicing relationship regression model.
  • the identification point classification model is used for classifying and identifying the feature points based on the input feature map.
  • a feature map contains several feature points. Assuming that the spatial resolution of a feature map is expressed as h ⁇ w, it means that the feature map includes h ⁇ w feature points. There is a positional mapping relationship between these feature points and the pixels in the image to be detected. Therefore, when the category corresponding to a certain feature point is determined, the category corresponding to the corresponding pixel in the image to be detected is determined.
  • one or more category labels for defining identification points are preset according to actual needs. For example, it can be defined that the center point and vertex of the object are identification points, then three types of labels can be preset: center point, vertex, and others (or background). Classify and identify the feature points on the feature map received from the backbone network model, and obtain the category labels corresponding to each feature point. If the category label corresponding to a feature point is finally determined to be a center point or vertex, then determine the same in the image to be detected. The pixel position corresponding to the feature point is an identification point, specifically corresponding to the center point or vertex of an object. A plurality of identification points can be identified from the image to be detected through the identification point classification model, and the plurality of identification points correspond to a plurality of objects contained in the image.
  • the landmark classification model can include multiple convolutional layers. After the input feature map is operated by multiple convolutional layers, a two-channel feature map can be obtained to detect these two types of landmarks: if a certain feature point is If it is a center point, it will output 1 at the corresponding position of the first channel of the feature map; if it is a vertex, it will output 1 or 0 at the corresponding position of the second channel of the feature map; if it is a background, it will correspond to all channels of the feature map All positions output 0. It can be seen that the two channels correspond to two class labels respectively.
  • the number of finally identified marker points will be greater than the number of objects; secondly, the classification result only identifies the The positions of the many identification points in the system cannot know the corresponding relationship between the identification points and the objects, that is, it is not known which identification point belongs to which object.
  • the detection frame regression model is used for regressing detection frames respectively corresponding to multiple objects in the image to be detected according to a plurality of identification points output by the identification point classification model.
  • the detection frame regression model is trained to have the ability to learn the distance from the center point of the object to its vertex. Based on this, for a center point output by the marker point classification model, the center point can be predicted based on the detection frame regression model According to the distance to the vertex of the corresponding object, the predicted vertex coordinates of the object can be known, and the detection frame corresponding to the object can be formed by the vertex coordinates of the object.
  • the number of vertices of an object is related to the shape of the object. For example, if an object is a rectangle, then the number of vertex coordinates is four; if an object is a triangle, then the number of vertex coordinates is three.
  • the detection box regression model can include multiple convolutional layers, and the feature map input by the backbone network model can obtain a multi-channel feature map after being operated by multiple convolutional layers, where the number of channels is an object It has twice the number of vertices, and the reason for the double is that the coordinates of a vertex are composed of two coordinate values, the abscissa and the ordinate.
  • the multi-channel feature map output by the detection frame regression model will record multiple coordinate values corresponding to each feature point, wherein, for a certain feature point, there will be recorded in the feature map of one of the channels. A coordinate value corresponding to the channel.
  • the feature point corresponding to the center point coordinates can be determined in this multi-channel feature map, and then the multi-channel feature map can be queried in turn to obtain multiple coordinate values corresponding to the feature point.
  • Each coordinate value corresponds to the distance from the center point to the multiple vertices of the corresponding object. Based on these multiple distances, the coordinates of each vertex of the corresponding object can be known, and the corresponding detection frame can be obtained.
  • the splicing relationship regression model is used to regress the distance between a certain identification point of an object and a certain identification point of another object, so as to find out whether there is a splicing relationship between the detection frames corresponding to different objects based on the distance, wherein,
  • the above two objects refer to objects with a certain set association relationship, such as two adjacent objects, two objects forming a key-value pair.
  • the splicing relationship regression model is trained to have the ability to learn the distance between the marker point of an object and the marker points of other objects that have a set relationship with it.
  • the distance from the identification point to the target identification point can be predicted based on the splicing relationship regression model, wherein the object corresponding to the target identification point has a set association relationship with the object corresponding to the identification point, so that according to the distance can be
  • the detection boxes with splicing relationship are known.
  • the structure of this model is similar to that of the detection box regression model, and the working principle is also similar, so I won't go into details here.
  • multiple identification points corresponding to multiple objects are identified in the image, and these multiple identification points can be It includes a variety of different types of identification points, such as object center points and object vertices.
  • multiple detection frames corresponding to multiple objects are determined in the image by using the detection frame regression model. These multiple detection frames correspond to multiple objects one by one to represent multiple Objects correspond to their respective location areas in the image.
  • the splicing relationship regression model is used to finally determine the splicing relationship of multiple detection frames, that is, to determine which detection frames in multiple detection frames have a splicing relationship with which detection frames. If it is determined that some detection frames have a splicing relationship, it indicates that there is a certain set correlation between the objects corresponding to these detection frames, so that the positioning of each object position in the image and the mutual relationship between different objects are realized.
  • the recognition of the relationship is equivalent to a structured analysis of the information contained in the image, which provides the necessary premise for the subsequent processing of the image.
  • the process of using the splicing relationship regression model to finally determine the splicing relationship of multiple detection frames can be realized as follows: according to the correspondence between multiple detection frames and multiple identification points, and the identification points corresponding to different detection frames The distance between them determines the splicing relationship of multiple detection frames.
  • the splicing relationship regression model can output a set of distance values corresponding to the identification point i (used to indicate that there is an identification point of another object that has an association relationship between the objects corresponding to the identification point i, and the distance from the identification point i distance), on the basis of the position of the identification point i, the target position can be calculated based on this group of distance values, and the identification point j matching the target position is determined among the multiple identification points obtained by classification, that is, the identification point j’s The position is closest to the target position. Assuming that the detection frame P is regressed based on the marker point j, it can be considered that the detection frame R and the detection frame P have a splicing relationship.
  • the classification of identification points, the learning of the distance between identification points of different objects, and the regression of detection frames based on identification points all reflect the extraction of rich semantic information from images. Based on these rich semantic The information can ensure that the analysis of the structured information of the object in the image can be accurately completed, and interference such as poor image quality, occlusion, and wrinkles can be avoided.
  • the image detection scheme provided by the embodiment of the present invention can be applied to different application scenarios.
  • the following takes the table detection scenario and the text detection scenario as examples to describe the implementation process of the image detection scheme in these two application scenarios. illustrate.
  • Fig. 3 is a flowchart of another image detection method provided by an embodiment of the present invention. As shown in Fig. 3, the method includes the following steps:
  • the cell center points determine the cell vertices that belong to the same cell as the center point of any cell, wherein the cell that belongs to the same cell as the center point of any cell
  • the vertices constitute a detection frame corresponding to the center point of any cell.
  • the cell vertices For any of the cell vertices, determine at least two cell center points that share the any cell vertex, determine at least two detection frames corresponding to the at least two cell center points, and determine the The at least two detection frames have a splicing relationship.
  • the image detection scheme for an image including a table area (for convenience of description, referred to as a table image) provided by the embodiment of the present invention can be realized by means of the image detection model provided in FIG. 2 .
  • the form image detection process based on the image detection model is exemplarily described below in conjunction with Fig. 4a and Fig. 4b.
  • h and w represent the resolution
  • h/4 and w/4 are assumed to downsample the form image by 2 times.
  • the feature extraction of the table image can be realized through the backbone network model, and the feature map F1 is obtained.
  • the feature map F1 is input into the marker point classification model, and the marker point classification model will output the feature map F2 shown in the figure.
  • the feature map F2 is a 2-channel feature map, and the 2 channels correspond to two types of marker points. : Cell center point and cell vertex. To put it simply, the feature map F2 will describe the category determination results corresponding to each feature point: either the center point of the cell, or the vertex of the cell, or the background. In this way, multiple cell vertices and multiple cell central points contained in the table image can be identified based on the marker point classification model.
  • the detection box regression model is used to regress the distance from the center point of the cell to the 4 cell vertices of the corresponding cell (that is, it belongs to the same cell), because each The coordinates of the vertex include two values of abscissa and ordinate, therefore, 8 coordinate values will be output. Therefore, input the feature map F1 into the detection frame regression model, and the detection frame regression model will output the feature map F3 shown in the figure.
  • the feature map F3 is an 8-channel feature map, which is used to describe the 8 coordinate values corresponding to each feature point. .
  • the feature point corresponding to the center point of each cell can be located in the feature map F3, and then the 8 coordinate values corresponding to the feature point can be obtained, which is the corresponding unit
  • the distance between the grid center point and the four cell vertices of the corresponding cell based on the distance, the coordinates of the four cell vertices can be obtained.
  • a detection frame corresponding to any cell center point is formed by cell vertices belonging to the same cell as the cell center point. In this way, for each cell center point, the corresponding detection frame is regressed, and the position of the detection frame indicates the corresponding position of the corresponding cell in the table image.
  • the stitching relationship regression model is used to regress the distance from the cell vertex to the center point of other cells that share the cell vertex.
  • Up to four cells can share one cell vertex, so the distance from one cell vertex to the center point of four cells can be regressed at most, and the coordinates of each cell center point include two values of abscissa and ordinate, so the output is 8 value. Therefore, the feature map F1 is input into the splicing relationship regression model, and the splicing relationship regression model will output the feature map F4 shown in the figure.
  • the feature map F4 is an 8-channel feature map, which is used to describe the 8 coordinate values corresponding to each feature point. .
  • the feature point corresponding to each cell vertex can be located in the feature map F4, and then the 8 coordinate values corresponding to the feature point can be obtained for a certain cell vertex
  • these 8 coordinate values are the distances between them and the cell center points of other cells that share the vertex of this cell. Based on this distance, we can know the center points of the 4 cells that share the vertex of this cell. coordinate of.
  • the vertex of a cell is shared by at most 4 cells, so, in fact, for a vertex of a cell, there may be 2 or 4 or 6 of the corresponding 8 coordinate values 0, assuming that 4 of the 8 coordinate values corresponding to a cell vertex are 0, indicating that the cell vertex is only shared by two cells.
  • the detection frames with a splicing relationship can be spliced together, so that finally a complete table will be spliced from multiple detection frames, that is, the composition structure of the table in the table image is obtained.
  • FIG. 4 b a part of the table area is shown with thinner lines, and four cells are included in the table area.
  • the cell center points obtained by the landmark classification model are represented by black dots, and the cell vertices obtained by the landmark classification model are represented by black triangles.
  • an auxiliary frame obtained based on the splicing relationship regression model is denoted as Q5, and the vertex of the auxiliary frame Q5 is used to indicate the position of the center point of each cell sharing the cell vertex.
  • a matching cell center point is respectively determined, wherein the matching means the closest distance.
  • the determined result is the center points of the four cells shown in the figure.
  • the positional relationship between the objects (cells in this embodiment) corresponding to the detection frames can be visually and intuitively expressed , that is to say whether there is a certain relationship between different objects contained in the image, because the detection frames corresponding to the objects with the relationship will be stitched together.
  • Fig. 4b only the splicing effect of the detection frame based on a cell vertex is illustrated.
  • each cell vertex output by the marker point classification model is traversed in turn, and the above-mentioned detection frame splicing relationship is judged and spliced for each cell vertex, and finally the multiple discrete ( Independent) detection frames are spliced into a complete table.
  • the "complete table” mentioned here means that the understanding of the overall structure of a complete table contained in the image is completed, that is, knowing how many cells are included in a table included in the image, What is the positional relationship between different cells.
  • the understanding of the structure of this complete table is the prerequisite for converting the table in image format into an editable table file, that is, generating an excel table.
  • the detection frame splicing result is exemplarily illustrated in conjunction with FIG. 5a.
  • the originally discrete four detection frames Q1, Q2, Q3, and Q4 will eventually be spliced into the effect shown in the figure: the detection frames corresponding to adjacent cells will show a co-edge relationship.
  • the splicing results of the multiple detection frames may be displayed on the table image in a first style for editing by the user.
  • Multiple detection frames obtained based on the detection frame regression model may also be displayed on the image in a second style for editing by the user.
  • multiple detection frames obtained based on the detection frame regression model are displayed, allowing users to view the position of each detection frame.
  • the user finds that the position of a certain detection frame is inaccurate it can be adjusted manually (such as moving or dragging lines) in order to optimize the detection box regression model based on the user's adjustment results.
  • the user can find that the splicing results are inaccurate and make manual adjustments, so as to optimize the splicing relationship regression model based on the user's adjustment results.
  • the first style and the second style may be expressed as different colors, lines of different thickness, lines of different shapes, and so on.
  • the thinner lines show the initial detection frame recognition results
  • the thicker lines show the splicing results of the detection frames.
  • the collected table images are not shown in the figure. In fact, , the above-mentioned detection frame and detection frame splicing result are displayed in the table image, which allows users to intuitively see the accuracy of the detection frame and detection frame splicing result, and is convenient for making corresponding adjustments.
  • different detection frames may be displayed in a differentiated manner according to the confidence level corresponding to each detection frame.
  • the confidence corresponding to each detection frame is directly output by the detection frame regression model, and is used to represent the accuracy of a detection result of a detection frame.
  • the differential display of different detection frames according to the different confidence levels may be to display the detection frames with a confidence level lower than the set threshold, so that the user can focus on the detection frames with lower confidence levels and make timely correction operations ; Or, it can also be a certain style to display the detection frame whose confidence is higher than the set threshold, and another style to display the detection frame whose confidence is lower than the set threshold.
  • the style can be lines of different thicknesses and different colors lines, and so on.
  • the table detection scene is taken as an example to illustrate the image detection process based on the image detection model shown in FIG. 2 .
  • the image detection scheme can also be completed based on the image detection model shown in FIG. 6 .
  • Fig. 6 is a schematic diagram of the composition and structure of another image detection model provided by the embodiment of the present invention.
  • a Branching Network Model The Offset Regression Model.
  • the offset regression model is used to determine the coordinate offset of each marker point output by the marker point classification model.
  • the offset regression model performs several layers of convolution operations on the feature map F1 received from the backbone network model to obtain a 2-channel feature map F5. If a feature point in the feature map F5 is a marker point, then the offsets of the marker point’s horizontal and vertical coordinates due to downsampling will be respectively output on the 2-channel feature map F5.
  • the backbone network model will perform multiple downsampling operations on the input image to be detected during the feature extraction process.
  • the downsampling operation will result in the need to round the coordinates of the marker points, resulting in The accuracy of the point coordinate calculation results decreases.
  • the above offset is the error value caused by the downsampling operation.
  • the reason why the offset compensation is performed on the coordinates of the center point of the object output by the detection model is because the detection model will use multiple downsampling operations during the layer-by-layer feature extraction process of the image to be detected, and the downsampling operation It will lead to the need to round the coordinates of the object center point, resulting in a decrease in the accuracy of the calculation results of the object center point coordinates. In order to compensate for the loss of accuracy caused by downsampling, it is necessary to compensate the error caused by the downsampling operation. The shift is the error caused by the downsampling operation.
  • the training process of the offset regression model can be described as: For a training sample image including a certain object, on the basis of knowing the coordinates of the identification points (such as center points, vertices) of the object, the model can be classified according to the identification points For the downsampling multiple of the training sample image, calculate the offset of the identification point of the object. Calculated as follows:
  • (x 0 , y 0 ) are respectively the abscissa and ordinate of the identification point of the object
  • (x 1 , y 1 ) are respectively the offset corresponding to the abscissa and the ordinate
  • int() It is a rounding down operator
  • n indicates that the landmark classification model has performed 2 n times downsampling processing on the training sample image.
  • the coordinates of multiple cell center points and multiple cell vertices are obtained through the identification point classification model, and the coordinates of multiple cell vertices are obtained through partial
  • the displacement regression model can obtain the coordinate offsets of the above-mentioned multiple cell center points and multiple cell vertices.
  • the coordinates of the cell center point can be Add the coordinate offset of the cell center point to update the coordinates of the cell center point, and then obtain the coordinates of the corresponding four cell vertices according to the updated coordinates of the cell center point and the above distance.
  • the coordinates of the cell vertex plus The coordinate offset of the cell vertex is used to update the coordinates of the cell vertex, and then the coordinates of the corresponding four cell center points are obtained according to the updated coordinates of the cell vertex and the above distance.
  • the in-depth analysis of the table structure can also be performed based on the splicing results of multiple detection frames.
  • the in-depth analysis of the table structure refers to determining the row and column numbers of each cell in the table, so as to convert the table area in the form of an image into an editable table file, such as an excel table.
  • the corresponding position information of multiple cells in the editable form file (ie, the row and column numbers of the cells) is determined, so as to generate an editable form file according to the position information. That is, according to the splicing relationship of multiple detection frames, multiple detection frames are spliced to obtain the corresponding vertex positions of multiple detection frames after splicing; according to the corresponding vertex positions of multiple detection frames after splicing, determine multiple The row and column information corresponding to the cell in the editable table file; an editable table file is generated according to the row and column information.
  • the splicing result of the above multiple detection frames is equivalent to marking a complete table area and the position of each cell in the table area in the image, and converting this table area into a corresponding table area based on the marking result
  • the editable table file can facilitate the user to store, count, edit and process the data information contained in each cell.
  • FIG. 8 is a flow chart of another image detection method provided by an embodiment of the present invention. As shown in FIG. 8, the method includes the following steps:
  • a text detection scene is taken as an example to introduce an optional implementation process of the image detection method provided in the embodiment of the present invention in the text detection scene.
  • the image to be detected is an image containing multiple characters, and the above-mentioned identification point may be the center point of the characters.
  • the distance from each text center point to the vertex of its corresponding text box can be obtained according to the detection frame regression model, so that for each text center point will be obtained The corresponding text box.
  • the distance between the text center point k and the center point of the corresponding adjacent text can also be determined through the splicing relationship regression model. Based on this distance, it can be determined that the text center point k corresponds to The coordinates of the center point of the adjacent text, assuming that the text center point corresponding to the center point coordinates is determined as the text center point p from the multiple text center points output by the identification point classification model, so that the corresponding text center point k can be determined There is a splicing relationship between the text box W1 and the text box W2 corresponding to the text center point p, which can reflect that the words corresponding to the two text boxes are adjacent in position, which may be a word or a word in a sentence two words.
  • splicing of frames can be implemented as follows: combining two adjacent boundary lines in the middle of two adjacent text boxes into one, or generating a bounding box containing adjacent text boxes, as a splicing result, As shown in Figure 9.
  • the image detection method provided by the present invention can be executed on the cloud, where several computing nodes can be deployed, and each computing node has computing, storage and other processing resources.
  • multiple computing nodes can be organized to provide certain services.
  • one computing node can also provide one or more services.
  • the way the cloud provides the service may be to provide a service interface externally, and the user invokes the service interface to use the corresponding service.
  • the service interface includes software development kit (Software Development Kit, referred to as SDK), application programming interface (Application Programming Interface, referred to as API) and other forms.
  • the cloud may provide a service interface of the image detection service, and the user invokes the image detection service interface through the user equipment to trigger a request to the cloud for invoking the image detection service interface.
  • the cloud determines the computing node that responds to the request, and uses the processing resources in the computing node to perform the following steps:
  • the association relationship of the multiple objects is determined.
  • the user equipment E1 invokes the image detection service interface to send a call request to the cloud computing node E2, the call request includes the image to be detected, and may also include the required The type information corresponding to the detected object.
  • the method of invoking the image detection service interface illustrated in Figure 10 is: the user uses a specific APP, and an "upload" button is set on an interface of the APP, and the user loads the image to be detected on the interface, and clicks the upload button. , triggering the above call request.
  • the APP is a client program that provides image detection services on the cloud
  • the above-mentioned upload button in the program is an application program interface for invoking the service.
  • the user can also edit the image to be inspected through a variety of image editing tools provided under the "Image Editing" menu, such as scaling, cutting and other preprocessing to enhance image quality.
  • the cloud computing node E2 After receiving the call request, the cloud computing node E2 knows which type of object needs to be detected in the image to be detected based on the above type information, and then performs the detection process.
  • the detection process refer to the above-mentioned embodiment. introduction and will not be repeated here.
  • the cloud computing node E2 can know the corresponding positions of the multiple objects contained in the image to be detected (that is, the positions of multiple detection frames) and the relationship between each other (by the detection frame splicing relationship between them), optionally, the cloud computing node E2 can feed back these detection results to the user equipment E1, so that the user equipment E1 can perform subsequent processing based on the detection results, such as the detection frame splicing, table Structure recognition, text recognition, etc. Or, optionally, after the cloud computing node E2 obtains the above detection results, it can further perform subsequent processing on the image to be detected based on the detection results, such as the detection frame splicing, table structure recognition, text recognition, etc. introduced above, and the final The processing result is fed back to the user equipment E1.
  • the image to be detected uploaded by the user is an image taken from a taxi ticket.
  • the taxi ticket includes multiple key-value pairs, Expressed as key:value format.
  • each key and value here can be regarded as a word (a word is equivalent to the concept of a word in the above, or can also be expressed as a text block),
  • After obtaining the text boxes corresponding to all keys and all values further determine the splicing relationship between the text boxes corresponding to the key and the text boxes corresponding to the value.
  • the splicing relationship reflects the relationship between the key and the value, that is, which key and which value constitute A key-value pair.
  • each pair of key-value information included in the taxi invoice image can be recorded in the form of a document, and the information can be obtained structured output.
  • financial personnel can extract corresponding information based on reimbursement requirements to complete reimbursement processing.
  • the image of the taxi invoice and the information extraction result fed back by the computing node E2 can be simultaneously displayed on the user equipment E1: an information structure composed of at least one set of key-value pairs. The user can find out whether the extraction result is wrong by comparing and watching, and correct it.
  • Fig. 11 is a flowchart of another image detection method provided by an embodiment of the present invention. As shown in Fig. 11, the method may include the following steps:
  • the solution provided by this embodiment can be applied to the bill recognition scenario, wherein, in this application scenario, it is assumed that the bill is a bill containing a form, such as a general invoice, a value-added tax invoice, various reports, statements, and so on.
  • a form such as a general invoice, a value-added tax invoice, various reports, statements, and so on.
  • the ultimate goal of image detection on the bill image is to convert the form in the image format into an editable form file (such as an excel form), and fill the data content contained in the form in the image into the editable form file.
  • an editable form file such as an excel form
  • the edited form file it is convenient for the storage, editing, statistical analysis and other processing of form data.
  • the multiple identification points to be identified in this embodiment may be cell vertices and cell center points.
  • the implementation process of the above solution provided in this embodiment reference may be made to the descriptions in other related embodiments above, and details are not repeated here.
  • the bill image and the final form file can be displayed in the same interface, and the user can compare and check the accuracy of the converted result, and correct any mistakes. place to be corrected.
  • the table contains a large number of cells. After displaying the bill image and the table file in the same interface, it may be difficult for the user to find the wrong cell. To facilitate the user's inspection operation, the following can be used optionally The scheme:
  • the bill image and the form file are displayed in the same interface, wherein the text content corresponding to the target detection frame is displayed in the set style in the form file.
  • the multiple detection frames mentioned above are the detection frames corresponding to multiple cells detected from the table area, and the detection of multiple detection frames can be completed through the detection frame regression model introduced above.
  • the detection box regression model outputs the corresponding 8 coordinate values for a certain cell center point (that is, the distance corresponding to the cell center point to the four vertices of its corresponding cell)
  • a degree of confidence which indicates the probability that the distance from the center point of the cell to the four vertices of the corresponding cell is the probability of these 8 coordinate values
  • the degree of confidence can be used as the detection frame corresponding to the center point of the cell Confidence.
  • a threshold can be set.
  • the detection frame is used as a target detection frame, and the target detection is highlighted in the generated table file.
  • the text content corresponding to the box so that users can focus on the cells that may be wrong.
  • FIG. 12 is a flow chart of another image detection method provided by an embodiment of the present invention. As shown in FIG. 12, the method may include the following steps:
  • the solution provided by this embodiment can be applied to an e-commerce scenario.
  • a commodity image uploaded by a merchant will contain a lot of textual information, such as identification information such as a commodity name and commodity introduction information, and the like.
  • Text recognition processing can be performed on the product image to obtain the text content contained therein.
  • the premise of character recognition is to determine the position containing the character in the product image. Wherein, the text position is represented by a text box.
  • the target image area is used as an input of a character recognition process to obtain the text content contained therein.
  • a character recognition process to obtain the text content contained therein.
  • the extracted text content can be processed according to different application purposes.
  • the e-commerce platform needs to review whether the text content meets the requirements, such as whether it contains some sensitive words.
  • a sensitive word database can be constructed in advance. If the words contained in the sensitive word database are identified from the product image, the product image is considered unsuitable for release, and corresponding prompt information is given to the merchant.
  • the category of the product corresponding to the product image may be determined according to the keywords included in the text content.
  • the recognized text content will contain product introduction related information, and may also contain product name and other identification information. If the preset keywords for category division can be extracted from this information, such as shoes, hats, skirts , and so on, the classification of commodities can be realized based on the extracted keywords.
  • Fig. 13 is a flowchart of another image detection method provided by an embodiment of the present invention. As shown in Fig. 13, the method may include the following steps:
  • the solution provided in this embodiment can be applied to the educational scene.
  • the teacher may use blackboard writing, PPT and other demonstration tools during the teaching process, and the students can take pictures of the demonstration tools to obtain teaching images.
  • the students can take pictures of the demonstration tools to obtain teaching images.
  • it is faced with the need to classify and retrieve a large number of teaching images in the future.
  • the image detection scheme provided by the embodiment of the present invention can be used to perform text detection processing on each of the collected teaching images, so as to first detect Multiple text boxes contained in each teaching image, and then according to the judging result of the splicing relationship between the text boxes, the text boxes with the splicing relationship are spliced together to form a target image area, and then the text recognition process is performed on the target image area, Get the text content contained in it. After that, the name of the required knowledge point is used as the search keyword, and the text content recognized in each teaching image is used as the search base to search for the teaching image containing the knowledge point.
  • image detection processing can also be performed on teaching materials such as students' homework and test papers.
  • teaching materials such as students' homework and test papers.
  • parents want to collect a lot of test questions in order to summarize and use them as a reference when they need to give their children test questions.
  • parents can take pictures of their children's homework, test papers and other materials to obtain corresponding images, or they can also collect them on the Internet. Homework, test paper images.
  • the text content that is, the content of the test question, can be identified in the image according to the detection scheme introduced above.
  • FIG. 14 is a flow chart of another image detection method provided by an embodiment of the present invention. As shown in FIG. 14, the method may include the following steps:
  • the solution provided by this embodiment can be applied to medical scenarios, where a large number of medical records and medical images (such as various angiographic images) can be generated, and image detection processing can be performed on these images.
  • the image detection scheme provided by the embodiment of the present invention can be used to perform text detection processing on each collected medical image to obtain the text content contained therein. Afterwards, according to the text content corresponding to each medical image, search for medical images that match the set keywords, such as a certain disease, time period, and so on.
  • the text detection process of the medical image can be implemented with reference to the detection process introduced in other embodiments above, and will not be repeated here.
  • the image detection solution provided by the embodiment of the present invention can also be used in some table detection scenarios.
  • table structure recognition can be performed on commodity images containing tables; in a medical scenario, table structure recognition can be performed on medical images containing tables; in an education scenario, test paper images containing tables can be recognized Table structure recognition.
  • Table structure recognition can be performed on commodity images containing tables; in a medical scenario, table structure recognition can be performed on medical images containing tables; in an education scenario, test paper images containing tables can be recognized Table structure recognition.
  • A1 Obtain a commodity image including a table area, and the table area includes a plurality of cells.
  • the product image may be an image obtained by photographing the outer package of a certain product, or a promotional image designed by a merchant to promote a certain product, and so on.
  • A2 Identify multiple identification points corresponding to multiple cells in the product image, determine multiple detection frames corresponding to multiple cells in the product image according to the multiple identification points, and determine multiple detection frames corresponding to multiple cells in the product image.
  • the corresponding relationship between the identification points and the distance between the identification points corresponding to different detection frames determine the corresponding row and column information of multiple cells in the editable table file.
  • the multiple identification points may be multiple cell vertices and multiple cell center points identified from within the table area.
  • the classification of marker points, the regression of the detection frame, and the prediction of the distance between marker points of adjacent cells can be realized by using various models provided in the foregoing embodiments.
  • A3. Generate an editable table file according to the row and column information.
  • B1. Acquire a medical image including a table area, where the table area includes a plurality of cells.
  • the medical image may be a medical record image, or a medical imaging image, and so on.
  • a plurality of identification points corresponding to a plurality of cells are identified in the medical image, and a plurality of detection frames corresponding to a plurality of cells are determined in the medical image according to the plurality of identification points.
  • the corresponding relationship between the identification points and the distance between the identification points corresponding to different detection frames determine the corresponding row and column information of multiple cells in the editable table file.
  • the multiple identification points may be multiple cell vertices and multiple cell center points identified from within the table area.
  • the classification of marker points, the regression of the detection frame, and the prediction of the distance between marker points of adjacent cells can be realized by using various models provided in the foregoing embodiments.
  • the teaching image can be an image obtained by photographing a test paper, or an image obtained by photographing PPT and blackboard writing when the teacher is teaching, or an image obtained by photographing a textbook document, or an image obtained by photographing a student's homework. images, etc.
  • the captured image will contain a table area, for example, the test paper includes the table, the homework answer area or the question stem area includes the table, the teaching material includes the table, and so on.
  • Identify a plurality of identification points corresponding to a plurality of cells in the teaching image determine a plurality of detection frames corresponding to a plurality of cells in the teaching image according to the plurality of identification points, and determine a plurality of detection frames corresponding to a plurality of cells in the teaching image according to the plurality of detection frames and a plurality of detection frames.
  • the corresponding relationship between the identification points and the distance between the identification points corresponding to different detection frames determine the corresponding row and column information of multiple cells in the editable table file.
  • the multiple identification points may be multiple cell vertices and multiple cell center points identified from within the table area.
  • the classification of marker points, the regression of the detection frame, and the prediction of the distance between marker points of adjacent cells can be realized by using various models provided in the foregoing embodiments.
  • Teachers, parents, and students can rewrite the content contained in some of the cells based on the generated table file (it can be an excel table, or a table inserted in a document), so as to realize the purpose of re-editing the title.
  • the generated table file it can be an excel table, or a table inserted in a document
  • image detection device of one or more embodiments of the present invention will be described in detail below. Those skilled in the art can understand that these image detection devices can be configured by using commercially available hardware components through the steps taught in this solution.
  • FIG. 15 is a schematic structural diagram of an image detection device provided by an embodiment of the present invention. As shown in FIG. 15 , the device includes: an acquisition module 11 and a detection module 12 .
  • the acquiring module 11 is configured to acquire an image to be detected, and the image contains multiple objects.
  • the detection module 12 is configured to identify a plurality of identification points corresponding to the plurality of objects in the image, and determine a plurality of detection points corresponding to the plurality of objects in the image according to the plurality of identification points frame, according to the corresponding relationship between the multiple detection frames and the multiple identification points, and the distance between the identification points corresponding to different detection frames, to determine the association relationship of the multiple objects.
  • the detection module 12 is specifically configured to: determine the multiple The splicing relationship of the detection frames; wherein, if at least two detection frames have a splicing relationship, it indicates that there is a set association relationship between the objects corresponding to the at least two detection frames.
  • the device further includes: a display module, configured to display the splicing results of the multiple detection frames on the image in a first style according to the splicing relationship of the multiple detection frames for editing by the user and/or, displaying the plurality of detection frames in a second style on the image for editing by the user.
  • a display module configured to display the splicing results of the multiple detection frames on the image in a first style according to the splicing relationship of the multiple detection frames for editing by the user and/or, displaying the plurality of detection frames in a second style on the image for editing by the user.
  • the image is an image including a table area
  • the multiple objects are multiple cells existing in the table area.
  • the device further includes: a table generation module, configured to perform splicing processing on the plurality of detection frames according to the splicing relationship, so as to obtain corresponding vertex positions of the spliced multiple detection frames; according to the splicing Determine the corresponding row and column information of the plurality of cells in the editable table file according to the corresponding vertex positions of the subsequent multiple detection frames; generate the editable table file according to the row and column information.
  • the device further includes: a character recognition module, configured to splice the multiple detection frames according to the splicing relationship; in the image The target image area is intercepted, and the target image area is composed of at least two detection frames spliced together; character recognition processing is performed on the target image area to obtain corresponding text content.
  • a character recognition module configured to splice the multiple detection frames according to the splicing relationship
  • the detection module 12 can be specifically configured to: identify a plurality of cell center points and a plurality of cell vertices contained in the image;
  • the center point belongs to the cell vertices of the same cell, wherein the detection frame corresponding to the center point of any cell is formed by the cell vertices belonging to the same cell as the center point of any cell;
  • a cell vertex determining at least two cell center points sharing any one of the cell vertices; determining at least two detection frames corresponding to the at least two cell center points; determining the at least two detection frames
  • There is a splicing relationship updating the position of the vertex corresponding to the vertex of any cell in the at least two detection frames to the coordinates of the vertex of any cell.
  • the device shown in FIG. 15 can execute the image detection method provided in the foregoing embodiments.
  • the detailed execution process and technical effects refer to the descriptions in the foregoing embodiments, which will not be repeated here.
  • the structure of the above-mentioned image detection device shown in FIG. 15 can be implemented as an electronic device.
  • the electronic device can include: a processor 21 and a memory 22 .
  • executable codes are stored in the memory 22, and when the executable codes are executed by the processor 21, the processor 21 can at least implement the image detection method provided in the foregoing embodiments.
  • the electronic device may also include a communication interface 23 for communicating with other devices.
  • an embodiment of the present invention provides a non-transitory machine-readable storage medium, the non-transitory machine-readable storage medium stores executable code, and when the executable code is executed by the processor of the electronic device , so that the processor can at least implement the image detection method provided in the foregoing embodiments.

Abstract

本发明实施例提供一种图像检测方法、设备和存储介质,该方法包括:获取待检测的图像,所述图像中包含多个对象;在图像中识别出多个对象对应的多个标识点;根据多个标识点在图像中确定出与多个对象对应的多个检测框,根据多个检测框与多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定多个对象的关联关系。该方案中,标识点的分类、不同对象的标识点之间距离的学习以及基于标识点进行检测框的回归,都体现了对图像进行了丰富的语义信息的提取,基于这些丰富的语义信息可以保证能够准确地完成对图像中对象的结构化信息的解析。

Description

图像检测方法、设备和存储介质
本申请要求2021年05月25日递交的申请号为202110573876.5、发明名称为“图像检测方法、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及图像处理技术领域,尤其涉及一种图像检测方法、设备和存储介质。
背景技术
随着光学字符识别(Optical Character Recognition,简称OCR)技术的实用化,在越来越多的应用场景中都将面临图像检测任务。
比如,在包含表格的图像中对表格进行检测,识别表格的结构,以便将图像形式的表格转换为可编辑的excel表格文件,以便于实现对表格图像中所包含信息的存储、编辑操作。
为实现上述目的,前提是能够在图片中准确地完成表格结构信息的识别。
发明内容
本发明实施例提供一种图像检测方法、设备和存储介质,用以实现准确地对图像进行对象的结构化信息解析。
第一方面,本发明实施例提供一种图像检测方法,该方法包括:
获取待检测的图像,所述图像中包含多个对象;
在所述图像中识别出所述多个对象对应的多个标识点;
根据所述多个标识点在所述图像中确定出与所述多个对象对应的多个检测框;
根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个对象的关联关系。
第二方面,本发明实施例提供一种图像检测装置,该装置包括:
获取模块,用于获取待检测的图像,所述图像中包含多个对象;
检测模块,用于在所述图像中识别出所述多个对象对应的多个标识点,根据所述多个标识点在所述图像中确定出与所述多个对象对应的多个检测框,根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个对象的关联关系。
第三方面,本发明实施例提供了一种图像检测方法,该方法包括:
接收用户设备调用图像检测服务接口的请求,所述请求中包括待检测的图像,所述图像中包含多个对象;
利用所述图像检测服务接口对应的处理资源执行如下步骤:
在所述图像中识别出所述多个对象对应的多个标识点;
根据所述多个标识点在所述图像中确定出与所述多个对象对应的多个检测框;
根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个对象的关联关系。
第四方面,本发明实施例提供了一种图像检测方法,该方法包括:
获取包含表格区域的票据图像,所述表格区域中存在的多个单元格;
在所述票据图像中识别出所述多个单元格对应的多个标识点;
根据所述多个标识点在所述票据图像中确定出与所述多个单元格对应的多个检测框;
根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个单元格在可编辑的表格文件中对应的行列信息;
根据所述行列信息生成可编辑的表格文件。
第五方面,本发明实施例提供一种电子设备,包括:存储器、处理器;其中,存储器上存储有可执行代码,当所述可执行代码被处理器执行时,使处理器至少可以实现如第一方面或第四方面所述的图像检测方法。
第六方面,本发明实施例提供了一种非暂时性机器可读存储介质,非暂时性机器可读存储介质上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使处理器至少可以实现如第一方面或第四方面所述的图像检测方法。
在本发明实施例提供的图像检测方案中,针对一个包含多个对象的图像来说,在需要对多个对象进行结构化信息的识别时,该结构化信息可以体现为是不同对象之间是否存在某种设定的关联关系,比如位置邻接关系、信息搭配关系,等等。首先,在图像中识别出与这多个对象对应的多个标识点(比如对象的中心点、边界点等),以通过边框回归的方式,基于这多个标识点回归出与这多个对象对应的多个检测框。这多个检测框即用于粗略地表示这多个对象在图像中对应的位置。在得到上述多个离散的检测框之后,根据多个检测框与多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定多个对象的关联关系。
在上述方案中,可以根据实际需求定义一种或多种标识点,通过对图像中像素点进行是否是标识点的分类,一方面,通过标识点可以回归出不同对象在图像中对应的位置区域(检测框),即不同检测框用于表征图像中包括的各个对象;另一方面,基于对不同对象的标识点之间的距离的学习,还可以结合不同对象的检测框所对应的标识点之间的距离,确定多个对象的关联关系。该方案中,标识点的分类、不同对象的标识点之间的距离的学习以及基于标识点进行检测框的回归,都体现了对图像进行了丰富的语义信息的提取,基于这些丰富的语义信息可以保证能够准确地完成对图像中对象的结构化信 息的解析。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种图像检测方法的流程图;
图2为本发明实施例提供的一种图像检测模型的组成结构示意图;
图3为本发明实施例提供的另一种图像检测方法的流程图;
图4a为本发明实施例提供的一种表格图像检测原理的示意图;
图4b为本发明实施例提供的一种表格图像检测场景的示意图;
图5a为本发明实施例提供的一种表格图像中检测框拼接结果的示意图;
图5b为本发明实施例提供的一种显示检测框和检测框拼接结果的示意图;
图6为本发明实施例提供的另一种图像检测模型的组成结构示意图;
图7为本发明实施例提供的一种表格结构识别过程的示意图;
图8为本发明实施例提供的另一种图像检测方法的流程图;
图9为本发明实施例提供的一种文字检测过程的示意图;
图10为本发明实施例提供的一种图像检测方法的应用示意图;
图11为本发明实施例提供的另一种图像检测方法的流程图;
图12为本发明实施例提供的另一种图像检测方法的流程图;
图13为本发明实施例提供的另一种图像检测方法的流程图;
图14为本发明实施例提供的另一种图像检测方法的流程图;
图15为本发明实施例提供的一种图像检测装置的结构示意图;
图16为与图15所示实施例提供的图像检测装置对应的电子设备的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
另外,下述各方法实施例中的步骤时序仅为一种举例,而非严格限定。
本发明实施例提供的图像检测方法可以由一电子设备来执行,该电子设备可以是诸 如PC机、笔记本电脑、智能手机等终端设备,也可以是服务器。该服务器可以是包含一独立主机的物理服务器,或者也可以为虚拟服务器,或者也可以为云端服务器或服务器集群。
本发明实施例提供的图像检测方法的主要目的是:对待检测的图像进行目标检测,即检测出图像中包含的多个对象的位置以及多个对象之间存在的关系。其中,可以以包围某对象的检测框的位置来表示该对象在待检测图像中的位置。
不同的应用场景中,待检测的图像将会不同,需要对图像进行检测的多个对象也会不同。
比如在表格检测场景中,待检测的图像是指包含表格区域的图像,多个对象是指表格区域中包含的多个单元格,此时,图像检测的目的是:检测出这多个单元格在图像中各自对应的位置,以及确定这多个单元格之间的位置关系。
再比如,在一些文字检测场景中,待检测的图像是指包含文字内容的图像,多个对象是指图像中包含的多个文字,此时,图像检测的目的是:检测出这多个文字在图像中各自对应的位置,以及确定这多个文字之间的文字相邻关系。
再比如,在一些信息提取场景中,待检测的图像可以是包含键值(key-value)对信息的图像,多个对象是指该图像中包含的所有key和所有value,此时,图像检测的目的是:检测出这些key和value在图像中各自对应的位置,以及确定这些多个key和value之间的所属关系(或者说对应关系、匹配关系)。
实际应用中,待检测的图像可能是用户自行拍摄得到的图像,图像质量很难保证,比如在现实场景中,用户拍得的图像中的表格可能呈现出旋转、反光、覆盖、褶皱等视觉特点,这对图像检测任务来说,便提出了更大的挑战。而采用本发明实施例提供的图像检测方案,即使待检测图像存在一些视觉上的缺陷,也可以完成图像中多个对象的位置和彼此之间关系的准确检测。
下面结合以下实施例对本文提供的图像检测方法的执行过程进行示例性说明。
图1为本发明实施例提供的一种图像检测方法的流程图,如图1所示,该方法包括如下步骤:
101、获取待检测的图像,该图像中包含多个对象。
102、在图像中识别出多个对象对应的多个标识点。
103、根据多个标识点在所述图像中确定出与多个对象对应的多个检测框。
104、根据多个检测框与多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定多个对象的关联关系。
如上文所述,在不同应用场景中需要完成的图像检测任务各不相同,实际应用中,可以预先针对每种应用场景对应的待检测图像,定义需要检测的目标对象(即上述多个对象)是什么。比如,在上述表格检测场景中,针对包含表格区域的图像,需要检测的 目标对象即为表格区域中包含的各个单元格。在文字检测场景中,针对包含文字的图像,需要检测的目标对象即为图像中包含的各个文字。
首先,针对一个包含多个对象的待检测的图像来说,需要检测出这多个对象在图像中分别对应的位置。
具体地,可以采用预先训练得到的一个图像检测模型来完成多个对象的位置定位处理。结合图2示例性说明该图像检测模型的组成结构。
如图2中所示,该图像检测模型可以包括一个主干网络模型和多个分支网络模型。
其中,主干网络模型用于实现对输入的待检测图像进行特征提取,得到某种尺度的特征图。实际应用中,可选地,主干网络模型可以实现为:采用由多个卷积层构成的卷积网络模型,残差网络模型等神经网络模型。如图2中所示,主干网络模型输出的特征图会分别输入到多个分支网络模型。
多个分支网络模型从各自实现的功能层面来说,可以分别称为:标识点分类模型、检测框回归模型、拼接关系回归模型。
其中,标识点分类模型,用于基于输入的特征图进行特征点的分类识别。特征图中包含若干特征点,假设某特征图的空间分辨率表示为h×w,即意味着这个特征图中包括h×w个特征点。而这些特征点与待检测图像中的像素点之间存在位置映射关系,因此,当确定某个特征点对应的类别时,亦即确定了待检测图像中相应像素点对应的类别。
实际应用中,根据实际需求,预先设置一种或几种用于定义标识点的类别标签。比如,可以定义对象的中心点、顶点是标识点,那么可以预先设定三种类别标签:中心点、顶点、其他(或者说背景)。对从主干网络模型接收到的特征图进行特征点的分类识别,得到各特征点对应的类别标签,如果最终确定某特征点对应的类别标签为中心点或顶点,那么在待检测图像中确定与该特征点对应的像素位置,该像素位置即为一个标识点,具体是对应于某对象的中心点或顶点。通过标识点分类模型可以从待检测图像中识别出多个标识点,这多个标识点对应于图像中包含的多个对象。
实际应用中,假设预先定义两种类别的标识点:中心点和顶点。标识点分类模型中可以包括多个卷积层,输入的特征图通过多个卷积层的运算后可以得到一个两通道的特征图,用于检测这两类标识点:如果某个特征点是中心点,则在该特征图的第一个通道对应位置输出1;如果是顶点,则在特征图的第二个通道对应位置输出1或0;如果是背景,则在特征图的所有通道对应位置都输出0。由此可见,两个通道分别对应于两个类别标签。
需要说明的是,第一,当定义的标识点的种类有不止一种时,最终识别出的标识点的数量会大于对象的数量;第二,分类结果仅仅是识别出了待检测图像中包括的众多标识点的位置,并不能得知标识点与对象之间的对应关系,即不知道哪个标识点是属于哪个对象的。
其中,检测框回归模型,用于根据标识点分类模型输出的多个标识点,回归出与待检测图像中的多个对象分别对应的检测框。该检测框回归模型被训练为具有学习对象的中心点到其顶点的距离的能力,基于此,针对标识点分类模型输出的一个中心点来说,可以基于该检测框回归模型预测出该中心点到其对应对象的顶点的距离,根据该距离便可以得知预测出的该对象的顶点坐标,由该对象的顶点坐标即可构成该对象对应的检测框。对象的顶点的个数与对象呈现出的形状相关,比如一个对象是矩形,那么顶点坐标的数量为四,如果一个对象是三角形,那么顶点坐标的数量为三。
具体地,该检测框回归模型中可以包括多个卷积层,由主干网络模型输入的特征图通过多个卷积层的运算后可以得到一个多通道的特征图,其中,通道数量是一个对象具有的顶点数量的两倍,其中,之所以两倍是因为一个顶点坐标由横坐标和纵坐标两个坐标值构成。该检测框回归模型输出的上述多通道的特征图中会记录有每个特征点对应的多个坐标值,其中,针对某个特征点来说,在其中一个通道的特征图中会记录有与该通道对应的一个坐标值。基于标识点分类模型输出的中心点坐标,可以在这个多通道特征图中确定与中心点坐标对应的特征点,进而依次查询多通道特征图,得到该特征点对应的多个坐标值,这多个坐标值即对应于该中心点到其对应的对象的多个顶点的距离,基于这多个距离便可以知道相应对象的各顶点坐标,得到对应的检测框。
其中,拼接关系回归模型,用于回归出一个对象的某标识点与另一个对象的某标识点之间的距离,以便基于该距离发现不同对象对应的检测框之间是否存在拼接关系,其中,上述两个对象是指存在某设定的关联关系的对象,比如相邻的两个对象,构成键值对的两个对象。该拼接关系回归模型被训练为具有学习某对象的标识点到与其具有设定关系的其他对象的标识点之间距离的能力,基于此,针对标识点分类模型输出的某个标识点来说,可以基于该拼接关系回归模型预测出该标识点到目标标识点的距离,其中,目标标识点对应的对象与该标识点对应的对象之间具有设定的关联关系,从而,根据该距离便可以得知具有拼接关系的检测框。该模型结构与检测框回归模型的结构相似,工作原理也相似,在此不赘述。基于上述图像检测模型的组成,在利用该图像检测模型对待检测的图像进行处理时,首先,基于标识点分类模型在图像中识别出多个对象对应的多个标识点,这多个标识点可以包括多种不同类别的标识点,比如包括对象中心点、对象顶点。之后,基于多个标识点的识别结果,利用检测框回归模型在图像中确定出与多个对象对应的多个检测框,这多个检测框与多个对象一一对应,用以表示多个对象在图像中各自对应的位置区域。之后,基于多个标识点的识别结果,利用拼接关系回归模型最终确定多个检测框的拼接关系,即确定多个检测框中哪些检测框与哪些检测框是有拼接关系的。如果确定某几个检测框具有拼接关系,则表明这几个检测框对应的对象之间存在某种设定的关联关系,这样便实现了对图像中各对象位置的定位以及不同对象之间相互关系的识别,相当于对图像中包含的信息进行了结构化的解析,为后续对图像的处 理提供了必要的前提。
其中,可选地,利用拼接关系回归模型最终确定多个检测框的拼接关系的过程可以实现为:根据多个检测框与多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定多个检测框的拼接关系。具体来说,假设基于某个标识点i回归出一个检测框R,将该标识点i输入到拼接关系回归模型后,基于拼接关系回归模型所具备的学习不同对象的标识点之间的距离的能力,该拼接关系回归模型可以输出与该标识点i对应的一组距离值(用于表明与该标识点i对应的对象之间存在关联关系的另一对象的标识点,相距标识点i的距离),在标识点i的位置的基础上,基于这组距离值便可以计算得到目标位置,在分类得到的多个标识点中确定与该目标位置匹配的标识点j,即标识点j的位置与该目标位置最为接近。假设基于标识点j回归出检测框P,那么可以认为检测框R和检测框P具有拼接关系。
上述可选方案可以适用于仅定义了一类标识点的情形。
在上述方案中,标识点的分类、不同对象的标识点之间的距离的学习以及基于标识点进行检测框的回归,都体现了对图像进行了丰富的语义信息的提取,基于这些丰富的语义信息可以保证能够准确地完成对图像中对象的结构化信息的解析,避免诸如图像拍摄质量不佳、遮挡、褶皱等的干扰。
如前文所述,本发明实施例提供的图像检测方案可以适用于不同的应用场景中,下面以表格检测场景和文字检测场景为例,对在这两个应用场景中图像检测方案的实施过程进行说明。
图3为本发明实施例提供的另一种图像检测方法的流程图,如图3所示,该方法包括如下步骤:
301、获取待检测的图像,该图像为包含表格区域的图像,表格区域中存在多个单元格。
302、识别图像中包含的多个单元格中心点和多个单元格顶点。
303、对于其中的任一单元格中心点,确定与所述任一单元格中心点属于同一单元格的单元格顶点,其中,由与所述任一单元格中心点属于同一单元格的单元格顶点构成与所述任一单元格中心点对应的检测框。
304、对于其中的任一单元格顶点,确定共享所述任一单元格顶点的至少两个单元格中心点,确定与所述至少两个单元格中心点对应的至少两个检测框,确定所述至少两个检测框具有拼接关系。
305、将所述至少两个检测框中与所述任一单元格顶点对应的顶点位置更新为所述任一单元格顶点的坐标。
本发明实施例提供的对包含表格区域的图像(为描述方便,称为表格图像)的图像检测方案,可以借助图2中提供的图像检测模型来实现。下面结合图4a和图4b示例性 说明基于该图像检测模型的表格图像检测过程。在图4a中,h和w表示分辨率,h/4和w/4是假设对表格图像进行了2次下采样。
如图4a所示,将表格图像输入到主干网络模型后,通过主干网络模型可以实现对表格图像的特征提取,得到特征图F1。
如图4a所示,将特征图F1输入标识点分类模型,标识点分类模型会输出图中示意的特征图F2,特征图F2是一个2通道的特征图,2通道分别对应于两类标识点:单元格中心点和单元格顶点。简单来说,特征图F2中会描述有每个特征点对应的类别判定结果:或者为单元格中心点,或者为单元格顶点,或者为背景。这样,基于标识点分类模型便可以识别出表格图像中包含的多个单元格顶点和多个单元格中心点。
如图4a所示,在表格检测场景中,检测框回归模型用于回归出单元格中心点到其对应的单元格(即与其属于同一单元格)的4个单元格顶点的距离,由于每个顶点的坐标包括横坐标和纵坐标两个值,因此,会输出8个坐标值。所以,将特征图F1输入检测框回归模型,检测框回归模型会输出图中示意的特征图F3,特征图F3是一个8通道的特征图,用于描述每个特征点对应的8个坐标值。基于标识点分类模型输出的每个单元格中心点的坐标,可以在特征图F3中定位每个单元格中心点对应的特征点,进而获取该特征点对应的8个坐标值,即为相应单元格中心点到其对应的单元格的4个单元格顶点的距离,基于该距离便可以得知这4个单元格顶点的坐标。针对任一单元格中心点来说,由与该单元格中心点属于同一单元格的单元格顶点构成与该任一单元格中心点对应的检测框。这样,针对每个单元格中心点,便回归出与之对应的检测框,检测框的位置即表明相应单元格在表格图像中对应的位置。
如图4a所示,在表格检测场景中,拼接关系回归模型用于回归出单元格顶点到共享该单元格顶点的其他单元格的中心点的距离。最多可以四个单元格共享一个单元格顶点,因此一个单元格顶点最多回归出到4个单元格中心点的距离,每个单元格中心点坐标包括横坐标和纵坐标两个值,因此输出8个值。所以,将特征图F1输入拼接关系回归模型,拼接关系回归模型会输出图中示意的特征图F4,特征图F4是一个8通道的特征图,用于描述每个特征点对应的8个坐标值。基于标识点分类模型输出的每个单元格顶点的坐标,可以在特征图F4中定位每个单元格顶点对应的特征点,进而获取该特征点对应的8个坐标值,针对某个单元格顶点来说,这8个坐标值即为其与共享该单元格顶点的其他单元格的单元格中心点之间的距离,基于该距离便可以得知共享该单元格顶点的4个单元格中心点的坐标。
可以理解的是,某单元格顶点最多被4个单元格共享,所以,实际上,针对某单元格顶点来说,其对应的8个坐标值中可能有2个或4个或6坐标值为0,假设某单元格顶点对应的8个坐标值中有4个坐标值为0,表明该单元格顶点仅被两个单元格共享。
假设针对某个单元格顶点来说,最终确定出共享该单元格顶点的至少两个单元格中 心点,那么,可以在基于检测框回归模型得到的与多个单元格中心点对应的检测框中,确定出与该至少两个单元格中心点对应的至少两个检测框,从而确定该至少两个检测框具有拼接关系。
之后,可以将具有拼接关系的检测框拼接在一起,这样最终便会由多个检测框拼接成一个完整的表格,即得到了表格图像中表格的组成结构。
为便于理解,结合图4b来示例性说明上述表格图像检测过程中得到的各种检测结果。
在图4b中,以较细的线条示意出一部分表格区域,在该表格区域中包括四个单元格。以黑色圆点表示通过标识点分类模型得到的单元格中心点,以黑色三角形表示通过标识点分类模型得到的单元格顶点。
针对图中示意的四个单元格中心点,基于检测框回归模型会得到对应的四个检测框,分别表示为Q1、Q2、Q3、Q4,对应于图中四个加粗线条绘制的矩形框。
针对图中示意的一个单元格顶点,基于拼接关系回归模型会得到的一个辅助框,表示为Q5,该辅助框Q5的顶点用以表示共享该单元格顶点的各单元格的中心点的位置。
进而,针对辅助框Q5的每个顶点,在通过标识点分类模型得到的多个单元格中心点中,分别确定出匹配的单元格中心点,其中,匹配是指距离最为接近的意思。确定结果即为图中示意的四个单元格中心点。最终便得到了这四个单元格中心点对应的四个检测框具有拼接关系的判定结果,进而对这四个检测框进行拼接处理。
如图4b中所示,在基于上述单元格顶点对这四个检测框进行拼接处理的过程中,会将四个检测框与该单元格顶点对应的顶点的位置更新为该单元格顶点,相当于将这四个检测框的对应顶点拉到该单元格顶点处,其中,该对应顶点是指检测框上与该单元格顶点距离最为接近的顶点。
通过基于检测框之间的拼接关系的判定结果将具有拼接关系的检测框进行拼接,可以从视觉上直观地表示出检测框对应的对象(本实施例中是指单元格)之间的位置关系,亦即表示出图像中包含的不同对象之间是否存在某种关联关系,因为会将具有关联关系的对象所对应的检测框拼接在一起。
在图4b中仅示意基于一个单元格顶点的检测框拼接效果。实际上,依次遍历标识点分类模型输出的每个单元格顶点,针对每个单元格顶点都进行上述检测框拼接关系的判断、拼接处理,最终会将检测框回归模型输出的多个离散的(独立的)检测框拼接为一个完整的表格。可以理解的是,这里所说的“完整的表格”,是指完成了对图像中包含的一个完整表格的整体结构的理解,即知道了图像中包括的一个表格中包含了多少个单元格,不同单元格之间的位置关系是如何的。这个完整表格的结构的理解,是将图像格式的表格转换为可编辑的表格文件,即生成excel表格的前提条件。
为便于理解,结合图5a示例性说明检测框拼接结果。在图5a中,原本离散的四个 检测框Q1、Q2、Q3、Q4最终会被拼接为图中示意的效果:相邻单元格对应的检测框会呈现共边关系。
在一可选实施例中,可以根据多个检测框的拼接关系,在表格图像上以第一样式显示出多个检测框的拼接结果,以供用户编辑。也可以在图像上以第二样式显示出基于检测框回归模型得到的多个检测框,以供用户编辑。
其中,显示出基于检测框回归模型得到的多个检测框,可以让用户观看到每个检测框的位置,当用户发现某个检测框的位置不准时,可以手动进行调整(如移动或拖拽线条),以便基于用户的调整结果进行检测框回归模型的优化。
类似地,用户也可以基于对多个检测框的拼接结果的观察,发现拼接结果不准的情况,进行手动调整,以便基于用户的调整结果进行拼接关系回归模型的优化。
第一样式和第二样式可以表现为是不同的颜色、不同的粗细的线条、不同形状的线条,等等。如图5b中所示,其中以较细的线条示意的是初始的检测框识别结果,以较粗的线条示意的是检测框的拼接结果,图中并未示意出采集的表格图像,实际上,在表格图像显示出上述检测框以及检测框拼接结果,可以让用户直观地看出检测框和检测框拼接结果的准确性,便于做出相应调整。
在一可选实施例中,在得到多个检测框后,还可以根据每个检测框对应的置信度对不同检测框进行差异化的显示。其中,每个检测框对应的置信度由检测框回归模型直接输出,用于表示一个检测框的检测结果的准确性。其中,按照置信度的不同对不同检测框进行差异化显示,可以是将置信度低于设定阈值的检测框显示出来,以便用户聚焦于置信度较低的检测框,做出及时的修正操作;或者,也可以是某一种样式显示置信度高于设定阈值的检测框,以另一种样式显示置信度低于设定阈值的检测框,该样式可以是不同粗细的线条、不同颜色的线条,等等。
以上实施例中以表格检测场景为例,对基于图2中示意的图像检测模型执行的图像检测过程进行了举例说明。实际上,该图像检测方案还可以基于图6中示意的图像检测模型来完成。
图6为本发明实施例提供的另一种图像检测模型的组成结构示意图,如图6中所示,本实施例提供的图像检测模型与图2中示意的图像检测模型的区别在于:增加了一个分支网络模型:偏移量回归模型。偏移量回归模型,用于确定标识点分类模型输出的各个标识点的坐标偏移量。偏移量回归模型对从主干网络模型接收到的特征图F1进行几层卷积运算之后得到一个2通道的特征图F5。如果特征图F5中的某个特征点是标识点,则在这个2通道的特征图F5上会分别输出该标识点由于下采样带来的横纵坐标的偏移量。
实际应用中,主干网络模型在对输入的待检测图像进行特征提取的过程中,会对待检测图像进行多次下采样操作,下采样操作会导致需要对标识点坐标进行取整计算,从而导致标识点坐标计算结果的准确度下降,为了弥补下采样导致的准确度丢失,需要将 下采样操作引起的误差补偿回来,上述偏移量即为下采样操作引起的误差值。
之所以对检测模型输出的对象中心点坐标进行偏移量的补偿,是因为检测模型在对待检测的图像进行逐层的特征提取的过程中,会使用到多次下采样操作,而下采样操作会导致需要对对象中心点坐标进行取整计算,从而导致对象中心点坐标计算结果的准确度下降,为了弥补下采样导致的准确度丢失,需要将下采样操作引起的误差补偿回来,该上述偏移量即为下采样操作引起的误差。
偏移量回归模型的训练过程可以描述为:对于一个包括某对象的训练样本图像来说,在已知该对象的标识点(比如中心点、顶点)坐标的基础上,可以根据标识点分类模型对训练样本图像的下采样倍数,计算出该对象的标识点的偏移量。计算公式如下:
x 1=x 0/2 n-int(x 0/2 n),y 1=y 0/2 n-int(y 0/2 n);
其中,(x 0,y 0)分别是该对象的标识点的横坐标和纵坐标,(x 1,y 1)分别是所述横坐标和所述纵坐标对应的偏移量,int()为向下取整运算符,n表示标识点分类模型对训练样本图像进行了2 n倍的下采样处理。
在上述偏移量作为监督信息的情况下,完成对偏移量回归模型的训练。
仍以表格检测场景为例,在基于图6所示的图像检测模型对表格图像进行检测的过程中,通过标识点分类模型得到多个单元格中心点和多个单元格顶点的坐标,通过偏移量回归模型可以得到上述多个单元格中心点和多个单元格顶点的坐标偏移量。针对其中的某个单元格中心点来说,在通过检测框回归模型得到该单元格中心点与其对应的单元格的四个单元格顶点之间的距离后,可以以该单元格中心点的坐标加上该单元格中心点的坐标偏移量来更新该单元格中心点的坐标,之后根据更新后的该单元格中心点的坐标和上述距离,得到对应的四个单元格顶点的坐标。同理,针对某个单元格顶点来说,在通过拼接关系回归模型得到共享该单元格顶点的最多四个单元格中心点与该单元格顶点的距离后,以该单元格顶点的坐标加上该单元格顶点的坐标偏移量来更新该单元格顶点的坐标,之后根据更新后的该单元格顶点的坐标和上述距离,得到对应的四个单元格中心点的坐标。
仍以表格检测场景为例,在通过上文介绍的拼接过程完成多个单元格对应的多个检测框的拼接处理后,还可以根据多个检测框的拼接结果进行表格结构的深度解析。其中,表格结构的深度解析是指确定表格中每个单元格的行列号,以便将图像形式的表格区域转换成可编辑的表格文件,如excel表格。
概括来说就是:根据多个检测框的拼接结果确定多个单元格在可编辑的表格文件中对应的位置信息(即单元格的行列号),以根据该位置信息生成可编辑的表格文件。即根据多个检测框的拼接关系对多个检测框进行拼接处理,以得到拼接后的多个检测框各自对应的顶点位置;根据拼接后的多个检测框各自对应的顶点位置,确定多个单元格在可编辑的表格文件中对应的行列信息;根据所述行列信息生成可编辑的表格文件。
可以理解的是,上述多个检测框的拼接结果相当于只是在图像中标记出了一个完整的表格区域以及该表格区域内每个单元格的位置,基于该标记结果将这个表格区域转换为对应的可编辑的表格文件,可以便于用户对每个单元格内包含的数据信息的存储、统计、编辑等处理。
结合图7示例性简要说明表格结构识别的过程。如图7中所示,假设经过检测框的拼接过程得到的图中示意的一个完整的表格区域,该表格区域由6个检测框拼接得到,这样会得到拼接后的多个检测框各自对应的顶点位置。之后,根据拼接后的多个检测框各自对应的顶点位置,可以从中识别出所有的行线和列线,并对所有的行线依次进行编号:行线1、行线2、行线3、行线4;以及对所有的列线依次进行编号:列线1、列线2、列线3。根据行线和列线的编号结果可知,由这些行线和列线会形成一个三行两列的表格。进而,根据每个检测框对应的行线和列线的编号,可以确定检测框对应的行列号,在图7中,以A ij表示行列号,从而,根据每个检测框对应的行列号确定结果,便可以生成对应的excel表格。
图8为本发明实施例提供的另一种图像检测方法的流程图,如图8所示,该方法包括如下步骤:
801、获取待检测的图像,该图像中包括多个文字。
802、识别图像中包含的多个文字中心点。
803、根据多个文字中心点在图像中确定出与多个文字对应的多个文本框。
804、根据多个文本框与多个文字中心点之间的对应关系,以及不同文本框对应的文字中心点之间的距离,确定多个文字的相邻关系。
805、在图像中截取出目标图像区域,对目标图像区域进行文字识别处理,以得到对应的文字内容,其中,目标图像区域由拼接在一起的所述至少两个文本框构成。
本实施例中是以文字检测场景为例介绍本发明实施例提供的图像检测方法在文字检测场景中的一种可选的实施过程。此时,待检测的图像是一个包含多个文字的图像,上文中所说的标识点可以是文字中心点。
在通过标识点分类模型得到图像中包含的多个文字中心点后,可以根据检测框回归模型得到每个文字中心点到其对应的文本框的顶点的距离,从而针对每个文字中心点都会得到与之对应的文本框。
为便于理解,结合图9来示例性说明,在图9中,假设一个图像中包括“奶粉”这两个字,通过标识点分类模型会得到文字中心点k和文字中心点p,并假设这两个文字中心点对应的文本框分别表示为W1、W2。
另外,针对标识点分类模型输出的任一文字中心点k,还可以通过拼接关系回归模型确定文字中心点k与其对应的相邻文字的中心点的距离,基于该距离可以确定与文字中心点k对应的相邻文字的中心点坐标,假设从标识点分类模型输出的多个文字中心点 中确定与该中心点坐标对应的文字中心点为文字中心点p,这样便可以确定文字中心点k对应的文本框W1与文字中心点p对应的文本框W2之间具有拼接关系,该拼接关系可以反映出这两个文本框对应的文字之间是位置相邻关系,可能是一个词语或一句话中的两个字。将具有拼接关系的这两个文本框拼接在一起,得到一个目标图像区域。之后,可以在图像中截取出该目标图像区域,将该目标图像区域输入到文字识别模型中,对目标图像区域进行文字识别处理,得到对应的文字内容:奶粉。其中,此时,框的拼接可以实现为:将相邻两个文本框的中间两个相邻的边界线合并为一条,或者,生成一个包含相邻的文本框的外接框,作为拼接结果,如图9中所示。
如前文所述,本发明提供的图像检测方法可以在云端来执行,在云端可以部署有若干计算节点,每个计算节点中都具有计算、存储等处理资源。在云端,可以组织由多个计算节点来提供某种服务,当然,一个计算节点也可以提供一种或多种服务。云端提供该服务的方式可以是对外提供服务接口,用户调用该服务接口以使用相应的服务。服务接口包括软件开发工具包(Software Development Kit,简称SDK)、应用程序接口(Application Programming Interface,简称API)等形式。
针对本发明实施例提供的方案,云端可以提供有图像检测服务的服务接口,用户通过用户设备调用该图像检测服务接口,以向云端触发调用该图像检测服务接口的请求。云端确定响应该请求的计算节点,利用该计算节点中的处理资源执行如下步骤:
接收用户设备调用图像检测服务接口的请求,所述请求中包括待检测的图像,所述图像中包含多个对象;
利用所述图像检测服务接口对应的处理资源执行如下步骤:
在所述图像中识别出所述多个对象对应的多个标识点;
根据所述多个标识点在所述图像中确定出与所述多个对象对应的多个检测框;
根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个对象的关联关系。
图像检测服务接口利用处理资源执行图像检测处理的详细过程可以参考前述其他实施例中的相关说明,在此不赘述。
为便于理解,结合图10来示例性说明。在图10中,用户想要对待检测图像进行检测处理时,在用户设备E1中调用图像检测服务接口,以向云端计算节点E2发送调用请求,该调用请求中包括待检测图像,还可以包括需要检测的对象所对应的种类信息。其中,图10中示意的图像检测服务接口的调用方式为:用户使用特定的APP,在该APP的某界面上设有“上传”按钮,用户在该界面上加载待检测图像,点击上传按钮后,触发上述调用请求。也就是说,该APP是云端提供图像检测服务的客户端程序,该程序中的上述上传按钮为调用该服务的应用程序接口。加载到原始的待检测图像后,用户还可以通过“图像编辑”菜单下提供的多种图像编辑工具对该待检测图像进行编辑操作,比 如缩放、切割等预处理,增强图像质量。
本实施例中假设云端计算节点E2在接收到调用请求后,基于上述种类信息得知需要检测出待检测图像中包含的哪种类型的对象,进而执行检测过程,检测过程参考前述实施例中的介绍,在此不赘述。通过执行上文中介绍的图像检测方案,云端计算节点E2可以得知待检测图像中包含的多个对象各自对应的位置(即多个检测框的位置)以及彼此之间的关系(以检测框之间的拼接关系来表现),可选地,云端计算节点E2可以将这些检测结果反馈给用户设备E1,以供用户设备E1基于该检测结果进行后续处理,比如上文中介绍的检测框拼接、表格结构识别、文字识别等。或者,可选地,云端计算节点E2在得到上述检测结果后,还可以进一步基于该检测结果对待检测图像进行后续处理,比如上文中介绍的检测框拼接、表格结构识别、文字识别等,将最终的处理结果反馈给用户设备E1。
为便于理解,在图10中假设这样的应用场景:用户上传的待检测图像是对一个出租车票拍摄得到的图像,如图10中所示,该出租车票上包括多个键值对,表现为key:value的格式。基于上文介绍的文字检测场景的图像检测过程可知,这里的每个key和value都可以视为一个词语(一个词语相当于上文中的一个字的概念,或者也可以表述为一个文字块),在得到所有key和所有value对应的文本框后,进一步确定key对应的文本框与value对应的文本框之间的拼接关系,该拼接关系反映key与value的所属关系,即哪个key与哪个value构成一个键值对。基于对每个文本框内包含的文字内容的识别结果,以及key与value的键值对关系的确定结果,便可以文档的形式记录下来出租车发票图像中包括的各对键值信息,得到信息的结构化输出结果。在实际应用中,比如财务人员便可以基于报销需求从中提取出相应信息,完成报销处理。
如图10中所示,以上述场景为例,可以在用户设备E1上同时显示出租车发票图像以及计算节点E2反馈的信息提取结果:由至少一组键值对内容构成的信息结构体。用户对比观看可以发现是否提取结果有误,加以修正。
实际应用中,在诸多应用领域中都会涉及到图像检测的需求,都可以使用本发明实施例的技术方案,下面结合几个实施例进行示例性说明。
图11为本发明实施例提供的另一种图像检测方法的流程图,如图11所示,该方法可以包括如下步骤:
1101、获取包含表格区域的票据图像,表格区域中存在的多个单元格。
1102、在票据图像中识别出多个单元格对应的多个标识点。
1103、根据多个标识点在票据图像中确定出与多个单元格对应的多个检测框。
1104、根据多个检测框与多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定多个单元格在可编辑的表格文件中对应的行列信息。
1105、根据所述行列信息生成可编辑的表格文件。
本实施例提供的方案可以适用于票据识别场景中,其中,在该应用场景中,假设票据是包含表格的票据,比如普通发票、增值税发票、各种报表、对账单,等等。
在该应用场景中,对票据图像进行图像检测的最终目的是:将图像格式的表格转换为可编辑的表格文件(如excel表格),将图像中表格内包含的数据内容对应地填充到该可编辑的表格文件内,以便于表格数据的存储、编辑、统计分析等处理。
基于前文对表格检测场景的相关实施例的介绍可知,本实施例中需要识别的多个标识点可以是单元格顶点和单元格中心点这两类标识点。本实施例提供的上述方案的实施过程可以参考前述其他相关实施例中的描述,在此不赘述。
另外,可选地,为了让用户能够检查转换后的表格文件的准确性,可以在同一界面中显示出票据图像以及最终得到的表格文件,用户可以对比检查转换结果的准确性,对有错的地方加以修正。
另外,比如表格中包含的单元格数量很多,在同一界面中显示票据图像和表格文件后,用户可能很难发现其中有错误的单元格,为便于用户的检查操作,可选地,可以采用如下的方案:
根据多个检测框对应的置信度,从多个检测框中确定置信度符合设定要求的目标检测框;
在同一界面中显示票据图像和表格文件,其中,在表格文件中以设定样式显示与目标检测框对应的文字内容。
上述多个检测框即为从表格区域中检测出的对应于多个单元格的检测框,通过上文介绍的检测框回归模型可以完成多个检测框的检测。实际应用中,该检测框回归模型在针对某个单元格中心点输出对应的8个坐标值(即对应于单元格中心点到其对应的单元格的四个顶点的距离)时,还会输出一个置信度,该置信度表示该单元格中心点到其对应的单元格的四个顶点的距离是这8个坐标值的概率,该置信度便可以作为该单元格中心点对应的检测框的置信度。可以设定一个阈值,如果某个检测框对应的置信度低于该阈值,表示该检测框的识别结果可能有误,该检测框作为目标检测框,在生成的表格文件中突出显示该目标检测框对应的文字内容,让用户可以着重关注可能有误的单元格。
图12为本发明实施例提供的另一种图像检测方法的流程图,如图12所示,该方法可以包括如下步骤:
1201、获取商品图像,该商品图像中包括多个文字。
1202、识别商品图像中包含的多个文字中心点。
1203、根据多个文字中心点在商品图像中确定出与多个文字对应的多个文本框。
1204、根据多个文本框与多个文字中心点之间的对应关系,以及不同文本框对应的文字中心点之间的距离,确定多个文字的相邻关系。
1205、在商品图像中截取出目标图像区域,对目标图像区域进行文字识别处理,以 得到对应的文字内容,其中,目标图像区域由拼接在一起的所述至少两个文本框构成。
1206、确定所述文字内容中是否包含敏感词。
本实施例提供的方案可以适用于电商场景中,在电商场景中,商家在上传的商品图像中会包含很多文字信息,比如商品名称等标识信息以及商品介绍信息,等等。可以对商品图像进行文字识别处理,以得到其中包含的文字内容。其中,文字识别的前提是基于在商品图像中确定包含文字的位置。其中,文字位置由文本框来表示。
在本实施例提供的方案中,通过检测出每个文字对应的文本框,进而根据文本框之间拼接关系的判定结果,将邻接的文本框拼接在一起,以在商品图像中定位出拼接在一起的文本框所占据的目标图像区域,以目标图像区域作为一次文字识别处理过程的输入,得到其中包含的文字内容。其中,几个文本框若为拼接关系,表示这几个文本框对应的文字很可能构成一个词语,或者是一句话,作为一个整体进行文字识别处理,也有助于提高文字识别结果的准确性。
上述进行文字检测的详细过程可以参考前述其他实施例中的相关说明,在此不赘述。
之后,可以对提取出的文字内容,根据不同的应用目的进行相应的处理。
比如,电商平台需要审核这些文字内容是否符合要求,比如是否包含一些敏感词语。实际应用中,可以预先构建出一个敏感词语库,如果从商品图像中识别出包含在敏感词语库中的词语,则认为该商品图像不适宜发布,给出商家对应的提示信息。
再比如,可以根据文字内容中包含的关键词,确定商品图像对应商品的类目。识别出的文字内容中会包含商品介绍相关信息,也可能会包含商品名称等标识信息,若能从这些信息中提取出预设的用于类目划分的关键词,比如,鞋、帽子、裙子,等等,便可以基于提取出的关键词实现商品的类目划分。
图13为本发明实施例提供的另一种图像检测方法的流程图,如图13所示,该方法可以包括如下步骤:
1301、获取教学图像,该教学图像中包括多个文字。
1302、识别教学图像中包含的多个文字中心点。
1303、根据多个文字中心点在教学图像中确定出与多个文字对应的多个文本框。
1304、根据多个文本框与多个文字中心点之间的对应关系,以及不同文本框对应的文字中心点之间的距离,确定多个文字的相邻关系。
1305、在教学图像中截取出目标图像区域,对目标图像区域进行文字识别处理,以得到对应的文字内容,其中,目标图像区域由拼接在一起的所述至少两个文本框构成。
1306、根据所述文字内容进行教学图像搜索处理。
本实施例提供的方案可以适用于教育场景中,在教育场景中,老师在授课过程中可能会使用板书、PPT等演示工具,同学可以对演示工具拍摄得到教学图像,当同学拍得大量的教学图像时,面临着后续需要对大量教学图像进行分类整理和按需检索的需求。
当某同学需要从采集的大量教学图像中搜索出与某个知识点相关的教学图像时,可以采用本发明实施例提供的图像检测方案对采集的各教学图像进行文字检测处理,以先检测出每个教学图像中包含的多个文本框,进而根据文本框之间的拼接关系判定结果,将具有拼接关系的文本框拼接在一起构成一个目标图像区域,进而对目标图像区域进行文字识别处理,得到其中包含的文字内容。之后,以所需的知识点名称为搜索关键词,以每个教学图像中识别出的文字内容作为搜索库,以搜索出包含该知识点的教学图像。
在教育场景中,除了存在上述对教学图像的图像检测需求外,还可以对学生的作业、试卷等教学素材也进行图像检测处理。比如家长想要收集众多试题,以便归纳总结,在需要给孩子出试题的时候作为参考,此时,家长可以对孩子的作业、试卷等素材进行拍摄得到对应的图像,或者也可以在网络上收集作业、试卷图像。之后,可以根据上文所介绍的检测方案在图像中识别出文字内容,即试题内容。
图14为本发明实施例提供的另一种图像检测方法的流程图,如图14所示,该方法可以包括如下步骤:
1401、获取医疗图像,该医疗图像中包括多个文字。
1402、识别医疗图像中包含的多个文字中心点。
1403、根据多个文字中心点在医疗图像中确定出与多个文字对应的多个文本框。
1404、根据多个文本框与多个文字中心点之间的对应关系,以及不同文本框对应的文字中心点之间的距离,确定多个文字的相邻关系。
1405、在医疗图像中截取出目标图像区域,对目标图像区域进行文字识别处理,以得到对应的文字内容,其中,目标图像区域由拼接在一起的所述至少两个文本框构成。
1406、根据所述文字内容进行医疗图像搜索处理。
本实施例提供的方案可以适用于医疗场景中,在医疗场景中,可以产生大量病历图像、医疗影像(如各种造影图像),可以对这些图像进行图像检测处理。
比如,当某机构需要进行病历的统计、分析时,可以采用本发明实施例提供的图像检测方案对采集的各医疗图像进行文本检测处理,以得到其中包含的文字内容。之后,根据各医疗图像对应的文字内容,搜索与设定的关键词匹配的医疗图像,该设定的关键词比如可以是某种病症、时间段,等等。
对医疗图像的文本检测过程可以参考前述其他实施例中介绍的检测过程来实施,在此不赘述。
以上对在一些文字检测场景中应用本发明实施例提供的图像检测方案进行了示例性说明。实际上,在一些表格检测场景中也可以使用本发明实施例提供的图像检测方案。比如,在电商场景中,可以对包含表格的商品图像进行表格结构识别;在医疗场景中,可以对包含表格的医疗图像进行表格结构识别;在教育场景中,可以对包含表格的试卷图像进行表格结构识别。下面分别针对上述三种领域实现表格结构识别的具体实现方式 进行举例说明:
(一)电商场景:
A1、获取包含表格区域的商品图像,表格区域包括多个单元格。
该商品图像可以是对某商品的外包装进行拍摄得到的图像,也可以是商家为某商品进行宣传而设计的宣传图像,等等。
A2、在商品图像中识别出多个单元格对应的多个标识点,根据多个标识点在商品图像中确定出与多个单元格对应的多个检测框,根据多个检测框与多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定多个单元格在可编辑的表格文件中对应的行列信息。
如前文所述,多个标识点可以是从表格区域内识别出的多个单元格顶点和多个单元格中心点。标识点分类、检测框回归以及相邻单元格的标识点之间的距离预测过程,可以采用前述实施例中提供的多种模型来实现。
A3、根据所述行列信息生成可编辑的表格文件。
A4、将从每个检测框内提取出的文字内容对应地填充到表格文件中对应的单元格内。
(二)医疗场景:
B1、获取包含表格区域的医疗图像,表格区域包括多个单元格。
该医疗图像可以是病历图像,也可以是医疗影像图像,等等。
B2、在医疗图像中识别出多个单元格对应的多个标识点,根据多个标识点在医疗图像中确定出与多个单元格对应的多个检测框,根据多个检测框与多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定多个单元格在可编辑的表格文件中对应的行列信息。
如前文所述,多个标识点可以是从表格区域内识别出的多个单元格顶点和多个单元格中心点。标识点分类、检测框回归以及相邻单元格的标识点之间的距离预测过程,可以采用前述实施例中提供的多种模型来实现。
B3、根据所述行列信息生成可编辑的表格文件。
B4、将从每个检测框内提取出的文字内容对应地填充到表格文件中对应的单元格内。
(三)教育场景:
C1、获取包含表格区域的教学图像,表格区域包括多个单元格。
该教学图像可以是对试卷进行拍摄得到的图像,也可以是对教师授课时的PPT、板书进行拍摄得到的图像,也可以是对教材文档拍摄得到的图像,还可以是学生作业进行拍摄得到的图像,等等。拍得的图像中会包含表格区域,比如试卷中包括表格,作业的答题区域或题干区域中包括表格,教材中包括表格,等等。
C2、在教学图像中识别出多个单元格对应的多个标识点,根据多个标识点在教学图像中确定出与多个单元格对应的多个检测框,根据多个检测框与多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定多个单元格在可编辑的表格文件中对应的行列信息。
如前文所述,多个标识点可以是从表格区域内识别出的多个单元格顶点和多个单元格中心点。标识点分类、检测框回归以及相邻单元格的标识点之间的距离预测过程,可以采用前述实施例中提供的多种模型来实现。
C3、根据所述行列信息生成可编辑的表格文件。
老师、家长、学生可以基于生成的表格文件(可以是excel表格,也可以是在文档中插入的一个表格)改写其中一些单元格内包含的内容,以实现重新编辑题目等目的。
以上仅以几种应用领域为例对本发明实施例提供的图像检测方案可以适用于的应用场景进行了举例说明,实际上,不以此为限。
以下将详细描述本发明的一个或多个实施例的图像检测装置。本领域技术人员可以理解,这些图像检测装置均可使用市售的硬件组件通过本方案所教导的步骤进行配置来构成。
图15为本发明实施例提供的一种图像检测装置的结构示意图,如图15所示,该装置包括:获取模块11、检测模块12。
获取模块11,用于获取待检测的图像,所述图像中包含多个对象。
检测模块12,用于在所述图像中识别出所述多个对象对应的多个标识点,根据所述多个标识点在所述图像中确定出与所述多个对象对应的多个检测框,根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个对象的关联关系。
可选地,所述检测模块12具体用于:根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个检测框的拼接关系;其中,若至少两个检测框具有拼接关系,则指示出所述至少两个检测框对应的对象之间具有设定的关联关系。
可选地,所述装置还包括:显示模块,用于根据多个检测框的拼接关系,在所述图像上以第一样式显示出所述多个检测框的拼接结果,以供用户编辑;和/或,在所述图像上以第二样式显示出所述多个检测框,以供用户编辑。
可选地,所述图像为包含表格区域的图像,所述多个对象为所述表格区域中存在的多个单元格。此时,所述装置还包括:表格生成模块,用于根据所述拼接关系对所述多个检测框进行拼接处理,以得到拼接后的多个检测框各自对应的顶点位置;根据所述拼接后的多个检测框各自对应的顶点位置,确定所述多个单元格在可编辑的表格文件中对应的行列信息;根据所述行列信息生成所述可编辑的表格文件。
可选地,所述多个对象为多个文字,此时,所述装置还包括:文字识别模块,用于根据所述拼接关系对所述多个检测框进行拼接处理;在所述图像中截取出目标图像区域,所述目标图像区域由拼接在一起的至少两个检测框构成;对所述目标图像区域进行文字识别处理,以得到对应的文字内容。
可选地,检测模块12具体可以用于:识别所述图像中包含的多个单元格中心点和多个单元格顶点;对于其中的任一单元格中心点,确定与所述任一单元格中心点属于同一单元格的单元格顶点,其中,由与所述任一单元格中心点属于同一单元格的单元格顶点构成与所述任一单元格中心点对应的检测框;对于其中的任一单元格顶点,确定共享所述任一单元格顶点的至少两个单元格中心点;确定与所述至少两个单元格中心点对应的至少两个检测框;确定所述至少两个检测框具有拼接关系,将所述至少两个检测框中与所述任一单元格顶点对应的顶点位置更新为所述任一单元格顶点的坐标。
图15所示装置可以执行前述实施例中提供的图像检测方法,详细的执行过程和技术效果参见前述实施例中的描述,在此不再赘述。
在一个可能的设计中,上述图15所示图像检测装置的结构可实现为一电子设备,如图16所示,该电子设备可以包括:处理器21、存储器22。其中,存储器22上存储有可执行代码,当所述可执行代码被处理器21执行时,使处理器21至少可以实现如前述实施例中提供的图像检测方法。
可选地,该电子设备中还可以包括通信接口23,用于与其他设备进行通信。
另外,本发明实施例提供了一种非暂时性机器可读存储介质,所述非暂时性机器可读存储介质上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器至少可以实现如前述实施例中提供的图像检测方法。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助加必需的通用硬件平台的方式来实现,当然也可以通过硬件和软件结合的方式来实现。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以计算机产品的形式体现出来,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替 换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (12)

  1. 一种图像检测方法,其特征在于,包括:
    获取待检测的图像,所述图像中包含多个对象;
    在所述图像中识别出所述多个对象对应的多个标识点;
    根据所述多个标识点在所述图像中确定出与所述多个对象对应的多个检测框;
    根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个对象的关联关系。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个对象的关联关系,包括:
    根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个检测框的拼接关系;
    其中,若至少两个检测框具有拼接关系,则指示出所述至少两个检测框对应的对象之间具有设定的关联关系。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    根据多个检测框的拼接关系,在所述图像上以第一样式显示出所述多个检测框的拼接结果,以供用户编辑;和/或,
    在所述图像上以第二样式显示出所述多个检测框,以供用户编辑。
  4. 根据权利要求2所述的方法,其特征在于,所述图像为包含表格区域的图像,所述多个对象为所述表格区域中存在的多个单元格;
    所述方法还包括:
    根据所述拼接关系对所述多个检测框进行拼接处理,以得到拼接后的多个检测框各自对应的顶点位置;
    根据所述拼接后的多个检测框各自对应的顶点位置,确定所述多个单元格在可编辑的表格文件中对应的行列信息;
    根据所述行列信息生成所述可编辑的表格文件。
  5. 根据权利要求2所述的方法,其特征在于,所述多个对象为多个文字;所述方法还包括:
    根据所述拼接关系对所述多个检测框进行拼接处理;
    在所述图像中截取出目标图像区域,所述目标图像区域由拼接在一起的至少两个检测框构成;
    对所述目标图像区域进行文字识别处理,以得到对应的文字内容。
  6. 根据权利要求4所述的方法,其特征在于,所述在所述图像中识别出所述多个对象对应的多个标识点,包括:
    识别所述图像中包含的多个单元格中心点和多个单元格顶点;
    所述根据所述多个标识点在所述图像中确定出与所述多个对象对应的多个检测框,包括:
    对于其中的任一单元格中心点,确定与所述任一单元格中心点属于同一单元格的单元格顶点,其中,由与所述任一单元格中心点属于同一单元格的单元格顶点构成与所述任一单元格中心点对应的检测框;
    确定不同检测框对应的标识点之间的距离,包括:
    对于其中的任一单元格顶点,确定共享所述任一单元格顶点的至少两个单元格中心点。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述多个检测框与所述多个标识点之间的对应关系,以及不同对象对应的标识点之间的距离,确定所述多个检测框的拼接关系,包括:
    确定与所述至少两个单元格中心点对应的至少两个检测框;
    确定所述至少两个检测框具有拼接关系;
    所述根据所述拼接关系对所述多个检测框进行拼接处理,包括:
    将所述至少两个检测框中与所述任一单元格顶点对应的顶点位置更新为所述任一单元格顶点的坐标。
  8. 一种图像检测方法,其特征在于,包括:
    接收用户设备调用图像检测服务接口的请求,所述请求中包括待检测的图像,所述图像中包含多个对象;
    利用所述图像检测服务接口对应的处理资源执行如下步骤:
    在所述图像中识别出所述多个对象对应的多个标识点;
    根据所述多个标识点在所述图像中确定出与所述多个对象对应的多个检测框;
    根据所述多个检测框与所述多个标识点之间的对应关系,以及不同对象对应的标识点之间的距离,确定所述多个对象的关联关系。
  9. 一种图像检测方法,其特征在于,包括:
    获取包含表格区域的票据图像,所述表格区域中存在的多个单元格;
    在所述票据图像中识别出所述多个单元格对应的多个标识点;
    根据所述多个标识点在所述票据图像中确定出与所述多个单元格对应的多个检测框;
    根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个单元格在可编辑的表格文件中对应的行列信息;
    根据所述行列信息生成可编辑的表格文件。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个单元格在可编辑的表格文件中对应的行列信息,包括:
    根据所述多个检测框与所述多个标识点之间的对应关系,以及不同检测框对应的标识点之间的距离,确定所述多个检测框的拼接关系;其中,若至少两个检测框具有拼接关系,则指示出所述至少两个检测框对应的单元格之间具有位置邻接关系;
    根据所述拼接关系对所述多个检测框进行拼接处理,以得到拼接后的多个检测框各自对应的顶点位置;
    根据所述拼接后的多个检测框各自对应的顶点位置,确定所述多个单元格在可编辑的表格文件中对应的行列信息。
  11. 一种电子设备,其特征在于,包括:存储器、处理器;其中,所述存储器上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1至7中任一项所述的图像检测方法。
  12. 一种非暂时性机器可读存储介质,其特征在于,所述非暂时性机器可读存储介质上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求1至7中任一项所述的图像检测方法。
PCT/CN2022/094684 2021-05-25 2022-05-24 图像检测方法、设备和存储介质 WO2022247823A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110573876.5 2021-05-25
CN202110573876.5A CN115393837A (zh) 2021-05-25 2021-05-25 图像检测方法、设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022247823A1 true WO2022247823A1 (zh) 2022-12-01

Family

ID=84113988

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/094684 WO2022247823A1 (zh) 2021-05-25 2022-05-24 图像检测方法、设备和存储介质

Country Status (2)

Country Link
CN (1) CN115393837A (zh)
WO (1) WO2022247823A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115640401A (zh) * 2022-12-07 2023-01-24 恒生电子股份有限公司 文本内容提取方法及装置
CN116503888A (zh) * 2023-06-29 2023-07-28 杭州同花顺数据开发有限公司 一种从图像中提取表格的方法、系统和存储介质
CN117558392A (zh) * 2024-01-12 2024-02-13 富纳德科技(北京)有限公司 一种电子病历共享协作方法与系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476210A (zh) * 2020-05-11 2020-07-31 上海西井信息科技有限公司 基于图像的文本识别方法、系统、设备及存储介质
CN112149663A (zh) * 2020-08-28 2020-12-29 北京来也网络科技有限公司 结合rpa和ai的图像文字的提取方法、装置及电子设备
CN112287916A (zh) * 2020-12-28 2021-01-29 平安国际智慧城市科技股份有限公司 视频图文课件文本提取方法、装置、设备及介质
CN112633118A (zh) * 2020-12-18 2021-04-09 上海眼控科技股份有限公司 一种文本信息提取方法、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476210A (zh) * 2020-05-11 2020-07-31 上海西井信息科技有限公司 基于图像的文本识别方法、系统、设备及存储介质
CN112149663A (zh) * 2020-08-28 2020-12-29 北京来也网络科技有限公司 结合rpa和ai的图像文字的提取方法、装置及电子设备
CN112633118A (zh) * 2020-12-18 2021-04-09 上海眼控科技股份有限公司 一种文本信息提取方法、设备及存储介质
CN112287916A (zh) * 2020-12-28 2021-01-29 平安国际智慧城市科技股份有限公司 视频图文课件文本提取方法、装置、设备及介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115640401A (zh) * 2022-12-07 2023-01-24 恒生电子股份有限公司 文本内容提取方法及装置
CN115640401B (zh) * 2022-12-07 2023-04-07 恒生电子股份有限公司 文本内容提取方法及装置
CN116503888A (zh) * 2023-06-29 2023-07-28 杭州同花顺数据开发有限公司 一种从图像中提取表格的方法、系统和存储介质
CN116503888B (zh) * 2023-06-29 2023-09-05 杭州同花顺数据开发有限公司 一种从图像中提取表格的方法、系统和存储介质
CN117558392A (zh) * 2024-01-12 2024-02-13 富纳德科技(北京)有限公司 一种电子病历共享协作方法与系统
CN117558392B (zh) * 2024-01-12 2024-04-05 富纳德科技(北京)有限公司 一种电子病历共享协作方法与系统

Also Published As

Publication number Publication date
CN115393837A (zh) 2022-11-25

Similar Documents

Publication Publication Date Title
WO2022247823A1 (zh) 图像检测方法、设备和存储介质
KR102266529B1 (ko) 이미지 기반의 데이터 처리 방법, 장치, 기기 및 판독 가능 저장 매체
US9685095B2 (en) Systems and methods for assessment administration and evaluation
CN112528963A (zh) 基于MixNet-YOLOv3和卷积递归神经网络CRNN的算术题智能批阅系统
CN108229485B (zh) 用于测试用户界面的方法和装置
CN111738041A (zh) 一种视频分割方法、装置、设备及介质
CN109740515B (zh) 一种评阅方法及装置
CN111091538A (zh) 一种管道焊缝自动识别、缺陷检测方法及装置
CN109815955A (zh) 题目辅助方法及系统
US11341319B2 (en) Visual data mapping
CN111652232A (zh) 票据识别方法及装置、电子设备和计算机可读存储介质
CN111126486A (zh) 一种测验统计方法、装置、设备及存储介质
US20220147769A1 (en) Systems and Methods for Artificial Facial Image Generation Conditioned On Demographic Information
CN112381099A (zh) 一种基于数字教育资源的录题系统
CN113673500A (zh) 证件图像识别方法、装置、电子设备及存储介质
CN114663904A (zh) 一种pdf文档布局检测方法、装置、设备及介质
Vargas Munoz et al. Deploying machine learning to assist digital humanitarians: making image annotation in OpenStreetMap more efficient
CN113822847A (zh) 基于人工智能的图像评分方法、装置、设备及存储介质
US10817582B2 (en) Systems and methods for providing concomitant augmentation via learning interstitials for books using a publishing platform
CN112925470B (zh) 交互式电子白板的触摸控制方法、系统和可读介质
CN114049631A (zh) 一种数据标注的方法、装置、计算机设备和存储介质
US11386263B2 (en) Automatic generation of form application
US20200364034A1 (en) System and Method for Automated Code Development and Construction
Manasa Devi et al. Automated text detection from big data scene videos in higher education: a practical approach for MOOCs case study
CN115631374A (zh) 控件操作方法、控件检测模型的训练方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22810548

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE