CN112115865A

CN112115865A - Method, apparatus, device and storage medium for processing image

Info

Publication number: CN112115865A
Application number: CN202010987108.XA
Authority: CN
Inventors: 曲福
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-22
Anticipated expiration: 2040-09-18
Also published as: CN112115865B

Abstract

The application discloses a method, a device, equipment and a storage medium for processing images, and relates to the fields of image processing, cloud computing, deep learning and natural language processing. The specific implementation scheme is as follows: acquiring a target image; recognizing character information in a target image; determining semantic feature vectors and visual feature vectors corresponding to the text information; determining a table area in the target image according to the semantic feature vector, the visual feature vector and a pre-trained two-classification model, wherein the pre-trained two-classification model is used for judging whether the character information is located in the table area according to the semantic feature vector and the visual feature vector corresponding to the character information; and outputting the information of the table area. According to the implementation mode, the table area is determined according to the semantic feature vector and the visual feature vector, so that the table area can be detected more accurately, and the universality is higher.

Description

Method, apparatus, device and storage medium for processing image

Technical Field

The present application relates to the field of image processing, and in particular, to the field of image processing, cloud computing, deep learning, and natural language processing, and in particular, to a method, an apparatus, a device, and a storage medium for processing an image.

Background

With the continuous progress of artificial intelligence technology, more and more artificial intelligence is used to perform intelligent analysis on image documents. The artificial intelligence can correct the direction and the deflection of the image, analyze the layout, recognize the content and the like, and the capabilities can greatly facilitate various workers involved in inputting and checking the image document, and greatly improve the intellectualization of various business processes.

The detection of table areas for document images containing tables is the basis of many intelligent table applications. At present, the accuracy of table area detection for a document image is not high, and the detection effect is not ideal.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for processing an image.

According to an aspect of the present disclosure, there is provided a method for processing an image, including: acquiring a target image; recognizing character information in a target image; determining semantic feature vectors and visual feature vectors corresponding to the text information; determining a table area in the target image according to the semantic feature vector, the visual feature vector and a pre-trained two-classification model, wherein the pre-trained two-classification model is used for judging whether the character information is located in the table area according to the semantic feature vector and the visual feature vector corresponding to the character information; and outputting the information of the table area.

According to another aspect of the present disclosure, there is provided an apparatus for processing an image, including: an acquisition unit configured to acquire a target image; an identifying unit configured to identify text information in a target image; the character information processing device comprises a feature vector determining unit, a character information processing unit and a character information processing unit, wherein the feature vector determining unit is configured to determine semantic feature vectors and visual feature vectors corresponding to character information; the table area determining unit is configured to determine a table area in the target image according to the semantic feature vector, the visual feature vector and a pre-trained two-classification model, wherein the pre-trained two-classification model is used for judging whether the character information is located in the table area according to the semantic feature vector and the visual feature vector corresponding to the character information; an output unit configured to output information of the table area.

According to still another aspect of the present disclosure, there is provided an electronic device for processing an image, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for processing images as described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method for processing an image as described above.

According to the technology of the application, the problem that the accuracy of table region detection in a document image is not high is solved, and the table region is determined according to the semantic feature vector and the visual feature vector, so that the table region can be detected more accurately, and the universality is higher.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for processing an image according to the present application;

FIG. 3 is a schematic illustration of an application scenario of a method for processing an image according to the present application;

FIG. 4 is a flow diagram of another embodiment of a method for processing an image according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of an apparatus for processing images according to the present application;

FIG. 6 is a block diagram of an electronic device for processing images used to implement an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for processing images or the apparatus for processing images of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include a camera 101, a scanner 102, a network 103, and a terminal device 104. The network 103 is used to provide a medium for communication links between the camera 101, the scanner 102, and the terminal device 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The camera 101, scanner 102 may interact with the terminal device 104 over the network 103 to receive or send messages. The camera 101 and the scanner 102 may capture images and transmit the captured images to the terminal device 104, or may be stored locally.

The terminal device 104 may acquire the captured image from the camera 101 or the scanner 102, process the image, and output the processing result to identify information of the table area in the captured image. Various communication client applications, such as an image processing application, may be installed on the terminal device 104.

The terminal device 104 may be hardware or software. When the terminal device 104 is hardware, it may be various electronic devices including, but not limited to, a smart phone, a tablet computer, an e-book reader, a car computer, a laptop portable computer, a desktop computer, and the like. When the terminal device 104 is software, it can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for processing an image provided by the embodiment of the present application may be executed by the terminal device 104. Accordingly, means for processing an image may be provided in the terminal device 104.

It should be understood that the number of cameras, scanners, networks, and terminal devices in fig. 1 is merely illustrative. There may be any number of cameras, scanners, networks, and terminal devices, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for processing an image according to the present application is shown. The method for processing the image of the embodiment comprises the following steps:

step 201, acquiring a target image.

In this embodiment, an execution subject (for example, the terminal device 104 in fig. 1) for processing an image may acquire a captured target image from a scanner, a digital camera, a mobile terminal with a camera, or the like through a wired connection or a wireless connection. The target image may be a company yearbook scan, a slide image, or the like. The target image may contain table, text, etc. The table and text of the target image may be white, black, red, yellow, blue, and the like, which is not specifically limited in this application. The background color of the target image is different from the colors of the tables and characters in the target image.

Step 202, recognizing the character information in the target image.

After acquiring the target image, the execution subject can identify the text information in the target image. Specifically, the execution subject may recognize the text information in the target image by an Optical Character Recognition (OCR) technique, and convert the text information in the target image into an editable text. The text information in the target image comprises text content and text positions. The text content may be a title, a specific text description, and the like, and the present application is not particularly limited thereto. The text position may be a relative distance between the texts, a distance between each text and the page frame, and the like, which is not specifically limited in the present application. For example, the recognized text content is: cat eye film, hualian cinema (back dragon shop), green book (original edition 2D)2D, number 1 laser hall, the corresponding characters position of above-mentioned literal content respectively is: width: 352, height: 83, left pitch: 588, upper pitch: 315; width: 355, height: 55, left pitch: 603, upper spacing: 458; width: 382, height: 54, left pitch: 607, upper pitch: 513; width: 215, height: 59, left pitch: 615, upper pitch: 582.

step 203, determining semantic feature vectors and visual feature vectors corresponding to the text information.

After the execution main body identifies the character information in the target image, the semantic feature vector and the visual feature vector corresponding to the character information can be determined. Specifically, the execution subject can extract semantic features of the text information through a pre-trained Word2vec model, and obtain a feature vector representing semantics, i.e., a semantic feature vector. Specifically, the text information may be composed of a plurality of words, the Word2vec model may be used to map each Word to a vector, called a Word vector, which may be used to represent Word-to-Word relationships, and the Word vector has good semantic properties, which is a common way to represent Word features. The value of each dimension of the word vector represents a feature with certain semantic and grammatical interpretations that captures useful syntactic and semantic properties. Therefore, each dimension of a word vector may be referred to as a word feature. The word vector distributes different syntactic and semantic features of a word to each of its dimensional representations. Specifically, the executive body can extract the visual feature vectors in the text information through a pre-trained neural network model. For example, the visual feature vector in the text information may be a feature vector corresponding to a position where each word in the text information is located. For example, the three-dimensional position coordinates may be three-dimensional position coordinates corresponding to each word in the text information, and the present application is not limited to this.

And step 204, determining a table area in the target image according to the semantic feature vector, the visual feature vector and the pre-trained two-classification model.

After determining the semantic feature vector and the visual feature vector corresponding to the text information, the execution main body can determine a table area in the target image according to the semantic feature vector, the visual feature vector and the pre-trained two-classification model. Specifically, the pre-trained binary classification model is used for judging whether the text information is located in the table area according to the semantic feature vector and the visual feature vector corresponding to the text information. The pre-training of the binary model may comprise the steps of: and taking each semantic feature vector and each visual feature vector as input vectors of the two-classification model, taking information whether the character information corresponding to each semantic feature vector and each visual feature vector is in a table area as an output predicted value of the two-classification model, and training and optimizing the two-classification model to obtain the two-classification model with classification capability. The execution main body can input the semantic feature vector and the visual feature vector into a two-classification model with classification capability to obtain a predicted value of whether the text information corresponding to the semantic feature vector and the visual feature vector is in the table area, so that the table area in the target image is determined according to the semantic feature vector and the text information corresponding to the visual feature vector indicated by the predicted value in the table area. For the predicted value, for example, the output predicted value of the binary model corresponding to the text information in/out of the table area may be 1/0 or 1/-1, 2/-2, etc., which is not specifically limited in this application.

Step 205, outputting the information of the table area.

After determining the table area in the target image, the execution subject may mark the table area in the recognized text information and output information of the table area through the display screen.

With continued reference to fig. 3, there is shown a schematic illustration of one application scenario of the method for processing an image according to the present application. In the application scenario of fig. 3, a camera 301 captures a target image 302, a computer 304 acquires the target image 302 and identifies text information 305 in the target image 302, and the computer 304 determines semantic feature vectors and visual feature vectors corresponding to the text information 305. The computer 304 determines a table area in the target image 302 according to the semantic feature vector and the visual feature vector and a pre-trained two-classification model in the computer 304, wherein the pre-trained two-classification model is used for judging whether the text information 305 is located in the table area according to the semantic feature vector and the visual feature vector corresponding to the text information 305; the information 306 of the table area is output.

According to the embodiment, the table area is determined according to the semantic feature vector and the visual feature vector, so that the table area can be detected more accurately and has higher universality.

With continued reference to FIG. 4, a flow 400 of another embodiment of a method for processing an image according to the present application is shown. As shown in fig. 4, the method for processing an image of the present embodiment may include the following steps:

step 401, a target image is acquired.

Step 402, identifying character information in the target image.

And step 403, determining semantic feature vectors and visual feature vectors corresponding to the text information.

Step 404, determining a table area in the target image according to the semantic feature vector, the visual feature vector and the pre-trained two-classification model.

The principle of step 401 to step 404 is similar to that of step 201 to step 204, and is not described here again.

Specifically, step 404 can be implemented by steps 4041 to 4043 as follows:

in this embodiment, the text information includes at least one sub-text information identified as the same line.

Step 4041, for each piece of sub-text information, combining the semantic feature vector and the visual feature vector corresponding to the piece of sub-text information to obtain a combined feature vector.

After determining the semantic feature vector and the visual feature vector corresponding to the text information, the execution main body may combine and splice the semantic feature vector and the visual feature vector corresponding to each sub-text information to obtain a combined feature vector. Specifically, when feature vector splicing is performed, the order of semantic feature vector and visual feature vector splicing is not specifically limited.

Step 4042, determining the identifier corresponding to each combined feature vector according to each combined feature vector and the pre-trained binary model.

In this embodiment, each combined feature vector is input into a pre-trained binary model, and a predicted value for classifying each combined feature vector is output, where the predicted value may be represented by a classification identifier. And the pre-trained binary classification model is used for representing the corresponding relation between the characteristic vector and the identification.

Step 4043, determine a table region in the target image according to the target image, the sub-text information and the identifiers.

After obtaining the identifier corresponding to each combined feature vector, the execution main body may determine the table area in the target image according to the target image, each sub-text information, and each identifier. Specifically, the execution body may determine sub-text information corresponding to the combined feature vectors according to the combined feature vectors corresponding to the identifications indicating the table regions, and then determine the table regions in the target image according to the sub-text information and the positions and relative relationships of the sub-text information in the target image.

In the embodiment, the semantic feature vectors and the visual feature vectors of the text information are combined to obtain the combined feature vectors, so that when the text information is classified by utilizing the pre-trained two-classification model according to the combined feature vectors, the classification result is more accurate, the classification process is more efficient, and the method is suitable for various scenes.

In some optional implementations of this embodiment, when there is only one table in the image, the identifier includes a first identifier; and step 4043 may include the steps of: aggregating each sub-text information corresponding to the first identification; and determining a marking box corresponding to each piece of sub-text information after aggregation as a table area.

In particular, the first identifier may be used to indicate a table region. When only one table exists in the image, the execution main body can aggregate sub-character information corresponding to the first identification indicating the table area; and determining the marking box corresponding to each piece of sub-character information after aggregation as a table area. Specifically, the label box corresponding to each piece of sub-text information after aggregation can be determined by an area surrounded by a minimum circumscribed rectangle of each piece of sub-text information after aggregation. Specifically, the table area may be determined by calculating coordinates of four vertices of a minimum bounding rectangle of each sub-document information after aggregation.

In the implementation mode, under the condition that only one table exists in the image, the table area is determined by aggregating each character information corresponding to the identifier of the indication table area and according to the marking frame corresponding to each sub-character information, so that the determination of the table area is more efficient, accurate and wide in applicability, and the method can be suitable for wired or wireless table detection.

Specifically, when there are a plurality of tables in the image, step 4043 may be implemented by steps 40431 to 40433 as follows:

in this embodiment, the identifier includes a first identifier and a second identifier, and the visual feature vector includes a plurality of sub-visual feature vectors corresponding to the sub-character information.

Step 40431, determining first location information of each sub-text information corresponding to the first identifier according to each sub-visual feature vector corresponding to each sub-text information corresponding to the first identifier.

After the execution main body distinguishes the first identifier and the second identifier, the execution main body may determine first position information of each sub-text information corresponding to the first identifier according to each sub-visual feature vector corresponding to each sub-text information corresponding to the first identifier. Specifically, the first flag is used to indicate a table region. Each sub-visual feature vector is used to indicate the location of the corresponding sub-text information. The position may specifically be represented by coordinates of each sub-text information. By determining the first position information of each sub-text information corresponding to the first identifier, the position of each sub-text information located in the table area, that is, the coordinates of each sub-text information located in the table area, can be obtained. The first position information may be coordinates of each sub-character information located in the table area.

Step 40432, according to the sub visual feature vectors corresponding to the sub-text information corresponding to the second identifier, determining second position information of the sub-text information corresponding to the second identifier.

Specifically, the second flag is used to indicate a non-table area. Each sub-visual feature vector is used to indicate the location of the corresponding sub-text information. The position may specifically be represented by coordinates of each sub-text information. By determining the second position information of each sub-text information corresponding to the second identifier, the position of each sub-text information located in the non-table area, that is, the coordinates of each sub-text information located in the non-table area, can be obtained. The second position information may be coordinates of each sub-character information located in the non-table area.

Step 40433 determines a table region in the target image according to the target image, the first position information, and the second position information.

After obtaining the first position information and the second position information, the execution subject may determine a table region in the target image according to the target image, the first position information, and the second position information. Specifically, the execution subject may determine the position of the table area and the number of the table areas in the target image by determining the relative positional relationship of the first positional information and the second positional information and determining the intersection of the sub-character information in the table area and the non-table area.

In the embodiment, the intersection condition of each sub-character information in the table area and the non-table area is determined by determining the position information of each sub-character information corresponding to the table area and the non-table area in the target image, so that the table area in the target image can be determined more accurately, and the method and the device are both applicable to wired tables and wireless tables and have a wide application range.

Specifically, step 40433 may be implemented by steps 404331 to 404332, and steps 404333 to 404334 as follows:

in this embodiment, the first position information includes a first coordinate and a second coordinate, and the second position information includes a third coordinate and a fourth coordinate, where the second seat height is higher than the first coordinate, and the fourth seat height is higher than the third coordinate.

At step 404331, in response to determining that the fourth coordinate is below the first coordinate or the third coordinate is above the second coordinate, aggregating each sub-text information corresponding to the first identifier.

And step 404332, determining the label box corresponding to each aggregated character information as a table area.

After determining the first coordinate, the second coordinate, the third coordinate and the fourth coordinate, the execution body may determine a relative position relationship among the first coordinate, the second coordinate, the third coordinate and the fourth coordinate to determine an intersection condition of each sub-text information in the table area and the non-table area, so as to accurately distinguish the table area and the non-table area in the target image according to the intersection condition of each text information. The execution subject aggregates respective sub-text information corresponding to the first identification in response to determining that the fourth coordinate is below the first coordinate or that the third coordinate is above the second coordinate. That is, when the fourth coordinate is lower than the first coordinate or the third coordinate is higher than the second coordinate, it indicates that the sub-text information in the table region and the non-table region is not crossed, and at this time, the region corresponding to the text information obtained by aggregating the sub-text information corresponding to the first identifier is the table region. Aggregating the sub-text information corresponding to the first identifier does not mean changing the original coordinates of the sub-text information corresponding to the first identifier to be aggregated together, but classifies the area where the sub-text information corresponding to the first identifier is located into a table area based on the original coordinates of the sub-text information.

In the embodiment, the table area in the target image can be accurately and quickly determined by judging the lowest coordinate and the highest coordinate corresponding to each sub-character information of the table area and the coordinate height relationship between the lowest coordinate and the highest coordinate corresponding to each sub-character information of the non-table area, and the application range is wide.

Step 404333, in response to determining that the third coordinate is higher than the first coordinate and the fourth coordinate is lower than the second coordinate, aggregating each sub-text information corresponding to the first identifier between the first coordinate and the third coordinate to obtain a first aggregated sub-text information; and aggregating all sub-text information corresponding to the first identification between the fourth coordinate and the second coordinate to obtain second aggregated sub-text information.

After determining the first coordinate, the second coordinate, the third coordinate and the fourth coordinate, the execution body may determine a relative position relationship among the first coordinate, the second coordinate, the third coordinate and the fourth coordinate to determine an intersection condition of each sub-text information in the table area and the non-table area, so as to accurately distinguish the table area and the non-table area in the target image according to the intersection condition of each text information. When the third coordinate is higher than the first coordinate and the fourth coordinate is lower than the second coordinate, the sub-text information of the table area intersects with the sub-text information of the non-table area, indicating that at least two tables exist in the target image. Taking the example that two tables exist in the target image, when the third coordinate is higher than the first coordinate and the fourth coordinate is lower than the second coordinate, each piece of sub-character information indicating between the first coordinate and the third coordinate is located in the table area, and each piece of sub-character information between the fourth coordinate and the second coordinate is located in the table area. At the moment, aggregating all sub-text information corresponding to the first identifier between the first coordinate and the third coordinate to obtain first aggregated sub-text information located in the first table area; and aggregating all sub-text information corresponding to the first identification between the fourth coordinate and the second coordinate to obtain second aggregated sub-text information positioned in the second table area.

Step 404334, determining the label box corresponding to the first aggregate sub-character information and the label box corresponding to the second aggregate sub-character information as table areas.

After obtaining the first aggregate sub-text information located in the first table region and the second aggregate sub-text information located in the second table region, the execution main body may determine that the label box corresponding to the first aggregate sub-text information and the label box corresponding to the second aggregate sub-text information are table regions. Specifically, the label box corresponding to the first aggregation sub-document information may be determined by calculating coordinates of four vertices of a circumscribed rectangle of an area where the first aggregation sub-document information is located; the labeling box corresponding to the second aggregation sub-character information can be determined by calculating coordinates of four vertexes of a circumscribed rectangle of the region where the second aggregation sub-character information is located.

In the embodiment, the table area in the target image can be accurately and quickly determined by judging the lowest coordinate and the highest coordinate corresponding to each sub-character information of the table area and the coordinate height relationship between the lowest coordinate and the highest coordinate corresponding to each sub-character information of the non-table area, and the method is also applicable to the situation of multiple tables and has a wide application range.

In step 405, the information of the table area is output.

The principle of step 405 is similar to that of step 205, and is not described here again.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present application provides an embodiment of an apparatus for processing an image, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for processing an image of the present embodiment includes: an acquisition unit 501, a recognition unit 502, a feature vector determination unit 503, a table area determination unit 504, and an output unit 505.

An acquisition unit 501 configured to acquire a target image.

The recognition unit 502 is configured to recognize character information in the target image.

The feature vector determination unit 503 is configured to determine a semantic feature vector and a visual feature vector corresponding to the text information.

A table area determining unit 504 configured to determine a table area in the target image according to the semantic feature vector, the visual feature vector, and a pre-trained binary model, wherein the pre-trained binary model is used to determine whether the text information is located in the table area according to the semantic feature vector and the visual feature vector corresponding to the text information.

An output unit 505 configured to output information of the table area.

In some optional implementations of this embodiment, the text information includes at least one sub-text information identified as the same line; and the table area determination unit 504 is further configured to: for each piece of sub-character information, combining the semantic feature vector and the visual feature vector corresponding to the sub-character information to obtain a combined feature vector; determining an identifier corresponding to each combined feature vector according to each combined feature vector and a pre-trained binary classification model, wherein the pre-trained binary classification model is used for representing the corresponding relation between the feature vectors and the identifiers; and determining a table area in the target image according to the target image, the sub-character information and the identifications.

In some optional implementations of this embodiment, the identifier includes a first identifier; and the table area determination unit 504 is further configured to: aggregating each sub-text information corresponding to the first identification; and determining a marking box corresponding to each piece of sub-text information after aggregation as a table area.

In some optional implementations of this embodiment, the identifier includes a first identifier and a second identifier, and the visual feature vector includes a plurality of sub-visual feature vectors corresponding to the sub-text information; and the table area determination unit 504 is further configured to: determining first position information of each sub-text information corresponding to the first identifier according to each sub-visual feature vector corresponding to each sub-text information corresponding to the first identifier; determining second position information of each sub-text information corresponding to the second identifier according to each sub-visual feature vector corresponding to each sub-text information corresponding to the second identifier; and determining a table area in the target image according to the target image, the first position information and the second position information.

In some optional implementations of this embodiment, the first location information includes a first coordinate and a second coordinate, and the second location information includes a third coordinate and a fourth coordinate, where the second coordinate is higher than the first coordinate, and the fourth coordinate is higher than the third coordinate; and the table area determination unit 504 is further configured to: aggregating each sub-text information corresponding to the first identifier in response to determining that the fourth coordinate is lower than the first coordinate or the third coordinate is higher than the second coordinate; and determining the marking frame corresponding to each aggregated character information as a table area.

In some optional implementations of this embodiment, the table area determination unit 504 is further configured to: in response to determining that the third coordinate is higher than the first coordinate and the fourth coordinate is lower than the second coordinate, aggregating the sub-text information corresponding to the first identifier between the first coordinate and the third coordinate to obtain first aggregated sub-text information; aggregating all sub-text information corresponding to the first identification between the fourth coordinate and the second coordinate to obtain second aggregated sub-text information; and determining a marking frame corresponding to the first aggregation sub-character information and a marking frame corresponding to the second aggregation sub-character information as a table area.

It should be understood that units 501 to 505, which are described in the apparatus 500 for processing an image, correspond to respective steps in the method described with reference to fig. 2. Thus, the operations and features described above for the method for processing an image are equally applicable to the apparatus 500 and the units included therein and will not be described in detail here.

According to an embodiment of the present application, an electronic device and a readable storage medium for processing an image are also provided.

As shown in fig. 6, is a block diagram of an electronic device for processing an image according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses 605 and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses 605 may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for processing images provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for processing an image provided by the present application.

The memory 602, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and units, such as program instructions/units corresponding to the method for processing an image in the embodiment of the present application (for example, the acquisition unit 501, the recognition unit 502, the feature vector determination unit 503, the table area determination unit 504, and the output unit 505 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the method for processing an image in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an electronic device for processing an image, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, which may be connected to an electronic device for processing images via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for the method of processing an image may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603, and the output device 604 may be connected by a bus 605 or other means, and are exemplified by the bus 605 in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus for processing images, such as an input device like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the table area is determined according to the semantic feature vector and the visual feature vector, so that the table area can be detected more accurately and has higher universality.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for processing an image, comprising:

acquiring a target image;

identifying character information in the target image;

determining semantic feature vectors and visual feature vectors corresponding to the text information;

determining a table area in the target image according to the semantic feature vector, the visual feature vector and a pre-trained binary classification model, wherein the pre-trained binary classification model is used for judging whether the character information is located in the table area according to the semantic feature vector and the visual feature vector corresponding to the character information;

and outputting the information of the table area.

2. The method of claim 1, wherein the textual information includes at least one sub-textual information identified as a same line; and

determining a table region in the target image according to the semantic feature vector, the visual feature vector and a pre-trained binary classification model, including:

for each piece of sub-character information, combining the semantic feature vector and the visual feature vector corresponding to the sub-character information to obtain a combined feature vector;

determining identifiers corresponding to the combined feature vectors according to the combined feature vectors and the pre-trained binary classification model, wherein the pre-trained binary classification model is used for representing the corresponding relation between the feature vectors and the identifiers;

and determining a table area in the target image according to the target image, each piece of sub-character information and each identification.

3. The method of claim 2, wherein the identification comprises a first identification; and

determining a table area in the target image according to the target image, each piece of sub-character information and the identifier, including:

aggregating each sub-text information corresponding to the first identification;

and determining a marking box corresponding to each piece of sub-text information after aggregation as a table area.

4. The method of claim 2, wherein the identifier comprises a first identifier and a second identifier, and the visual feature vector comprises a plurality of sub-visual feature vectors corresponding to sub-word information; and

determining first position information of each sub-text information corresponding to the first identifier according to each sub-visual feature vector corresponding to each sub-text information corresponding to the first identifier;

determining second position information of each sub-text information corresponding to the second identifier according to each sub-visual feature vector corresponding to each sub-text information corresponding to the second identifier;

and determining a table area in the target image according to the target image, the first position information and the second position information.

5. The method of claim 4, wherein the first location information comprises a first coordinate and a second coordinate, the second location information comprises a third coordinate and a fourth coordinate, wherein the second coordinate is higher than the first coordinate, and the fourth coordinate is higher than the third coordinate; and

determining a table region in the target image according to the target image, the first position information, and the second position information includes:

in response to determining that the fourth coordinate is below the first coordinate or the third coordinate is above the second coordinate, aggregating respective sub-text information corresponding to the first identifier;

and determining the marking frame corresponding to each aggregated character information as a table area.

6. The method of claim 5, wherein the determining a table region in the target image from the target image, the first location information, and the second location information comprises:

in response to determining that the third coordinate is higher than the first coordinate and the fourth coordinate is lower than the second coordinate, aggregating the sub-text information corresponding to the first identifier between the first coordinate and the third coordinate to obtain first aggregated sub-text information; aggregating all sub-text information corresponding to the first identification between the fourth coordinate and the second coordinate to obtain second aggregated sub-text information;

and determining that the labeling box corresponding to the first aggregation sub-character information and the labeling box corresponding to the second aggregation sub-character information are table areas.

7. An apparatus for processing an image, comprising:

an acquisition unit configured to acquire a target image;

an identifying unit configured to identify text information in the target image;

a feature vector determining unit configured to determine a semantic feature vector and a visual feature vector corresponding to the text information;

a table area determining unit configured to determine a table area in the target image according to the semantic feature vector, the visual feature vector and a pre-trained binary model, wherein the pre-trained binary model is used for judging whether the text information is located in the table area according to the semantic feature vector and the visual feature vector corresponding to the text information;

an output unit configured to output information of the table area.

8. The apparatus of claim 7, wherein the text information comprises at least one sub-text information identified as a same line; and

the table area determination unit is further configured to:

9. The apparatus of claim 8, wherein the identification comprises a first identification; and

the table area determination unit is further configured to:

10. The apparatus of claim 8, wherein the identifier comprises a first identifier and a second identifier, and the visual feature vector comprises a plurality of sub-visual feature vectors corresponding to sub-word information; and

the table area determination unit is further configured to:

11. The apparatus of claim 10, wherein the first location information comprises a first coordinate and a second coordinate, the second location information comprises a third coordinate and a fourth coordinate, wherein the second seat is higher than the first coordinate, and the fourth seat is higher than the third coordinate; and

the table area determination unit is further configured to:

12. The apparatus of claim 11, wherein the table region determination unit is further configured to:

13. An electronic device for processing an image, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.