CN111611990B

CN111611990B - Method and device for identifying tables in images

Info

Publication number: CN111611990B
Application number: CN202010444345.1A
Authority: CN
Inventors: 黄相凯; 李乔伊; 刘明浩; 秦铎浩; 郭江亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2023-10-31
Anticipated expiration: 2040-05-22
Also published as: CN111611990A

Abstract

The embodiment of the application discloses a method and a device for identifying a table in an image, which can be used in the technical field of image processing. The specific implementation scheme is as follows: acquiring a picture to be processed; identifying a field name and a field value included in the picture to be processed; acquiring semantic vectors of field names and semantic vectors of field values; determining a matching relationship between the field names and the field values based on the semantic vectors of the field names and the semantic vectors of the field values and a pre-trained matching model; and generating a table according to the matching relation of the field names and the field values. This embodiment improves the efficiency of identifying the form in the image.

Description

Method and device for identifying tables in images

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of image processing.

Background

The form is a very commonly used document form in daily work, but in many scenes, the form exists in the form of an image, and how to convert the form in the form of a picture into a format capable of being stored in a structuring way becomes a problem to be solved urgently.

The traditional mode of carrying out structured storage on the image table is mostly manual input, and the picture information is input into a data system in contrast, so that a large amount of manpower is consumed, and the method has high repeatability. With the development of optical character recognition (Optical Character Recognition, OCR) technology, the technology of image-to-text data conversion has been approaching maturity, but OCR technology cannot determine the correspondence between field names and field values.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for identifying a table in an image.

In a first aspect, some embodiments of the present application provide a method for identifying a form in an image, the method comprising: acquiring a picture to be processed; identifying a field name and a field value included in the picture to be processed; acquiring semantic vectors of field names and semantic vectors of field values; determining a matching relationship between the field names and the field values based on the semantic vectors of the field names and the semantic vectors of the field values and a pre-trained matching model; and generating a table according to the matching relation of the field names and the field values.

In a second aspect, some embodiments of the present application provide an apparatus for identifying a form in an image, the apparatus comprising: a first acquisition unit configured to acquire a picture to be processed; an identifying unit configured to identify a field name and a field value included in the picture to be processed; a second acquisition unit configured to acquire a semantic vector of a field name and a semantic vector of a field value; a determining unit configured to determine a matching relationship of the field name and the field value based on the semantic vector of the field name and the semantic vector of the field value and a pre-trained matching model; and a generation unit configured to generate a table according to the matching relationship of the field names and the field values.

In a third aspect, some embodiments of the present application provide an apparatus comprising: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors cause the one or more processors to implement the method as described in the first aspect.

In a fourth aspect, some embodiments of the application provide a computer readable medium having stored thereon a computer program which when executed by a processor implements a method as described in the first aspect.

According to the technology provided by the application, the efficiency of identifying the table in the image is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 is a diagram of some exemplary system architecture in which the present application may be used;

FIG. 2 is a schematic diagram of a first embodiment according to the present application;

FIG. 3 is a schematic diagram of a picture to be processed according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a second embodiment according to the present application;

FIG. 5 is a schematic diagram of a third embodiment according to the present application;

fig. 6 is a schematic diagram of an electronic device suitable for implementing a method for identifying a form in an image according to an embodiment of the application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of a method for identifying a form in an image or an apparatus for identifying a form in an image of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various client applications, such as a text recognition class application, a social class application, a search class application, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as a plurality of software or software modules, or as a single software or software module. The present application is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background server providing support for applications installed on the terminal devices 101, 102, 103, and the server 105 may acquire the pictures to be processed uploaded by the terminal devices 101, 102, 103; identifying a field name and a field value included in the picture to be processed; acquiring semantic vectors of field names and semantic vectors of field values; determining a matching relationship between the field names and the field values based on the semantic vectors of the field names and the semantic vectors of the field values and a pre-trained matching model; and generating a table according to the matching relation of the field names and the field values.

It should be noted that, the method for identifying a table in an image provided by the embodiment of the present application may be performed by the server 105, or may be performed by the terminal devices 101, 102, 103, and accordingly, the device for identifying a table in an image may be provided in the server 105, or may be provided in the terminal devices 101, 102, 103.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present application is not particularly limited herein.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for identifying a table in an image in accordance with the present application is shown. The method for identifying the table in the image comprises the following steps:

step 201, a picture to be processed is acquired.

In this embodiment, a method executing body (for example, a terminal or a server shown in fig. 1) for identifying a form in an image may acquire a to-be-processed picture, where the to-be-processed picture includes an image of the form to be identified, and the to-be-processed picture may be obtained by scanning or photographing and may be derived from a medical institution, a financial institution or other types of institutions. In addition, the execution main body can also perform preprocessing operations such as eliminating camera distortion, cutting, rotating and the like on the picture to be processed, so that the follow-up identification is facilitated.

Step 202, identifying a field name and a field value included in the picture to be processed.

In this embodiment, the execution body may identify the field name and the field value included in the picture to be processed acquired in step 201. The table at least includes a field name and a field value, wherein the field name represents some fixed attributes, the field value represents contents corresponding to the fixed attributes, and taking fig. 3 as an example, the fixed attributes such as "item category", "item name", "number", "unit", "amount", and "category" are the field names, and the contents corresponding to the fixed attributes are the field values, such as "contrast catheter (japanese taylor 210)" corresponding to "item name", and "3.00" corresponding to "number".

Here, the execution subject may identify the field name and the field value included in the picture to be processed using OCR technology or other convolutional neural network-based image detection methods, such as a Region convolutional neural network (Region-CNN) algorithm, a fast Region convolutional neural network (fast R-CNN) algorithm.

The optical character recognition refers to the process of analyzing, recognizing and processing the image file of the text data to obtain the text and layout information. That is, the text in the image is recognized and returned in the form of text. Typical OCR solutions can be divided into two parts: character detection and character recognition. Text detection, i.e., detecting the location, extent, and layout of text in an image, generally includes layout analysis, text line detection, and the like. Word detection mainly determines which positions of an image have words, and how large the range of words is. Text recognition is to recognize text content based on text detection and convert text information in an image into text information. Character recognition primarily determines what each character is detected by the character.

The fast R-CNN algorithm adopts a regional candidate network (region proposal networks, RPN) for assisting in generating samples, the algorithm structure is divided into two parts, whether a candidate frame is a target frame or not is judged by utilizing the RPN, then the type of the target frame is judged by a multi-task loss function of classification positioning, and the whole network flow can share the characteristic information extracted by the convolutional neural network, so that the calculation cost is saved. For text detection of a limited scene, the fast R-CNN algorithm is excellent in performance, and text regions with different granularities can be determined through multiple detections.

In addition, the execution body may segment the text in the same line by using the text range obtained by text detection, or segment the text in the same line by using a pre-trained semantic model, for example, segment by using a pre-trained Long Short-Term Memory (LSTM) or a Bi-directional Long-Term Memory (BiLSTM, bi-directional Long Short-Term Memory).

Step 203, obtaining the semantic vector of the field name and the semantic vector of the field value

In this embodiment, the execution body may obtain the semantic vector of the field name and the semantic vector of the field value identified in step 201. The semantic vector of field names and the semantic vector of field values can be determined by a Word bag model, a Word2Vec (Word to vector) model, a topic model, and the like.

In some alternative implementations of the present embodiment, the semantic vector of field values includes a semantic vector determined via: inputting the field value into a pre-trained coding network to obtain semantic codes of each word in the field value; the semantic codes of the individual words in the field values are fused to obtain semantic vectors of the field values. Also, the semantic vector of field names may be determined with reference to the present implementation. Because words which are not included in some dictionaries may appear in the table, compared with a mode of obtaining semantic vectors through dictionary inquiry, the semantic vectors of the field values are obtained through fusion of semantic codes of the individual words in the field values, semantic vectors which are more fit with actual semantics of the field values can be determined for each field value, and a matching model can determine more accurate matching relations based on the semantic vectors which are more fit with the actual semantics of the field values.

By way of example, the encoding network may include a forward LSTM and a backward LSTM, where the context information representation of each word is obtained by concatenating the output results of the forward LSTM and the backward LSTM, and semantic vectors of field values are obtained by means of max pooling, average pooling, and so on. Specifically, for the current single word t, the first t-1 single words represent vectors and can be used as the t upper information, and the t upper information is encoded through the forward LSTM to obtain hidden layer output at the t moment in the forward LSTM; and similarly, coding the lower information of t through a backward LSTM (least squares) to obtain hidden layer output at t moment in the backward LSTM, splicing the hidden layer output at t moment in the forward LSTM and the hidden layer output at t moment in the backward LSTM to obtain bidirectional LSTM output representation at t moment, and fusing semantic codes of each word in the field value through maximum pooling operation to obtain semantic vectors of the field value.

Step 204, determining a matching relationship between the field name and the field value based on the semantic vector of the field name and the semantic vector of the field value and a pre-trained matching model.

In this embodiment, the execution body may determine the matching relationship between the field name and the field value based on the semantic vector of the field name and the semantic vector of the field value and a pre-trained matching model. The input of the matching model can be generated based on the semantic vector of the field name, the semantic vector of the field value and other related vectors, for example, the semantic vector of the field name and the semantic vector of the field value can be directly used as the input vector of the matching model, the semantic vector of the field name and the semantic vector of the field value can be spliced to be used as the input vector of the matching model, and the semantic vector of the field name, the semantic vector of the field value and other related vectors can be spliced to be used as the input vector of the matching model.

Here, the matching model may characterize a correspondence of the input vector to a matching relationship of the field name and the field value. The matching model can be obtained by training an initial matching model based on a sample, and the matching model can also be a corresponding relation table which is preset by a technician based on statistics of parameter values of a large number of input parameters and matching results and stores the corresponding relation between the parameter values of a plurality of input parameters and the matching results; the calculation formula may be a calculation formula for calculating a calculation result for characterizing a matching result by performing numerical calculation on the parameter values of one or more input parameters, which are preset by a technician based on statistics of a large amount of data and stored in the electronic device, for example, the calculation formula may be a formula for performing weighted average on the parameter values of one or more input parameters, and the obtained calculation result indicates matching if the obtained calculation result is greater than a predetermined value.

The matching model may include a model for classification such as logistic regression (Logistic Regression), random forest (random forest), iterative decision tree (gradient boosting decision tree), or support vector machine (Support Vector Machine), and may also include functions such as fully connected network and softmax logistic regression, and maximum independent variable point set (argmax).

Step 205, a table is generated according to the matching relationship between the field names and the field values.

In this embodiment, the execution body may generate the table according to a matching relationship between the field name and the field value. The field name matches the field value, indicating that the field value may be ascribed to the field name. The field name does not match the field value and a determination may be continued as to whether other field names match the field value. In addition, the execution body may determine which field values are located in the same row according to the location information of the field values or the direction of the reference text.

In the process 200 of the method for identifying a table in an image according to the embodiment, the matching relationship between the field name and the field value is determined by the semantic vector of the field name and the semantic vector of the field value and the pre-trained matching model, and then the table is automatically generated according to the matching relationship between the field name and the field value.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for identifying a table in an image is shown. The process 400 of the method for identifying a form in an image comprises the steps of:

step 401, obtaining a picture to be processed.

Step 402, identifying a field name and a field value included in the picture to be processed.

Step 403, matching the semantic vector of the field name and the semantic vector of the field value with a pre-trained matching model.

Step 404, obtaining location information of the field name and location information of the field value.

In this embodiment, the method execution subject (e.g., the terminal or the server shown in fig. 1) for recognizing the form in the image may acquire the position information of the field name and the position information of the field value through OCR or other text detection algorithms, such as Faster R-CNN. The location information may indicate the location of the field name and the field value. The location information may include coordinates of key points of an area where the field name and the field value are located in the image to be processed, or information of boundary lines of the area where the field name and the field value are located in the image to be processed. The key points may include center points, boundary points, etc., taking the shape of a region as a rectangle as an example, and the key points may include upper left corner points, lower left corner points, upper right corner points, lower right corner points.

Step 405, generating a distance vector of the field name and the field value according to the position information of the field name and the position information of the field value.

In this embodiment, the execution body may generate the distance vector of the field name and the field value according to the location information of the field name and the location information of the field value. The distance vector may characterize the distance between the field name and the field value, and the dimension of the distance vector may directly include the distance between the field name and the field value, or other information that may characterize the distance between the field name and the field value, such as the difference between the abscissas and the ordinates.

In some optional implementations of the present embodiment, generating a distance vector of the field name and the field value from the location information of the field name and the location information of the field value includes: and generating a distance vector of the field name and the field value according to the difference value between the coordinates of the key point of the area where the field name is located and the coordinates of the key point of the area where the field value is located in the preset direction. The predetermined direction may include an abscissa direction and/or an ordinate direction. Specifically, taking a region shape as a rectangle as an example, four dimensions of the distance vector may be respectively: the difference between the abscissa of the upper left corner of the region where the field value is located and the abscissa of the upper left corner of the region where the field name is located, the difference between the ordinate of the upper left corner of the region where the field value is located and the ordinate of the upper left corner of the region where the field name is located, the difference between the abscissa of the lower right corner of the region where the field value is located and the abscissa of the lower right corner of the region where the field name is located, and the difference between the ordinate of the lower right corner of the region where the field value is located and the ordinate of the lower right corner of the region where the field name is located.

According to the method, the distance vector of the field name and the field value is generated through the difference value of the coordinates of the key point of the area where the field name is located and the coordinates of the key point of the area where the field value is located in the preset direction, and compared with the mode that the distance vector is generated by directly using the distance between the field name and the field value, the distance vector generated based on the method can embody the position relationship of the field name and the field value in the preset direction, contains richer position information, and is beneficial to the follow-up determination of more accurate matching relationship.

Step 406, generating an input vector of the matching model according to the semantic vector of the field name, the semantic vector of the field value and the distance vector.

In this embodiment, the execution body may generate the input vector of the matching model according to the semantic vector of the field name, the semantic vector of the field value, and the distance vector. The execution main body can directly splice the semantic vector of the field name, the semantic vector of the field value and the distance vector to obtain the input vector of the matching model, can also process the semantic vector, the semantic vector of the field value and the distance vector, and can then fuse the processed vectors based on modes such as splicing to obtain the input vector.

In some optional implementations of the present embodiment, generating an input vector of the matching model according to the semantic vector of the field name, the semantic vector of the field value, and the distance vector includes: fusing the semantic vector of the field name and the semantic vector of the field value to obtain a first vector; performing dimension transformation on the distance vector to obtain a second vector, wherein the dimension of the second vector is the same as that of the first vector; the first vector and the second vector are concatenated to obtain an input vector. As an example, the execution body may splice the semantic vector of the field name and the semantic vector of the field value to obtain the first vector, or may fully fuse and learn the information of the two through a full join layer after splicing to obtain the first vector. Because the matching relationship between the field name and the field value is important when the position information and the semantic information are determined, the dimension of the second vector representing the position information is the same as the dimension of the first vector representing the semantic information through dimension transformation, the two vectors are spliced to be used as input vectors of a matching model, the position information and the semantic information carried in the input vectors are more balanced, and the matching model can give consideration to both the position information and the semantic information to obtain more accurate output results.

Step 407, inputting the input vector into a matching model, and determining the matching relation between the field name and the field value according to the output of the matching model.

In this embodiment, the execution body may input the input vector into a matching model, and determine a matching relationship between the field name and the field value according to an output of the matching model. The output of the matching model may indicate whether the field name and the field value match.

Step 408, a table is generated according to the matching relationship between the field names and the field values.

In this embodiment, the operations of step 401, step 402, step 403, and step 408 are substantially the same as the operations of step 201, step 202, step 203, and step 205, and will not be described herein.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, in the process 400 of the method for identifying a table in an image in this embodiment, an input vector of a matching model is generated according to a semantic vector of a field name, a semantic vector of a field value and a distance vector, and the input vector contains not only semantic information but also distance information, and the two kinds of information are combined to more accurately determine a matching relationship between the field name and the field value, thereby improving the accuracy of identifying the table in the image.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an apparatus for identifying a table in an image, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for identifying a table in an image of the present embodiment includes: a first acquisition unit 501, an identification unit 502, a second acquisition unit 503, a determination unit 504, and a generation unit 505. The first acquisition unit is configured to acquire a picture to be processed; an identifying unit configured to identify a field name and a field value included in the picture to be processed; a second acquisition unit configured to acquire a semantic vector of a field name and a semantic vector of a field value; a determining unit configured to determine a matching relationship of the field name and the field value based on the semantic vector of the field name and the semantic vector of the field value and a pre-trained matching model; and a generation unit configured to generate a table according to the matching relationship of the field names and the field values.

In the present embodiment, specific processes of the first acquisition unit 501, the identification unit 502, the second acquisition unit 503, the determination unit 504, and the generation unit 505 of the apparatus 500 for identifying a table in an image may refer to steps 201, 202, 203, 204, and 205 in the corresponding embodiment of fig. 2.

In some optional implementations of the present embodiment, the determining unit includes: an acquisition subunit configured to acquire the position information of the field name and the position information of the field value; a first generation subunit configured to generate a distance vector of the field name and the field value from the position information of the field name and the position information of the field value; a second generation subunit configured to generate an input vector of the matching model according to the semantic vector of the field name, the semantic vector of the field value, and the distance vector; and a determination subunit configured to input the input vector into the matching model, and determine a matching relationship of the field name and the field value according to an output of the matching model.

In some optional implementations of the present embodiment, the second generating subunit is further configured to: fusing the semantic vector of the field name and the semantic vector of the field value to obtain a first vector; performing dimension transformation on the distance vector to obtain a second vector, wherein the dimension of the second vector is the same as that of the first vector; the first vector and the second vector are concatenated to obtain an input vector.

In some optional implementations of the present embodiment, the location information includes: coordinates of key points of an area where field names and field values are located in an image to be processed; and a first generation subunit further configured to: and generating a distance vector of the field name and the field value according to the difference value between the coordinates of the key point of the area where the field name is located and the coordinates of the key point of the area where the field value is located in the preset direction.

In some optional implementations of the present embodiment, the apparatus further comprises a semantic vector determination unit configured to: inputting the field value into a pre-trained coding network to obtain semantic codes of each word in the field value; the semantic codes of the individual words in the field values are fused to obtain semantic vectors of the field values.

The device provided by the embodiment of the application obtains the picture to be processed; identifying a field name and a field value included in the picture to be processed; acquiring semantic vectors of field names and semantic vectors of field values; determining a matching relationship between the field names and the field values based on the semantic vectors of the field names and the semantic vectors of the field values and a pre-trained matching model; and generating the table according to the matching relation of the field names and the field values, so that the efficiency of identifying the table in the image is improved.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

As shown in fig. 6, is a block diagram of an electronic device for a method of identifying a form in an image according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 6, the electronic device includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 6.

The memory 602 is a non-transitory computer readable storage medium provided by the present application. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for identifying a table in an image provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the method for identifying a table in an image provided by the present application.

The memory 602 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules corresponding to a method for identifying a table in an image in an embodiment of the present application (e.g., the first acquisition unit 501, the identification unit 502, the second acquisition unit 503, the determination unit 504, and the generation unit 505 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 602, i.e., implements the method for identifying tables in images in the method embodiments described above.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created from the use of the electronic device for identifying the form in the image, and the like. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 optionally includes memory remotely located with respect to processor 601, which may be connected via a network to an electronic device for identifying forms in images. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for the method of identifying a form in an image may further comprise: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device for identifying forms in images, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, and the like. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme provided by the embodiment of the application, the efficiency of identifying the table in the image is improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method for identifying a form in an image, comprising:

acquiring a picture to be processed;

identifying a field name and a field value included in the picture to be processed;

acquiring semantic vectors of the field names and semantic vectors of the field values;

determining a matching relation between the field names and the field values based on semantic vectors of the field names, semantic vectors of the field values, distance vectors and a pre-trained matching model, wherein the matching relation is used for representing whether the field values are under the field names, the field names represent fixed attributes, the field values represent contents corresponding to the fixed attributes, and the distance vectors are determined based on position information of the field names and position information of the field values;

and generating a table according to the matching relation of the field names and the field values.

2. The method of claim 1, wherein the determining the matching relationship of the field name and the field value based on the semantic vector of the field name, the semantic vector of the field value, the distance vector, and a pre-trained matching model comprises:

acquiring the position information of the field name and the position information of the field value;

generating a distance vector of the field name and the field value according to the position information of the field name and the position information of the field value;

generating an input vector of the matching model according to the semantic vector of the field name, the semantic vector of the field value and the distance vector;

and inputting the input vector into the matching model, and determining the matching relation between the field name and the field value according to the output of the matching model.

3. The method of claim 2, wherein the generating the input vector of the matching model from the semantic vector of the field name, the semantic vector of the field value, and the distance vector comprises:

fusing the semantic vector of the field name and the semantic vector of the field value to obtain a first vector;

performing dimension transformation on the distance vector to obtain a second vector, wherein the dimension of the second vector is the same as that of the first vector;

and splicing the first vector and the second vector to obtain the input vector.

4. The method of claim 2, wherein the location information comprises: coordinates of key points of the region where the field names and the field values are located in the picture to be processed; and

the generating a distance vector of the field name and the field value according to the position information of the field name and the position information of the field value comprises the following steps:

and generating a distance vector of the field name and the field value according to the difference value between the coordinates of the key point of the area where the field name is located and the coordinates of the key point of the area where the field value is located in a preset direction.

5. The method of any of claims 1-4, wherein the semantic vector of field values comprises a semantic vector determined via:

inputting the field value into a pre-trained coding network to obtain semantic codes of each word in the field value;

and fusing semantic codes of the individual words in the field value to obtain a semantic vector of the field value.

6. An apparatus for identifying a form in an image, comprising:

a first acquisition unit configured to acquire a picture to be processed;

an identifying unit configured to identify a field name and a field value included in the picture to be processed;

a second acquisition unit configured to acquire a semantic vector of the field name and a semantic vector of the field value;

a determining unit configured to determine a matching relationship between the field name and the field value based on a semantic vector of the field name, a semantic vector of the field value, a distance vector, and a pre-trained matching model, wherein the matching relationship is used for characterizing whether the field value is under the field name, the field name characterizes a fixed attribute, the field value characterizes a content corresponding to the fixed attribute, and the distance vector is determined based on position information of the field name and position information of the field value;

and a generation unit configured to generate a table according to the matching relationship of the field names and the field values.

7. The apparatus of claim 6, wherein the determining unit comprises:

an acquisition subunit configured to acquire the location information of the field name and the location information of the field value;

a first generation subunit configured to generate a distance vector of the field name and the field value according to the position information of the field name and the position information of the field value;

a second generation subunit configured to generate an input vector of the matching model from the semantic vector of the field name, the semantic vector of the field value, and the distance vector;

a determination subunit configured to input the input vector into the matching model, and determine a matching relationship between the field name and the field value according to an output of the matching model.

8. The apparatus of claim 7, wherein the second generation subunit is further configured to:

and splicing the first vector and the second vector to obtain the input vector.

9. The apparatus of claim 7, wherein the location information comprises: coordinates of key points of the region where the field names and the field values are located in the picture to be processed; and

the first generation subunit is further configured to:

10. The apparatus according to any of claims 6-9, wherein the apparatus further comprises a semantic vector determination unit configured to:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-5.