CN112149663A

CN112149663A - RPA and AI combined image character extraction method and device and electronic equipment

Info

Publication number: CN112149663A
Application number: CN202010886737.3A
Authority: CN
Inventors: 汪冠春; 胡一川; 褚瑞; 李玮; 田艳莉; 王建周
Original assignee: Beijing Benying Network Technology Co Ltd; Beijing Laiye Network Technology Co Ltd
Current assignee: Beijing Benying Network Technology Co Ltd; Beijing Laiye Network Technology Co Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-29

Abstract

The application provides an extraction method and device of image characters by combining RPA and AI, electronic equipment and a storage medium, and belongs to the technical field of image processing. Wherein, the method comprises the following steps: performing target detection on the image to be processed to determine the position information of each detection frame and the type of each detection frame contained in the image to be processed, wherein the type of each detection frame comprises: characters, non-characters, text line beginning and text line ending; combining the detection frames with the types of characters according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed; and performing character recognition on each text box to determine characters contained in the image to be processed. Therefore, by the extraction method of the image characters combining the RPA and the AI, the data contents of different types in the image can be determined by one-time detection, so that the process of extracting the image characters is simplified, and the efficiency of extracting the characters is improved.

Description

RPA and AI combined image character extraction method and device and electronic equipment

Technical Field

The present application relates to the field of automation technologies, and in particular, to a method and an apparatus for extracting image and text in combination with RPA and AI, an electronic device, and a storage medium.

Background

Robot Process Automation (RPA) is a Process task automatically executed according to rules by simulating human operations on a computer through specific robot software.

Artificial Intelligence (AI) is a technical science that studies and develops theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence.

With the development of AI, Optical Character Recognition (OCR) technology has been advanced in various fields to help people reduce repetitive inefficient work, especially work required to transcribe text information to a computer. The combination of the RPA technology and the OCR technology has become a new trend in the RPA field to help enterprises process character and image data more efficiently and improve work efficiency.

However, in the related art, for documents with various types of contents, such as texts, tables, red chapters, etc., it is usually necessary for a plurality of models to perform character extraction on different types of contents respectively and sequentially, so that the process of character extraction is complicated and the efficiency is low.

Disclosure of Invention

The extraction method, the extraction device, the electronic equipment and the storage medium of the image characters combined with the RPA and the AI are used for solving the problems that in the related technology, for documents with various types of contents including texts, tables, red chapters and the like, a plurality of models are generally needed to respectively and sequentially extract characters from different types of contents, so that the character extraction process is complicated, and the efficiency is low.

An embodiment of the application provides a method for extracting image characters by combining an RPA and an AI, which includes: performing target detection on an image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, wherein the type of each detection frame comprises: characters, non-characters, text line beginning and text line ending; combining the detection frames with the types of characters according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed; and performing character recognition on each text box to determine characters contained in the image to be processed.

Optionally, in a possible implementation manner of the embodiment of the first aspect of the present application, the performing target detection on the image to be processed to determine the position information of each detection frame and the type of each detection frame included in the image to be processed specifically includes:

extracting a plurality of dimensional features of each detection frame from the image to be processed respectively;

performing attention mechanism learning on the plurality of dimensional features to acquire adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame;

and determining the type of each detection frame according to the adjacent frame information of each detection frame and the corresponding text line head and tail information.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, before the extracting, from the image to be processed, the multiple dimensional features of each detection frame respectively, the method further includes:

preprocessing the image to be processed to obtain a plurality of characteristic graphs corresponding to the image to be processed;

the extracting the multiple dimensional features of each detection frame from the image to be processed specifically includes:

and extracting a plurality of dimensional features of each detection frame from the plurality of feature maps respectively.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, the extracting, from the image to be processed, the multiple dimensional features of each detection frame respectively includes:

and performing convolution processing on the image to be processed by utilizing at least two filters to acquire at least two dimensional characteristics of each detection frame, wherein the receptive fields of the at least two filters are different.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, before performing attention mechanism learning on the multiple dimensional features to obtain adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame, the method further includes:

and splicing the plurality of dimensional features to generate the feature of each detection frame.

and carrying out normalization processing on the plurality of dimension characteristics.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, the position information of each detection frame includes a coordinate of each detection frame in the first direction and an offset of each detection frame in the second direction, and the combining, according to the position information of each detection frame and the type of each detection frame, the detection frames of which the types are characters to determine each text frame included in the image to be processed specifically includes:

if the type of any detection frame is the beginning of a text line, acquiring candidate detection frames matched with a second coordinate in the first direction and the first coordinate from each detection frame according to the first coordinate of any detection frame in the first direction;

acquiring an adjacent detection frame adjacent to any detection frame in a second direction from the candidate detection frames according to a first offset of the detection frame in the second direction;

and if the type of the adjacent detection frame is a character, combining the adjacent detection frame with any detection frame.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, after the merging the detection frames of which the types are characters according to the position information of each detection frame and the type of each detection frame to determine each text frame included in the image to be processed, the method further includes:

performing connected domain analysis on each text box to determine a connected domain shape corresponding to each text box;

and if the shape of the connected component corresponding to any text box is a circle, determining that the red chapter is contained in any text box.

Another embodiment of the present application provides an apparatus for extracting image text in combination with RPA and AI, including: the first determining module is configured to perform target detection on an image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, where the type of each detection frame includes: characters, non-characters, text line beginning and text line ending; the second determining module is used for combining the detection frames with the types of characters according to the position information of each detection frame and the type of each detection frame so as to determine each text frame contained in the image to be processed; and the third determining module is used for performing character recognition on each text box so as to determine characters contained in the image to be processed.

Optionally, in a possible implementation manner of the embodiment of the first aspect of the present application, the first determining module specifically includes:

the extraction unit is used for respectively extracting a plurality of dimensional features of each detection frame from the image to be processed;

a first obtaining unit, configured to perform attention mechanism learning on the multiple dimensional features to obtain adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame;

and the determining unit is used for determining the type of each detection box according to the adjacent box information of each detection box and the corresponding text line head and tail information.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, the first determining module further includes:

the second acquisition unit is used for preprocessing the image to be processed to acquire a plurality of characteristic graphs corresponding to the image to be processed;

the extraction unit specifically comprises:

and the extracting subunit is used for extracting a plurality of dimensional features of each detection frame from the plurality of feature maps respectively.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, the extracting unit specifically includes:

and the acquisition subunit is configured to perform convolution processing on the image to be processed by using at least two filters to acquire at least two dimensional features of each detection frame, where receptive fields of the at least two filters are different.

and the splicing unit is used for splicing the plurality of dimensional features to generate the feature of each detection frame.

and the normalization unit is used for performing normalization processing on the plurality of dimension characteristics.

Optionally, in another possible implementation manner of the embodiment of the first aspect of the present application, the position information of each detection frame includes a coordinate of each detection frame in the first direction and an offset of each detection frame in the second direction, and the second determining module specifically includes:

a third obtaining unit, configured to, when a type of any detection frame is a start of a text line, obtain, from each detection frame, a candidate detection frame that matches a second coordinate in a first direction with a first coordinate according to the first coordinate in the first direction of the detection frame;

a fourth acquiring unit, configured to acquire, from the candidate detection frames, an adjacent detection frame adjacent to the any detection frame in the second direction according to a first offset amount of the any detection frame in the second direction;

and the merging unit is used for merging the adjacent detection frame and any detection frame when the type of the adjacent detection frame is a character.

Optionally, in yet another possible implementation manner of the embodiment of the first aspect of the present application, the apparatus further includes:

the fourth determining module is used for analyzing the connected domain of each text box to determine the shape of the connected domain corresponding to each text box;

and the fifth determining module is used for determining that the any text box contains the red chapter when the connected domain corresponding to the any text box is in a circular shape.

An embodiment of another aspect of the present application provides an electronic device, which includes: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for extracting image text in combination with RPA and AI as described above when executing the program.

In another aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for extracting image text by combining RPA and AI as described above.

In another aspect of the present application, a computer program is provided, which is executed by a processor to implement the method for extracting image text by combining RPA and AI according to the embodiment of the present application.

The method, the device, the electronic device, the computer-readable storage medium, and the computer program for extracting image characters by combining RPA and AI provided in the embodiments of the present application perform target detection on an image to be processed to determine position information of each detection box and a type of each detection box included in the image to be processed, merge the detection boxes of which the types are characters according to the position information of each detection box and the type of each detection box to determine each text box included in the image to be processed, and perform character recognition on each text box to determine characters included in the image to be processed. Therefore, the type of each detection frame is determined while the target detection is carried out on the image to be processed, so that the data contents of different types in the image can be determined through one-time detection, the process of extracting the image characters is simplified, and the efficiency of extracting the characters is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of an image and text extraction method in combination with an RPA and an AI according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of the positions of an image to be processed and a detection frame;

fig. 3 is a schematic flowchart of another method for extracting image and text in combination with RPA and AI according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of another method for extracting image and text in combination with RPA and AI according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an image and text extraction device combining an RPA and an AI according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the like or similar elements throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The embodiment of the application provides an extraction method of image characters combining RPA and AI, aiming at the problems that in the related art, for documents with various types of contents, such as texts, tables, red chapters and the like, a plurality of models are generally needed to respectively and sequentially extract characters from different types of contents, so that the character extraction process is complicated, and the efficiency is low.

According to the extraction method of the image characters combining the RPA and the AI, the position information of each detection frame and the type of each detection frame contained in the image to be processed are determined by performing target detection on the image to be processed, the detection frames with the types of characters are combined according to the position information of each detection frame and the type of each detection frame, each text frame contained in the image to be processed is determined, and then character recognition is performed on each text frame, so that the characters contained in the image to be processed are determined. Therefore, the type of each detection frame is determined while the target detection is carried out on the image to be processed, so that the data contents of different types in the image can be determined through one-time detection, the process of extracting the image characters is simplified, and the efficiency of extracting the characters is improved.

The following describes in detail an extraction method, an extraction device, an electronic device, a storage medium, and a computer program for extracting image text in combination with RPA and AI provided by the present application with reference to the drawings.

Fig. 1 is a schematic flow chart of an image and text extraction method combining an RPA and an AI according to an embodiment of the present disclosure.

As shown in fig. 1, the method for extracting image text by combining RPA and AI includes the following steps:

step 101, performing target detection on an image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, wherein the type of each detection frame includes: characters, non-characters, text line beginning, and text line end.

It should be noted that the RPA technology can intelligently understand the existing application of the electronic device through the user interface, automate repeated regular operations based on rules and in large batch, such as automatically and repeatedly reading mails, reading Office components, operating databases, web pages, client software, and the like, collect data and perform complex calculations, so as to generate files and reports in large batch, thereby greatly reducing the input of labor cost and effectively improving the Office efficiency through the RPA technology. Therefore, in the scene of extracting the image characters, the RPA program can be configured in the electronic device for extracting the image characters, so that the electronic device can automatically extract the characters of the acquired image according to the rules set in the RPA program.

In practical use, the method for extracting image characters by combining RPA and AI according to the embodiment of the present application can be applied to any scene where characters in an image are extracted, and the embodiment of the present application does not limit this. For example, the method can be applied to the recording scene of paper files such as certificates and bills.

The image to be processed may refer to an image acquired by the RPA robot. For example, when the extraction method of image characters combining the RPA and the AI according to the embodiment of the present application is applied to a document information uploading scenario of an accounting department, the image to be processed may be an image of various documents, such as travel fees, transportation fees, banquet fees, and the like, which is acquired by an RPA robot and uploaded by a user through an electronic device.

The position information of the detection frame may include coordinates of each vertex of the detection frame in the image to be processed; or, when the coordinate system corresponding to the image to be processed includes the first direction and the second direction, the position information of the detection frame may also include a coordinate of the detection frame in the first direction and an offset in the second direction, so as to determine a specific position of the detection frame in the image to be processed by the position information of the detection frame.

For example, as shown in fig. 2, the position of the image to be processed and a detection frame are schematically illustrated. Wherein, O is the origin of the coordinate system corresponding to the image 20 to be processed, Y is the first direction of the coordinate system corresponding to the image 20 to be processed, and X is the second direction of the coordinate system corresponding to the image 20 to be processed, then Y is₁To detect the coordinates of the frame 21 in the first direction, x₁The amount of displacement of the detection frame 21 in the second direction, i.e. the position information of the detection frame 21 is (y)₁，x₁)。

As a possible implementation manner, an OCR algorithm based on ctpn (detecting Text in natural Image with connectivity Text forward network) may be adopted to perform target detection on an Image to be processed, so as to determine the position information of each frame to be detected and the type of each detection frame included in the Image to be processed. Specifically, a transform may be used to replace a Long Short-Term Memory (LSTM) network in a CTPN framework to perform target detection on an image to be processed, and obtain text line association information in the image, so that not only can position information of detection boxes corresponding to each target in the image be determined, and whether the type of each detection box is a character, but also information such as whether the type of each detection box is a text line start and a text line end can be determined.

And 102, combining the detection frames with the types of the characters according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed.

The text box included in the image to be processed may include a complete and independent text in the image to be processed.

In the embodiment of the application, after the position information of each detection frame and the type of each detection frame in the image to be processed are determined, the detection frames in the same row may be determined according to the position information of each detection frame, and then whether the detection frame is the beginning of a text line may be determined according to the types of the detection frames in the same row. If one detection box A is the beginning of a text line, determining the next detection box B which is positioned in the same line and adjacent to the detection box A according to the position information of the detection box A; if the type of the detection box B is character and the detection box B is the end of the text line, the detection box A and the detection box B can be combined to be used as a text box; if the type of the detection box B is a character and is not the end of a text line, the detection box A and the detection box B can be merged, the next detection box C adjacent to the detection box B is continuously determined, the steps are further repeated to determine whether the detection box C can be merged with the detection box A and the detection box B or not until the detection box D with the type of the end of the text line is traversed, and all the detection boxes between the detection box A and the detection box D can be merged to generate a text box. Thus, repeating the above steps can determine all text boxes included in the image to be processed.

And 103, performing character recognition on each text box to determine characters contained in the image to be processed.

In the embodiment of the application, after the text box included in the image to be processed is determined, the characters in each text box can be identified by adopting any character identification algorithm, so as to determine the characters corresponding to each text box, and further determine the characters included in the image to be processed.

In a possible implementation form of the method, the multiple dimensional features of the detection frame can be extracted to determine the type of the detection frame, so that the accuracy of image character extraction is further improved.

The method for extracting image and text in combination with RPA and AI provided in the embodiment of the present application is further described with reference to fig. 3.

Fig. 3 is a schematic flowchart of another method for extracting image and text by combining an RPA and an AI according to an embodiment of the present disclosure.

As shown in fig. 3, the method for extracting image text by combining RPA and AI includes the following steps:

step 201, performing target detection on the image to be processed to determine the position information of each detection frame included in the image to be processed.

The detailed implementation process and principle of step 201 may refer to the detailed description of the above embodiments, and are not described herein again.

Step 202, extracting a plurality of dimensional features of each detection frame from the image to be processed respectively.

The plurality of dimensional features may be features that can represent features of a detection frame in the image to be processed from different granularities.

In the embodiment of the application, convolution processing can be performed on an image to be processed through convolution kernels with different sizes so as to generate a plurality of dimensional features of each detection frame in the image to be processed, wherein a result of convolution and convolution processing performed on the image to be processed is used as one dimensional feature of each detection frame in the image to be processed; or, the convolution processing can be performed on the image to be processed by convolution kernels with the same size and different convolution modes to generate a plurality of dimensional features of each detection frame in the image to be processed, wherein a convolution result corresponding to one convolution mode is one dimensional feature of each detection frame in the image to be processed. Therefore, the characteristics of different granularities of the image to be processed are extracted, and the accuracy of identifying the detection frame is improved.

As a possible implementation, the image to be processed may be subjected to convolution processing by different filters to generate different dimensional features of each detection frame. That is, in a possible implementation form of the embodiment of the present application, the step 202 may include:

The filter may be a convolution kernel with a coefficient of expansion. For example, the filter may have a size of 3 × 3, and the expansion coefficient may be 1, 2, 5, and so on.

As a possible implementation, n may be used₁Convolution processing is carried out on the image to be processed by each filter with the size of 3 x 3, namely the receptive field of the filter is 3 x 3, so as to generate n of each detection frame in the image to be processed₁Dimension characteristics; and can pass through n₂Carrying out hole convolution on the image to be processed by using a filter with the size of 3 multiplied by 3 and the expansion coefficient of 2, namely the size of the receptive field of the filter is 7 multiplied by 7, so as to generate n of each detection frame in the image to be processed₂Dimension characteristics; finally, can be represented by n₃Carrying out hole convolution on the image to be processed by using a filter with the size of 3 multiplied by 3 and the expansion coefficient of 5, namely the size of the receptive field of the filter is 19 multiplied by 19, so as to generate n of each detection frame in the image to be processed₃Dimensional features such that n for each detection frame of the image to be processed can be generated₁+n₂+n₃And (5) dimension characteristics.

In practice, n is₁、n₂、n₃The specific value of (a) can be determined according to actual needs, and the embodiment of the application is not limited thereto. For example, n₁Can be 256, n₂May be 128, n₃May be 128.

Furthermore, before extracting the multi-dimensional features of the image to be processed, the image to be processed can be preprocessed to generate a feature map corresponding to the image to be processed, so that the accuracy of recognizing the image to be processed is further improved. That is, in a possible implementation form of the embodiment of the present application, before the step 202, the method may further include:

preprocessing an image to be processed to obtain a plurality of characteristic graphs corresponding to the image to be processed;

accordingly, the step 202 may include:

As a possible implementation manner, the densenert 121 may be adopted to perform feature extraction on the image to be processed, so as to generate a plurality of feature maps corresponding to the image to be processed. For example, feature maps with a size of 512 × (pic _ height/8) × (pic _ width/8) corresponding to the image to be processed may be generated, that is, feature maps with a size of (pic _ height/8) × (pic _ width/8) corresponding to the image to be processed are generated, where pic _ height is the height of the image to be processed, and pic _ width is the width of the image to be processed. After a plurality of feature maps corresponding to the image to be processed are determined, each feature map may be subjected to convolution processing, so as to generate a plurality of feature maps, and a plurality of dimensional features of each detection frame are extracted respectively.

Specifically, the same convolution processing may be performed on the feature map in the above manner of performing convolution on the image to be processed, so as to extract the multiple dimensional features of each detection frame from the feature map. For example, if there are 512 feature maps corresponding to the image to be processed, n can be passed₁The filters with the size of 3 × 3 × 512 respectively perform convolution processing on 512 feature maps to generate n of each detection frame₁Dimension characteristics; and can pass through n₂Respectively performing hole convolution on 512 feature maps by using filters with the size of 3 multiplied by 512 and the expansion coefficient of 2 to generate n of each detection frame₂Dimension characteristics; finally, can be represented by n₃Respectively performing hole convolution on 512 feature maps by using filters with the size of 3 multiplied by 512 and the expansion coefficient of 5 to generate n of each detection frame₃Dimensional features such that n for each detection frame of the image to be processed can be generated₁+n₂+n₃And (5) dimension characteristics.

Step 203, performing attention mechanism learning on the multiple dimensional features to obtain adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame.

In this embodiment of the present application, a target detection model including a multilayer decoder and an attention mechanism may be used to learn a plurality of dimensional features to obtain adjacent box information and corresponding text line head and tail information of each detection box, that is, to obtain type information of the adjacent detection box of each detection box, and whether each detection box is a text line head and whether each detection box is a text line tail.

As a possible implementation manner, when the target detection model identifies the image to be processed, if it is detected that the image to be processed includes k detection boxes, the target detection model may input 2k pieces of coordinate information for indicating the position information of each detection box (for example, the coordinate of each detection box in the first direction and the offset of each detection box in the second direction), and may output 4k pieces of score information for indicating the category of each detection box, that is, each detection box may correspond to 4 pieces of score information for indicating the probability that the detection box is a character, the probability that the detection box is a non-character, the probability that the detection box is the beginning of a text line, and the probability that the detection box is the end of a text line, respectively.

Furthermore, before the attention mechanism learning is carried out on the multiple dimension features, the multiple dimension features can be fused to identify the multiple dimension features integrally, and the accuracy of identifying the detection frame in the image to be processed is further improved. That is, in a possible implementation form of the embodiment of the present application, before step 203, the method may further include:

As a possible implementation manner, in order to enable the target detection mode to recognize the image to be processed, the multiple dimensional features of each detection frame in the image to be processed may be spliced in combination with the features of the image to be processed with different granularities to generate the features of each detection frame, that is, the features of each detection frame are represented by an overall feature vector, so that the features of each detection frame may include feature information of the image to be processed with different granularities, thereby improving the accuracy of recognizing the detection frame in the image to be processed.

For example, the image to be processed is subjected to convolution processing through 256 filters with the size of 3 × 3, and 256-dimensional features of each detection frame in the image to be processed are generated; performing hole convolution on the image to be processed through 128 filters with the size of 3 multiplied by 3 and the expansion coefficient of 2 to generate 128-dimensional characteristics of each detection frame in the image to be processed; the method comprises the steps of performing hole convolution on an image to be processed through 128 filters with the size of 3 x 3 and the expansion coefficient of 5 to generate 128-dimensional features of each detection frame in the image to be processed, and accordingly performing splicing processing on the generated 256-dimensional features and the two 128-dimensional features to generate 512-dimensional features of each detection frame of the image to be processed.

As a possible implementation manner, after the multi-dimensional features are spliced, the features generated after splicing can be compressed through the full connection layer, so as to reduce the size of the spliced features, and reduce the calculation amount for identifying the spliced features. For example, if the feature after splicing is 512 dimensions, the feature after splicing can be compressed into 256 dimensions through the full connection layer.

Further, since the feature metrics determined in different ways may be different, features with too small brightness may be easily ignored, thereby affecting the reliability of the recognition of the image to be processed. That is, in a possible implementation form of the embodiment of the present application, before step 203, the method may further include:

and carrying out normalization processing on the multiple dimension characteristics.

As a possible implementation manner, after the multiple dimensional features of each detection frame in the image to be processed are determined in different manners, normalization processing may be performed on the multiple dimensional features, so that the metrics of the dimensional features are in the same numerical range, and thus, the influence of different metrics of the multiple dimensional features on the recognition accuracy of the image to be processed can be reduced.

As another possible implementation, after the multiple dimensional features are spliced, normalization processing may be performed on the spliced features.

Step 204, determining the type of each detection box according to the adjacent box information of each detection box and the corresponding text line head and tail information, wherein the type of each detection box comprises: characters, non-characters, text line beginning, and text line end.

In the embodiment of the application, after the information of the adjacent detection frame and the corresponding text line head and tail information of each detection frame in the image to be processed are determined through an attention mechanism, the type of each detection frame can be determined according to the information of the adjacent detection frame and the text line head and tail information of each detection frame.

As a possible implementation manner, the type of each detection box may also be determined only according to the head and tail information of the text line corresponding to each detection box. For example, the 4 pieces of score information of the detection box a output in the target detection mode are [0.99,0,1,0], where the 4 pieces of score information sequentially represent the probability that the detection box a is a character, the probability that the detection box a is a non-character, the probability that the detection box a is a beginning of a text line, and the probability that the detection box a is an end of a text line, so that the type of the detection box can be determined as a character and a beginning of a text line.

As another possible implementation manner, the type of each detection box may also be determined by the adjacent box information of each detection box and the corresponding text line head and tail information. Specifically, the type of each detection frame may be determined according to the head and tail information of the text line corresponding to each detection frame, and then the type of each detection frame is checked according to the adjacent frame information of each detection frame, so as to assist in judging whether the determined type of the detection frame is accurate, and further improve the accuracy of determining the type of the detection frame.

For example, the 4 pieces of score information of the detection box a output by the target detection mode are [0.99,0,1,0], where the 4 pieces of score information sequentially represent the probability that the detection box a is a character, the probability that the detection box a is a non-character, the probability that the detection box a is a beginning of a text line, and the probability that the detection box a is an end of a text line in order, so that the type of the detection box a can be determined as a character and a beginning of a text line. The score information of the adjacent box B located before the detection box a is [0.9,0.05,0.1,0.95], and the score information of the adjacent box C located after the detection box a is [0.92,0.1,0.1,0.1], so that the probability that the type of the detection box a is the beginning of the text line is very high, and the type of the detection box a is determined to be the beginning of the character or the text line.

Step 205, merging the detection frames with the type being the character according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed.

And step 206, performing character recognition on each text box to determine characters contained in the image to be processed.

The detailed implementation process and principle of the

steps

205 and 206 can refer to the detailed description of the above embodiments, and are not described herein again.

The method for extracting image characters by combining the RPA and the AI according to the embodiment of the present application includes extracting multiple dimensional features of each detection box from an image to be processed, and performing attention mechanism learning on the multiple dimensional features to obtain adjacent box information of each detection box and text line head and tail information corresponding to each detection box, then determining a type of each detection box according to the adjacent box information of each detection box and the text line head and tail information corresponding to each detection box, and further combining the detection boxes of which the types are characters according to position information of each detection box and the type of each detection box to determine each text box included in the image to be processed, and performing character recognition on each text box to determine characters included in the image to be processed. Therefore, by extracting the multi-dimensional features of the images to be processed with different granularities and performing feature representation on each detection frame in the images to be processed, the data contents of different types in the images can be determined by one-time detection, the image character extraction process is simplified, the character extraction efficiency is improved, and the character extraction accuracy and reliability are further improved.

In a possible implementation form of the method and the device, the character type detection frames can be combined according to the position information of the detection frames to determine each text frame included in the image, and connected domain analysis can be performed on the text frames to realize identification of the red seal in the image, so that the practicability and the universality of image character extraction are further improved.

The method for extracting image text in combination with RPA and AI provided in the embodiment of the present application is further described below with reference to fig. 4.

Fig. 4 is a schematic flowchart of another method for extracting image and text by combining RPA and AI according to an embodiment of the present disclosure.

As shown in fig. 4, the method for extracting image text by combining RPA and AI includes the following steps:

step 301, performing target detection on an image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, where the type of each detection frame includes: characters, non-characters, text line beginning, and text line end.

The detailed implementation process and principle of step 301 may refer to the detailed description of the above embodiments, and are not described herein again.

Step 302, if the type of any detection frame is the beginning of a text line, acquiring candidate detection frames matched with the second coordinate and the first coordinate in the first direction from each detection frame according to the first coordinate of any detection frame in the first direction.

The position information of the detection frames comprises the coordinates of each detection frame in the first direction and the offset of each detection frame in the second direction. It should be noted that the first direction may be a Y-axis direction of a coordinate system corresponding to the image to be processed, and the second direction may be an X-axis direction of the coordinate system corresponding to the image to be processed, which is not limited in this embodiment of the present application. The specific schematic diagram can be explained with reference to fig. 2 and fig. 2 in the above embodiment, and is not repeated here.

In the embodiment of the present application, if the type of one detection box is the beginning of a text line, the detection boxes in the same independent text can be determined from the detection boxes in the same line and merged. Specifically, assuming that the type of the detection frame a is the beginning of a text line, a detection frame with a difference between the second coordinate and the first coordinate smaller than the first threshold may be determined according to the first coordinate of the detection frame a in the first direction and the second coordinate of each of the other detection frames in the image to be processed in the first direction, and the detection frame is determined as a candidate detection frame with the second coordinate matching the first coordinate, that is, the candidate detection frame and the detection frame a are in the same line.

It should be noted that, in actual use, a specific value of the first threshold may be determined according to actual needs and the height of the detection frame, which is not limited in this application. For example, the first threshold may be 1/3 for the detection box height.

Step 303, according to the first offset of any detection frame in the second direction, acquiring an adjacent detection frame adjacent to any detection frame in the second direction from the candidate detection frames.

In the embodiment of the present application, after determining the candidate detection frame in the same line as the detection frame whose type is the beginning of the text line, an adjacent detection frame adjacent to the detection frame may be determined according to the first offset of the detection frame in the second direction and the second offset of each candidate detection frame in the second direction. Specifically, assuming that the detection box a is a detection box with a type of the beginning of a text line, if a difference between a second offset of a candidate detection box B corresponding to the detection box a and a first offset of the detection box a is less than or equal to a width of the detection box, it may be determined that the candidate detection box B is an adjacent detection box of the detection box a; otherwise, it may be determined that the candidate detection box B is not an adjacent detection box to the detection box a.

And 304, if the type of the adjacent detection frame is a character, combining the adjacent detection frame with any detection frame.

In the embodiment of the present application, after determining the adjacent detection frame of the detection frame, if the type of the adjacent detection frame is a character, the adjacent detection frame and the detection frame may be merged. Thereafter, in the same manner, an adjacent detection frame adjacent to the adjacent detection frame is determined, and it is determined whether or not the merging process can be performed. It can be understood that after traversing all the detection boxes in the image to be processed by the above method, all the text boxes included in the image to be processed can be determined.

And 305, performing connected component analysis on each text box to determine the shape of the connected component corresponding to each text box.

In the embodiment of the disclosure, the connected component analysis may be further performed on the text box to identify the red chapter included in the image to be processed. Specifically, because the contained content is a text box of characters, which is usually square, even if the text box containing the content of characters is analyzed for connected components, the generated connected components are usually square; the red seal is generally a high-height circle, and the red seal generally contains different types of contents such as characters and images, and the characters are not distributed in rows, so that the red seal part can be generally divided into a plurality of text boxes, and the text boxes corresponding to the red seal have the same image characteristics. Therefore, the connected component analysis is performed on each text box included in the image to be processed, the text boxes corresponding to the red seal can be connected together to form a complete connected component, and the connected component corresponding to the red seal is usually circular.

Step 306, if the shape of the connected component corresponding to any text box is a circle, it is determined that any text box contains a red chapter.

In the embodiment of the present application, since the shape of the chapter red in the text is generally circular, after performing connected component analysis on each text box in the image to be processed, since the shape of the connected component of the text box corresponding to the character is generally square, and the shape of the connected component corresponding to the chapter red is generally circular, after performing connected component analysis on each text box, the text box whose corresponding connected component shape is circular can be determined as the text box containing the chapter red.

Step 307, performing character recognition on each text box to determine characters contained in the image to be processed.

The detailed implementation process and principle of the step 307 may refer to the detailed description of the above embodiments, and are not described herein again.

In the method for extracting image characters by combining RPA and AI provided in the embodiment of the present application, when a detection frame is a beginning of a text line, a candidate detection frame is determined according to a first coordinate of the detection frame in a first direction and a second coordinate of other detection frames in the first direction, and an adjacent detection frame adjacent to the detection frame and having a character type is merged with the detection frame according to a first offset of the detection frame in a second direction and a type of the candidate detection frame, so that a red chapter in an image to be processed is determined through connected component analysis, and character recognition is performed on each text frame to determine characters included in the image to be processed. From this, through carrying out connected domain analysis to the text box to the red chapter that contains in the discernment image, thereby not only can determine the data content of different grade type in the image through once detecting, simplified the process that the image characters drawed, promoted the efficiency that the characters were drawed, further promoted the practicality and the commonality that the image characters were drawed moreover.

In order to implement the above embodiments, the present application further provides an image text extraction device combining RPA and AI.

Fig. 5 is a schematic structural diagram of an image and text extraction device combining an RPA and an AI according to an embodiment of the present disclosure.

As shown in fig. 5, the RPA and AI combined image character extracting apparatus 40 includes:

a first determining module 41, configured to perform target detection on the image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, where the type of each detection frame includes: characters, non-characters, text line beginning and text line ending;

a second determining module 42, configured to combine the detection frames with the types of characters according to the position information of each detection frame and the type of each detection frame, so as to determine each text frame included in the image to be processed;

and a third determining module 43, configured to perform character recognition on each text box to determine characters included in the image to be processed.

In practical use, the extraction device for RPA and AI combined image characters provided in the embodiment of the present application may be configured in any electronic device to execute the extraction method for RPA and AI combined image characters.

The extraction device of image characters combining RPA and AI provided in the embodiment of the present application determines, by performing target detection on an image to be processed, location information of each detection box and a type of each detection box included in the image to be processed, and merges, according to the location information of each detection box and the type of each detection box, the detection boxes whose types are characters to determine each text box included in the image to be processed, and then performs character recognition on each text box to determine characters included in the image to be processed. Therefore, the type of each detection frame is determined while the target detection is carried out on the image to be processed, so that the data contents of different types in the image can be determined through one-time detection, the process of extracting the image characters is simplified, and the efficiency of extracting the characters is improved.

In a possible implementation form of the present application, the first determining module 41 specifically includes:

Further, in another possible implementation form of the present application, the first determining module 41 further includes:

the second acquisition unit is used for preprocessing the image to be processed so as to acquire a plurality of characteristic graphs corresponding to the image to be processed;

correspondingly, the extraction unit specifically includes:

and the extracting subunit is used for extracting the multiple dimensional features of each detection frame from the multiple feature maps respectively.

Further, in another possible implementation form of the present application, the extracting unit specifically includes:

and the acquisition subunit is used for performing convolution processing on the image to be processed by utilizing at least two filters so as to acquire at least two dimensional characteristics of each detection frame, wherein the receptive fields of the at least two filters are different.

and the normalization unit is used for performing normalization processing on the multiple dimension characteristics.

Further, in another possible implementation form of the present application, the position information of each detection frame includes a coordinate of each detection frame in the first direction and an offset of each detection frame in the second direction; correspondingly, the second determining module 42 specifically includes:

a third acquiring unit, configured to acquire, when the type of any one of the detection frames is a start of a text line, a candidate detection frame that matches a second coordinate in the first direction with the first coordinate from each of the detection frames according to the first coordinate in the first direction of any one of the detection frames;

a fourth acquiring unit configured to acquire, from the candidate detection frames, an adjacent detection frame adjacent to any detection frame in the second direction according to a first offset amount of the any detection frame in the second direction;

and the merging unit is used for merging the adjacent detection frame with any detection frame when the type of the adjacent detection frame is a character.

Further, in another possible implementation form of the present application, the device 40 for extracting image text in combination with RPA and AI further includes:

the fourth determining module is used for analyzing the connected domain of each text box so as to determine the shape of the connected domain corresponding to each text box;

and the fifth determining module is used for determining that the red chapter is contained in any text box when the shape of the connected domain corresponding to any text box is a circle.

It should be noted that the foregoing explanation of the embodiment of the method for extracting image characters by combining RPA and AI shown in fig. 1, fig. 3, and fig. 4 is also applicable to the device 40 for extracting image characters by combining RPA and AI in this embodiment, and will not be repeated here.

The extraction device for the image characters combining the RPA and the AI provided by the embodiment of the application extracts a plurality of dimensional features of each detection frame from an image to be processed respectively, and performs attention mechanism learning on the plurality of dimensional features to obtain adjacent frame information of each detection frame and text line head and tail information corresponding to each detection frame, and then determines the type of each detection frame according to the adjacent frame information of each detection frame and the text line head and tail information corresponding to each detection frame, and then combines the detection frames with the type of characters according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed, and performs character recognition on each text frame to determine characters contained in the image to be processed. Therefore, by extracting the multi-dimensional features of the images to be processed with different granularities and performing feature representation on each detection frame in the images to be processed, the data contents of different types in the images can be determined by one-time detection, the image character extraction process is simplified, the character extraction efficiency is improved, and the character extraction accuracy and reliability are further improved.

In order to implement the above embodiments, the present application further provides an electronic device.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

As shown in fig. 6, the electronic device 200 includes:

a memory 210 and a processor 220, a bus 230 connecting different components (including the memory 210 and the processor 220), wherein the memory 210 stores a computer program, and when the processor 220 executes the program, the method for extracting image and text by combining RPA and AI according to the embodiment of the present application is implemented.

Bus 230 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 200 typically includes a variety of electronic device readable media. Such media may be any available media that is accessible by electronic device 200 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 210 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)240 and/or cache memory 250. The electronic device 200 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 260 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard drive"). Although not shown in FIG. 6, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 230 by one or more data media interfaces. Memory 210 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 280 having a set (at least one) of program modules 270, including but not limited to an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment, may be stored in, for example, the memory 210. The program modules 270 generally perform the functions and/or methodologies of the embodiments described herein.

Electronic device 200 may also communicate with one or more external devices 290 (e.g., keyboard, pointing device, display 291, etc.), with one or more devices that enable a user to interact with electronic device 200, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 200 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 292. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 293. As shown, the network adapter 293 communicates with the other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processor 220 executes various functional applications and data processing by executing programs stored in the memory 210.

It should be noted that, for the implementation process and the technical principle of the electronic device of this embodiment, reference is made to the foregoing explanation of the method for extracting image and text in combination with RPA and AI in this embodiment of the application, and details are not repeated here.

The electronic device provided in this embodiment of the present application may execute the extraction method of image words by combining RPA and AI as described above, and perform target detection on an image to be processed to determine location information of each detection box and a type of each detection box included in the image to be processed, and according to the location information of each detection box and the type of each detection box, combine the detection boxes whose types are characters to determine each text box included in the image to be processed, and further perform word recognition on each text box to determine words included in the image to be processed. Therefore, the type of each detection frame is determined while the target detection is carried out on the image to be processed, so that the data contents of different types in the image can be determined through one-time detection, the process of extracting the image characters is simplified, and the efficiency of extracting the characters is improved.

In order to implement the above embodiments, the present application also proposes a computer-readable storage medium.

The computer readable storage medium stores thereon a computer program, which when executed by a processor, implements the extraction method of the image text combining the RPA and the AI according to the embodiment of the present application.

In order to implement the foregoing embodiments, a further embodiment of the present application provides a computer program, which when executed by a processor, implements the method for extracting image text by combining RPA and AI according to the embodiments of the present application.

In an alternative implementation, the embodiments may be implemented in any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic devices may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., through the internet using an internet service provider).

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An extraction method of image characters combining RPA and AI is characterized by comprising the following steps:

performing target detection on an image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, wherein the type of each detection frame comprises: characters, non-characters, text line beginning and text line ending;

combining the detection frames with the types of characters according to the position information of each detection frame and the type of each detection frame to determine each text frame contained in the image to be processed;

and performing character recognition on each text box to determine characters contained in the image to be processed.

2. The method according to claim 1, wherein the performing target detection on the image to be processed to determine the position information of each detection frame and the type of each detection frame included in the image to be processed specifically includes:

3. The method of claim 2, wherein before said extracting the plurality of dimensional features of each of the detection frames from the image to be processed, respectively, further comprises:

4. The method according to claim 2, wherein the extracting the dimensional features of each of the detection frames from the image to be processed respectively comprises:

5. The method of claim 2, before the learning of the attention mechanism on the plurality of dimensional features to obtain the adjacent box information of each detection box and the head and tail information of the text line corresponding to each detection box, further comprising:

6. The method according to any one of claims 2-5, further comprising, before the learning of the attention mechanism on the plurality of dimensional features to obtain the adjacent box information of each of the detection boxes and the text line head-tail information corresponding to each of the detection boxes, the following steps:

7. The method according to any one of claims 1 to 5, wherein the position information of each of the detection boxes includes a coordinate of each of the detection boxes in a first direction and an offset of each of the detection boxes in a second direction, and the combining the detection boxes of which the types are characters according to the position information of each of the detection boxes and the type of each of the detection boxes to determine each text box included in the image to be processed specifically includes:

8. The method according to any one of claims 1 to 5, wherein after said combining the detection boxes of which the types are characters according to the position information of each detection box and the type of each detection box to determine each text box included in the image to be processed, further comprising:

9. An extraction device for combining RPA and AI images and texts, comprising:

the first determining module is configured to perform target detection on an image to be processed to determine position information of each detection frame and a type of each detection frame included in the image to be processed, where the type of each detection frame includes: characters, non-characters, text line beginning and text line ending;

the second determining module is used for combining the detection frames with the types of characters according to the position information of each detection frame and the type of each detection frame so as to determine each text frame contained in the image to be processed;

and the third determining module is used for performing character recognition on each text box so as to determine characters contained in the image to be processed.

10. The apparatus of claim 9, wherein the first determining module specifically comprises:

11. The apparatus of claim 10, wherein the first determining module further comprises:

the extraction unit specifically comprises:

12. The apparatus according to claim 10, wherein the extracting unit specifically comprises:

13. The apparatus of claim 10, wherein the first determining module further comprises:

14. The apparatus of any of claims 10-13, wherein the first determining module further comprises:

15. The apparatus according to any one of claims 9-13, wherein the position information of each of the detection frames includes coordinates of each of the detection frames in a first direction and an offset of each of the detection frames in a second direction, and the second determining module specifically includes:

16. The apparatus of any of claims 9-13, further comprising:

17. An electronic device, comprising: memory, processor and program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements the method for extracting image text in combination with RPA and AI according to any of claims 1-8.

18. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the RPA and AI combined image text extraction method according to any one of claims 1 to 8.