CN108229303B

CN108229303B - Detection recognition and training method, device, equipment and medium for detection recognition network

Info

Publication number: CN108229303B
Application number: CN201711126372.9A
Authority: CN
Inventors: 刘学博; 梁鼎
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2021-05-04
Anticipated expiration: 2037-11-14
Also published as: CN108229303A

Abstract

The embodiment of the invention discloses a training method, a device, equipment and a medium for detecting, identifying and detecting an identification network, wherein the detecting and identifying method comprises the following steps: inputting an image to be processed into a detection and identification network; the detection identification network comprises a sharing network layer, a detection network layer and an identification network layer; outputting the sharing layer characteristics of the image to be processed through the sharing network layer; inputting the sharing layer characteristics into the detection network layer, outputting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics; and inputting the sharing layer characteristics and the text box information into the identification network layer, and outputting the text content in the text box through the identification network layer. The embodiment of the invention reduces repeated feature extraction on the image and improves the processing efficiency; the efficiency and the speed of character detection and identification are improved.

Description

Detection recognition and training method, device, equipment and medium for detection recognition network

Technical Field

The invention relates to a computer vision technology, in particular to a training method, a device, equipment and a medium for detecting and identifying and detecting and identifying a network.

Background

Text detection and recognition in natural scenes is an important issue in the field of image understanding and image restoration. Accurate text detection and recognition can be used for many problems such as image search under large data sets, automatic translation, blind guidance, robotic navigation, etc.

However, text detection and recognition in natural scenes is very challenging, and factors such as different background scenes, low resolution, different fonts, different lighting conditions, different size scales, different tilt directions, and blur make the problem very complex and difficult.

Disclosure of Invention

The embodiment of the invention provides a technical scheme for character recognition.

According to an aspect of the embodiments of the present invention, there is provided a detection and identification method, including:

inputting an image to be processed into a detection and identification network; the detection identification network comprises a sharing network layer, a detection network layer and an identification network layer;

outputting sharing layer characteristics of the image to be processed through the sharing network layer, wherein the sharing layer characteristics are used for embodying at least one of the following characteristics in the image: texture features, edge features and detail features of the small object;

inputting the sharing layer characteristics into the detection network layer, outputting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics;

and inputting the sharing layer characteristics and the text box information into the identification network layer, and outputting the text content in the text box through the identification network layer.

In another embodiment based on the above method of the present invention, the detection layer characteristics include category information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information;

the obtaining of the text box information including the characters in the image to be processed based on the detection layer features includes:

acquiring text box information including characters in the image to be processed according to the category information of each pixel in the image to be processed, wherein the text box information includes: text box category information and text box position information; the text box category information is used for indicating whether the text box contains characters or not; the text box position information comprises the distance from any pixel point in the image to be processed to the upper part, the lower part, the left part and the right part of the text box and the rotation angle of the text box.

In another embodiment based on the foregoing method of the present invention, the obtaining text box information including characters in the image to be processed through the category information of each pixel in the image to be processed includes:

respectively reducing the length and the width of the image to be processed to set proportions based on the category information of the image to be processed, and segmenting the image to be processed into a plurality of rectangular frames according to the pixel position relationship; obtaining a text box based on the rectangular box in which the category information of each pixel in the rectangular box is marked as character information;

obtaining distance information of any pixel point in the image to be processed from the upper part, the lower part, the left part and the right part of the text box and rotation angle information of the text box;

and obtaining the text box information based on the obtained text box position information and the obtained text box category information.

In another embodiment of the above method according to the present invention, the inputting the sharing layer feature and the text box information into the recognition network layer, and predicting the text information in the text box via the recognition network layer includes:

acquiring corresponding text box characteristics based on the output text box information, and performing characteristic fusion on the text box characteristics and sharing layer characteristics output by the sharing network layer;

and the recognition network layer predicts the character information in the text box based on the fused features.

In another embodiment of the foregoing method based on the present invention, the obtaining the corresponding text box feature based on the output text box information includes:

and carrying out perspective transformation on the text box information, segmenting a text box from the image to be processed, and generating corresponding text box characteristics based on the segmented text box.

In another embodiment of the foregoing method according to the present invention, the segmenting the text box from the image to be processed includes:

obtaining the coordinates of the upper left corner of the text box according to the position information of the text box;

keeping the ratio of the height and the width of the text box unchanged, and zooming the text box to make the height of each text box consistent;

constructing a perspective transformation matrix based on the rotation angle of the text box, the upper left corner coordinate and the scaling;

and segmenting the text box from the image to be processed based on the perspective transformation matrix.

In another embodiment of the foregoing method according to the present invention, the segmenting the text box from the image to be processed based on the perspective transformation matrix includes:

and performing matrix multiplication operation on the perspective transformation matrix and the image to be processed to obtain a segmentation image with the same size as the image to be processed, wherein each segmentation image only comprises a text box at the upper left corner.

According to another aspect of the embodiments of the present invention, there is provided a training method for detecting a recognition network, including:

inputting an image to be processed into a detection and identification network; wherein, the detection identification network comprises a sharing network layer, a detection network layer and an identification network layer; the image to be processed is marked with text box information and character information contained in the text box;

outputting a first shared layer feature via the shared network layer; inputting the first sharing layer characteristics and the information of the text box marked by the image to be processed into the identification network layer, and predicting character information included in the text box through the identification network layer; training the sharing network layer and the recognition network layer based on the predicted character information and the labeled character information until a first training completion condition is met; the shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of the small object;

inputting an image to be processed into a trained shared network layer, and outputting a second shared layer characteristic by the trained shared network layer; inputting the second sharing layer characteristics into the detection network layer, predicting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics; and training the detection network layer based on the predicted text box information and the labeled text box information until a second training completion condition is met.

In another embodiment of the foregoing method based on the present invention, training the shared network layer and the recognition network layer based on the predicted textual information and the labeled textual information until a first training completion condition is satisfied includes:

adjusting network parameter values in the shared network layer and the identified network layer based on an error between the predicted literal information and the labeled literal information;

and iteratively executing the shared network layer and the recognition network layer after the parameters are adjusted to recognize the image to be processed to obtain the predicted character information until the first training completion condition is met.

In another embodiment of the above method according to the present invention, the first training completion condition includes:

the error between the predicted text information and the marked text information is smaller than a first preset value; or the iterative prediction times are greater than or equal to a first preset time.

In another embodiment of the foregoing method according to the present invention, the training the detection network layer based on the predicted text box information and the labeled text box information until a second training completion condition is met includes:

adjusting parameters of the detection network layer based on an error between predicted text box information and labeled text box information;

and iteratively executing detection on the image to be processed through the detection network layer after the parameters are adjusted to obtain predicted text box information until a second training completion condition is met.

In another embodiment of the above method according to the present invention, the second training completion condition includes:

the error between the predicted text box information and the labeled text box information is smaller than a second preset value; or the iterative prediction times are greater than or equal to a second preset time.

obtaining text box information including words in the image to be processed based on the detection layer features comprises:

In another embodiment of the foregoing method according to the present invention, predicting, via the identified network layer, text information included in the text box includes:

acquiring corresponding text box characteristics based on the text box information marked by the image to be processed, and performing characteristic fusion on the text box characteristics and first sharing layer characteristics output by the sharing network layer;

In another embodiment based on the foregoing method of the present invention, obtaining the corresponding text box feature based on the text box information labeled on the image to be processed includes:

and carrying out perspective transformation on the labeled text box information, segmenting a text box from the image to be processed, and generating corresponding text box characteristics based on the segmented text box.

In another embodiment of the foregoing method according to the present invention, segmenting a text box from the image to be processed includes:

In another embodiment of the foregoing method according to the present invention, segmenting the text box from the image to be processed based on the perspective transformation matrix includes:

According to another aspect of the embodiments of the present invention, there is provided a detection and identification apparatus, including:

the input unit is used for inputting the image to be processed into the detection and identification network; the detection identification network comprises a sharing network layer, a detection network layer and an identification network layer;

the low-layer extraction unit is used for outputting the sharing layer characteristics of the image to be processed through the sharing network layer; the shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of the small object;

the text box detection unit is used for inputting the sharing layer characteristics into the detection network layer, outputting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics;

and the character recognition unit is used for inputting the sharing layer characteristics and the text box information into the recognition network layer and outputting the character contents in the text box through the recognition network layer.

In another embodiment of the above apparatus according to the present invention, the detection layer characteristics include category information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information;

the text box detection unit is specifically configured to obtain text box information including characters in the image to be processed through category information of each pixel in the image to be processed, where the text box information includes: text box category information and text box position information; the text box category information is used for indicating whether the text box contains characters or not; the text box position information comprises the distance from any pixel point in the image to be processed to the upper part, the lower part, the left part and the right part of the text box and the rotation angle of the text box.

In another embodiment of the above apparatus according to the present invention, the text box detecting unit includes:

the text box obtaining module is used for respectively reducing the length and the width of the image to be processed to set proportions based on the category information of the image to be processed and dividing the image to be processed into a plurality of rectangular boxes according to the pixel position relation; obtaining a text box based on the rectangular box in which the category information of each pixel in the rectangular box is marked as character information;

the information acquisition module is used for acquiring distance information of any pixel point in the image to be processed from the upper part, the lower part, the left part and the right part of the text box and rotation angle information of the text box; and acquiring the text box information based on the acquired text box position information and the acquired text box category information.

In another embodiment of the above apparatus according to the present invention, the character recognition unit includes:

the feature extraction module is used for obtaining corresponding text box features based on the output text box information and fusing the text box features with the sharing layer features output by the sharing network layer;

and the character prediction module is used for predicting character information in the text box by the identification network layer based on the fused features.

In another embodiment of the apparatus according to the present invention, the feature extraction module is specifically configured to perform perspective transformation on the text box information, segment a text box from the image to be processed, and generate corresponding text box features based on the segmented text box.

In another embodiment of the above apparatus according to the present invention, the feature extraction module includes:

the zooming module is used for obtaining the coordinates of the upper left corner of the text box according to the position information of the text box; keeping the ratio of the height and the width of the text box unchanged, and zooming the text box to make the height of each text box consistent;

the transformation module is used for constructing a perspective transformation matrix based on the rotation angle of the text box, the upper left corner coordinate and the scaling;

and the text box segmentation module is used for segmenting the text box from the image to be processed based on the perspective transformation matrix.

In another embodiment of the above apparatus according to the present invention, the text box segmentation module is specifically configured to perform matrix multiplication on the perspective transformation matrix and the image to be processed to obtain a segmented image with the same size as the image to be processed, where each segmented image includes a text box only in the upper left corner.

According to another aspect of the embodiments of the present invention, there is provided a training apparatus for detecting a recognition network, including:

the image input unit is used for inputting the image to be processed into the detection and identification network; wherein, the detection identification network comprises a sharing network layer, a detection network layer and an identification network layer; the image to be processed is marked with text box information and character information contained in the text box;

a first training unit to output a first shared layer feature via the shared network layer; inputting the first sharing layer characteristics and the information of the text box marked by the image to be processed into the identification network layer, and predicting character information included in the text box through the identification network layer; training the sharing network layer and the recognition network layer based on the predicted character information and the labeled character information until a first training completion condition is met; the shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of the small object;

the second training unit is used for inputting the images to be processed into the trained shared network layer and outputting the characteristics of the second shared layer through the trained shared network layer; inputting the second sharing layer characteristics into the detection network layer, predicting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics; and training the detection network layer based on the predicted text box information and the labeled text box information until a second training completion condition is met.

In another embodiment of the above apparatus according to the present invention, the first training unit is specifically configured to adjust the network parameter values in the shared network layer and the identified network layer based on an error between the predicted literal information and the labeled literal information; and iteratively executing the shared network layer and the recognition network layer after the parameters are adjusted to recognize the image to be processed to obtain the predicted character information until the first training completion condition is met.

In another embodiment of the above apparatus according to the present invention, the first training completion condition includes:

In another embodiment of the above apparatus according to the present invention, the second training unit is specifically configured to adjust a parameter of the detection network layer based on an error between the predicted text box information and the labeled text box information; and iteratively executing detection on the image to be processed through the detection network layer after the parameters are adjusted to obtain predicted text box information until a second training completion condition is met.

In another embodiment of the above apparatus according to the present invention, the second training completion condition includes:

the second training unit is specifically configured to acquire text box information including characters in the image to be processed through category information of each pixel in the image to be processed; the text box information includes: text box category information and text box position information; the text box category information is used for indicating whether the text box contains characters or not; the text box position information comprises the distance from any pixel point in the image to be processed to the upper part, the lower part, the left part and the right part of the text box and the rotation angle of the text box.

In another embodiment of the above apparatus according to the present invention, the second training unit includes:

the text box obtaining module is used for respectively reducing the length and the width of the image to be processed to set proportions on the basis of the category information of the image to be processed and dividing the image to be processed into a plurality of rectangular boxes according to the pixel position relation; obtaining a text box based on the rectangular box in which the category information of each pixel in the rectangular box is marked as character information;

In another embodiment of the above apparatus according to the present invention, the first training unit includes:

the feature extraction module is used for obtaining corresponding text box features based on the text box information marked by the image to be processed and performing feature fusion on the text box features and first sharing layer features output by the sharing network layer;

and the file prediction module is used for predicting the character information in the text box by the identification network layer based on the fused features.

In another embodiment of the above apparatus according to the present invention, the feature extraction module is specifically configured to perform perspective transformation on the labeled text box information, segment a text box from the image to be processed, and generate a corresponding text box feature based on the segmented text box.

According to another aspect of the embodiments of the present invention, there is provided an electronic device, including a processor, where the processor includes the detection and recognition device as described above or the training device for detecting and recognizing a network as described above.

According to an aspect of an embodiment of the present invention, there is provided an electronic device, including: a memory for storing executable instructions;

and a processor in communication with the memory to execute the executable instructions to perform operations of the detection recognition method as described above or the training method of the detection recognition network as described above.

According to an aspect of the embodiments of the present invention, there is provided a computer storage medium for storing computer readable instructions, wherein the instructions, when executed, perform the operations of the detection recognition method as described above or the training method of the detection recognition network as described above.

Based on the training method, the device, the equipment and the medium for detecting, identifying and detecting the identification network provided by the embodiment of the invention, the image to be processed is input into the detection identification network; outputting the sharing layer characteristics of the image to be processed through a sharing network layer; the feature extraction of the image repeatedly is reduced through the sharing layer features output by the sharing network layer, and the processing efficiency is improved; inputting the characteristics of the sharing layer into a detection network layer, and outputting text box information including characters in the image to be processed through the detection network; inputting the sharing layer characteristics and the text box information into an identification network layer, and outputting the character content in the text box through the identification network layer; the detection of the text box information and the identification of the character information in the text box are realized through a detection and identification network; the efficiency and the speed of character recognition are improved.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

The invention will be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of an embodiment of a detection and identification method of the present invention.

Fig. 2 is a schematic structural diagram of an embodiment of the detection and identification device of the present invention.

FIG. 3 is a flowchart of a training method for detecting a recognition network according to an embodiment of the present invention.

FIG. 4 is a schematic structural diagram of an embodiment of a training apparatus for detecting a recognition network according to the present invention.

Fig. 5 is a schematic structural diagram of an electronic device for implementing a terminal device or a server according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In the prior art, most methods with excellent effects utilize deep learning to divide text detection and recognition into two parts, namely, text detection is performed on a whole picture to obtain position information of different texts, and then the detected text is deducted according to the position information for recognition.

In the process of implementing the invention, the inventor finds that the prior art has at least the following problems:

1. the overall accuracy of the method for dividing the text detection and recognition into two parts is limited by the accuracy of the detection and recognition respectively; 2. the method of dividing text detection and recognition into two parts needs to store the intermediate result of detection as the input of recognition, and meanwhile, the operation and storage efficiency is lower because the two network models are more complex to detect and recognize.

FIG. 1 is a flow chart of an embodiment of a detection and identification method of the present invention. As shown in fig. 1, the method of this embodiment includes:

step 101, inputting an image to be processed into a detection and identification network.

The detection and identification network comprises a sharing network layer, a detection network layer and an identification network layer.

And 102, outputting the sharing layer characteristics of the image to be processed through the sharing network layer.

The shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of small objects, when text box detection and character recognition are independently used as a task to be processed, a neural network is respectively needed, the two networks can be regarded as a text box detection network and a character recognition network, processing objects of the text box detection network and the character recognition network are images, the neural network is basically formed by combining a certain number of network layers such as convolution layers, pooling layers, full connection layers and the like, because the text box detection network and the character recognition network are character information in processing images, parameters of the text box detection network and the character recognition network in the first layers of network layers for acquiring the features of a sharing layer can be shared, wherein the features of the sharing layer are used for acquiring the features of texture features, the edge features of the images, the detail features of the images and the like in the image detection and recognition network, detection and identification of small objects can be better handled; the network layer commonly involved in the text box detection network and the character recognition network is independently used as a shared network layer to extract the features of the image to be processed, so that the repeated processing of the image to be processed is avoided, and the subsequent text box detection and/or character recognition only needs to input the obtained features of the shared layer into the corresponding network layer. Illustratively, the accuracy of text detection and recognition is improved by utilizing multi-scale feature cascade (a shared layer feature graph output by a shared network layer is fused with known text box information, namely, features of different layers are fused) and CTC (connected Temporal classification), a method for decoding one sequence into another sequence in a deep neural network is good in character recognition, smaller characters which are difficult to distinguish in a picture are better processed, and repeated feature extraction of the picture is reduced by sharing a part of network.

And 103, inputting the sharing layer characteristics into the detection network layer, outputting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics.

And 104, inputting the sharing layer characteristics and the text box information into the identification network layer, and outputting the character contents in the text box through the identification network layer.

Based on the detection and identification method provided by the embodiment of the invention, the image to be processed is input into a detection and identification network; outputting the sharing layer characteristics of the image to be processed through a sharing network layer; the feature extraction of the image repeatedly is reduced through the sharing layer features output by the sharing network layer, and the processing efficiency is improved; inputting the characteristics of the sharing layer into a detection network layer, and outputting text box information including characters in the image to be processed through the detection network; inputting the sharing layer characteristics and the text box information into an identification network layer, and outputting the character content in the text box through the identification network layer; the detection of the text box information and the identification of the character information in the text box are realized through a detection and identification network; the efficiency and the speed of character recognition are improved.

The detection and recognition method provided by the invention is suitable for languages of different languages, and for different languages, only the characters of the language to be processed are used for training when the detection and recognition network is trained, and the obtained detection and recognition network can realize the detection and recognition of the characters of the language.

In a specific example of the above-described embodiment of the detection and identification method of the present invention, the detection layer characteristics include category information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information; alternatively, the category information may specifically indicate a non-text category by 0 and a text category by 1, or indicate a non-text category by 1 and a text category by 0.

Operation 103 includes:

and acquiring the text box information including characters in the image to be processed according to the category information of each pixel in the image to be processed.

Wherein the text box information includes: text box category information and text box position information; the text box category information is used for indicating whether the text box contains characters or not; the text box position information comprises the distance from any pixel point in the image to be processed to the upper part, the lower part, the left part and the right part of the text box and the rotation angle of the text box. In this embodiment, before training the detection and identification network through the to-be-processed image of the sample image, the to-be-processed image of the sample image needs to be labeled, and the position of the text box is determined by labeling the category of each pixel in the to-be-processed image of the sample image, where the generally-labeled category includes text and non-text (which may be labeled with 1 and 0), and text box information corresponding to the text box including the text can be determined by labeling the text and the non-text.

In a specific example of the foregoing embodiments of the detection and identification method of the present invention, obtaining, through category information of each pixel in an image to be processed, text box information including a character in the image to be processed includes:

respectively reducing the length and the width of the image to be processed to set proportions based on the category information of the image to be processed, dividing the image to be processed into a plurality of rectangular frames according to the pixel position relation, and marking the category information of each pixel in the rectangular frames as the rectangular frames of the character information to obtain text frames;

text box information is obtained based on the obtained text box position information and text box category information.

Through the setting of the embodiment, the image to be processed is marked as the image only comprising 1 and 0 (the category information represents the character category through 1, the non-character category through 0, or represents the non-character category through 1, the character category through 0), and in the network classification process, the position may be inaccurate, at this time, the length and the width of the text box are respectively reduced to the set proportion (for example, the length and the width are reduced to 0.6 times of the original length and width), and the size of the text box is reduced, so that the influence of the text position inaccuracy on the algorithm can be reduced; the position information of the text box is determined by finding the minimum circumscribed rectangle of the text box, the distance information of each pixel in the text box from the upper part, the lower part, the left part and the right part of the text box can be obtained through the circumscribed rectangle, and the angle information of the text box is based on the rotation angle of the minimum circumscribed rectangle and the rectangle placed in the forward direction.

In another embodiment of the detection and identification method of the present invention, based on the above embodiments, operation 104 includes:

acquiring corresponding text box characteristics based on the output text box information, and performing characteristic fusion on the text box characteristics and sharing layer characteristics output by a sharing network layer;

In the embodiment, the feature fusion is to connect the obtained shared layer feature and the detection layer feature together, so that the fused feature not only includes the shared layer feature of the image, but also includes the semantic feature of the detection layer, and can be better used for character detection and recognition.

In a specific example of the foregoing embodiments of the detection and identification method of the present invention, obtaining a corresponding text box feature to-be-processed image based on output text box information includes:

In this embodiment, the text box is deducted from the original image according to the position information of the manual label, and a perspective transformation may be adopted, that is, an arbitrary quadrangle obtained by labeling is deducted into a rectangle to identify the input of the network layer. The formula is as follows:

t_x＝l-x₀

t_y＝t-y₀

scale＝dat_h/(t+b)

dat_ω＝scale×(l+r)

wherein, inputting: t, b, l and r are the vertical distance from a certain point in any quadrangle to the upper, lower, left and right sides of the quadrangle, theta is the rotation angle of the arbitrary quadrangle, dst_h，dst_wRespectively the height and width, x, of the set output rectangular picture₀，y₀The coordinate position of the point in the picture before transformation. And (3) outputting: the original picture is multiplied by the matrix M through the perspective transformation matrix M, so that an output picture can be directly obtained, namely, a rectangular picture which is extracted is used for identifying a network layer; the text box feature referred to in this embodiment is a text box feature map, and a text can be obtained based on the obtained pixel values corresponding to the text boxAnd (5) a frame characteristic diagram.

In a specific example of the foregoing embodiments of the detection and identification method of the present invention, segmenting a text box from an image to be processed includes:

constructing a perspective transformation matrix based on the rotation angle, the upper left corner coordinate and the scaling of the text box;

and based on the perspective transformation matrix, segmenting a text box from the image to be processed.

In this embodiment, in order to construct the perspective transformation matrix, the coordinates of the upper left corner of the text box are first obtained, and in order to obtain all the text boxes, the heights of all the text boxes are adjusted to be consistent, and the text boxes after adjustment can be divided based on one perspective transformation matrix.

In a specific example of the foregoing embodiments of the detection and identification method of the present invention, segmenting a text box from an image to be processed based on a perspective transformation matrix includes:

and performing matrix multiplication operation on the perspective transformation matrix and the image to be processed to obtain a segmented image with the same size as the image to be processed, wherein each segmented image only comprises a text box at the upper left corner.

In this embodiment, only one text box can be segmented based on the perspective transformation matrix each time, and all the text boxes are obtained by performing matrix multiplication on the moved perspective transformation matrix and the image to be processed.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Fig. 2 is a schematic structural diagram of an embodiment of the detection and identification device of the present invention. The apparatus of this embodiment may be used to implement the method embodiments of the present invention described above. As shown in fig. 2, the apparatus of this embodiment includes:

and the input unit 21 is used for inputting the image to be processed into the detection and identification network.

And the low-layer extraction unit 22 is used for outputting the shared layer characteristics of the image to be processed through the shared network layer.

The shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of the small object.

The text box detection unit 23 is configured to input the shared layer feature into the detection network layer, output the detection layer feature of the image to be processed through the detection network layer, and obtain text box information including characters in the image to be processed based on the detection layer feature.

And the character recognition unit 24 is used for inputting the sharing layer characteristics and the text box information into the recognition network layer and outputting the character content in the text box through the recognition network layer.

Based on the detection and identification device provided by the embodiment of the invention, the image to be processed is input into a detection and identification network; outputting the sharing layer characteristics of the image to be processed through a sharing network layer; the feature extraction of the image repeatedly is reduced through the sharing layer features output by the sharing network layer, and the processing efficiency is improved; inputting the characteristics of the sharing layer into a detection network layer, and outputting text box information including characters in the image to be processed through the detection network; inputting the sharing layer characteristics and the text box information into an identification network layer, and outputting the character content in the text box through the identification network layer; the detection of the text box information and the identification of the character information in the text box are realized through a detection and identification network; the efficiency and the speed of character recognition are improved.

In a specific example of the above-described embodiment of the detection and identification device of the present invention, the detection layer characteristics include category information of each pixel in the image to be processed; the category information is used to indicate whether the corresponding pixel is a text category through different information.

The text box detecting unit 23 is specifically configured to obtain text box information including characters in the image to be processed through the category information of each pixel in the image to be processed.

Wherein the text box information includes: text box category information and text box position information; the text box type information is used for indicating whether the text box contains characters or not; the text box position information comprises the distance from any pixel point in the image to be processed to the upper part, the lower part, the left part and the right part of the text box and the rotation angle of the text box.

In a specific example of the foregoing embodiments of the detection and identification device of the present invention, the text box detection unit 23 includes:

In another embodiment of the detection and identification device of the present invention, based on the above embodiments, the character recognition unit 24 includes:

the feature extraction module is used for obtaining corresponding text box features based on the output text box information and performing feature fusion on the text box features and sharing layer features output by a sharing network layer;

and the character prediction module is used for identifying character information in the text box predicted by the network layer based on the fused features.

In a specific example of the foregoing embodiments of the detection and recognition device of the present invention, the feature extraction module is specifically configured to perform perspective transformation on the text box information, segment a text box from the image to be processed, and generate a corresponding text box feature based on the segmented text box.

In a specific example of the foregoing embodiments of the detection and identification device of the present invention, the feature extraction module includes:

the transformation module is used for constructing a perspective transformation matrix based on the rotation angle, the upper left corner coordinate and the scaling of the text box;

and the text box segmentation module is used for segmenting a text box from the image to be processed based on the perspective transformation matrix.

In a specific example of the foregoing embodiments of the detection and recognition apparatus of the present invention, the text box segmentation module is specifically configured to perform matrix multiplication on the perspective transformation matrix and the image to be processed to obtain a segmented image with the same size as the image to be processed, where each segmented image only includes a text box in an upper left corner.

FIG. 3 is a flowchart of a training method for detecting a recognition network according to an embodiment of the present invention. As shown in fig. 3, the method of this embodiment includes:

step 301, inputting the image to be processed into the detection and identification network.

The detection and identification network comprises a shared network layer, a detection network layer and an identification network layer; the image to be processed is marked with text box information and character information included in all the text boxes; the image to be processed is input into the detection and recognition network, two training tasks of character detection and character recognition can be completed simultaneously, compared with the method for training the character detection network and the character recognition network respectively, the method equivalently utilizes more labeled data and information, effectively relieves the overfitting phenomenon, promotes the improvement of the accuracy of the final result, does not need two networks of character detection and character recognition when being simultaneously used for character recognition, and improves the efficiency and the speed of character recognition.

Step 302, outputting a first sharing layer characteristic through a sharing network layer; inputting the first sharing layer characteristics and the text box information marked by the image to be processed into an identification network layer, and predicting character information included in the text box through the identification network layer; and training the shared network layer and the recognition network layer based on the predicted character information and the labeled character information until a first training completion condition is met.

For detecting and identifying the network, firstly training a shared network layer and an identifying network layer, wherein the shared network layer and the identifying network layer are regarded as a network; the text box information of the sharing layer characteristics and the to-be-processed image labels, which are output by the sharing network layer, can be input into the recognition network layer; the shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of the small object.

Step 303, inputting the image to be processed into the trained shared network layer, and outputting the second shared layer characteristic by the trained shared network layer; inputting the second sharing layer characteristics into a detection network layer, predicting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics; and training the detection network layer based on the predicted text box information and the labeled text box information until a second training completion condition is met.

The training method for detecting and identifying the network provided by the embodiment of the invention comprises the steps of firstly training a shared network layer and an identification network layer through an image to be processed, inputting the image to be processed into the trained shared network layer and an untrained detection network layer to obtain predicted text box information, and training the detection network layer based on the predicted text box information and labeled text box information; when the detection network layer is trained, the shared network layer and the detection branch layer detection network layer are used as a network, the network is trained, and the shared network layer is trained, so that the training of the detection branch layer detection network layer is realized in the process, the trained shared network layer, the recognition branch layer recognition network layer and the detection branch layer detection network layer form a trained detection recognition network, the obtained detection recognition network can realize the detection and recognition of characters at the same time, and due to the existence of the shared network layer, repeated feature extraction to images is reduced, the network structure is lightened, the complexity of time and space is reduced, and the model volume is reduced.

In a specific example of the above-described embodiment of the training method for detecting a recognition network according to the present invention, the training 302 for training the shared network layer and the recognition network layer based on the predicted text information and the labeled text information includes:

In this embodiment, the specific process of updating the parameter according to the error may include: taking the error between the predicted character information and the known character information as the maximum error; the maximum error is reversely propagated through the gradient, and the error of each layer in the shared network layer and the identification network layer is calculated; calculating the gradient of each layer of parameters according to the error of each layer, and correcting the parameters of the shared network layer and the corresponding layer in the identification network layer according to the gradient; calculating the error between the predicted character information and the known character information output by the shared network layer and the recognition network layer after the parameters are optimized, and taking the error as the maximum error;

performing iteration, reversely propagating the maximum error through the gradient, calculating the error of each of the shared network layer and the identification network layer; and calculating the gradient of the parameter of each layer according to the error of each layer, and correcting the parameter of the shared network layer and the parameter of the corresponding layer in the identified network layer according to the gradient until a preset first training completion condition is met.

The first training completion condition in the above embodiment includes:

In the network training, the stopping condition of the network training may be determined according to the error value, or according to the number of iterative training, or by other stopping conditions that may be considered by those skilled in the art to stop the training, which is only used to facilitate the implementation of the method of the present embodiment by those skilled in the art, and is not limited to the method of the present embodiment.

In another embodiment of the training method for detecting and identifying a network, based on the above embodiments, in operation 303, training a detection network layer based on the predicted textbox information and the labeled textbox information includes:

adjusting parameters of the detection network layer based on an error between the predicted text box information and the labeled text box information;

In this embodiment, the parameters in the detection network layer may also be trained by an inverse gradient method, and the specific training process may include: taking the error between the predicted text box information and the known text box information as a maximum error; the maximum error is reversely propagated through the gradient, and the error of each layer in the detection network layer (because the shared network layer is trained, the parameters of the shared network layer do not need to be trained again at the moment) is calculated; calculating the gradient of each layer of parameters according to the error of each layer, and correcting and detecting the parameters of the corresponding layer in the network layer according to the gradient; calculating an error between the predicted text box information and the known text box information output by the detection network layer after the optimized parameters are calculated, and taking the error as a maximum error;

performing iteration, reversely propagating the maximum error through the gradient, and calculating the error of each layer in the detection network layer; and calculating the gradient of the parameters of each layer according to the error of each layer, and correcting and detecting the parameters of the corresponding layer in the network layer according to the gradient until a preset second training completion condition is met.

The second training completion condition in the above embodiment includes:

In another embodiment of the training method for detecting and identifying a network, based on the above embodiments, the detection layer features include category information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information; alternatively, the category information may specifically indicate a non-text category by 0 and a text category by 1, or indicate a non-text category by 1 and a text category by 0.

Operation 303 comprises:

The text box information includes: text box category information and text box position information; the text box type information is used for indicating whether the text box contains characters or not; the text box position information comprises the distance from any pixel point in the image to be processed to the upper part, the lower part, the left part and the right part of the text box and the rotation angle of the text box. In this embodiment, before training the detection and identification network through the to-be-processed image of the sample image, the to-be-processed image of the sample image needs to be labeled, and the position of the text box is determined by labeling the category of each pixel in the to-be-processed image of the sample image, where the generally-labeled category includes text and non-text (which may be labeled with 1 and 0), and text box information corresponding to the text box including the text can be determined by labeling the text and the non-text.

In a specific example of the above embodiments of the training method for detecting and identifying a network according to the present invention, obtaining text box information including characters in an image to be processed through category information of each pixel in the image to be processed includes:

Through the setting of the embodiment, the image to be processed is marked as the image only comprising 1 and 0 (the category information represents the character category by 1, the non-character category by 0, or represents the non-character category by 1, the character category by 0), while in the network classification process, the position may be inaccurate, at this time, the length and the width of the text box are respectively reduced to the set proportion (for example, the length and the width are reduced to 0.6 times of the original), and the influence of the text position inaccuracy on the algorithm can be reduced by reducing the size of the text box; the position information of the text box is determined by finding the minimum circumscribed rectangle of the text box, the distance information of each pixel in the text box from the upper part, the lower part, the left part and the right part of the text box can be obtained through the circumscribed rectangle, and the angle information of the text box is based on the rotation angle of the minimum circumscribed rectangle and the rectangle placed in the forward direction.

In another embodiment of the training method for detecting and identifying a network, based on the foregoing embodiments, operation 302 includes:

acquiring corresponding text box characteristics based on text box information labeled by an image to be processed, and performing characteristic fusion on the text box characteristics and first sharing layer characteristics output by a sharing network layer;

In a specific example of the foregoing embodiments of the training method for detecting and identifying a network according to the present invention, obtaining a corresponding text box feature based on text box information labeled with an image to be processed includes:

t_x＝l-x₀

t_y＝t-y₀

scale＝dat_h/(t+b)

dat_ω＝scale×(l+r)

wherein, inputting: t, b, l and r are the vertical distance from a certain point in any quadrangle to the upper, lower, left and right sides of the quadrangle, theta is the rotation angle of the arbitrary quadrangle, dst_h，dst_wRespectively the height and width, x, of the set output rectangular picture₀，y₀The coordinate position of the point in the picture before transformation. And (3) outputting: the original picture is multiplied by the matrix M through the perspective transformation matrix M, so that an output picture can be directly obtained, namely, a rectangular picture which is extracted is used for identifying a network layer; the text box feature referred to in this embodiment is a text box feature map, and the text box feature map can be obtained based on the obtained pixel values corresponding to the text box.

In a specific example of the above embodiments of the training method for detecting and identifying a network according to the present invention, segmenting a text box from an image to be processed includes:

In a specific example of the above embodiments of the training method for detecting and identifying a network according to the present invention, segmenting a text box from an image to be processed based on a perspective transformation matrix includes:

FIG. 4 is a schematic structural diagram of an embodiment of a training apparatus for detecting a recognition network according to the present invention. The apparatus of this embodiment may be used to implement the method embodiments of the present invention described above. As shown in fig. 4, the apparatus of this embodiment includes:

and an image input unit 41, configured to input the image to be processed into the detection and recognition network.

The detection and identification network comprises a shared network layer, a detection network layer and an identification network layer; the image to be processed is marked with text box information and character information included in the text box.

A first training unit 42 for outputting a first shared layer feature via the shared network layer; inputting the first sharing layer characteristics and the text box information marked by the image to be processed into an identification network layer, and predicting character information included in the text box through the identification network layer; and training the shared network layer and the recognition network layer based on the predicted character information and the labeled character information until a first training completion condition is met.

A second training unit 43, configured to input the image to be processed into the trained shared network layer, and output a second shared layer feature by the trained shared network layer; inputting the second sharing layer characteristics into a detection network layer, predicting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics; and training the detection network layer based on the predicted text box information and the labeled text box information until a second training completion condition is met.

Based on the training device for detecting and identifying the network provided by the embodiment of the invention, firstly, the shared network layer and the identification network layer are trained through the image to be processed, the image to be processed is input into the trained shared network layer and the untrained detection network layer to obtain the information of the predicted text box, and the detection network layer is trained based on the information of the predicted text box and the information of the labeled text box; when the detection network layer is trained, the shared network layer and the detection branch layer detection network layer are used as a network, the network is trained, and the shared network layer is trained, so that the training of the detection branch layer detection network layer is realized in the process, the trained shared network layer, the recognition branch layer recognition network layer and the detection branch layer detection network layer form a trained detection recognition network, the obtained detection recognition network can realize the detection and recognition of characters at the same time, and due to the existence of the shared network layer, repeated feature extraction to images is reduced, the network structure is lightened, the complexity of time and space is reduced, and the model volume is reduced.

In a specific example of the above embodiment of the training apparatus for detecting a recognition network according to the present invention, the first training unit is specifically configured to adjust network parameter values in the shared network layer and the recognition network layer based on an error between the predicted textual information and the labeled textual information; and iteratively executing the shared network layer and the recognition network layer after the parameters are adjusted to recognize the image to be processed to obtain the predicted character information until the first training completion condition is met.

The preset first training completion condition satisfied in the above embodiment includes:

In another embodiment of the training apparatus for detecting and identifying a network according to the present invention, on the basis of the above embodiments, the second training unit is specifically configured to adjust a parameter of a detection network layer based on an error between predicted text box information and labeled text box information; and iteratively executing the detection of the image to be processed through the detection network layer after the parameters are adjusted to obtain the predicted text box information until a preset second training completion condition is met.

performing iteration, reversely propagating the maximum error through the gradient, and calculating the error of each layer in the detection network layer; and calculating the gradient of the parameters of each layer according to the error of each layer, and correcting and detecting the parameters of the corresponding layer in the network layer according to the gradient until a second training completion condition is met.

The preset second training completion condition satisfied in the above embodiment includes:

In yet another embodiment of the training apparatus for detecting a recognition network according to the present invention, based on the above embodiments,

the detection layer characteristics comprise the category information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information;

the second training unit 43 is specifically configured to obtain, through the category information of each pixel in the image to be processed, text box information including characters in the image to be processed.

The text box information includes: text box category information and text box position information; the text box type information is used for indicating whether the text box contains characters or not; the text box position information comprises the distance from any pixel point in the image to be processed to the upper part, the lower part, the left part and the right part of the text box and the rotation angle of the text box. In this embodiment, before training the detection and recognition network through the image to be processed, the image to be processed needs to be labeled, and the position of the text box is determined by labeling the category of each pixel in the image to be processed, where the generally labeled category includes text and non-text (which may be labeled with 1 and 0), and the text box information corresponding to the text box including the text can be determined after labeling the text and the non-text.

In one specific example of the above-described embodiments of the training apparatus for detecting a recognition network of the present invention,

a second training unit comprising:

In another embodiment of the training apparatus for detecting and identifying a network according to the present invention, on the basis of the above embodiments, the first training unit 42 includes:

the feature extraction module is used for obtaining corresponding text box features based on the text box information marked by the image to be processed and performing feature fusion on the text box features and first sharing layer features output by a sharing network layer;

and the file prediction module is used for identifying character information in the text box predicted by the network layer based on the fused features.

In a specific example of each of the above embodiments of the training apparatus for detecting and identifying a network according to the present invention, the feature extraction module is specifically configured to perform perspective transformation on the labeled text box information, segment a text box from an image to be processed, and generate a corresponding text box feature based on the segmented text box.

In a specific example of the above embodiments of the training apparatus for detecting and identifying a network of the present invention, the feature extraction module includes:

In a specific example of the above embodiments of the training apparatus for detecting and identifying a network of the present invention, the text box segmentation module is specifically configured to perform matrix multiplication on the perspective transformation matrix and the image to be processed to obtain a segmented image with the same size as the image to be processed, where each segmented image only includes a text box in an upper left corner.

According to an aspect of the embodiments of the present invention, there is provided an electronic device, including a processor, where the processor includes the detection recognition apparatus or the training apparatus for detecting a recognition network according to any of the above embodiments of the present invention.

According to an aspect of an embodiment of the present invention, there is provided an electronic apparatus including: a memory for storing executable instructions;

and a processor in communication with the memory for executing the executable instructions to perform the operations of the detection recognition method or the training method of the detection recognition network according to any of the above embodiments of the present invention.

According to an aspect of the embodiments of the present invention, there is provided a computer storage medium for storing computer readable instructions, which when executed, perform the operations of any one of the above-described embodiments of the method for detecting recognition or the training method for detecting a recognition network of the present invention.

The embodiment of the invention also provides electronic equipment, which can be a mobile terminal, a Personal Computer (PC), a tablet computer, a server and the like. Referring now to fig. 5, a schematic diagram of an electronic device 500 suitable for implementing a terminal device or a server according to an embodiment of the present application is shown: as shown in fig. 5, the computer system 500 includes one or more processors, communication sections, and the like, for example: one or more Central Processing Units (CPUs) 501, and/or one or more image processors (GPUs) 513, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)502 or loaded from a storage section 508 into a Random Access Memory (RAM) 503. The communication portion 512 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card.

The processor can communicate with the read-only memory 502 and/or the random access memory 530 to execute the executable instructions, connect with the communication part 512 through the bus 504, and communicate with other target devices through the communication part 512, so as to complete the corresponding operations of any method provided by the embodiments of the present application, for example, inputting the image to be processed into the detection and identification network; outputting the sharing layer characteristics of the image to be processed through a sharing network layer; inputting the sharing layer characteristics into a detection network layer, outputting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics; and inputting the sharing layer characteristics and the text box information into the identification network layer, and outputting the character content in the text box through the identification network layer.

In addition, in the RAM503, various programs and data necessary for the operation of the apparatus can also be stored. The CPU501, ROM502, and RAM503 are connected to each other via a bus 504. The ROM502 is an optional module in case of the RAM 503. The RAM503 stores or writes executable instructions into the ROM502 at runtime, and the executable instructions cause the processor 501 to perform operations corresponding to the above-described communication method. An input/output (I/O) interface 505 is also connected to bus 504. The communication unit 512 may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

It should be noted that the architecture shown in fig. 5 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 5 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication part may be separately set or integrated on the CPU or the GPU, and so on. These alternative embodiments are all within the scope of the present disclosure.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flowchart, the program code may include instructions corresponding to performing the method steps provided by embodiments of the present disclosure, e.g., inputting an image to be processed into a detection recognition network; outputting the sharing layer characteristics of the image to be processed through a sharing network layer; inputting the sharing layer characteristics into a detection network layer, outputting the detection layer characteristics of the image to be processed through the detection network layer, and acquiring text box information including characters in the image to be processed based on the detection layer characteristics; and inputting the sharing layer characteristics and the text box information into the identification network layer, and outputting the character content in the text box through the identification network layer. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501.

The method and apparatus, device of the present invention may be implemented in a number of ways. For example, the method, apparatus and device of the present invention may be implemented by software, hardware, firmware or any combination of software, hardware and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for detection and identification, comprising:

outputting the sharing layer characteristics of the image to be processed through the sharing network layer; the shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of the small object;

2. The method according to claim 1, wherein the detection layer features comprise class information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information;

3. The method according to claim 2, wherein the obtaining of the text box information including characters in the image to be processed through the category information of each pixel in the image to be processed comprises:

4. The method according to any one of claims 1-3, wherein the obtaining the corresponding text box feature based on the output text box information comprises:

5. The method of claim 4, wherein segmenting the text box from the image to be processed comprises:

6. The method of claim 5, wherein the segmenting the text box from the image to be processed based on the perspective transformation matrix comprises:

7. A training method for detecting and identifying a network is characterized by comprising the following steps:

outputting a first shared layer feature via the shared network layer; inputting the first sharing layer characteristics and the information of the text box marked by the image to be processed into the identification network layer, and predicting character information included in the text box through the identification network layer; training the shared network layer and the recognition network layer based on the predicted text information and the labeled text information until a first training completion condition is met, wherein the shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of the small object;

predicting, via the identified network layer, textual information included in the text box, including:

the recognition network layer predicts the character information in the text box based on the fused features;

8. The method of claim 7, wherein training the shared network layer and the recognition network layer based on the predicted textual information and the tagged textual information until a first training completion condition is met comprises:

9. The method of claim 8, wherein the first training completion condition comprises:

10. The method of any of claims 7-9, wherein training the detection network layer based on the predicted text box information and the labeled text box information until a second training completion condition is met comprises:

11. The method of claim 10, wherein the second training completion condition comprises:

12. The method according to any one of claims 7 to 9, wherein the detection layer characteristics include class information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information;

13. The method according to claim 12, wherein the obtaining text box information including words in the image to be processed through the category information of each pixel in the image to be processed comprises:

14. The method according to any one of claims 7 to 9, wherein obtaining corresponding text box features based on the text box information of the image annotation to be processed comprises:

15. The method of claim 14, wherein segmenting a text box from the image to be processed comprises:

16. The method of claim 15, wherein segmenting the text box from the image to be processed based on the perspective transformation matrix comprises:

17. A detection and identification device, comprising:

a low-layer extraction unit, configured to output, via the shared network layer, a shared layer feature of the image to be processed, where the shared layer feature is used to embody at least one of the following features in the image: texture features, edge features and detail features of the small object;

the character recognition unit is used for inputting the sharing layer characteristics and the text box information into the recognition network layer and outputting the character contents in the text box through the recognition network layer;

the character recognition unit includes:

and the character prediction module is used for predicting character information in the text box based on the fused features by the identification network layer.

18. The apparatus according to claim 17, wherein the detection layer features comprise class information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information;

19. The apparatus of claim 18, wherein the text box detecting unit comprises:

20. The apparatus according to any of claims 17-19, wherein the feature extraction module is specifically configured to perform perspective transformation on the text box information, segment a text box from the image to be processed, and generate corresponding text box features based on the segmented text box.

21. The apparatus of claim 20, wherein the feature extraction module comprises:

22. The apparatus according to claim 21, wherein the text box segmentation module is specifically configured to perform a matrix multiplication operation on the perspective transformation matrix and the image to be processed to obtain a segmented image with a size equal to that of the image to be processed, and each segmented image includes a text box only in an upper left corner.

23. A training apparatus for detecting a recognition network, comprising:

a first training unit to output a first shared layer feature via the shared network layer; inputting the first sharing layer characteristics and the information of the text box marked by the image to be processed into the identification network layer, and predicting character information included in the text box through the identification network layer; training the shared network layer and the recognition network layer based on the predicted text information and the labeled text information until a first training completion condition is met, wherein the shared layer features are used for embodying at least one of the following features in the image: texture features, edge features and detail features of the small object;

the first training unit includes:

the file prediction module is used for predicting the character information in the text box based on the fused features by the identification network layer;

24. The apparatus according to claim 23, wherein the first training unit is specifically configured to adjust the network parameter values in the shared network layer and the identified network layer based on an error between the predicted literal information and the labeled literal information; and iteratively executing the shared network layer and the recognition network layer after the parameters are adjusted to recognize the image to be processed to obtain the predicted character information until the first training completion condition is met.

25. The apparatus of claim 24, wherein the first training completion condition comprises:

26. The apparatus according to any of the claims 23-25, wherein the second training unit is specifically configured to adjust the parameters of the detection network layer based on an error between predicted text box information and labeled text box information; and iteratively executing detection on the image to be processed through the detection network layer after the parameters are adjusted to obtain predicted text box information until a second training completion condition is met.

27. The apparatus of claim 26, wherein the second training completion condition comprises:

28. The apparatus according to any one of claims 23-25, wherein the detection layer characteristics include class information of each pixel in the image to be processed; the category information is used for marking whether the corresponding pixel is a character category or not through different information;

29. The apparatus of claim 28, wherein the second training unit comprises:

30. The apparatus according to any of claims 23-25, wherein the feature extraction module is specifically configured to perform perspective transformation on the labeled text box information, segment a text box from the image to be processed, and generate corresponding text box features based on the segmented text box.

31. The apparatus of claim 30, wherein the feature extraction module comprises:

32. The apparatus according to claim 31, wherein the text box segmentation module is specifically configured to perform a matrix multiplication operation on the perspective transformation matrix and the image to be processed to obtain a segmented image with a size same as that of the image to be processed, and each segmented image includes a text box only in an upper left corner.

33. An electronic device, characterized in that it comprises a processor comprising the detection and recognition device of any one of claims 17 to 22 or the training device of the detection and recognition network of any one of claims 23 to 32.

34. An electronic device, comprising: a memory for storing executable instructions;

and a processor in communication with the memory for executing the executable instructions to perform the operations of the detection and recognition method of any one of claims 1 to 6 or the training method of the detection and recognition network of any one of claims 7 to 16.

35. A computer storage medium storing computer readable instructions, wherein the instructions, when executed, perform the operations of the detection and recognition method of any one of claims 1 to 6 or the training method of the detection and recognition network of any one of claims 7 to 16.