CN112183542A

CN112183542A - Text image-based recognition method, device, equipment and medium

Info

Publication number: CN112183542A
Application number: CN202010997733.2A
Authority: CN
Inventors: 王林武
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2021-01-05

Abstract

The application relates to a text image-based recognition method, a text image-based recognition device, a text image-based recognition apparatus and a text image-based recognition medium. The method comprises the following steps: acquiring a text image to be identified; inputting the text image into a coding structure of a segmentation model, and sequentially coding the text image through at least one coding unit in the coding structure to obtain a first feature map corresponding to the text image; wherein the coding structure comprises at least one deformable convolution; acquiring intermediate coding characteristic graphs respectively generated by each coding unit in the coding process; decoding the first feature map according to each intermediate coding feature map by a decoding structure in the segmentation model to obtain a corresponding second feature map; and performing pixel-level classification processing according to the second feature map to identify texts in the text images. By adopting the method, the accuracy of extracting the feature map can be improved, complete semantic information is effectively reserved, the segmentation accuracy is greatly improved, and the text recognition efficiency is improved.

Description

Text image-based recognition method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text image-based recognition method, apparatus, device, and medium.

Background

With the development of computer technology, recognition technology based on text images has emerged. The existing Recognition technology based on text images is mainly implemented based on an OCR (Optical Character Recognition) technology.

However, when the existing OCR technology is used for text recognition, strict requirements are imposed on the shape, environment and recognition mode of a text image, for example, when the shape of the text image is an arc, the text image is under intense light, or the text image is too far away from the text image, the recognition efficiency is low due to insufficient text recognition accuracy.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a text image-based recognition method, apparatus, device, and medium capable of improving text recognition efficiency.

A text image based recognition method, the method comprising:

acquiring a text image to be identified;

inputting the text image into a coding structure of a segmentation model, and sequentially coding the text image through at least one coding unit in the coding structure to obtain a first feature map corresponding to the text image; wherein the coding structure comprises at least one deformable convolution;

acquiring intermediate code characteristic graphs respectively generated by each coding unit in the process of coding processing;

decoding the first feature map according to each intermediate coding feature map through a decoding structure in the segmentation model to obtain a corresponding second feature map;

and performing pixel-level classification processing according to the second feature map to identify texts in the text images.

An apparatus for text image based recognition, the apparatus comprising:

the acquisition module is used for acquiring a text image to be identified;

the encoding module is used for inputting the text image into an encoding structure of a segmentation model, and sequentially encoding the text image through at least one encoding unit in the encoding structure to obtain a first feature map corresponding to the text image; wherein the coding structure comprises at least one deformable convolution;

the obtaining module is further configured to obtain intermediate coding feature maps generated by each coding unit in a coding process;

the decoding module is used for decoding the first feature map according to each intermediate coding feature map through a decoding structure in the segmentation model to obtain a corresponding second feature map;

and the identification module is used for carrying out pixel-level classification processing according to the second feature map so as to identify the text in the text image.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a text image to be identified;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a text image to be identified;

According to the text image-based identification method, the text image-based identification device, the text image-based identification equipment and the text image-based identification medium, the text image is coded through the coding structure including at least one deformable convolution in the segmentation model, and the corresponding first characteristic diagram is obtained. And decoding the first characteristic diagram according to intermediate coding characteristic diagrams generated by each coding unit in the coding structure in the coding process by dividing the decoding structure of the model to obtain a corresponding second characteristic diagram. And carrying out pixel-level classification processing on the second feature map so as to identify the text in the text image. By adding the deformable convolution into the coding structure, the convolution receptive field can be obviously increased, and the text shape in the text image can be more accurately matched, so that the accuracy of extracting the feature map is improved. And the first feature graph is decoded through the intermediate coding feature graph generated in the coding structure, namely, the feature graph generated in the decoding structure and the corresponding intermediate coding feature graph are subjected to feature fusion, so that complete semantic information can be effectively reserved, the problem of semantic information loss is avoided, the segmentation precision is greatly improved, and the text recognition efficiency is greatly improved.

Drawings

FIG. 1 is a diagram of an embodiment of an application environment for a text image based recognition method;

FIG. 2 is a flow diagram of a text image based recognition method in one embodiment;

FIG. 3 is a flow diagram of a text image based recognition method in an exemplary embodiment;

FIG. 4 is a flow diagram illustrating a text image based recognition method in accordance with another exemplary embodiment;

FIG. 5 is a block diagram of a text image based recognition apparatus according to an embodiment;

FIG. 6 is a block diagram showing the structure of a text image-based recognition apparatus according to another embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The text image-based recognition method provided by the application can be applied to the application environment shown in fig. 1. The computer device 110 acquires an image containing a text on the vehicle 120 through a network, and obtains a text image to be recognized. Of course, in other application scenarios, the vehicle 120 may also be another target object to be identified, such as an object of a building or a parking space. It is understood that the application environment shown in fig. 1 is only for illustrative purposes and is not intended to limit a specific application scenario in which the method is applied. The computer device may specifically be a terminal or a server. The terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

It is understood that the computer device 110 captures images containing text on the vehicle 120 via the network, resulting in text images to be recognized. The computer device 110 inputs the text image into the coding structure of the segmentation model, and sequentially codes the text image through at least one coding unit in the coding structure to obtain a first feature map corresponding to the text image, wherein the coding structure comprises at least one deformable convolution. The computer device 110 obtains the intermediate code characteristic maps generated by the respective coding units in the course of the coding process. The computer device 110 then performs decoding processing on the first feature map according to each intermediate coding feature map by using the decoding structure in the segmentation model, so as to obtain a corresponding second feature map. The computer device 110 performs a pixel-level classification process based on the second feature map to identify text in the text image.

In one embodiment, as shown in fig. 2, a text image-based recognition method is provided, which is exemplified by the method applied to the computer device 110 in fig. 1, and includes the following steps:

s202, acquiring a text image to be recognized.

The text image is an image including text, and the text may specifically be a character, such as a single letter or a number. In particular, the computer device may obtain the text image to be recognized from a local or other computer device.

In one embodiment, the computer device may capture images of the target environment or target scene in which the text appears by a local image capture device, such as a camera, to obtain captured text images. Or, the computer device receives, through the network, the text image collected and sent by another computer device, which is not limited in this embodiment of the present application.

In one particular embodiment, when a vehicle is annual checked, the computer device may obtain a text image associated with the vehicle, such as a text image including a vehicle identification code, via the camera. The computer device determines a corresponding vehicle by performing recognition processing on the text image including the vehicle identification code. Vehicle Identification Number (VIN) of a Vehicle generally consists of seventeen letters or numbers, which are a unique set of numbers on the Vehicle, and the VIN can be used to identify information such as manufacturer, engine, chassis serial Number and other performances of the Vehicle, and the Vehicle Identification Number can also be called a Vehicle frame Number.

S204, inputting the text image into a coding structure of the segmentation model, and sequentially coding the text image through at least one coding unit in the coding structure to obtain a first feature map corresponding to the text image; wherein the coding structure comprises at least one deformable convolution.

Wherein the segmentation model is a machine learning model for performing segmentation tasks. It is understood that the segmentation model is to take a region corresponding to each text in the text image as a sub-region and distinguish the sub-regions to identify each text in the text image.

The segmentation model includes an encoding structure and a decoding structure. Wherein each coding structure comprises at least one coding unit. Each coding unit comprises a convolutional network and a pooling network. Naturally, the pooling may be down-sampling, or the like, and this is not limited in the embodiment of the present application. Also, for each coding unit, typically the number of convolutional networks is at least one, and the number of pooled networks is one.

Correspondingly, each decoding structure comprises at least one decoding unit. Each decoding unit comprises a convolution network and an up-sampling network. And, for each decoding unit, the number of convolutional networks is typically at least one, and the number of upsampling networks is one. In addition, the decoding structure also comprises a classification identification unit which is used for carrying out pixel-level classification processing on the feature map so as to carry out text identification.

Specifically, the computer device inputs a text image to be recognized into a coding structure of the segmentation model, the coding structure comprises at least one coding unit, and the text image is sequentially coded through deformable convolution in each coding unit, so that a first feature map corresponding to the text image is obtained. It will be appreciated that the encoding process specifically includes feature extraction, and pooling processes. The feature extraction can obtain corresponding feature maps, and the pooling process is to perform reduction processing on the feature maps obtained in the same coding unit. The first characteristic diagram is output after being processed by all coding units in the coding structure.

In an embodiment, the Segmentation model may be a Semantic Segmentation model, and may also be other Semantic Segmentation models, which is not limited in this embodiment of the present application. The segnet model may specifically be composed of an encoding structure (encoder) and a decoding structure (decoder). The coding structure, which may also be referred to as an encoder, is a network model along with VGG16 for parsing the target information. The decoding structure, which may also be referred to as a decoder, represents the parsed information in the image in a differentiated manner, i.e., each pixel is represented by the color of the corresponding target information or by a label (label).

In one embodiment, the number of deformable convolutions in the coding structure is not necessarily related to the number of coding units. That is, in the coding structure, a Deformable Convolution (abbreviated Convolution) may be introduced into each coding unit, or only some coding units may be introduced into the Deformable Convolution. It is understood that the deformable convolution may introduce additional model parameters for learning the pixel offset, and to balance the amount and accuracy of the network parameters, the convolution layers in some coding units in the coding structure may be replaced with the deformable convolution, for example, the computer device introduces the deformable convolution in the third and fourth coding units, and the conventional convolution is used in the first and second coding units, that is, the convolution in the CONV3 module and the CONV4 module in the VGG16 model is replaced with the deformable convolution, and of course, other modules may be replaced, which is not limited in this embodiment of the present application.

For example, for a CNN convolution (Convolutional Neural network) with a convolution kernel size of 3 × 3, a conventional convolution is to perform a sliding window shift on an input feature map of each coding unit, calculate and sum products of weights of each convolution kernel and input pixel values in a sliding window of a fixed shape (i.e., a size of 3 × 3) to obtain convolution output values, which can also be understood as output pixel values of an output feature map. The calculation formula for a pixel point P0 on the output characteristic diagram of each coding unit is: y (p)₀)＝∑w(p_n)*x(p₀+p_n),p_nE.g. R. Wherein, w (p)_n) Representing the convolution kernel weight, x (p)₀+p_n) Representing the pixel values of all pixel points in the sliding window; p is a radical of_nIndicating the position index value within the sliding window, y (p)₀) Representing the pixel value of one pixel point P0.

It can be understood that the deformable convolution is partially modified on the conventional convolution, that is, when a pixel in the sliding window is traversed, there is an offset Δ Pn that can be learned, so that the sampled pixel point is the result of summing the position of the pixel point in the sliding window and the offset, and the calculation formula is as follows: y (p)₀)＝∑w(p_n)*x(p₀+p_n+Δp_n),p_nE.g. R. That is, the deformable convolution is to sample pixels with different shapes or ranges according to different sizes or shapes of the text image, and the receptive field is larger. The offset Δ Pn corresponding to each pixel is not necessarily the same, and is obtained after model training.

In an embodiment, step S204 is to input the text image into the coding structure of the segmentation model, and sequentially encode the text image through the coding unit of at least one of the coding structures to obtain the first feature map corresponding to the text image, and specifically includes: inputting the text image into a first coding unit in a coding structure of a segmentation model, and respectively coding the text image through a convolutional network and a pooling network in the first coding unit to obtain an intermediate coding feature map output by the first coding unit; determining first input data corresponding to a current coding unit for coding units positioned after a first coding unit in a coding structure; the first input data is an intermediate coding feature map generated by the last coding unit in the coding process; transmitting the first input data to a current coding unit, and respectively coding the first input data through a convolutional network and a pooling network in the current coding unit to obtain an intermediate coding characteristic diagram output by the current coding unit; and taking the intermediate coding characteristic graph output by the current coding unit as first input data of the next coding unit, returning to the step of transmitting the first input data to the current coding unit and continuing to execute the steps until a first stop condition is met, and taking the intermediate coding characteristic graph output by the last coding unit of the coding structure as a first characteristic graph corresponding to the text image.

The first stop condition is a condition for stopping data transmission, and specifically may be obtaining an intermediate coding feature pattern output by a last coding unit in the coding structure, or obtaining a size of the intermediate coding feature pattern output by the last coding unit in the coding structure to reach a preset size. It is to be understood that the first stop condition may be 1/16 where the intermediate coding feature map size is the text image to be recognized, such as when four coding units are included in the segmentation model.

The first input data is input data corresponding to other coding units except the first coding unit in the coding structure of the segmentation model, and specifically may be an intermediate coding feature map generated in the coding process of the previous coding unit, or may be understood as an input feature map. The intermediate coding feature map is a feature map corresponding to each coding unit in the coding structure of the segmentation model, and specifically may be a feature map generated by each coding unit in the process of coding processing.

Specifically, the computer device inputs the text image to be recognized to a first coding unit in a coding structure of the segmentation model, and the text image is coded through a convolutional network and a pooling network in the first coding unit to obtain an intermediate coding feature map output by the first coding unit. For the coding unit behind the first coding unit in the coding structure, the computer device transmits the intermediate coding characteristic graph output by the first coding unit to the second coding unit and uses the intermediate coding characteristic graph as first input data of the second coding unit, and the intermediate coding characteristic graph output by the first coding unit is respectively coded and processed through a convolutional network and a pooling network in the second coding unit to obtain the intermediate coding characteristic graph output by the second coding unit.

Further, the computer device transmits the intermediate coding feature map output by the second coding unit to a third coding unit, and the intermediate coding feature map is used as an input feature map of the third coding unit, and the intermediate coding feature map output by the second coding unit is respectively coded by a convolutional network and a pooling network in the third coding unit to obtain the intermediate coding feature map output by the third coding unit. And analogizing in turn until a first stop condition is met, and taking the intermediate coding feature map output by the last coding unit of the coding structure as a first feature map corresponding to the text image by the computer equipment.

In one embodiment, at least one deformable convolution is included in the coding structure. That is to say, the convolution network in each coding unit of the coding structure may be composed of a deformable convolution, or may be composed of a deformable convolution and a conventional convolution together, which is not limited in the embodiment of the present application.

It will be appreciated that the deformable convolution may extract feature maps of different extents, shapes or orientations than conventional convolution, and thus the feature maps extracted by the deformable convolution may be more accurate and may be more adaptive to the target.

In the above embodiment, the computer device performs encoding processing on the text image by using a deformable convolution in the encoding structure of the segmentation model, so as to obtain the corresponding first feature map. By adding the deformable convolution into the coding structure, the convolution receptive field can be obviously increased, and the text shape in the text image can be more accurately matched, so that the accuracy of extracting the feature map is improved. Moreover, by introducing the deformable convolution into part of the coding units, the network parameter quantity and accuracy can be balanced, and the text recognition efficiency is greatly improved.

S206, acquiring intermediate coding characteristic diagrams generated by each coding unit in the coding process.

Specifically, each coding unit of the segmentation model generates a corresponding intermediate coding feature map in the respective coding process, and the computer device acquires the respective intermediate coding feature maps for performing decoding processing on the first feature map output by the coding structure through the decoding structure.

And S208, decoding the first feature map according to each intermediate coding feature map by dividing the decoding structure in the model to obtain a corresponding second feature map.

Specifically, the decoding structure in the segmentation model comprises at least one decoding unit, and the computer device performs decoding processing on the first feature map output by the coding structure through each decoding unit and based on each intermediate coding feature map, so as to obtain a corresponding second feature map. It is understood that the decoding process refers to semantic information restoration based on the first feature map, and classifies each pixel based on the feature map output by the last decoding unit.

In one embodiment, the decoding structure comprises at least one first decoding unit, and at least one second decoding unit; step S208, namely, the step of obtaining a corresponding second feature map by segmenting the decoding structure in the model and performing decoding processing on the first feature map according to each intermediate coding feature map specifically includes: inputting the first feature map into a first decoding unit in a decoding structure of the segmentation model, and performing up-sampling processing on the first feature map through the first decoding unit to obtain a corresponding intermediate decoding feature map; for a decoding unit behind the first decoding unit in the decoding structure, when the current decoding unit is the first decoding unit, determining second input data corresponding to the first decoding unit; the second input data is obtained by fusing the intermediate decoding characteristic diagram output by the previous decoding unit and the intermediate coding characteristic diagram with the same resolution in the coding structure; performing up-sampling processing on the second input data through a first decoding unit to obtain a corresponding intermediate decoding characteristic diagram; the intermediate decoding feature map corresponding to the first decoding unit is used for directly serving as input data of the next decoding unit or generating input data of the next decoding unit through fusion; for a decoding unit behind the first decoding unit in the decoding structure, when the current decoding unit is a second decoding unit, determining third input data corresponding to the second decoding unit; the third input data is an intermediate decoding characteristic graph output by the last decoding unit; performing up-sampling processing on the third input data through a second decoding unit to obtain a corresponding intermediate decoding characteristic diagram; the intermediate decoding feature map corresponding to the second decoding unit is used for directly serving as input data of the next decoding unit or generating input data of the next decoding unit through fusion; and stopping when the second stopping condition is met, and taking the intermediate decoding feature graph output by the last decoding unit of the decoding structure as a second feature graph corresponding to the text image.

The decoding unit comprises a first decoding unit and a second decoding unit. It can be understood that the network structures of the first decoding unit and the second decoding unit are the same, and the first decoding unit and the second decoding unit decode the input data successively through the up-sampling network and the convolution network, and the difference is that the input data of the first decoding unit and the second decoding unit are different.

It can be understood that the first decoding unit corresponds to the second input data, and specifically, the first decoding unit may be a feature map obtained by fusing an intermediate decoding feature map output by a previous decoding unit and an intermediate coding feature map having the same resolution in a coding structure. The second decoding unit corresponds to the third input data, and may specifically be an intermediate decoding feature map output by the previous decoding unit. The intermediate coding feature map is a feature map corresponding to each coding unit in the coding structure of the segmentation model, and specifically may be a feature map generated by each coding unit in the course of the coding process.

The feature map obtained by fusing the intermediate decoded feature map output by the previous decoding means and the intermediate encoded feature map having the same resolution in the encoding structure may also be referred to as an intermediate fused feature map. It can be understood that the fusion specifically is to perform an overlap processing on the intermediate coding feature map in the coding structure and the intermediate decoding feature map in the decoding structure, that is, the fusion processing does not change the size of the feature maps, that is, the intermediate fusion feature map has the same size as the intermediate coding feature map before the fusion and the intermediate decoding feature map.

Specifically, the computer device inputs the first feature map into a first decoding unit in a decoding structure of the segmentation model, and performs upsampling processing on the first feature map through the first decoding unit to obtain a corresponding intermediate decoding feature map. The computer device determines input data of a decoding unit following the first decoding unit according to a class of a decoding unit following the first decoding unit in the decoding structure.

It is understood that when the current decoding unit is the first decoding unit, the computer device determines the second input data corresponding to the first decoding unit, that is, determines the fused intermediate fusion feature map as the input data of the first decoding unit. And the computer equipment performs up-sampling processing on the second input data through the first decoding unit to obtain a corresponding intermediate decoding characteristic diagram. The computer device judges input data of a next decoding unit according to the class of the decoding unit following the current first decoding unit. For example, when the decoding unit after the current first decoding unit is the first decoding unit, the computer device uses the intermediate decoding feature map for fusion to generate the input data of the next decoding unit; when the decoding unit after the current first decoding unit is the second decoding unit, the computer device directly uses the intermediate decoding characteristic diagram as the input data of the next decoding unit.

Similarly, when the current decoding unit is the second decoding unit, the computer device determines the third input data corresponding to the second decoding unit, that is, determines the intermediate decoding feature map output by the previous decoding unit as the input data of the second decoding unit. And the computer equipment performs upsampling processing on the third input data through the second decoding unit to obtain a corresponding intermediate decoding characteristic diagram. The computer device judges input data of a next decoding unit according to the class of the decoding unit following the current second decoding unit. For example, when the decoding unit after the current second decoding unit is the first decoding unit, the computer device uses the intermediate decoding feature map for fusion to generate the input data of the next decoding unit; when the decoding unit after the current second decoding unit is the second decoding unit, the computer device directly uses the intermediate decoding characteristic diagram as the input data of the next decoding unit.

Further, when a second stop condition is satisfied, the computer device stops, and takes the intermediate decoding feature map output by the last decoding unit of the decoding structure as a second feature map corresponding to the text image. The second stop condition is a condition for stopping data transmission, and specifically may be obtaining an intermediate decoding feature map output by a last decoding unit in the decoding structure, or obtaining a size of the intermediate decoding feature map output by the last decoding unit in the decoding structure to reach a preset size. It is to be understood that, for example, when four decoding units are included in the segmentation model, the second stop condition may be that the intermediate decoding feature map size is the same as the size of the text image to be recognized.

In one embodiment, the computer device may concatenate the partial intermediate encoded feature map output in the encoded structure with the intermediate decoded feature map output by the partial decoding unit in the decoded structure by channel to achieve feature fusion. For example, when the decoding structure includes 4 coding units and 4 decoding units, the computer device may merge the first decoding unit with the channel of the fourth coding unit and merge the second decoding unit with the channel of the third coding unit. That is, the intermediate decoding feature map output by the first decoding unit is fused with the intermediate encoding feature map having the same resolution output by the encoding unit to serve as input data for the second decoding unit. And, the intermediate decoding feature map output by the second decoding means is fused with the intermediate encoding feature map having the same resolution output by the encoding means, as input data to the third decoding means.

In the above embodiment, the computer device performs decoding processing on the first feature map output by the coding structure through the first decoding unit and the second decoding unit and based on each intermediate coding feature map, so as to obtain the corresponding second feature map. By carrying out feature fusion on the feature graph of the coding structure and the feature graph with the same resolution in the decoding structure, the problem of semantic information loss can be solved, namely, the semantic information in the coding structure is reserved, and information loss cannot be caused. Moreover, by realizing feature fusion in part of decoding units, namely inputting the fused feature graph in the first decoding unit, the network parameter quantity and accuracy can be balanced, thereby greatly improving the efficiency of text recognition.

In one embodiment, the decoding units include at least one first decoding unit, but the number of second decoding units is not limited. That is, there is a case where the decoding unit includes at least one first decoding unit and does not include the second decoding unit. At this time, the computer device inputs the first feature map into a first decoding unit in a decoding structure of the segmentation model, and performs upsampling processing on the first feature map through the first decoding unit to obtain a corresponding intermediate decoding feature map. It will be appreciated that the intermediate decoding feature map is used to fuse the input data to generate the next decoding unit. For a first decoding unit located after the first decoding unit in the decoding structure, the computer device determines second input data corresponding to the first decoding unit, that is, the intermediate decoding feature map is fused with an intermediate encoding feature map having the same resolution in the encoding structure to obtain a corresponding intermediate fusion feature map, and determines the intermediate fusion feature map as input data of the first decoding unit.

Further, when a third stop condition is satisfied, the computer device stops, and takes the intermediate decoding feature map output by the last first decoding unit of the decoding structure as a second feature map corresponding to the text image. The third stop condition is a condition for stopping data transmission, and specifically may be that an intermediate decoding feature map output by a last first decoding unit in the decoding structure is obtained, or a size of the intermediate decoding feature map output by the last first decoding unit in the decoding structure reaches a preset size. It is to be understood that, for example, when four decoding units are included in the segmentation model, the third stop condition may be that the intermediate decoding feature map size is the same as the size of the text image to be recognized.

In one embodiment, the computer device may concatenate each intermediate encoded feature map output in the encoding structure with an intermediate decoded feature map output by a decoding unit in the decoding structure by channel to achieve feature fusion. For example, when the decoding structure includes 4 coding units and 4 decoding units, the computer device may merge the first decoding unit with the channel of the fourth coding unit, merge the second decoding unit with the channel of the third coding unit, merge the third decoding unit with the channel of the second coding unit, and merge the fourth decoding unit with the channel of the first coding unit.

And S210, performing pixel-level classification processing according to the second feature map to identify texts in the text images.

Specifically, the computer device obtains a second feature map which is output by the decoding structure and is consistent with the size of the text image to be recognized, and classifies each pixel based on the second feature map, so that the text in the text image is recognized.

In one embodiment, the decoding structure further comprises a unit for classifying and identifying. After the computer device acquires the second feature map which is output by the decoding structure and has the same size with the text image to be recognized, each pixel can be classified through a K-class softmax classifier in the classification recognition unit, and therefore the text in the text image can be recognized. Of course, other classifiers are possible, and the embodiment of the present application is not limited thereto.

In one embodiment, the text image in the text image based recognition method includes a text image related to a vehicle; the text in the text image comprises a vehicle identification code; the text image-based recognition method further comprises the step of searching for vehicle inspection information. The method specifically comprises the following steps: when the vehicle identification code in the text image related to the vehicle is identified, searching corresponding annual inspection information of the vehicle according to the vehicle identification code; and executing corresponding business processing based on the annual inspection information of the vehicle.

It is understood that, when the text image is an image including a vehicle identification code, the computer device pixel-classifies the second feature map corresponding to the image including the vehicle identification code by the class K softmax classifier, thereby identifying the vehicle identification code (VIN) in the image. The computer device can determine the corresponding vehicle through the VIN code, so as to find out the corresponding annual inspection information of the vehicle. Of course, the text image related to the vehicle may also be an image including a license plate number or a vehicle name plate of the vehicle, which is not limited in the embodiment of the present application.

Further, the computer equipment executes corresponding service processing on the vehicle according to the searched annual inspection information of the vehicle. For example, business processes such as verifying the validity of the vehicle annual check information and updating the vehicle annual check information are performed. It can be understood that when the valid period of the vehicle annual inspection information is close to or exceeds the valid period of the vehicle annual inspection information, corresponding prompt information can be sent to the owner of the vehicle; alternatively, when new vehicle annual inspection information related to the vehicle is detected, the new vehicle annual inspection information may be correspondingly entered, which is not limited in the embodiment of the present application.

In the above embodiment, when the computer device identifies the vehicle identification code in the text image related to the vehicle, the corresponding vehicle annual inspection information may be searched according to the vehicle identification code, and corresponding business processing may be performed based on the vehicle annual inspection information. By the mode, the vehicle inspection information corresponding to the vehicle can be quickly and accurately searched and the subsequent processing can be carried out, so that the efficiency of service processing is greatly improved.

According to the text image-based identification method, the text image is coded through the coding structure including at least one deformable convolution in the segmentation model, and the corresponding first feature map is obtained. And decoding the first characteristic diagram according to intermediate coding characteristic diagrams generated by each coding unit in the coding structure in the coding process by dividing the decoding structure of the model to obtain a corresponding second characteristic diagram. And carrying out pixel-level classification processing on the second feature map so as to identify the text in the text image. By adding the deformable convolution into the coding structure, the convolution receptive field can be obviously increased, and the text shape in the text image can be more accurately matched, so that the accuracy of extracting the feature map is improved. And the first feature graph is decoded through the intermediate coding feature graph generated in the coding structure, namely, the feature graph generated in the decoding structure and the corresponding intermediate coding feature graph are subjected to feature fusion, so that complete semantic information can be effectively reserved, the problem of semantic information loss is avoided, the segmentation precision is greatly improved, and the text recognition efficiency is greatly improved.

In one embodiment, the step of inputting the first feature map into a first decoding unit in a decoding structure of the segmentation model, and performing upsampling processing on the first feature map by the first decoding unit to obtain a corresponding intermediate decoded feature map specifically includes: acquiring an index characteristic diagram with the same size as the first characteristic diagram, and transmitting the first characteristic diagram and the corresponding index characteristic diagram to a first decoding unit; traversing the index feature map through the first decoding unit, and acquiring maximum value indexes corresponding to all pixels in the first feature map from the index feature map; acquiring a preset feature map; the size of the preset feature map is larger than that of the first feature map; and assigning the position pixel corresponding to the maximum value index in the preset feature map as the pixel value of the first feature map, and assigning the other position pixels which are not assigned in the preset feature map as zero to obtain the corresponding intermediate decoding feature map.

Specifically, the computer device acquires an index profile having the same size as the first profile and transmits the first profile to the first decoding unit together with the corresponding index profile. The computer device traverses the index feature map through the first decoding unit and acquires maximum value indexes corresponding to all pixels in the first feature map from the index feature map.

It is understood that each pixel in the first feature map has a corresponding maximum index. An indexed feature map is a feature map that includes index values. The first characteristic diagram is the intermediate coding characteristic diagram output by the last coding unit, that is, each intermediate coding characteristic diagram has a corresponding index characteristic diagram.

Further, the computer device obtains a preset feature map with a size larger than that of the first feature map, assigns the position pixel corresponding to the maximum value index in the preset feature map as the pixel value of the first feature map, assigns other position pixels in the preset feature map which are not assigned with zero, and obtains the assigned preset feature map, namely the corresponding intermediate decoding feature map.

It is to be understood that, when the input data of the decoding unit is the first feature map, the size of the obtained preset feature map may be twice that of the first feature map, for example, the size of the first feature map is 2 × 2, and the size of the preset feature map is 4 × 4. And the computer equipment assigns the pixel of the index position corresponding to the maximum value of the preset feature map as the pixel value of the input first feature map, and assigns the pixels of other positions which are not assigned in the preset feature map as zero, so as to obtain the corresponding intermediate decoding feature map.

It is understood that the analysis is performed by taking the upsampling process in the first decoding unit as an example, but for each decoding unit in the decoding structure, the upsampling process is implemented by the steps of the above-mentioned embodiment. The difference is that the input data for each decoding unit is different.

In one embodiment, the computer device obtains an indexed feature map of the same size as the input feature map and transmits the input feature map to the next decoding unit along with the corresponding indexed feature map. The computer device traverses the index feature map through the current decoding unit and acquires maximum value indexes respectively corresponding to the pixels in the input feature map from the index feature map. The computer equipment obtains a preset feature map with the size larger than that of the input feature map, assigns the position pixel corresponding to the maximum value index in the preset feature map as the pixel value of the input feature map, and assigns other position pixels which are not assigned in the preset feature map as zero to obtain a corresponding intermediate decoding feature map. Wherein different input data can be determined according to different types of decoding units. For example, when the decoding unit is the first decoding unit, the input data is an intermediate fusion feature map obtained by fusing the intermediate decoding feature map output by the previous decoding unit and the intermediate coding feature map with the same resolution in the coding structure; when the decoding unit is the second decoding unit, the input data is the intermediate decoding characteristic diagram output by the last decoding unit.

In the above embodiment, the computer device performs decoding processing on the first feature map and the corresponding index feature map by using the first decoding unit to obtain the corresponding intermediate decoding feature map. By the method, the upsampling processing can be realized, namely, the feature map subjected to the coding processing in the coding structure is restored, so that the semantic segmentation is performed.

In one embodiment, the segmentation model in the text image-based recognition method is trained by the following steps: acquiring training data; the training data comprises a sample text image and a label text corresponding to the sample text image; inputting the sample text image into a coding structure of a segmentation model to be trained, and sequentially coding the sample text image through at least one coding unit in the coding structure to obtain a first sample characteristic diagram corresponding to the sample text image; wherein the coding structure comprises at least one deformable convolution; acquiring intermediate sample coding characteristic graphs respectively generated by each coding unit in the coding process; decoding the first sample characteristic diagram according to the coding characteristic diagram of each intermediate sample through a decoding structure in the segmentation model to be trained to obtain a corresponding second sample characteristic diagram; performing pixel-level classification processing according to the second sample feature map to identify sample texts in the sample text images; and adjusting the network parameters of the segmentation model to be trained and continuing training based on the difference between the sample text and the label text until the training stopping condition is met.

Wherein the training data is data for training the text recognition model. The sample text image is an image that is used to train the text recognition model and that contains text.

Specifically, the computer device obtains training data, wherein the training data comprises a sample text image and a label text corresponding to the sample text image. And coding the sample text image to obtain a corresponding first sample characteristic diagram through at least one deformable convolution coding structure included in the segmentation model to be trained. And decoding the first sample characteristic diagram according to the decoding structure of the segmentation model to be trained and the intermediate sample coding characteristic diagrams generated by each coding unit in the coding structure in the coding process to obtain a corresponding second sample characteristic diagram. And performing pixel-level classification processing on the second sample feature map to identify sample texts in the sample text images. And the computer equipment adjusts the network parameters of the segmentation model to be trained and continues training based on the difference between the sample text and the label text until the training stopping condition is met. The training stopping condition is a condition for stopping model training, and specifically may be that a preset iteration number is reached or a trained segmentation model reaches a preset performance index.

In one embodiment, the step of obtaining training data specifically includes: acquiring a first sample text image; carrying out image transformation or text transformation on the first sample image to obtain a corresponding second sample text image; the first sample text image and the second sample text image are used as sample text images together; and marking pixels of the corresponding areas of the texts in the sample text image as label texts respectively, and taking the sample text image and the label texts corresponding to the sample text image as training data together.

Wherein the sample text image includes a first sample text image and a second sample text image. The first sample text image is a sample text image directly acquired by the computer equipment, and the second sample text image is an image obtained by the computer equipment after image transformation or text transformation is carried out on the acquired first sample text image.

It is understood that the image transformation includes a change in shape of the image, and specifically, the first sample image may be transformed by randomly cropping, flipping, or rotating by a certain angle. The text transformation means to transform the number or shape of texts in the image, and specifically may be to increase or decrease the number of texts in the first sample text image, or modify the arrangement form of texts in the first sample text image.

In one embodiment, a computer device may obtain a first sample text image from a local or other computer device. And the computer equipment carries out transformation processing such as random cutting, turning or rotating a certain angle on the first sample text image to obtain a second sample text image after image transformation. It is to be understood that the second sample text image is a partial image of the first sample text image, the first sample text image after horizontal flipping, or the first sample text image after rotation, which is not limited in this embodiment of the application.

In one embodiment, a computer device may obtain a first sample text image from a local or other computer device. Moreover, the computer device obtains the second sample text image after text transformation by modifying a part of text in the first sample text image, repeating the text in the first sample text image, or adjusting the text in the first sample text image from straight rows to arc-shaped arrangements, and the like, which is not limited in the embodiment of the present application.

In one embodiment, when the first sample text image is an original VIN image of a vehicle, the computer device may modify some characters in the original VIN image, cut the straight-lined VIN codes in the original VIN image to generate double-lined VIN codes, or deform the straight-lined VIN codes in the original VIN image to generate arc-shaped VIN codes, so as to obtain a second sample text image after text transformation, which is not limited in this application.

In one embodiment, the computer device obtains a second sample text image by image transformation and text transformation of the first sample text image, and the first sample text image and the second sample text image are used together as the sample text image. The computer device marks pixels of the corresponding regions of each text in the sample text image as label text, and marks regions other than the corresponding regions of the text as background classes. The computer device uses the sample text image and the label text corresponding to the sample text image together as training data.

In the above embodiment, the computer device performs image transformation or text transformation on the first sample text image to obtain the second sample text image. By such a data expansion manner, the number of training data can be greatly expanded.

In one embodiment, the computer device may divide the augmented training data into a training set and a validation set, and uniformly scale the size of each sample text image to a predetermined standard size, such as 512 x 128. And training a segmentation model by scaling the sample text image to a preset standard size and the corresponding label text.

In the above embodiment, the computer device inputs the sample text image into the segmentation model to be trained, obtains the corresponding second sample feature map through the corresponding processing of the coding structure and the decoding structure of the segmentation model to be trained, and identifies and obtains the sample text in the sample text image. And the computer equipment adjusts the network parameters of the segmentation model to be trained and continues training based on the difference between the sample text and the label text. By the mode, the coding structure and the decoding structure can be trained in a targeted mode, and therefore the processing efficiency of the coding structure and the decoding structure is improved. Thus, the computer device can obtain a well-trained segmentation model which improves the text recognition efficiency.

In a specific embodiment, referring to fig. 3, the text image-based recognition method specifically includes the following steps: text recognition is performed on an image (such as an input image in fig. 3) including a vehicle VIN code by using a SegNet-based semantic segmentation algorithm, and since the VIN code is generally in a strip shape or an arc shape, in order to adapt to the aspect ratio of an original input picture of the VIN code, more information is not lost, the SegNet network needs to uniformly scale the input picture to 512 × 128.

The SegNet network structure is composed of an encoding structure and a decoding structure. The encoding structure is used for extracting a feature map of the input image including the vehicle VIN code through the base network VGG 16. The coding structure includes 4 convolutional layers and 4 pooling layers (core size is 2 × 2_ s2, such as pool, pool2, pool3 and pool4 in fig. 3), and each convolutional layer and one pooling layer constitute one coding unit 301. And, part of convolution layers in the VGG16 network are replaced by deformable convolutions 302, that is, the convolutions in the CONV3(CONV3_1, CONV3_2 and CONV3_3) module and the CONV4(CONV4_1, CONV4_2 and CONV4_3) module of VGG16 are replaced by deformable convolutions, which are denoted by def.conv3(def.conv3_1, def.conv3_2 and def.conv3_3) module and def.conv4(def.conv4_1, def.conv4_2 and def.conv4_ 3). And sequentially carrying out coding processing on the image comprising the vehicle VIN code through each coding unit in the coding structure to obtain a corresponding first characteristic diagram. Moreover, the characteristic size of the first characteristic diagram output by the coding structure is 1/16 of the image size of the original input vehicle VIN code;

correspondingly, the decoding structure includes 4 upsampled layers (such as upsample1, upsample2, upsample3 and upsample4 in fig. 3) and 4 convolutional layers, corresponding to 4 pooled layers of the coding structure one to one, and each convolutional layer and one upsampled layer constitute a decoding unit 303. And performing upsampling processing according to the maximum value index of the pooling layer, so that the size of the second characteristic diagram output by the decoding structure is consistent with the size of the originally input image comprising the vehicle VIN code. The intermediate encoding feature maps of the outputs of the def.conv3(def.conv3_1, def.conv3_2 and def.conv3_3) module and the def.conv4(def.conv4_1, def.conv4_2 and def.conv4_3) in the encoding structure are feature fused with corresponding decoding structures CONV3_ D (CONV3_3_ D, def.conv3_2_ D and CONV3_1_ D) module and CONV4_ D (def.conv4_3_ D, def.conv4_2_ D and def.conv4_1_ D), that is, the intermediate encoding feature maps extracted from the encoding structure and the intermediate decoding feature maps obtained from the decoding structure and spliced with the same resolution are feature maps by channel, and the corresponding intermediate fused feature maps are obtained. Therefore, semantic information in the coding structure can be well reserved, and information loss cannot be caused.

For example, the size of the text image to be recognized input into the segmentation model is set as S, and the coding structure comprises 4 coding units, wherein the sizes of the intermediate coding feature maps output by the 1 st to 4 th coding units are sequentially

And

wherein the 4 th coding unit output has a size of

The intermediate code feature map of (2) is also the first feature map. The decoding structure comprises 4 decoding units, in the first step, the first characteristic diagram is taken as the input data of the 1 st decoding unit, and the size of the corresponding output intermediate decoding characteristic diagram is equal to

At this time, in the coding unit

Intermediate coding feature map of

Fusing the intermediate decoding characteristic graphs to obtain

The intermediate fused feature map of (1). A second step of

The intermediate fused feature map of (2) is used as input data of the 2 nd decoding unit, and the size of the corresponding output intermediate decoding feature map is

At this time, in the coding unit

Intermediate coding feature map of

Fusing the intermediate decoding characteristic graphs to obtain

The intermediate fused feature map of (1). A third step of mixing

The intermediate fused feature map of (2) is used as input data of the 3 rd decoding unit, and the size of the corresponding output intermediate decoding feature map is

The fourth step is to

Intermediate fusion ofThe feature map is used as input data of the 4 th decoding unit, and the size of the corresponding output intermediate decoding feature map is S. The 4 th decoding unit outputs an intermediate decoding feature map with the size of S, namely the second feature map.

Further, the computer device classifies each pixel based on the second feature map to identify the vehicle VIN code in the originally input image including the VIN code of the vehicle.

In a specific embodiment, referring to fig. 4, the text image-based recognition method specifically includes the following steps:

s402, acquiring a text image to be recognized.

S404, inputting the text image into a first coding unit in a coding structure of the segmentation model, and respectively coding the text image through a convolutional network and a pooling network in the first coding unit to obtain an intermediate coding feature map output by the first coding unit.

S406, for the coding unit following the first coding unit in the coding structure, determining first input data corresponding to the current coding unit.

S408, transmitting the first input data to the current coding unit, and respectively coding the first input data through the convolutional network and the pooling network in the current coding unit to obtain an intermediate coding feature map output by the current coding unit.

And S410, taking the intermediate coding feature pattern output by the current coding unit as first input data of the next coding unit, returning to the step of transmitting the first input data to the current coding unit and continuing to execute the steps until a first stop condition is met, and taking the intermediate coding feature pattern output by the last coding unit of the coding structure as a first feature pattern corresponding to the text image.

S412, acquiring intermediate code characteristic diagrams generated by each coding unit in the process of coding processing.

And S414, inputting the first feature map into a first decoding unit in a decoding structure of the segmentation model, and performing up-sampling processing on the first feature map through the first decoding unit to obtain a corresponding intermediate decoding feature map.

S416, for the decoding unit following the first decoding unit in the decoding structure, when the current decoding unit is the first decoding unit, determining the second input data corresponding to the first decoding unit.

S418, the first decoding unit performs upsampling processing on the second input data to obtain a corresponding intermediate decoding feature map.

S420, for a decoding unit following the first decoding unit in the decoding structure, when the current decoding unit is the second decoding unit, determining third input data corresponding to the second decoding unit.

And S422, performing up-sampling processing on the third input data through the second decoding unit to obtain a corresponding intermediate decoding characteristic diagram.

And S424, stopping when the second stopping condition is met, and taking the intermediate decoding feature graph output by the last decoding unit of the decoding structure as the second feature graph corresponding to the text image.

And S426, performing pixel-level classification processing according to the second feature map to identify texts in the text images.

It should be understood that although the various steps in the flowcharts of fig. 2 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 5, there is provided a text image-based recognition apparatus 500, including: an obtaining module 501, an encoding module 502, a decoding module 503, and an identifying module 504, wherein:

an obtaining module 501, configured to obtain a text image to be recognized.

The encoding module 502 is configured to input the text image into the encoding structure of the segmentation model, and sequentially encode the text image through at least one encoding unit in the encoding structure to obtain a first feature map corresponding to the text image; wherein the coding structure comprises at least one deformable convolution.

The obtaining module 501 is further configured to obtain intermediate coding feature maps generated by each coding unit during the coding process.

And a decoding module 503, configured to perform decoding processing on the first feature map according to each intermediate coding feature map through a decoding structure in the partition model, so as to obtain a corresponding second feature map.

And the identifying module 504 is configured to perform pixel-level classification processing according to the second feature map to identify text in the text image.

In one embodiment, the encoding module 502 is further configured to input the text image to a first encoding unit in the encoding structure of the segmentation model, and encode the text image through a convolutional network and a pooling network in the first encoding unit, respectively, to obtain an intermediate encoding feature map output by the first encoding unit; determining first input data corresponding to a current coding unit for coding units positioned after a first coding unit in a coding structure; the first input data is an intermediate coding feature map generated by the last coding unit in the coding process; transmitting the first input data to a current coding unit, and respectively coding the first input data through a convolutional network and a pooling network in the current coding unit to obtain an intermediate coding characteristic diagram output by the current coding unit; and taking the intermediate coding characteristic graph output by the current coding unit as first input data of the next coding unit, returning to the step of transmitting the first input data to the current coding unit and continuing to execute the steps until a first stop condition is met, and taking the intermediate coding characteristic graph output by the last coding unit of the coding structure as a first characteristic graph corresponding to the text image.

In an embodiment, the decoding structure includes at least one first decoding unit and at least one second decoding unit, and the decoding module 503 is further configured to input the first feature map to a first decoding unit in the decoding structure of the segmentation model, and perform upsampling processing on the first feature map by the first decoding unit to obtain a corresponding intermediate decoding feature map; for a decoding unit behind the first decoding unit in the decoding structure, when the current decoding unit is the first decoding unit, determining second input data corresponding to the first decoding unit; the second input data is obtained by fusing the intermediate decoding characteristic diagram output by the previous decoding unit and the intermediate coding characteristic diagram with the same resolution in the coding structure; performing up-sampling processing on the second input data through a first decoding unit to obtain a corresponding intermediate decoding characteristic diagram; the intermediate decoding feature map corresponding to the first decoding unit is used for directly serving as input data of the next decoding unit or generating input data of the next decoding unit through fusion; for a decoding unit behind the first decoding unit in the decoding structure, when the current decoding unit is a second decoding unit, determining third input data corresponding to the second decoding unit; the third input data is an intermediate decoding characteristic graph output by the last decoding unit; performing up-sampling processing on the third input data through a second decoding unit to obtain a corresponding intermediate decoding characteristic diagram; the intermediate decoding feature map corresponding to the second decoding unit is used for directly serving as input data of the next decoding unit or generating input data of the next decoding unit through fusion; and stopping when the second stopping condition is met, and taking the intermediate decoding feature graph output by the last decoding unit of the decoding structure as a second feature graph corresponding to the text image.

In one embodiment, the decoding module 503 is further configured to obtain an index feature map with the same size as the first feature map, and transmit the first feature map and the corresponding index feature map to the first decoding unit; traversing the index feature map through the first decoding unit, and acquiring maximum value indexes corresponding to all pixels in the first feature map from the index feature map; acquiring a preset feature map; the size of the preset feature map is larger than that of the first feature map; and assigning the position pixel corresponding to the maximum value index in the preset feature map as the pixel value of the first feature map, and assigning the other position pixels which are not assigned in the preset feature map as zero to obtain the corresponding intermediate decoding feature map.

In one embodiment, referring to fig. 6, the text image-based recognition apparatus 500 further includes a training module 505 for obtaining training data; the training data comprises a sample text image and a label text corresponding to the sample text image; inputting the sample text image into a coding structure of a segmentation model to be trained, and sequentially coding the sample text image through at least one coding unit in the coding structure to obtain a first sample characteristic diagram corresponding to the sample text image; wherein the coding structure comprises at least one deformable convolution; acquiring intermediate sample coding characteristic graphs respectively generated by each coding unit in the coding process; decoding the first sample characteristic diagram according to the coding characteristic diagram of each intermediate sample through a decoding structure in the segmentation model to be trained to obtain a corresponding second sample characteristic diagram; performing pixel-level classification processing according to the second sample feature map to identify sample texts in the sample text images; and adjusting the network parameters of the segmentation model to be trained and continuing training based on the difference between the sample text and the label text until the training stopping condition is met.

In one embodiment, the acquisition module 501 is further configured to acquire a first sample image; carrying out image transformation or text transformation on the first sample image to obtain a corresponding second sample text image; the first sample text image and the second sample text image are used as sample text images together; and marking pixels of the corresponding areas of the texts in the sample text image as label texts respectively, and taking the sample text image and the label texts corresponding to the sample text image as training data together.

In one embodiment, the text image comprises a text image associated with a vehicle; the text in the text image includes a vehicle identification code, and the text image-based recognition apparatus 500 further includes a service processing module 506, configured to, when the vehicle identification code in the text image related to the vehicle is recognized, search for corresponding annual inspection information of the vehicle according to the vehicle identification code; and executing corresponding business processing based on the annual inspection information of the vehicle.

According to the recognition device based on the text image, the text image is coded through a coding structure with a segmentation model comprising at least one deformable convolution, and a corresponding first feature map is obtained. And decoding the first characteristic diagram according to intermediate coding characteristic diagrams generated by each coding unit in the coding structure in the coding process by dividing the decoding structure of the model to obtain a corresponding second characteristic diagram. And carrying out pixel-level classification processing on the second feature map so as to identify the text in the text image. By adding the deformable convolution into the coding structure, the convolution receptive field can be obviously increased, and the text shape in the text image can be more accurately matched, so that the accuracy of extracting the feature map is improved. And the first feature graph is decoded through the intermediate coding feature graph generated in the coding structure, namely, the feature graph generated in the decoding structure and the corresponding intermediate coding feature graph are subjected to feature fusion, so that complete semantic information can be effectively reserved, the problem of semantic information loss is avoided, the segmentation precision is greatly improved, and the text recognition efficiency is greatly improved.

For specific limitations of the text image based recognition apparatus, reference may be made to the above limitations of the text image based recognition method, which are not described herein again. The modules in the text image-based recognition apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, and the computer device may be specifically a terminal or a server, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, and a communication interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The Communication interface of the computer device is used for performing wired or Wireless Communication with an external terminal, and the Wireless Communication may be implemented by WIFI (Wireless Fidelity, Wireless local area network), an operator network, NFC (Near Field Communication), or other technologies. The computer program is executed by a processor to implement a text image based recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described text image based recognition method. Here, the steps of the text image-based recognition method may be the steps in the text image-based recognition methods of the above-described embodiments.

In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the above-mentioned text image based recognition method. Here, the steps of the text image-based recognition method may be the steps in the text image-based recognition methods of the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A text image-based recognition method, the method comprising:

acquiring a text image to be identified;

2. The method according to claim 1, wherein the inputting the text image into a coding structure of a segmentation model, and sequentially coding the text image by a coding unit of at least one of the coding structures to obtain a first feature map corresponding to the text image comprises:

inputting the text image to a first coding unit in a coding structure of a segmentation model, and respectively coding the text image through a convolutional network and a pooling network in the first coding unit to obtain an intermediate coding feature map output by the first coding unit;

for coding units subsequent to the first coding unit in the coding structure, determining first input data corresponding to a current coding unit; the first input data is an intermediate coding feature map generated by a last coding unit in the process of coding processing;

transmitting the first input data to the current coding unit, and respectively coding the first input data through a convolutional network and a pooling network in the current coding unit to obtain an intermediate coding characteristic diagram output by the current coding unit;

and taking the intermediate coding feature pattern output by the current coding unit as first input data of the next coding unit, returning to the step of transmitting the first input data to the current coding unit and continuing to execute the step until a first stop condition is met, and taking the intermediate coding feature pattern output by the last coding unit of the coding structure as the first feature pattern corresponding to the text image.

3. The method of claim 1, wherein the decoding structure comprises at least one first decoding unit and at least one second decoding unit; the obtaining, by a decoding structure in the segmentation model, a corresponding second feature map by performing decoding processing on the first feature map according to each intermediate coding feature map includes:

inputting the first feature map into a first decoding unit in a decoding structure of the segmentation model, and performing up-sampling processing on the first feature map through the first decoding unit to obtain a corresponding intermediate decoding feature map;

for a decoding unit located after the first decoding unit in the decoding structure, when a current decoding unit is a first decoding unit, determining second input data corresponding to the first decoding unit; the second input data is obtained by fusing the intermediate decoding characteristic diagram output by the last decoding unit and the intermediate coding characteristic diagram with the same resolution in the coding structure;

performing up-sampling processing on the second input data through the first decoding unit to obtain a corresponding intermediate decoding feature map; the intermediate decoding feature map corresponding to the first decoding unit is used as the input data of the next decoding unit directly or used for generating the input data of the next decoding unit through fusion;

for a decoding unit located after the first decoding unit in the decoding structure, when a current decoding unit is a second decoding unit, determining third input data corresponding to the second decoding unit; the third input data is an intermediate decoding characteristic diagram output by a last decoding unit;

performing upsampling processing on the third input data through the second decoding unit to obtain a corresponding intermediate decoding feature map; the intermediate decoding feature map corresponding to the second decoding unit is used for directly serving as input data of a next decoding unit or generating input data of the next decoding unit through fusion;

and stopping when a second stopping condition is met, and taking the intermediate decoding feature graph output by the last decoding unit of the decoding structure as a second feature graph corresponding to the text image.

4. The method according to claim 3, wherein the inputting the first feature map into a first decoding unit in the decoding structure of the segmentation model, and performing upsampling processing on the first feature map by the first decoding unit to obtain a corresponding intermediate decoded feature map comprises:

acquiring an index feature map with the same size as the first feature map, and transmitting the first feature map and the corresponding index feature map to a first decoding unit;

traversing the index feature map through the first decoding unit, and acquiring maximum value indexes corresponding to all pixels in the first feature map from the index feature map;

acquiring a preset feature map; the size of the preset feature map is larger than that of the first feature map;

and assigning the position pixel corresponding to the maximum value index in the preset feature map as the pixel value of the first feature map, and assigning other position pixels which are not assigned in the preset feature map as zero to obtain a corresponding intermediate decoding feature map.

5. The method of claim 1, wherein the segmentation model is trained by:

acquiring training data; the training data comprises a sample text image and a label text corresponding to the sample text image;

inputting the sample text image into a coding structure of a segmentation model to be trained, and sequentially coding the sample text image through at least one coding unit in the coding structure to obtain a first sample characteristic diagram corresponding to the sample text image; wherein the coding structure comprises at least one deformable convolution;

acquiring intermediate sample coding characteristic graphs respectively generated by each coding unit in the coding process;

decoding the first sample feature map according to the intermediate sample coding feature maps through a decoding structure in the segmentation model to be trained to obtain a corresponding second sample feature map;

performing pixel-level classification processing according to the second sample feature map to identify sample texts in the sample text images;

and adjusting the network parameters of the segmentation model to be trained and continuing training based on the difference between the sample text and the label text until the training stopping condition is met.

6. The method of claim 5, wherein the obtaining training data comprises:

acquiring a first sample text image;

performing image transformation or text transformation on the first sample image to obtain a corresponding second sample text image;

the first sample text image and the second sample text image are used as sample text images together;

and marking pixels of the corresponding area of each text in the sample text image as label texts respectively, and taking the sample text image and the label texts corresponding to the sample text image as training data together.

7. The method of any one of claims 1 to 6, wherein the text image comprises a text image relating to a vehicle; the text in the text image comprises a vehicle identification code; the method further comprises the following steps:

when a vehicle identification code in a text image related to a vehicle is identified, searching corresponding vehicle annual inspection information according to the vehicle identification code;

and executing corresponding business processing based on the annual inspection information of the vehicle.

8. An apparatus for recognizing based on a text image, the apparatus comprising:

the acquisition module is used for acquiring a text image to be identified;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.