CN115909336A

CN115909336A - Text recognition method and device, computer equipment and computer-readable storage medium

Info

Publication number: CN115909336A
Application number: CN202110942358.6A
Authority: CN
Inventors: 王斌; 薛莫白; 曹浩宇; 包志敏; 姜德强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2023-04-04

Abstract

The embodiment of the application discloses a text recognition method, a text recognition device, computer equipment and a computer readable storage medium, wherein a text image sample is obtained; performing image index calculation according to the image attribute information of the text image sample, and determining a reference sample index based on the calculation result; carrying out image feature extraction processing on the text image sample through a feature extraction model to obtain image feature information; extracting attention characteristics through a characteristic extraction model based on image characteristic information to obtain attention characteristic information of attention context information; predicting a prediction sample index based on the attention feature information; and training the feature extraction model according to the prediction sample indexes and the corresponding reference sample indexes, and extracting attention feature information of the text image to be recognized through the trained feature extraction model to recognize the image text. According to the scheme, a large number of label-free text image samples can be used for training the feature extraction model, and the training effect of the feature extraction model is enhanced.

Description

Text recognition method and device, computer equipment and computer-readable storage medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to a text recognition method and apparatus, a computer device, and a computer-readable storage medium.

Background

Optical Character Recognition (OCR) refers to a process in which a computer device detects the shape of a Character, such as a Character printed on paper or a Character included in a picture, and then translates the detected shape into a computer text using a Character Recognition method. In some application scenarios, for example, scenes such as advertisement scenes and poster propaganda are generally deformed, and the font deformation is various, in order to improve the recognition effect, a large number of training samples corresponding to the scenes need to be obtained, the training samples are labeled, and the model is trained through the labeled training data, so as to improve the capability of the model in recognizing characters.

However, when the trained model is applied to other scenes, because the font deformation modes are different, the recognition effect of the model is poor, training samples under different scenes are obtained, and a large amount of manpower is consumed for labeling a large amount of training samples, so that the difficulty in obtaining the training samples is high, and the difficulty in training the model is high.

Disclosure of Invention

The embodiment of the application provides a text recognition method, a text recognition device, computer equipment and a computer readable storage medium, which can train a feature extraction model by using an unlabeled text image sample and enhance the training effect of the feature extraction model.

The text recognition method provided by the embodiment of the application comprises the following steps:

acquiring a text image sample;

performing image index calculation according to the image attribute information of the text image sample, and determining a reference sample index of the text image sample based on the calculation result;

performing image feature extraction processing on the text image sample through a feature extraction model to obtain image feature information of the text image sample;

performing attention feature extraction on the text image sample based on the image feature information through the feature extraction model to obtain attention feature information of attention context information of the text image sample;

predicting a prediction sample index of the text image sample based on the attention feature information of the text image sample;

and training the feature extraction model according to the prediction sample indexes and the corresponding reference sample indexes, and extracting attention feature information of the text image to be recognized through the trained feature extraction model to recognize the image text.

Correspondingly, an embodiment of the present application further provides a text recognition apparatus, including:

the acquisition unit is used for acquiring a text image sample;

the calculation unit is used for performing image index calculation according to the image attribute information of the text image sample and determining a reference sample index of the text image sample based on the calculation result;

the first feature extraction unit is used for performing image feature extraction processing on the text image sample through a feature extraction model to obtain image feature information of the text image sample;

a second feature extraction unit, configured to perform attention feature extraction on the text image sample based on the image feature information through the feature extraction model, so as to obtain attention feature information of attention context information of the text image sample;

a prediction unit configured to predict a prediction sample index of the text image sample based on attention feature information of the text image sample;

and the training unit is used for training the feature extraction model according to the prediction sample index and the corresponding reference sample index so as to extract the attention feature information of the text image to be recognized through the trained feature extraction model for image text recognition.

Correspondingly, the embodiment of the application also provides computer equipment, which comprises a memory and a processor; the memory stores a computer program, and the processor is used for operating the computer program in the memory to execute any text recognition method provided by the embodiment of the application.

Accordingly, embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and the computer program is loaded by a processor to execute any one of the text recognition methods provided in the embodiments of the present application.

The embodiment of the application obtains a text image sample; performing image index calculation according to the image attribute information of the text image sample, and determining a reference sample index of the text image sample based on the calculation result; performing image feature extraction processing on the text image sample through a feature extraction model to obtain image feature information of the text image sample; performing attention feature extraction on the text image sample based on the image feature information through the feature extraction model to obtain attention feature information of attention context information of the text image sample; predicting a prediction sample index of the text image sample based on the attention feature information of the text image sample; and training the feature extraction model according to the prediction sample indexes and the corresponding reference sample indexes, and extracting attention feature information of the text image to be recognized through the trained feature extraction model to recognize the image text. According to the scheme, the characteristic extraction model is trained through the reference sample indexes and the prediction sample indexes, a large number of unlabelled text image samples can be used for training the characteristic extraction model, and the training effect of the characteristic extraction model is enhanced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene diagram of a text recognition method provided in an embodiment of the present application;

FIG. 2 is a flowchart of a text recognition method provided in an embodiment of the present application;

FIG. 3 is a flowchart of an image restoration process provided in an embodiment of the present application;

FIG. 4 is another flow chart of a text recognition method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a feature extraction network provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a model structure provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a text recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a text recognition method, a text recognition device, computer equipment and a computer readable storage medium. The text recognition device may be integrated in a computer device, and the computer device may be a server or a terminal.

The terminal may include a mobile phone, a wearable smart device, a tablet Computer, a notebook Computer, a Personal Computer (PC), a vehicle-mounted Computer, and the like.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform.

For example, as shown in fig. 1, the text recognition method may include an upstream task, a downstream task, and a feature extraction task, in the feature extraction task, a computer device may obtain a text image sample, perform image feature extraction processing on the text image sample through a DenseNet neural network of a feature extraction model to obtain image feature information of the text image sample, perform random masking processing on the image feature information by the feature extraction model, perform attention feature extraction on the masked image feature information based on a multi-head attention feature mechanism to obtain initial attention feature information of the text image sample, perform Normalization processing on the initial attention feature information sequentially through a first BN (Batch Normalization, BN) layer, and perform Normalization processing again on a forward feedback network (feed forward) and a second BN layer to obtain the attention feature information. The BN layer is used for regulating data in the attention feature information into a data interval, the divergence degree of the data is reduced, and the training difficulty of a feature extraction model is reduced.

The upstream task is used for training the feature extraction model, so that the feature extraction capability of the feature extraction model is improved, the computer device can perform image index calculation on the text image sample according to the image attribute information of the text image sample to obtain image index information, for example, the image index information such as a color histogram, a boundary feature, a homogeneity degree, a contrast degree, an entropy and the like of the text image sample is obtained according to the color feature, the contour feature, the shape feature, the texture feature and the like of the text image sample, and the index combination processing is performed on the image index information matched with the index value expression type of the image index to obtain a reference sample index of the text image sample pair. Predicting sample indexes of the text image samples based on attention characteristic information through different full-connection layers, for example, performing characteristic processing modes such as image restoration processing and dimension conversion to obtain the predicted sample indexes; and training the feature extraction model based on the error between the reference sample index and the prediction sample index to obtain a pre-training feature extraction model.

The downstream task is used for extracting attention feature information of a small number of target image samples with sample labels through a pre-training feature extraction model, predicting a prediction result of the target image samples based on the attention feature information through a text recognition model, performing parameter adjustment on the pre-training feature extraction model and the text recognition model according to the prediction result and the sample labels to obtain a trained feature extraction model and a trained text recognition model, and performing image text recognition on a text image to be recognized through the trained feature extraction model and the trained text recognition model. According to the scheme, the characteristic extraction model is trained through the reference sample indexes and the prediction sample indexes, a large number of label-free text image samples can be used for training the characteristic extraction model, and the training effect of the characteristic extraction model is enhanced.

The following are detailed descriptions. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

In this embodiment, a text recognition apparatus will be described from the perspective of the text recognition apparatus, where the text recognition apparatus may be specifically integrated in a computer device, and the computer device may be a server or a terminal, and as shown in fig. 2, a specific flow of a text recognition method is as follows:

101. a text image sample is obtained.

The text image sample can be a training sample used for training the feature extraction model, the text image sample can contain characters, characters and the like, the characters can be characters of different languages, artistic characters of various deformations and the like, and the text image sample can be a text image sample without a labeled label.

For example, the text image sample may be obtained from a database or a block chain, or a text image to be recognized uploaded by a user at a terminal may be used as the text image sample.

102. And performing image index calculation according to the image attribute information of the text image sample, and determining a reference sample index of the text image sample based on the calculation result.

The image attribute information may be information characterizing properties of the text image sample, for example, image attribute information such as color values and luminances of each pixel in the text image sample under different color channels.

The image index calculation may be a calculation of a text image sample based on image attribute information under at least one feature, for example, a calculation of a text image sample for a color feature, a contour feature, a shape feature, or a texture feature.

The reference sample index may be a reference sample index determined according to a calculation result obtained by calculating the image index, and the reference sample index may function as a label of the text graphics sample and be used for comparing with a prediction sample index obtained by predicting attention feature information extracted based on the feature extraction model.

For example, specifically, for at least one feature, image attribute information of a text image sample is obtained and calculated to obtain image index information (the image index information is a calculation result obtained by performing image index calculation on the image attribute information). For example, for the color feature, a color histogram (e.g., RGB color histogram, HSV color histogram, gray histogram) of the text image sample, or/and a color set and a color moment, etc. may be calculated from the color values of different channels of each pixel in the text image sample; for the contour features, the boundary features of the text image samples can be obtained through Hough (Hough) transformation, the edge direction histogram of the text image samples is calculated, or/and Fourier shape description is carried out on the text image samples, so that Fourier descriptors of the text image samples and the like are obtained; for the shape features, the distribution of the edges, corners, regions, or/and ridges of the text image samples in the text image samples may be calculated, and for the texture features, the homogeneity, the contrast, the non-similarity, the entropy, the angular second moment, or/and the correlation, etc. may be calculated according to the brightness and the color value of each pixel of the text image samples, and for the overall features of the text image samples, the attribute information, such as the color value of each pixel of the text image samples in different channels, is calculated to obtain the image index information about the text image samples (the image index information may be a three-dimensional tensor with a size of Chanels H W, where H W is the size of the text image samples, H is the length of H pixels, W is the width of W pixels, chanels is the number of channels, and the number of channels may be flexibly set according to different image attribute information).

The image index information calculated according to the image attribute information of the text image sample is used as a reference sample index of the text image sample, for example, a one-dimensional tensor corresponding to an RGB color histogram calculated for color features may be used as a reference sample feature of the text image sample.

The color histogram may be obtained by dividing a color space (e.g., an RGB color space, an HSV color space, a gray scale space, etc.) into a plurality of small color intervals, and obtaining image index information about the color histogram according to the number of pixels of each color, where the image index information may be in the form of a one-dimensional tensor; similarly, the edge direction histogram may be obtained by calculating the number of pixels in the text image in which the edge direction falls within each cell.

The color set may be obtained by converting an RGB color space into a visually equalized color space (e.g., HSV space), and quantizing the color space into a number of color intervals. The image is divided into a plurality of areas by a color automatic segmentation technology, and each area is indexed by a certain color component of a quantized color space, so that the image is expressed as a binary color index set, the color index set can be expressed as a one-dimensional tensor, and the one-dimensional tensor is image index information obtained by calculation according to image attribute information of a text image sample.

The color moments may be color moment extractions performed on the text image sample, for example, color moment extractions such as a first moment, a second moment, and a third moment are performed, and a one-dimensional tensor may be obtained according to the extracted color moments, where the one-dimensional tensor may be image index information.

The Hough transformation can identify the geometric shape in the text image sample, a calculation result can be obtained based on the Hough transformation, and a picture obtained by extracting the geometric shape in the text image sample can be obtained according to the calculation result, namely, a two-dimensional tensor of H W can be obtained by carrying out the Hough transformation on the text image sample, wherein H W is the size (the length is H pixels, and the width is W pixels) of the text image sample. The two-dimensional tensor can be an image index information. Similarly, a corresponding two-dimensional tensor can be obtained by performing fourier shape description on the text image sample and detecting the distribution situation of edges, corners, regions and ridges in the text image sample, and each two-dimensional tensor is image index information obtained by calculation based on the image attribute information of the text image sample.

The homogeneity, the contrast, the non-similarity, the entropy, the angular second moment, the correlation and the like can be measures for measuring texture features of the text image sample, and specifically, corresponding numerical values can be obtained by calculation according to the gray level and the brightness in the text image sample, and each obtained numerical value is image index information.

If image index calculation is performed on multiple features, a large number of reference sample indexes are obtained, the same number of prediction sample indexes needs to be predicted correspondingly, the data processing amount is large, part of the reference sample indexes are in a numerical form (for example, contrast and entropy of a text image sample obtained through calculation), the value range is large, and training of a model is not facilitated, in an embodiment, the image index information obtained through calculation may be combined to obtain the reference sample indexes, multiple calculation results are combined, the number of the reference sample indexes may be reduced, and the training difficulty of the model is reduced, that is, "performing image index calculation according to image attribute information of the text image sample, and determining the reference sample indexes of the text image sample based on the calculation results" specifically may include:

performing image index calculation according to the image attribute information of the text image sample to obtain at least one image index information;

and carrying out index merging processing on at least one image index information to obtain a reference sample index of the text image sample.

For example, image index calculation may be specifically performed according to attribute information of a text image sample to obtain at least one piece of image index information, all the obtained image index information are spliced into a multidimensional array or tensor, and the multidimensional array or tensor is used as a reference sample index.

Optionally, the image index information in the form of numerical values in the image index information is merged, for example, spliced into a form of an array, and the obtained array and the image index information in other non-numerical forms, for example, in the form of one-dimensional tensor and multi-dimensional tensor, are used as the reference index information.

Optionally, the image index information may be merged according to an index value expression type of the image index to obtain a reference sample index, that is, the step "merging the indexes of at least one image index information to obtain a reference sample index of the text image sample" may specifically be:

obtaining an index value expression type of at least one image index;

and carrying out index merging processing on at least one image index information according to the index value expression type to obtain a reference sample index of the text image sample.

The index value expression type may represent an expression type of the image index information, and may be, for example, a numerical value or a tensor (one-dimensional or multi-dimensional) form.

For example, specifically, each image index corresponds to an index value expression type, the corresponding index value expression type is obtained according to the image index, and the image index information of which the index value expression type is a numerical value and a one-dimensional tensor is spliced to obtain a one-dimensional tensor which is about the text image sample and contains more image index information; splicing the image index information of which the index value expression type is a two-dimensional tensor to obtain a three-dimensional tensor about the text image sample; the obtained image index information of the one-dimensional tensor, the three-dimensional tensor, and other index value expression types (for example, the three-dimensional tensor, etc.) is used as the reference sample index.

103. And carrying out image feature extraction processing on the text image sample through a feature extraction model to obtain image feature information of the text image sample.

The feature extraction model may be a neural model for performing image feature extraction on the text image sample.

The image feature extraction may be a process of performing image analysis and transformation on the text image sample to extract feature information of the text image sample, and the image feature information may be information obtained by image feature extraction.

For example, the feature extraction model may specifically include a Convolutional Neural Network (CRNN), the CRNN includes a Convolutional Neural Network (CNN) and a cyclic Neural Network (RNN), the CNN performs convolution processing on the text image sample to obtain a feature map of the text image sample, and the RNN performs feature extraction on the feature map to obtain image feature information of the text image sample.

Optionally, the feature extraction model may also perform image feature extraction through other neural networks, for example, encode the text image sample through a DenseNet network, and map the text image sample to image feature information capable of representing the text image sample.

104. And performing attention feature extraction on the text image sample through a feature extraction model based on the image feature information to obtain attention feature information of the attention context information of the text image sample.

The text image sample may include a plurality of image regions, each image region may correspond to region feature information, the image feature information of the text image sample may include region feature information of each image region, and the attention feature information may be information having context information obtained by fusing, for each image region, the region feature information of the image region and the region feature information of the associated image region.

For example, specifically, for each image region, the similarity between the region feature information of the image region and the region feature information of the related image region is used as a weight corresponding to the related image region, and the region feature information of the image region and the related image region is weighted and summed based on the weight to obtain the attention feature information of the attention context information corresponding to each image region.

And obtaining attention feature information of the attention context information of the text image to be recognized according to the attention feature information of the attention context information corresponding to each image area.

The associated image area may be an image area associated with the image area, for example, an adjacent image area of the image area, or all image areas of the text image sample, or other image areas in the text image sample.

In an embodiment, the image feature information may include an image feature vector, where the image feature vector may include a region feature vector of each image region, a similarity between the region feature vector of the image region and a region feature vector of an associated image region may be obtained according to a distance between the region feature vector of the image region and the region feature vector of the associated image region, and the image feature vector may further be subjected to attention space mapping, the similarity between each image region and the associated image region is determined according to the distance between the vectors in the attention space, and the attention feature information with context information is obtained according to the similarity, that is, the step "performing attention feature extraction on a text image sample based on the image feature information through a feature extraction model to obtain the attention feature information of the attention context information of the text image sample" specifically may include:

performing attention space mapping processing on the image characteristic information to obtain a corresponding space vector of each image area in the text image sample in an attention space, wherein the space vector comprises a query vector, a content vector and a key vector;

for each image area, calculating the similarity between the image area and the associated image area according to the distance between the query vector of the image area and the key vector of the associated image area;

and for each image area, performing fusion processing on the content vectors of the image area and the associated image area according to the similarity between the key vector of the image area and the associated image area to obtain the attention feature information of the attention context information.

The query vector, the key vector and the content vector may be space vectors obtained by linear transformation of the image feature vector according to different attention network parameters. The attention network parameters may be network parameters in a feature extraction model.

For example, the attention network parameter may comprise a first attention network parameter W _Q A second attention network parameter W _K And a third attention network parameter W _V Mapping the image feature vector based on the first attention network parameter to obtain a Query vector of the text image sample, which is marked as Query, Q for short, Q = Γ · W _Q And the query vector of the ith image area in the text image sample is recorded as Q _i (ii) a Mapping the image feature vector based on the second attention network parameter to obtain a Key vector of the text image sample, which is recorded as Key, K for short, K = Γ · W _K And the query vector of the ith image area in the text image sample is recorded as K _i (ii) a Mapping the image feature vector based on the third attention network parameter to obtain a content vector of the text image sample, which is marked as Value, V for short, V = Γ · W _V The query vector of the ith image area in the text image sample is marked as Vi, and the spatial vector of the text image sample in the attention space is:

calculating the distance between the query vector for the ith image region and the key vector for the associated image region j, e.g., the query vector and the key vector may be point multiplied, e.g., Q _i ·K _j And obtaining the similarity between the image area i and the image area j.

And carrying out the same processing on each image area in the text image sample to obtain the similarity between each image area and the corresponding associated area.

And performing weighted summation processing on the image areas and the content vectors of the associated image areas according to the similarity between the image areas and each associated image area to obtain area attention feature information of the attention context information of the image areas, and obtaining the attention feature information of the text image sample based on the area attention feature information of each image area.

In an embodiment, an initial similarity matrix of the text image sample may be obtained according to the initial similarity between each image region and the associated image region, and may be denoted as SCORE ₀ Element score in the initial similarity matrix at ith row and ith column _ij May represent the initial similarity of image area i to image area j (image area j is the associated image area of image area i), and is located in the element score of the jth row and ith column _ji An initial similarity of image region j to image region i may be represented.

In general, for each image region, the text content between the image region and the image region that is far away from the image region has little relation, and has no influence on the text content of the predicted image region, so that for each image region, a corresponding window matrix can be set for the obtained initial similarity matrix, so as to retain the similarity of the image region in the region range indicated by the window matrix, and shield the similarity at other positions. The window matrix may be set to a value of 0 for the window position, with other positions set to- ∞, or other large negative numbers, such as 10 ^-16 A matrix of (c). Adding the initial similarity matrix to the window matrix may result in the initial similarity matrixIn the method, the similarity of the non-window positions is set to a large negative number by adding a large negative number, and the similarity of the non-window positions is mapped to 0 by the normalization process, which can be summarized as the formula:

the first similarity matrix is obtained by performing normalization processing on each image area, and the similarity between the image area and the associated image area can be determined according to the first similarity matrix, and it can be understood that after the window matrix is added, since the similarity of the image area at the non-window position is 0, the associated image area corresponding to each image area is actually the image area at the window position of the window matrix.

The window matrix may be an isomorphic matrix similar to the initial similarity matrix, and may be used to retain the similarity within the region range (which may be referred to as a window position) indicated by the window matrix and mask the similarity of other positions, such as setting the similarity of other positions to ∞, where the window position may be set according to each image region, for example, for an image region i, the window position may be score _ii 、score _ii+1 And score _ii+1 I.e. the similarity of the adjacent 3 image areas is preserved, and the similarity with other image areas is masked.

In an ideal case, the similarity of the image area i to the image area j is high, and the similarity of the image area j to the image area i is high, that is, in an ideal case, the similarity matrix is a symmetric matrix, and a transposed matrix SCORE can be obtained by interchanging rows and columns of the first similarity matrix ^T Adding the transposed matrix and the first similarity matrix to obtain a similarity matrix SCORE = SCORE ₁ +SCORE ^T The similarity matrix is a symmetric matrix, and the element score of the similarity matrix _ij ＝score _ji 。

The similarity between the feature vector of each image region and the feature vector of the associated region can be determined according to the similarity matrix, for example, the similarity of the image region i to the image region j is the element SCORE of the similarity matrix SCORE _ij 。

And multiplying the content vector V of the text image sample by the similarity matrix to obtain the attention feature information C with context information of the text image sample, wherein V is the content vector of the text image sample, and SCORE is the similarity matrix.

In an embodiment, the method may increase interference information of the text image sample, and provide a training effect of the feature extraction model, that is, the step "performing attention feature extraction on the text image sample based on the image feature information through the feature extraction model to obtain attention feature information of the attention context information of the text image sample" may specifically include:

carrying out mask processing on the image characteristic information through a characteristic extraction model to obtain masked image characteristic information of a text image sample;

and performing attention feature extraction on the masked image feature information to obtain attention feature information of the attention context information of the text image sample.

The masking processing may be a processing manner of masking or selecting some feature information in the image feature information of the text image sample to increase noise of a training sample of the feature extraction model, so that the trained feature extraction model has more generalized feature extraction capability.

For example, some feature information in the image feature information of the text image sample may be masked, the interference information of the sample is added to obtain image feature information after masking, and the attention feature information of the image feature information after masking is extracted to obtain the attention feature information of the attention context information of the text image sample.

Optionally, step 103 may be to perform image feature extraction through an Attention Mechanism included in the feature extraction model, where the Attention Mechanism may be (Attention Mechanism) a special structure embedded in the machine learning model, and is used to automatically learn and calculate a contribution size of input data to output data, and in order to improve accuracy of feature extraction, the Attention Mechanism may be a multi-layer Attention Mechanism, that is, the step "performing Attention feature extraction on a text image sample based on image feature information through the feature extraction model to obtain Attention feature information of Attention context information of the text image sample" may specifically include:

taking the image characteristic information as input characteristic information of a multi-layer characteristic extraction mechanism;

and sequentially carrying out attention feature extraction on the input feature information through a multilayer feature extraction mechanism to obtain the attention feature information of the attention context information of the text image sample.

For example, there may be a certain ordering in the multi-layer attention system, where the image feature information is used as input feature information of a first layer attention system of the multi-layer attention system, the first layer attention system performs attention feature extraction on a text image sample to obtain first attention feature information, the first attention feature information is used as input feature information of a second layer attention system, the second layer attention system performs attention feature extraction on the input feature information to obtain second attention feature information, and so on, and the feature information output by the last layer attention system is used as the attention feature information.

Optionally, each layer of attention mechanism may include a multi-head attention mechanism, and the attention characteristic information of the layer of attention mechanism is obtained according to the sub-attention characteristic information obtained by the multi-head attention mechanism, that is, the step "sequentially performing attention characteristic extraction on input characteristic information through a multi-layer characteristic extraction mechanism to obtain the attention characteristic information of the attention context information of the text image sample" may specifically include:

sequentially extracting attention characteristics of input characteristic information on the basis of a multi-head attention mechanism through a multi-layer characteristic extraction mechanism to obtain sub-attention characteristic information under each attention mechanism;

and performing fusion processing on the sub-attention feature information under each attention mechanism through a multi-layer feature extraction mechanism to obtain the attention feature information of the attention context information of the text image sample.

For example, specifically, the attention feature information may be extracted by a multi-head attention mechanism for each layer of features, sub-attention feature information under each head of attention feature mechanism is obtained, each sub-attention feature information is spliced to obtain spliced attention feature information, dimension conversion is performed on the spliced attention feature information to make the dimension the same as that of the input image feature information, the attention feature information obtained by the layer of attention mechanism is output, and the attention feature information is used as the input feature information of the next layer of attention feature mechanism.

105. And predicting a prediction sample index of the text image sample based on the attention feature information of the text image sample.

The prediction sample index may be an index predicted from the attention feature information.

For example, the prediction sample index matching the reference sample index may be predicted based on the attention feature information.

In an embodiment, different processing manners may be adopted for the attention characteristic information according to the index type of the reference sample index to obtain a prediction sample index matching with the reference sample index, that is, the step "predicting the prediction sample index of the text image sample based on the attention characteristic information of the text image sample" may specifically be:

determining a characteristic processing mode corresponding to each index type;

and aiming at each index type, processing the attention characteristic information by adopting a corresponding characteristic processing mode to obtain a prediction sample corresponding to each index type.

The index type may be an index type of a reference sample index, for example, a one-dimensional index, a two-dimensional index, or a reduction index. The one-dimensional index indicates that all the included image index information is a one-dimensional tensor (a numerical value is a special one-dimensional tensor), the one-dimensional index indicates that all the included image index information is a two-dimensional tensor, and the restoration index indicates that the included image index information is an image tensor.

For example, a feature processing manner corresponding to each index type may be specifically determined, for example, a processing manner such as dimension conversion or image restoration, and according to each index type, the attention feature is processed by using the corresponding feature processing manner, so as to obtain a prediction sample index matched with the reference sample index.

And if the index type is a one-dimensional index or a two-dimensional index, performing dimension conversion processing on the attention feature mechanism to obtain a tensor with the same size as the index of the reference sample, and taking the obtained tensor as the index of the reference sample.

The feature extraction model has downsampling operation in the process of extracting attention feature information of a text image sample, information which is considered to be useless is abandoned in the process of learning the feature extraction model, the problem that the channel number cannot be solved even if the channel number is increased, two-dimensional image information cannot be effectively stored in the channel by the feature extraction model due to the fact that a network optimization target is influenced by task and training data distribution, part of effective information is considered to be redundant information by a network, and an image restoration task can furthest retain the image information and force the network to retain more effective information.

If the index type is a three-dimensional tensor corresponding to the text image sample, and the corresponding processing mode is image restoration processing, the step "processing attention feature information by using the corresponding feature processing mode to obtain a prediction sample corresponding to each index type" may specifically include:

performing transposition convolution processing based on the attention feature information to obtain processed attention feature information;

and carrying out normalization processing on the attention characteristic information to obtain a prediction sample index.

The transpose convolution processing may be deconvolution processing performed on the attention feature information to perform image restoration processing based on the attention feature information, so as to obtain a tensor with the same size as the text image sample.

For example, the attention feature information may be subjected to a transposition convolution process to obtain processed attention feature information, and the attention feature information may be subjected to a normalization process to obtain a prediction sample index.

Optionally, a specific network structure that performs batch normalization processing on processed attention feature information obtained by the transposition convolution processing, and processing through an activation function, and performs image restoration processing on the attention feature information is shown in fig. 3, where the attention feature information is processed alternately through a transposition convolution layer and an activation layer, and finally, the processed attention feature information is output, and the attention feature information is normalized to obtain a predicted sample label.

106. And training the feature extraction model according to the prediction sample indexes and the corresponding reference sample indexes, and extracting attention feature information of the text image to be recognized through the trained feature extraction model to recognize the image text.

For example, the feature extraction model may be specifically trained by predicting an error between a sample index and a corresponding reference sample index, and a network parameter of the feature extraction model is adjusted to obtain the trained feature extraction model.

And after the trained feature extraction model is obtained, image feature extraction can be carried out on the text sample to be recognized through the trained feature extraction model to obtain image feature information, and attention feature extraction is carried out on the image feature information to obtain the attention feature information of the text image to be recognized. Text content in the text image to be recognized is predicted based on the attention characteristic information.

Optionally, the feature extraction model is trained through the prediction sample index and the reference sample index, after the pre-training feature extraction model is obtained, the pre-training feature extraction model can be finely adjusted through the target text image sample with the label, the trained feature extraction model is obtained, so that the feature extraction capability of the trained feature extraction model is improved, namely, the step "training the feature extraction model according to the prediction sample index and the corresponding reference sample index" can be further included:

acquiring a target text image sample, wherein the text image sample carries a sample label;

performing image feature extraction processing on the target text image sample through a pre-training feature extraction model to obtain image feature information of the target text image sample;

performing attention feature extraction on the target text image sample through a pre-training feature extraction model based on the image feature information of the target text image sample to obtain the attention feature information of the attention context information of the target text image sample;

predicting based on attention feature information of the target text image sample through a text recognition model to obtain a prediction result of the target text image sample;

training the text recognition model and the pre-training feature extraction model based on the sample label and the prediction result to obtain a trained text recognition model and a trained feature extraction model, so as to recognize the text content of the text image to be recognized through the text recognition model.

For example, the attention feature information of the target text image sample may be extracted through a pre-training feature extraction model, the target text image sample is predicted through a text recognition model based on the attention feature information to obtain a prediction result, the pre-training feature extraction model and the text recognition model are trained according to the prediction result and a sample label to obtain a trained feature extraction model and a trained text recognition model, the feature of the text image to be recognized is extracted through the trained feature extraction model to obtain the attention feature information, and the prediction result is obtained through the trained text recognition model based on the attention feature information.

As can be seen from the above, in the embodiment of the present application, a text image sample is obtained; performing image index calculation according to the image attribute information of the text image sample, and determining a reference sample index of the text image sample based on the calculation result; carrying out image feature extraction processing on the text image sample through a feature extraction model to obtain image feature information of the text image sample; performing attention feature extraction on the text image sample based on the image feature information through a feature extraction model to obtain attention feature information of the attention context information of the text image sample; predicting a prediction sample index of the text image sample based on the attention feature information of the text image sample; and training the feature extraction model according to the prediction sample indexes and the corresponding reference sample indexes, and extracting attention feature information of the text image to be recognized through the trained feature extraction model to recognize the image text. According to the scheme, the characteristic extraction model is trained through the reference sample indexes and the prediction sample indexes, a large number of label-free text image samples can be used for training the characteristic extraction model, and the training effect of the characteristic extraction model is enhanced.

On the basis of the above-described embodiments, further details will be given below by way of example.

The embodiment will be described from the perspective of a text recognition apparatus, where the text recognition apparatus may be specifically integrated in a computer device, and the computer device may be a server or a terminal;

as shown in fig. 4, the flow of the text recognition method provided in the embodiment of the present application can be divided into a feature extraction task, an upstream task, and a downstream task, and the specific flow can be as follows:

1. and (3) feature extraction task: and performing feature extraction on the input image through a feature extraction model to obtain attention feature information, and using the attention feature information for an upstream task and a downstream task.

201. The server obtains a text image sample, and performs image feature extraction processing on the text image sample through a feature extraction model to obtain image feature information of the text image sample.

For example, the server may specifically obtain a text image sample in the database, encode the text image sample through a DenseNet network of the feature extraction model, and map the text image sample into image feature information Γ capable of representing the text image sample, where the image feature information may be a feature embedding sequence.

202. And the server extracts the attention characteristics of the image characteristic information through the characteristic extraction model to obtain the attention characteristic information of the attention context information of the text image sample.

For example, the server may specifically extract the model through the feature, based on the attention mechanismThe image feature vector is mapped by the first attention network parameter to obtain a Query vector of the text image sample, which is denoted as Query, Q for short, Q = Γ · W _Q And the query vector of the ith image area in the text image sample is recorded as Q _i (ii) a Mapping the image feature vector based on the second attention network parameter to obtain a Key vector of the text image sample, which is recorded as Key, K for short, K = Γ · W _K And the query vector of the ith image area in the text image sample is recorded as K _i (ii) a Mapping the image feature vector based on the third attention network parameter to obtain a content vector of the text image sample, which is marked as Value, V for short, V = Γ · W _V And recording the query vector of the ith image area in the text image sample as V _i 。

Calculating the distance between the query vector for the ith image region and the key vector for the associated image region j, e.g., the query vector and the key vector may be point multiplied, e.g., Q _i ·K _j And obtaining the similarity between the image area i and the associated image area j.

Optionally, in order to improve the feature extraction capability, the feature extraction model may include a feature extraction network, a network structure of the feature extraction network may be as shown in fig. 1, and after the attention feature information obtained through the attention mechanism is obtained, the data may be normalized through the first BN layer to obtain C ^* After = BN (v · SCORE), the information is input to a feed forward network (feed forward) through which attention feature information is extracted to reduce feature redundancy, and the feed forward network may include 2 networksFull connection layer and 1 active layer, C ^* After the data are normalized through a forward feedback network and a second BN layer, attention feature information obtained by the feature extraction network is output to obtain C = BN (W) ₁ (max(0，W ₀ ·C ^* +b ₀ ))+b ₁ ) Wherein W is ₀ 、W ₁ 、b ₀ And b ₁ The method is characterized in that network parameters of a full connection layer of a network are fed back in a forward direction, wherein a BN algorithm contained in the BN layer is as follows:

for any input sequence X, the mean and variance are calculated.

Normalizing the input X, where e is a small constant, prevents variance of 0 from producing an invalid calculation.

The final output Y is obtained by the learnable weight γ and the offset β.

The input sequence X may be image feature information including a set of text image samples, where the text image samples captured by the feature extraction model in one training are a set of text image samples, X _i The related information of one text image sample in a group of text image samples, for example, the information obtained by the attention feature mechanism of the attention feature information of the text image sample.

It is understood that, when the feature extraction model is trained, the mean and variance may be calculated according to the mean and variance obtained during training, for example, an average or a moving average.

Optionally, in order to improve the capability of the feature extraction model to extract the attention feature, the feature extraction model may include multiple layers of attention mechanisms, and the multiple layers of attention mechanisms may be distributed in different feature extraction networks, for example, as shown in fig. 5, the output of the previous layer of feature extraction network is used as the input feature information of the next layer of feature extraction network, and the output of the last layer of feature extraction network is used as the attention feature information.

2. An upstream task: calculating a reference sample index of the text image sample according to the image attribute information of the text image sample, predicting to obtain a prediction sample index based on the attention feature information, and training a feature extraction model according to the reference sample index and the prediction sample index.

203. And the server performs image index calculation according to the image attribute information of the text image sample, and at least one piece of image index information.

For example, the server may specifically acquire image attribute information of a text image sample for at least one feature and perform calculation to obtain image index information (the image index information is a calculation result obtained by performing image index calculation on the image attribute information). For example, for the color feature, a color histogram (e.g., RGB color histogram, HSV color histogram, gray histogram) of the text image sample, or/and a color set and a color moment, etc. may be calculated from the color values of different channels of each pixel in the text image sample; and calculating attribute information such as color values of each pixel of the text image sample under different channels according to the overall characteristics of the text image sample to obtain image index information about the text image sample.

204. And the server combines the image index information according to the index value expression type of the image index to obtain a reference sample index of the text image sample, wherein the reference sample index comprises a one-dimensional index, a two-dimensional index and a reduction index.

For example, each image index may specifically correspond to an index value expression type, the server acquires the corresponding index value expression type according to the image index, and performs stitching processing on the image index information of which the index value expression type is a numerical value and a one-dimensional tensor to obtain a one-dimensional tensor about the text image sample and containing more image index information, and the one-dimensional tensor is used as a one-dimensional index. And splicing the image index information of which the index value expression type is a two-dimensional tensor to obtain a three-dimensional tensor of the text image sample, and taking the three-dimensional tensor as a two-dimensional index.

And combining image index information obtained by calculating color values of the text image sample in different channels to obtain a three-dimensional tensor as a restoration index.

And taking the one-dimensional index, the two-dimensional index, the reduction index and the like as the index of the text image sample reference sample.

205. The server processes the attention characteristic information according to a processing mode corresponding to the index type of the reference sample index to obtain prediction index information, wherein the prediction index information comprises a one-dimensional prediction index, a two-dimensional prediction index and a prediction restoration index.

If the index type is a one-dimensional index or a two-dimensional index, performing dimension conversion processing on the attention feature information through different network structures to obtain a tensor with the same size as the reference sample index, and taking the obtained tensor as the reference sample index, for example, performing dimension conversion on the attention feature through two fully-connected layers for the one-dimensional index: pred ₁ ＝W _a1 ·(W _a0 ·C+b _a0 )+b _a1 To obtain the predicted sample index pred ₁ Aiming at the one-dimensional index, the attention feature is subjected to dimension conversion through three full-connection layers: pred ₂ ＝W _b2 ·(W _b1 ·(W _b0 ·C+b _b0 )+b _b1 )+b _b2 Wherein, W _a0 、W _a1 、W _b0 、W _b1 、W _b2 b _a0 、b _a1 、b _b0 、b _b1 And b _b2 Are network parameters.

If the index type is a restoration index, the corresponding processing mode is image restoration processing, a specific network structure for performing the image restoration processing on the attention feature information is shown in fig. 3, the attention feature information is processed alternately by a transposition convolution layer and an activation layer, the processed attention feature information is output finally, and the processed attention feature information is normalized to obtain a prediction sample label.

After the first layer is transposed and coiled into the lamination layer, the following can be obtained: f. of ⁰ = convTransposition (C), where C is attention feature information, and post-processing attention feature information f to be finally output ^* And normalizing the processed attention characteristic information: fake _ img = (f) ^* -min(f ^* ))/(max(f ^* )-min(f ^* ) Fake _ img is a prediction sample index.

Alternatively, the activation function included in the last activation layer may be a Tanh function, and the activation function included in the BN & activation layer may be a ReLU function.

Optionally, the number of layers of the transposed convolution layer and the number of layers of the BN & active layer included in the network structure for image restoration processing may be flexibly set according to needs, which is not limited herein.

206. And the server trains the feature extraction model through the reference sample index and the prediction sample index to obtain a pre-training feature extraction model.

For example, the error between the predicted sample index and the corresponding reference sample index is calculated by using the MSE loss function to train the feature extraction model, and the network parameters of the feature extraction model are adjusted to obtain the pre-trained feature extraction model.

3. And (3) downstream tasks: and loading a pre-training feature extraction model, accessing the pre-training feature extraction model, predicting attention feature information extracted from a target text image sample by the text recognition model according to the pre-training features to obtain a prediction result, and training the initial feature extraction model according to the prediction result and the sample label to obtain a trained feature extraction model.

207. And the server performs image feature extraction and attention feature extraction processing on the target text image sample through the pre-training feature extraction model to obtain attention feature information of the attention context information of the target text image sample.

For example, image feature extraction may be specifically performed on a target text image sample through a pre-training feature extraction mechanism to obtain image feature information of the target text image sample, and then attention feature extraction is performed on the image feature information to obtain attention feature information of attention context information of the target text image sample.

208. The server predicts based on the attention feature information of the target text image sample through the text recognition model to obtain a prediction result, and trains the pre-training feature extraction model and the text recognition model according to the prediction result and the sample label to obtain a trained feature extraction model and a trained text recognition model.

For example, the method specifically includes predicting a target text image sample based on attention feature information through a text recognition model to obtain a prediction result, training a pre-training feature extraction model and the text recognition model according to the prediction result and a sample label to obtain a trained feature extraction model and a trained text recognition model, performing feature extraction on a text image to be recognized through the trained feature extraction model to obtain the attention feature information, and obtaining the prediction result based on the attention feature information through the trained text recognition model.

For example, as shown in fig. 6, the text recognition model based on the CRNN structure, the fully-connected classifier that inputs attention feature information into the text recognition model, predicts the probability that each image region in the target text image is each character in the dictionary, obtains a prediction result, calculates Loss based on CTC Loss according to the prediction result and the sample label, and trains the pre-training feature extraction model and the text recognition model.

Or as shown in fig. 6, the text recognition model based on the Attention mechanism inputs Attention feature information into an LSTM decoder including the Attention mechanism, predicts the probability that each image region in the target text image is each character in the dictionary, obtains a prediction result, calculates Loss based on cross entropy (CE Loss) according to the prediction result and the sample label, and trains the pre-training feature extraction model and the text recognition model.

As can be seen from the above, in the embodiment of the present application, the server obtains the text image sample, and performs image feature extraction processing on the text image sample through the feature extraction model to obtain the image feature information of the text image sample; performing attention feature extraction on the image feature information through a feature extraction model to obtain attention feature information of attention context information of a text image sample; performing image index calculation according to the image attribute information of the text image sample, wherein at least one image index information is obtained; the server merges the image index information according to the index value expression type of the image index to obtain a reference sample index of the text image sample, wherein the reference sample index comprises a one-dimensional index, a two-dimensional index and a restoration index; processing attention characteristic information according to a processing mode corresponding to the index type of the index of the reference sample to obtain prediction index information, wherein the prediction index information comprises a one-dimensional prediction index, a two-dimensional prediction index and a prediction restoration index; training the feature extraction model through the reference sample index and the prediction sample index to obtain a pre-training feature extraction model; the method comprises the steps of carrying out image feature extraction and attention feature extraction processing on a target text image sample through a pre-training feature extraction model, obtaining attention feature information of attention context information of the target text image sample, predicting the attention feature information of the target text image sample through a text recognition model based on the attention feature information of the target text image sample to obtain a prediction result, training the pre-training feature extraction model and the text recognition model according to the prediction result and a sample label to obtain a trained feature extraction model and a trained recognition model.

In order to better implement the text recognition method provided by the embodiment of the application, a text recognition device is further provided in an embodiment. The meaning of the noun is the same as that in the text recognition method, and specific implementation details can refer to the description in the method embodiment.

The text recognition apparatus may be specifically integrated in a computer device, as shown in fig. 7, and the text recognition apparatus may include: the obtaining unit 301, the calculating unit 302, the first feature extracting unit 303, the second feature extracting unit 304, the predicting unit 305, and the training unit 306 are specifically as follows:

(1) Acquisition unit 301

The acquisition unit 301: for obtaining text image samples.

(2) Computing unit 302

The calculation unit 302: and the image index calculation module is used for performing image index calculation according to the image attribute information of the text image sample and determining a reference sample index of the text image sample based on the calculation result.

Optionally, the calculating unit 302 may include a calculating subunit and a combining subunit, specifically:

a calculation subunit: the image index calculation module is used for calculating image indexes according to the image attribute information of the text image sample to obtain at least one piece of image index information;

merging the subunits: and the index merging unit is used for merging the indexes of at least one image index information to obtain the reference sample index of the text image sample.

Optionally, the merging subunit may include an obtaining module and a merging module, specifically:

an acquisition module: the index value expression type is used for obtaining at least one image index;

a merging module: and the index merging module is used for carrying out index merging processing on at least one image index information according to the index value expression type to obtain a reference sample index of the text image sample.

(3) First feature extraction unit 303

The first feature extraction unit 303: the image feature extraction module is used for carrying out image feature extraction processing on the text image sample through the feature extraction model to obtain image feature information of the text image sample.

(4) Second feature extraction unit 304

The second feature extraction unit 304: the attention feature extraction module is used for extracting attention features of the text image samples on the basis of the image feature information through the feature extraction model to obtain attention feature information of the attention context information of the text image samples.

Optionally, the second feature extraction unit 304 may include a mapping subunit, a similarity degree subunit, and a first fusion subunit, specifically:

a mapping subunit: the system comprises a text image sample, a query vector generation unit, a content vector generation unit and a processing unit, wherein the text image sample is used for carrying out attention space mapping processing on image characteristic information to obtain a corresponding space vector of each image area in a text image sample in an attention space, and the space vector comprises the query vector, a content vector and a key vector;

similarity calculation subunit: the image processing device is used for calculating the similarity between the image area and the associated image area according to the distance between the query vector of the image area and the key vector of the associated image area for each image area;

a first fusion subunit: and the fusion processing module is used for performing fusion processing on the image characteristic information of the image area and the associated image area according to the similarity between the key vector of the image area and the associated image area to obtain the attention characteristic information of the attention context information.

Optionally, the second feature extraction unit 304 may include a mask subunit and an extraction subunit, specifically:

a mask subunit: the image feature information is subjected to mask processing through the feature extraction model to obtain masked image feature information of the text image sample;

an extraction subunit: and the attention feature extraction module is used for extracting attention features of the image feature information after the mask to obtain attention feature information of the attention context information of the text image sample.

Optionally, the second feature extraction unit 304 may include as a sub-unit and a first feature extraction sub-unit, specifically:

as subunits: the image feature information is used as input feature information of a multi-layer feature extraction mechanism;

a first feature extraction subunit: the method is used for sequentially carrying out attention feature extraction on the input feature information through a multi-layer feature extraction mechanism to obtain attention feature information of the attention context information of the text image sample.

Optionally, the second feature extraction unit 304 may include a second feature extraction subunit and a second fusion subunit, specifically:

a second feature extraction subunit: the attention feature extraction system is used for sequentially extracting attention features of input feature information on the basis of a multi-head attention mechanism through a multi-layer feature extraction mechanism to obtain sub-attention feature information under each attention mechanism;

a second fusion subunit: and the attention feature information of the attention context information of the text image sample is obtained by performing fusion processing on the sub-attention feature information under each attention mechanism through a multi-layer feature extraction mechanism.

(5) The prediction unit 305:

the prediction unit 305: and the prediction sample index is used for predicting the text image sample based on the attention feature information of the text image sample.

Optionally, the prediction unit 305 may include a determination subunit and a processing subunit, specifically:

determining a subunit: the characteristic processing mode corresponding to each index type is determined;

a processing subunit: and the method is used for processing the attention characteristic information by adopting a corresponding characteristic processing mode aiming at each index type to obtain a prediction sample index corresponding to each index type.

Optionally, the processing subunit may include a transpose convolution module and a normalization module, specifically:

and a transposition convolution module: the attention feature information processing device is used for performing transposition convolution processing on the basis of the attention feature information to obtain processed attention feature information;

a normalization module: and the method is used for carrying out normalization processing on the attention characteristic information to obtain a prediction sample index.

(6) The training unit 306:

the training unit 306: and the feature extraction model is trained according to the prediction sample indexes and the corresponding reference sample indexes, so that the attention feature information of the text image to be recognized is extracted through the trained feature extraction model for image text recognition.

Optionally, the text recognition apparatus may include a sample obtaining unit, a third feature extraction unit, a fourth feature extraction unit, a result prediction unit, and a supervised training unit, specifically:

a sample acquisition unit: the method comprises the steps of obtaining a target text image sample, wherein the text image sample carries a sample label;

a third feature extraction unit: the pre-training feature extraction model is used for carrying out image feature extraction processing on the target text image sample to obtain image feature information of the target text image sample;

a fourth feature extraction unit: the attention feature extraction module is used for extracting attention features of the target text image samples based on the image feature information of the target text image samples through the pre-training feature extraction model to obtain the attention feature information of the attention context information of the target text image samples;

a result prediction unit: the system comprises a text recognition model, a target text image sample and a prediction result, wherein the text recognition model is used for predicting based on the attention feature information of the target text image sample to obtain the prediction result of the target text image sample;

a supervision training unit: the method is used for training the text recognition model and the pre-training feature extraction model based on the sample labels and the prediction result to obtain a trained text recognition model and a trained feature extraction model, so that the text content of the text image to be recognized is recognized through the text recognition model.

An embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 8, which shows a schematic structural diagram of the computer device according to the embodiment of the present application, and specifically:

the computer device may include components such as a processor 1001 of one or more processing cores, memory 1002 of one or more computer-readable storage media, a power supply 1003, and an input unit 1004. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 8 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 1001 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 1002 and calling data stored in the memory 1002, thereby monitoring the computer device as a whole. Alternatively, processor 1001 may include one or more processing cores; preferably, the processor 1001 may integrate an application processor, which mainly handles operating systems, user interfaces, computer programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1001.

The memory 1002 may be used to store software programs and modules, and the processor 1001 executes various functional applications and data processing by operating the software programs and modules stored in the memory 1002. The memory 1002 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 1002 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 1002 may also include a memory controller to provide the processor 1001 access to the memory 1002.

The computer device further includes a power source 1003 for supplying power to each component, and preferably, the power source 1003 may be logically connected to the processor 1001 through a power management system, so that functions of managing charging, discharging, power consumption, and the like are implemented through the power management system. The power source 1003 may also include any component including one or more of a dc or ac power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 1004, and the input unit 1004 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 1001 in the computer device loads the executable file corresponding to the process of one or more computer programs into the memory 1002 according to the following instructions, and the processor 1001 runs the computer programs stored in the memory 1002, so as to implement various functions as follows:

acquiring a text image sample;

performing attention feature extraction on the text image sample through a feature extraction model based on the image feature information to obtain attention feature information of attention context information of the text image sample;

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the computer device according to the embodiment of the present application may obtain a text image sample; performing image index calculation according to the image attribute information of the text image sample, and determining a reference sample index of the text image sample based on the calculation result; performing image feature extraction processing on the text image sample through a feature extraction model to obtain image feature information of the text image sample; performing attention feature extraction on the text image sample based on the image feature information through the feature extraction model to obtain attention feature information of attention context information of the text image sample; predicting a prediction sample index of the text image sample based on the attention feature information of the text image sample; and training the feature extraction model according to the prediction sample indexes and the corresponding reference sample indexes, and extracting attention feature information of the text image to be recognized through the trained feature extraction model to recognize the image text. According to the scheme, the characteristic extraction model is trained through the reference sample indexes and the prediction sample indexes, a large number of unlabelled text image samples can be used for training the characteristic extraction model, and the training effect of the characteristic extraction model is enhanced.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the above embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.

To this end, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute any one of the text recognition methods provided by the embodiments of the present application.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

As the computer program stored in the computer-readable storage medium can execute any text recognition method provided in the embodiments of the present application, beneficial effects that can be achieved by any text recognition method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The text recognition method, the text recognition device, the computer device, and the computer-readable storage medium provided in the embodiments of the present application are described in detail above, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A text recognition method, comprising:

acquiring a text image sample;

2. The method according to claim 1, wherein the performing image index calculation according to the image attribute information of the text image sample, and determining a reference sample index of the text image sample based on the calculation result comprises:

and carrying out index merging processing on the at least one index information to obtain a reference sample index of the text image sample.

3. The method according to claim 2, wherein the performing index merging processing on the at least one index information to obtain a reference sample index of the text image sample comprises:

obtaining an index value expression type of at least one image index;

and carrying out index merging processing on the at least one piece of image index information according to the index value expression type to obtain a reference sample index of the text image sample.

4. The method of claim 1, wherein the reference sample indicators comprise at least two types of reference sample indicators, and wherein predicting the prediction sample indicator for the text image sample based on the attention feature information of the text image sample comprises:

determining a characteristic processing mode corresponding to each index type;

and aiming at each index type, processing the attention characteristic information by adopting a corresponding characteristic processing mode to obtain a prediction sample index corresponding to each index type.

5. The method according to claim 4, wherein the feature processing means includes image restoration processing, and the processing the attention feature information by using the corresponding feature processing means to obtain the prediction sample index corresponding to each index type includes:

performing transposition convolution processing on the attention feature information to obtain processed attention feature information;

and carrying out normalization processing on the processed attention characteristic information to obtain the prediction sample index.

6. The method of claim 1, wherein the image feature information comprises an image feature vector, and the performing, by the feature extraction model, attention feature extraction on the text image sample based on the image feature information to obtain attention feature information of attention context information of the text image sample comprises:

performing attention space mapping processing on the image feature vectors to obtain a corresponding space vector of each image region in the text image sample in the attention space, wherein the space vector can comprise a query vector, a content vector and a key vector;

7. The method of claim 1, wherein the feature extraction model comprises a multi-layer feature extraction mechanism, and wherein performing attention feature extraction on the text image sample based on the image feature information by the feature extraction model to obtain attention feature information of the attention context information of the text image sample comprises:

taking the image feature information as input feature information of the multi-layer feature extraction mechanism;

and sequentially carrying out attention feature extraction on the input feature information through the multilayer feature extraction mechanism to obtain attention feature information of the attention context information of the text image sample.

8. The method of claim 7, wherein each layer of feature extraction mechanism comprises a multi-head feature extraction mechanism, and the obtaining of the attention feature information of the attention context information of the text image sample by performing attention feature extraction on the input feature information sequentially through the multi-layer feature extraction mechanism comprises:

performing attention feature extraction on the sequential input feature information through the multilayer feature extraction mechanism based on a multi-head attention mechanism to obtain sub-attention feature information under each attention mechanism;

and performing fusion processing on the sub-attention feature information under each attention mechanism through the multi-layer feature extraction mechanism to obtain the attention feature information of the attention context information of the text image sample.

9. The method of claim 1, wherein the performing, by the feature extraction model, attention feature extraction on the text image sample based on the image feature information to obtain attention feature information of attention context information of the text image sample comprises:

performing mask processing on the image characteristic information through the characteristic extraction model to obtain masked image characteristic information of the text image sample;

10. The method of claim 1, wherein after training the feature extraction model based on the predicted sample metrics and corresponding reference sample metrics, the method further comprises:

obtaining a target text image sample, wherein the text image sample carries a sample label;

performing image feature extraction processing on the target text image sample through a pre-training feature extraction model to obtain image feature information of the target text image sample, wherein the pre-training feature extraction model is obtained through training of the prediction sample index and a corresponding reference sample index;

performing attention feature extraction on the target text image sample through the pre-training feature extraction model based on the image feature information of the target text image sample to obtain attention feature information of the attention context information of the target text image sample;

and training the text recognition model and the pre-training feature extraction model based on the sample label and the prediction result to obtain a trained text recognition model and a trained feature extraction model so as to perform image text recognition on a text image to be recognized.

11. A text recognition apparatus, comprising:

the acquisition unit is used for acquiring a text image sample;

a prediction unit, configured to predict a prediction sample index of the text image sample based on attention feature information of the text image sample;

12. A computer device comprising a memory and a processor; the memory stores a computer program, and the processor is configured to execute the computer program in the memory to perform the text recognition method according to any one of claims 1 to 10.

13. A computer-readable storage medium for storing a computer program which is loaded by a processor to perform the text recognition method of any one of claims 1 to 10.