CN113822264A

CN113822264A - Text recognition method and device, computer equipment and storage medium

Info

Publication number: CN113822264A
Application number: CN202110712851.9A
Authority: CN
Inventors: 王斌; 包志敏; 曹浩宇; 姜德强; 薛莫白
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-12-21

Abstract

The embodiment of the application discloses a text recognition method, a text recognition device, computer equipment and a storage medium, which relate to the technical field of communication, wherein a text image to be recognized is obtained and comprises at least two image areas; performing feature extraction on the text image to be recognized to obtain feature information of each image area; for each image area, calculating the content similarity between the image area and the associated image area according to the characteristic information of the image area and the characteristic information of the associated image area; according to the content similarity, carrying out fusion processing on the feature information of the image area and the associated image area to obtain attention feature information of the attention context information; and performing text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result. According to the scheme, text recognition is carried out according to the attention characteristic information of the attention context information, parallel recognition of image areas in the image to be recognized can be achieved, and the recognition speed of the image to be recognized is improved.

Description

Text recognition method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to a method and an apparatus for text recognition, a computer device, and a storage medium.

Background

Optical Character Recognition (OCR) refers to a process in which a computer device detects the shape of a Character, such as a Character printed on paper or a Character included in a picture, and then translates the detected shape into a computer text using a Character Recognition method. May be passed through a cyclic convolution model (CRNN) based on a time series alignment algorithm (CTC).

The CRNN model based on the CTC comprises a CNN network layer and an RNN network layer, wherein the RNN network layer adopts a Long Short Term Memory (LSTM) network, the LSTM belongs to a serial network structure, and prediction data of a T +1 th image area are calculated according to prediction data of the previous T image areas in the past, so that the time consumption of prediction results is caused.

Disclosure of Invention

The embodiment of the application provides a text recognition method, a text recognition device, computer equipment and a storage medium, which can realize parallel recognition of image areas in images to be recognized in parallel, obtain recognition results of the images to be recognized and improve the recognition speed of the images to be recognized.

The text recognition method provided by the embodiment of the application comprises the following steps:

acquiring a text image to be recognized, wherein the text image to be recognized comprises at least two image areas;

performing feature extraction on the text image to be recognized to obtain feature information of each image area in the text image to be recognized;

for each image area, calculating the content similarity between the image area and the related image area according to the characteristic information of the image area and the characteristic information of the related image area;

for each image area, according to the content similarity between the image area and the associated image area, carrying out fusion processing on the feature information of the image area and the associated image area to obtain attention feature information of attention context information;

and performing text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result.

Correspondingly, an embodiment of the present application further provides a text recognition apparatus, including:

an acquisition unit: the method comprises the steps of obtaining a text image to be recognized, wherein the text image to be recognized comprises at least two image areas;

an extraction unit: the text image recognition system is used for extracting the characteristics of the text image to be recognized to obtain the characteristic information of each image area in the text image to be recognized;

a calculation unit: the image processing device is used for calculating the content similarity between the image area and the related image area according to the characteristic information of the image area and the characteristic information of the related image area for each image area;

a fusion unit: the system comprises a correlation image area, an image area and a context information acquisition unit, wherein the correlation image area is used for acquiring the content similarity between the image area and the context information of the user;

an identification unit: and the recognition module is used for recognizing the text content of the image to be recognized based on the attention feature information to obtain a recognition result.

Correspondingly, the embodiment of the application also provides computer equipment, which comprises a memory and a processor; the memory stores a computer program, and the processor is used for operating the computer program in the memory to execute any text recognition method provided by the embodiment of the application.

Correspondingly, the embodiment of the present application further provides a storage medium, where the storage medium is used to store a computer program, and the computer program is loaded by a processor to execute any one of the text recognition methods provided by the embodiment of the present application.

The method and the device for recognizing the text image can acquire the text image to be recognized, wherein the text image to be recognized comprises at least two image areas; performing feature extraction on the text image to be recognized to obtain feature information of each image area in the text image to be recognized; for each image area, calculating the content similarity between the image area and the associated image area according to the characteristic information of the image area and the characteristic information of the associated image area; for each image area, performing fusion processing on the feature information of the image area and the associated image area according to the content similarity between the image area and the associated image area to obtain attention feature information of the attention context information; and performing text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result. According to the scheme, the attention feature information of the attention context information of each image area is obtained by fusing the feature information of each image area and the corresponding associated image area, and text recognition is performed on each image area according to the attention feature information of the attention context information, so that parallel recognition of the image areas in the images to be recognized can be realized, and the recognition speed of the images to be recognized is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene diagram of a text recognition method provided in an embodiment of the present application;

FIG. 2 is a flowchart of a text recognition method provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a path planning provided by an embodiment of the present application;

fig. 4 is a schematic diagram of another path planning provided in the embodiment of the present application;

FIG. 5 is a schematic diagram of another path plan provided by an embodiment of the present application;

fig. 6 is a schematic diagram of another path planning provided in the embodiment of the present application;

FIG. 7 is another schematic diagram of path planning provided by the embodiment of the present application

FIG. 8 is another flow chart of a text recognition method provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a user interface of a text recognition method provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of feature vector fusion provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of a text recognition apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a text recognition method and device, computer equipment and a storage medium. The text recognition device may be integrated in a computer device, and the computer device may be a server or a terminal.

The terminal may include a mobile phone, a wearable smart device, a tablet Computer, a notebook Computer, a Personal Computer (PC), a vehicle-mounted Computer, and the like.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform.

For example, as shown in fig. 1, the computer device may perform feature extraction on a text image to be recognized through a trained text content recognition model to obtain a feature map of the text image to be recognized, select features included in the feature map according to data required for an attention mechanism in the trained text content recognition model, perform feature embedding processing on the selected features according to feature information required for the attention mechanism to obtain feature information of each image region in the image to be recognized, and perform fusion processing on the feature information of the image region and the associated image region according to content similarity between the image region and the associated image region through the attention mechanism of the trained text content recognition model to obtain attention feature information of the attention context information; the method comprises the steps of carrying out text content recognition on an image to be recognized based on attention feature information through a trained text content recognition model to obtain an initial recognition result, and mapping the initial recognition result based on a preset strategy (for example, deleting a space symbol in the initial recognition result through the preset strategy) to obtain a recognition result. According to the scheme, the attention feature information of the attention context information of each image area is obtained by fusing the feature information of each image area and the corresponding associated image area, and text recognition is performed on each image area according to the attention feature information of the attention context information, so that parallel recognition of the image areas in the images to be recognized can be realized, and the recognition speed of the images to be recognized is improved

The training content recognition model can be a neural network model obtained based on Machine Learning, and Machine Learning (ML) is a multi-field cross subject and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

In this embodiment, a text recognition apparatus will be described from the perspective of the text recognition apparatus, where the text recognition apparatus may be specifically integrated in a computer device, and the computer device may be a server or a terminal, as shown in fig. 2, the flow of the text recognition method is as follows;

101. and acquiring a text image to be recognized, wherein the text image to be recognized comprises at least two image areas.

The text image to be recognized may be an image that needs to be subjected to text content recognition, and the text image to be recognized may include text contents, for example, text contents including different languages such as chinese characters, english, arabic numerals, japanese, and the like.

The image area may be an area obtained by dividing the text image to be recognized, for example, the text image to be recognized is divided by using a preset unit as a step length, so that a plurality of image areas may be obtained.

For example, the text image to be recognized sent by the terminal may be obtained, or the text image to be recognized in the database may be obtained, or the text image to be recognized stored in the block chain may be obtained.

102. And performing feature extraction on the text image to be recognized to obtain feature information of each image area in the text image to be recognized.

The feature extraction may be a process of performing image analysis and transformation on the text image to be recognized to extract characteristic information of the text image to be recognized, and the feature information may be information obtained by feature extraction.

For example, the text image to be recognized may be specifically analyzed, for example, color values of each pixel of the text image to be recognized in different color channels are obtained, and at least one color value matrix related to the text image to be recognized may be obtained, where the color value matrix may represent a color value of each pixel in the text image to be recognized; and converting the color value matrix of the text image to be recognized so as to perform feature extraction, and obtaining feature information of each image area in the text image to be recognized.

In an embodiment, the convolution processing may be performed on the text image to be recognized to obtain the feature information of each image region in the text image to be recognized, that is, "extracting features of the text image to be recognized to obtain the feature information of each image region in the text image to be recognized" specifically may include:

carrying out convolution processing on an image to be identified to obtain a feature map of the image to be identified;

and performing feature extraction on the feature map to obtain feature information of each image area in the image to be identified.

The feature map may be an output result of the convolution processing, and represents data of a certain feature distribution in the text image to be recognized.

For example, the feature map of the text image to be recognized may be obtained by performing convolution processing on the text image to be recognized through a convolution kernel (also referred to as a filter).

And selecting the characteristics of the characteristic graph to obtain the characteristic information of the text image to be recognized, wherein the characteristic information of the text image to be recognized is the set of the characteristic information of each image area, and therefore the characteristic information of each image area is obtained.

The above process of extracting features of the text image to be recognized to obtain the feature information of each image region in the text image to be recognized may also be implemented by training a text content recognition model, that is, in an embodiment, the step "extracting features of the text image to be recognized to obtain the feature information of each image region in the text image to be recognized" may specifically include:

and performing feature extraction on the text image to be recognized through the trained text content recognition model to obtain feature information of each image area of the text image to be recognized.

Among them, the trained text content recognition model is a network system formed by a large number of simple processing units (called neurons) widely connected to each other for recognizing text contained in an image.

For example, feature extraction may be specifically performed by training a Convolutional Neural Network (CNN) in the text recognition model, where the CNN may include at least one Convolutional layer, at least one sampling layer, and a classification layer, the Convolutional layer may extract features of the text image to be recognized, the sampling layer may perform feature selection on the extracted features, and the full connection layer classifies the features, so as to obtain feature information of each image region in the text image to be recognized.

The trained content recognition model may be a model obtained by training an initial text content recognition model, and the training of the initial text content recognition model continuously adjusts network parameters, such as attention weight parameters and convolution kernels, in the initial text content recognition model until a preset training end condition is met, for example, a correct rate of a prediction result output by the initial text content recognition model is greater than a preset correct rate, or training times are greater than preset times, so as to obtain a trained model, that is, in an embodiment, before the step "acquiring an image to be recognized", the method further includes:

acquiring a text image sample, wherein the text image sample comprises at least two image areas;

performing feature extraction on image areas of the text image sample through an initial text content identification model to obtain feature information of each image area of the text image sample;

for each image area, calculating the content similarity between the image area and the associated image area according to the characteristic information of the image area and the characteristic information of the associated image area;

for each image area, performing fusion processing on the feature information of the image area and the associated image area according to the content similarity between the image area and the associated image area to obtain attention feature information of the attention context information;

performing text content recognition on the image to be recognized based on the attention feature information to obtain a prediction result;

and training the initial text content recognition model based on the prediction result and the sample label of the text image sample to obtain the trained text content recognition model.

The text image sample may be an image input by training an initial text content recognition model.

For example, a text image sample may be obtained, and for each image region of the text image sample, the feature information of the image region and the associated image region is subjected to fusion processing by using an initial text content identification model according to the content similarity between the image region and the associated image region, so as to obtain the attention feature information of the attention context information; the text content of the image to be recognized is recognized based on the attention feature information to obtain a prediction result, and the specific process may refer to the description related to the embodiment of the present application, which is not described herein again.

And calculating an error between the prediction result and a sample label of the text image sample, and adjusting network parameters in the initial text content recognition model according to the calculated error so as to train the initial text content recognition model until a preset training end condition is met, thereby obtaining the trained content recognition model.

Each image area in the text image sample does not necessarily correspond to a sample label, for example, the sample label is a food, the image area may be a blank area, and does not contain any text content, and if there is an image area without a corresponding sample label, effective training cannot be performed, so a loss function (CTC) can be calculated by using a time series alignment algorithm, which adds a blank symbol blank on the basis of the sample label, and does not need to align the image area with characters in the sample label, and label each image area one by one, as long as the sample labels of the sample labels are obtained by mapping and are correct prediction results, thereby solving the problems of inconsistent lengths of the text image sample, the image area and the sample label (i.e., text results), and uncertain positions of the sample label and the image area (i.e., aligned sample label and image area), that is, in an embodiment, the step "training an initial text content recognition model based on a prediction result and a sample label of a text image sample, to obtain a trained text content recognition model", may specifically be:

determining a corresponding character path set according to characters in the sample label, wherein the character path set comprises at least one character path, and the character path is mapped through a preset mapping strategy to obtain the sample label;

calculating the path probability of each character path in the character path set based on the prediction probability of each character of the prediction result;

calculating an error value between the prediction result and the sample label according to the character path probability;

and training the initial text content recognition model based on the error value to obtain a trained text content recognition model.

The sample label may be text content contained in the text image sample, for example, the text content contained in the text image sample is "who my is", and then the sample label corresponding to the text image sample is "who my is".

The character path set may be a set including all character paths that can be mapped by a preset mapping policy to obtain a sample label.

The preset mapping strategy is to map the repeated characters into one character, namely combine the repeated characters into one character, and map the blank characters into blanks, namely delete the blank characters.

The character paths may represent the correct prediction result of each image area of the text image sample, and if the characters corresponding to the image areas of the text image sample correspond to the character paths one by one, the sample label may be obtained through mapping, and the initial text content recognition model correctly recognizes the text image sample.

For example, the sample label is "C (for distinction, it may be referred to as a first character C hereinafter) C (for distinction, it may be referred to as a second character C hereinafter) AT", the corresponding sample sequence is { -C, -, C-, a, -, }, the text image sample has 5 image regions, the corresponding path plan (not including the character path) diagram may be as shown in fig. 3, each character in the sample sequence corresponds to a different row in sequence, and each image region of the text image sample corresponds to a different column.

Since the text content included in the text image sample has a certain sequence, for example, from left to right, the prediction result is misaligned without following the sequence input result, and corresponding to the path planning diagram, the character path can only go from left to right (e.g., path a in fig. 4, from the first character C in the image area 1, to the first character C in the image area b), and go from top to right (e.g., path b in fig. 4, from the first character C in the image area 1, to the blank character in the image area b).

Because the preset mapping strategy can combine the repeated characters, the characters in the path planning can jump in the path planning, namely jump from the current area to the adjacent image area in the same line, such as the path a in fig. 4; the same characters cannot jump to each other (for example, path C shown in fig. 4, because two consecutive characters obtained by direct jumping are the same characters and are merged according to a preset mapping policy), the character path must pass through a blank character after the first character C to reach the second character C, for example, paths b and d shown in fig. 4, and the different characters can jump to each other directly (for example, path e shown in fig. 4).

In summary, the path set corresponding to the sample label "CCAT" may include all the character paths shown in fig. 5. The character path shown in fig. 5 can obtain the sample label through a preset mapping strategy.

For example, the sample label may be "ab", and a loss function is calculated based on the CTC algorithm, and a blank character blank may be added on the basis of the character a and the character b included in the sample label, and the blank character blank is recorded as "to obtain the label sequence { -, a, -, b, - }. Assuming that the text image sample includes 3 image regions, the preset mapping policy is to map repeated characters into one character, that is, merge the repeated characters into one character, and map blank characters into blank characters, that is, delete blank characters, the character path set may include character paths "-ab", "ab-", "a-b", "aab", and "abb", and the paths may obtain a sample label "ab" by deleting the blank characters in the character paths and merging the same characters in the paths through the preset mapping policy.

The prediction result includes a prediction probability for each character in a predetermined dictionary, e.g., 1k characters in the predetermined dictionary, the prediction result contains the prediction probability for each character in the preset dictionary for each image region in the text image sample, calculating the path probability of each character path in the character path set according to the preset probability of each character in the preset dictionary corresponding to each image region, according to the sum of the path probabilities of all the character paths in the path set, the probability of obtaining the sample label by predicting the initial text content recognition model can be obtained, namely the error value between the prediction result of the initial text content recognition model and the sample label, the direction propagation is carried out according to the error value, and adjusting the network parameters of the initial text content recognition model to train the initial text content recognition model to obtain the trained text content recognition model.

In an embodiment, by limiting the jumping from the current image area to the adjacent image area, the character cannot jump on itself (e.g. path a in fig. 6), and allowing the same character to jump directly (e.g. path c in fig. 6), when there are multiple consecutive same characters in the text image sample, the prediction can be accurately performed, and the situation that multiple repeated characters are merged due to prediction is avoided, that is, in an embodiment, the step "determining the corresponding character path set according to the character in the sample label" may specifically include:

determining a label sequence according to characters contained in the sample label, wherein the label sequence comprises a first character, a second character and a space character;

a corresponding set of character paths is determined based on the sequence of tags, the set of character paths including paths formed by jumping from a first character to a second character and jumping from the first character to an alternate character.

The first character and the second character may be different characters in the label sequence, the prediction sample is "C (for distinction, may be referred to as a first character C hereinafter) C (for distinction, may be referred to as a second character C hereinafter) AT", the first character may be the first character C, the second character may be the second character C, or the first character may be the second character C, the second character may be a, and the first character and the second character correspond to different rows in the path planning diagram.

The space character may be a character that separates characters in the sample label, such as a blank character, or other characters.

In particular, since the character cannot jump itself (as in path a in fig. 6), the same character is allowed to jump directly (as in path c in fig. 6),

in one embodiment, the set of character paths may be labeled as "CCAT" and the corresponding set of paths may include all of the character paths shown in FIG. 7. Since the characters are restricted from jumping, the corresponding preset mapping strategy is to map the space characters to be empty, i.e. delete the space characters, and the repeated characters do not need to be mapped to one character, i.e. the repeated characters do not need to be combined into one character. The character path shown in fig. 7 can obtain the sample label "CCAT" by a preset mapping policy.

103. And calculating the content similarity between the image area and the related image area according to the characteristic information of the image area and the characteristic information of the related image area for each image area.

The associated image area may be an image area associated with the image area, for example, an adjacent image area of the image area, or all image areas in the text image to be recognized, or other image areas in the text image to be recognized.

The content similarity may be a similarity between the content included in the image area and the content included in the associated image area.

For example, the feature information of the image region and the feature information of the associated image region may be mapped into a target feature space, the distance between the feature information of the image region and the feature information of the associated image region in the target feature space is calculated, and the content similarity between the image region and the associated image region may be determined according to the calculated distance.

The feature information may include a feature vector, and the content similarity between the image region and the associated image region may be calculated according to a distance between the feature vector of the image region and the feature vector of the associated image region, that is, in an embodiment, the step "for each image region, calculating the content similarity between the image region and the associated image region according to the feature information of the image region and the feature information of the associated image region matched with the image region" specifically may include:

determining a target image area to be processed currently and a related image area of the target image area;

performing space mapping processing on the target characteristic vector of the target image area based on the attention weight information to obtain a first mapping vector corresponding to the target characteristic vector of the target image area in a characteristic space;

performing space mapping processing on the associated feature vector of the associated image area based on the attention weight information to obtain a corresponding second mapping vector of the associated feature vector of the associated image area in a feature space;

and calculating the content similarity between the target image area and the associated image area according to the distance between the first mapping vector and the second mapping vector to obtain the content similarity of each image area.

The target image area may be an image area to be processed currently, and the feature vector corresponding to the target image area is a target feature vector.

The attention weight information may be information obtained by spatially transforming the feature vector, for example, the attention weight information may be information in the form of an attention weight matrix.

The associated feature vector may be a feature vector of an associated image region corresponding to the target image region.

The first mapping vector may be a vector obtained by mapping a target feature vector corresponding to the target image region to a feature space; the second mapping vector may be a vector obtained by mapping the associated feature vector corresponding to the associated image region to the feature space.

The feature space may be a space different from the target feature vector, or may be the same as the target feature vector.

For example, the method may specifically include determining a current image area to be processed, that is, a target image area, and determining a related image area corresponding to the target image area, performing linear transformation on a target feature vector based on the attention weight information, and mapping the target feature vector to a feature space to obtain a first mapping vector corresponding to the target feature vector in the feature space.

And performing linear transformation on the related feature vectors of the related image regions based on the attention weight information, and mapping the related feature vectors into a feature space to obtain corresponding second mapping vectors of the related feature vectors in the feature space.

Calculating a Distance between the first mapping vector and the second mapping vector, where the Distance between the first mapping vector and the second mapping vector may be a Manhattan Distance (Manhattan Distance), an Euclidean Distance (Euclidean Distance), a Chebyshev Distance (Chebyshev Distance), a Cosine Similarity (Cosine Similarity), or a Hamming Distance (Hamming Distance), and calculating a content Similarity between the target image region and the associated image region according to the Distance between the first mapping vector and the second mapping vector, for example, normalizing the Distance to convert the Distance into data between 0 and 1, so as to obtain the content Similarity between the target image region and the associated image region.

And executing the operation aiming at each image area in the text image to be recognized to obtain the content similarity between each image area and the associated area.

The feature vectors corresponding to the image regions may also be mapped into a plurality of vectors, each vector reflecting different feature information of the feature vector, that is, the first mapping vector includes a first query vector, a first key vector and a first content vector, the second mapping vector includes a second query vector, a second key vector and a second content vector, each vector plays its own role, and the similarity between each image region and the associated image region is calculated according to the mapped vector in the similarity calculation is improved, that is, in an embodiment, the step "calculating the content similarity between the target image region and the associated image region according to the distance between the first mapping vector and the second mapping vector to obtain the content similarity of each image region" may specifically include:

and calculating the content similarity between the target image area and the associated image area according to the distance between the first query vector corresponding to the target image area and the second key vector corresponding to the associated image area to obtain the content similarity of each image area.

The first query vector, the first key vector, and the first content vector may be vectors obtained by performing linear transformation on the target feature vector of the target image region according to different attention weight matrices in the attention weight information.

The second query vector, the second key vector, and the second content vector may be vectors obtained by performing linear transformation on the relevant feature vectors of the relevant image regions according to different attention weight matrices in the attention weight information.

For example, the attention-focusing weight information may specifically include a first weight matrix, a second weight matrix, and a third weight matrix, the target eigenvector is mapped based on the first weight matrix to obtain a first Query vector, denoted as Query1, Q1 for short, the target eigenvector is mapped based on the second weight matrix to obtain a first Key vector, denoted as Key1, K1 for short, and the target eigenvector is mapped based on the third weight matrix to obtain a first content vector, denoted as Value1, V1 for short.

The same mapping process is performed on the associated image regions, and Q12, K2, and V2 corresponding to each associated image region can be obtained.

The distance between the first query vector corresponding to the target image region and the second key vector corresponding to the associated image region is calculated, for example, the first query vector and the second key vector may be subjected to point multiplication, for example, Q · K, and the result of the point multiplication is subjected to normalization processing to obtain the content similarity between the target image region and the associated image region, so as to obtain the content similarity of each image region.

And carrying out the same processing on each image area in the text image to be recognized to obtain the content similarity between each image area and the corresponding associated area.

The second content vector of the associated image region is weighted according to the obtained content similarity and is fused with the first content vector, so that the feature information of the associated image region and the feature information of the target image region can be fused to obtain the attention feature information of the attention context information, that is, in an embodiment, the step "for each image region, the feature information of the image region and the associated image region is fused according to the content similarity between the image region and the associated image region to obtain the attention feature information of the attention context information" may specifically be:

for each image area, according to the content similarity between the image area and the associated image area, carrying out weighting processing on the content feature vector corresponding to the associated image area to obtain a weighted content feature vector corresponding to the associated image area;

and for each image area, carrying out fusion processing according to the feature vector of the image area and the weighted content feature vector corresponding to the associated image area to obtain the attention feature information with context information corresponding to the image to be identified.

For example, specifically, for each image region, the content similarity between the feature vector of the image region and the feature vector of the associated image region is used as the weight corresponding to the associated image region, and the content similarity and the content feature vector of the associated image region are subjected to point multiplication to obtain the weighted content feature vector of each associated image region.

The feature vector of the image area and the weighted content feature vector of the related image area are subjected to fusion processing, for example, the feature vector of the image area and the weighted content feature vector of the related image area are added to obtain the attention feature information of the attention context information corresponding to each image area.

The same processing is carried out for each image area, and the attention feature information of the attention context information corresponding to the text image to be recognized can be obtained.

The similarity of the content of the image area i and the associated image area, image area j, is designated score_ij，score_ijThe similarity between the content of the image region j and the content of the image region i, i.e. the associated image region, is recorded as score_ij，score_jiQj Ki, ideally, score_ijLarger, score_jiAlso larger, can make score by adjusting the content similarity_ijAnd score_jiSimilarly, that is, in an embodiment, the step "calculating content similarity between the target image region and the associated image region according to a distance between the first query vector corresponding to the target image region and the second key vector corresponding to the associated image region, to obtain content similarity of each image region" may specifically include:

calculating the initial content similarity between the target image area and the associated image area according to the distance between the first query vector corresponding to the target image area and the second key vector corresponding to the associated image area to obtain the initial content similarity of each image area;

obtaining an initial content similarity matrix of the image to be identified according to the initial content similarity of each image area;

and adjusting the initial content similarity of each image area by using the initial content similarity matrix and the transpose matrix of the initial content similarity matrix to obtain the content similarity between the target image area and the associated image area.

Wherein the initial content similarity may be a value representing a degree of similarity between the image region and the associated image region, obtained from a distance between the first query vector and the second key vector.

The content similarity may be a similarity obtained by adjusting the initial content similarity.

The initial content similarity matrix may be obtained according to the initial content similarity between each image area and the related image area, and the transposed matrix may be obtained by interchanging rows and columns of the initial content similarity matrix.

For example, the point multiplication processing may be specifically performed on the first query vector of the target image area and the second key vector of the associated image area to obtain the initial content similarity between the image area and the associated image area. Obtaining an initial content similarity matrix of the text image to be recognized according to the initial content similarity between each image area and the associated image area, and marking the initial content similarity matrix as SCORE₀Element score in the initial content similarity matrix at ith row and ith column_ijMay represent the initial content similarity of image area i to image area j, the element score in the jth row and ith column_jiMay represent the initial content similarity of image region j to image region i.

Exchanging the rows and columns of the initial content similarity matrix to obtain a transposed matrix SCORE^TAdding the transposed matrix and the initial content similarity matrix to obtain a content similarity matrix SCORE (SCORE)₀+SCORE^TThe content similarity matrix is a symmetric matrix, and the element score of the content similarity matrix_ij＝score_ji。

The content similarity between the feature vector of each image region and the feature vector of the associated region can be determined according to the content similarity matrix, for example, the content similarity of the image region i to the image region j is the element SCORE of the content similarity matrix SCORE_ij。

In the training process of the initial text content recognition model, the initial content similarity is adjusted by adopting the transposed matrix, and the obtained content similarity matrix is a symmetric matrix, so that the convergence of the initial text content recognition model can be accelerated, and the training process is accelerated.

In an embodiment, a window matrix may be set on the content similarity matrix, and the window matrix may shield other image regions unrelated to the image regions, that is, the step "adjust the initial content similarity by using the initial content similarity matrix and the transposed matrix of the initial content similarity matrix to obtain the content similarity between the image regions and the associated image regions" may specifically include:

adding the initial content similarity matrix and the transposed matrix of the initial content similarity matrix to obtain a first content similarity matrix;

setting a corresponding window matrix for the first content similarity matrix aiming at each image area to obtain a first content similarity matrix;

and determining the content similarity between the image area and the associated image area according to the second content similarity matrix.

The window matrix may be an isomorphic matrix similar to the content similarity matrix, and may be used to retain content similarity within a region range (which may be referred to as a window position) indicated by the window matrix and mask content similarity at other positions, such as setting content similarity at other positions to ∞, where the window position may be set according to each image region, for example, for an image region i, the window position may be score_ii、score_ii+1And score_ii+1Namely, the content similarity of the adjacent 3 image areas is reserved, and the content similarity with other image areas is shielded.

For example, the method may specifically include interchanging rows and columns of the initial content similarity matrix to obtain a transposed matrix, and adding the transposed matrix and the initial content similarity matrix to obtain the first content similarity matrix. And setting a corresponding window matrix for the obtained content similarity matrix aiming at each image area so as to reserve the content similarity of the image area in the area range indicated by the window matrix and shield the content similarity of other positions.

The window matrix may be set to a value of 0 for the window position, with other positions set to- ∞, or other large negative numbers, such as 10^-16Of the matrix of (a). The content similarity matrix is added to the window matrix, so that the content similarity of the non-window position in the content similarity matrix is set to be a large negative number due to the addition of the large negative number, the content similarity of the non-window position is mapped to be 0 through normalization processing to obtain a second content similarity matrix, and the content similarity between the image area and the related image area can be determined according to the second content similarity.

Step 103 may be processed by a multi-Attention Mechanism, where the multi-Attention Mechanism includes multiple Attention mechanisms, and the Attention Mechanism may be (Attention Mechanism) a special structure embedded in a machine learning model, and is used to automatically learn and calculate the contribution of input data to output data. Each attention mechanism performs the same processing on each image area, and each attention mechanism can obtain the content similarity of each image area and the corresponding associated image area. That is, in an embodiment, the step "calculating, for each image region, a content similarity between the image region and the associated image region according to the feature information of the image region and the feature information of the associated image region" may specifically include:

for each image region, performing calculation processing according to the feature information of the image region and the feature information of the associated image region based on a multi-head attention mechanism to obtain the content similarity between the image region and the associated image region under each attention mechanism;

the operation executed by each attention mechanism may specifically refer to the description of the corresponding location in the embodiment, which is not described herein again.

Each attention mechanism may output, for each image region, a content similarity for that image region to the associated image region.

By using a multi-head attention mechanism, content similarities between multiple groups of each region and the associated image region can be obtained, and accordingly, step 104 needs to process the multiple groups of content similarities, where step 104 may specifically include:

for each image region, performing fusion processing on the feature information of the image region and the associated image region based on a multi-head attention mechanism according to the content similarity between the image region and the associated image region to obtain the attention feature information of the attention context information under each attention mechanism;

and fusing the attention characteristic information under each attention mechanism to obtain the attention characteristic information of the attention context information.

For example, specifically, each attention mechanism may perform fusion processing on feature information of the image region and the associated image region according to content similarity between the image region and the associated image region that are correspondingly output (the specific implementation process may refer to description of a corresponding position in the embodiment, which is not described herein again), so as to obtain the attention feature information under each attention mechanism.

And splicing the obtained plurality of attention characteristic information to obtain spliced characteristic information. And processing the splicing characteristics to obtain characteristic information with the same dimension as the attention characteristic information under each attention mechanism, so as to obtain the attention characteristic information of the attention context information.

Step 103 may also perform iterative processing on the feature information through a multi-layer attention mechanism. Each layer of attention mechanism may include at least one attention mechanism, each layer of attention mechanism performs the same processing on each image region, and each layer of attention mechanism may obtain, for each image region, a content similarity of the image region to the associated image region. That is, in an embodiment, the step "calculating, for each image region, a content similarity between the image region and the associated image region according to the feature information of the image region and the feature information of the associated image region" may specifically include:

determining a currently used target layer attention mechanism from the multi-layer attention mechanism, and determining the characteristic information of each image area in the text image to be recognized as target input characteristic information of the target attention mechanism;

calculating content similarity between the image area and the related image area according to the feature information of the image area and the feature information of the related image area through a target layer attention mechanism aiming at each image area;

wherein the target layer attention mechanism may be the currently used attention mechanism layer for text recognition. The target input feature information may be feature information that is input to the target layer attention mechanism for processing.

For example, the target layer attention mechanism may specifically include at least one attention mechanism, each attention mechanism may calculate, for each image region, content similarity between the image region and the associated image region according to the feature information of the image region and the feature information of the associated image region, and a specific implementation process may refer to description of a corresponding position in the embodiment and is not described herein again.

Step 103 is processed by a multi-layer attention mechanism, and the corresponding step 104 may specifically include:

performing fusion processing on the feature information of the image area and the associated image area through a target layer attention mechanism according to the content similarity between the image area and the associated image area to obtain processed attention feature information of the attention context information;

when the target layer attention mechanism is not the preset layer attention mechanism, updating a target attention mechanism of a related layer attention mechanism of the target layer attention mechanism in the multi-layer attention mechanism, updating target input characteristic information of the processed attention characteristic information, returning to execute the steps of aiming at each image area, and calculating the content similarity between the image area and the related image area through the target layer attention mechanism according to the characteristic information of the image area and the characteristic information of the related image area;

and when the target layer attention mechanism is the preset layer attention mechanism, outputting the processed attention characteristic information to obtain the attention characteristic information of the text image to be recognized.

Wherein the associated layer attention mechanism may be a next layer attention mechanism to the target attention mechanism. The predetermined layer attention mechanism may be the last layer in a multi-layer attention mechanism.

For example, specifically, the feature information of the image region and the associated image region may be fused according to the content similarity between the image region and the associated image region by using a target layer attention mechanism, so as to obtain the processed attention feature information of the attention context information, and a specific implementation process may refer to the description of a corresponding position in the embodiment and is not described herein again.

When the target layer attention mechanism is not the preset layer attention mechanism, the processed attention characteristic information output by the target layer attention mechanism is used as target input characteristic information of the next layer attention mechanism, the next layer attention mechanism is used as the target layer attention mechanism, the same operation is carried out, namely the target characteristic information is sequentially processed by each layer attention mechanism of the multi-layer attention mechanism, the processed attention characteristic information output by the previous layer attention mechanism is used as the target input characteristic information of the next layer attention mechanism until the target layer attention mechanism is the preset layer attention mechanism, and the processed attention characteristic information is output to obtain the attention characteristic information.

104. And for each image area, performing fusion processing on the feature information of the image area and the associated image area according to the content similarity between the image area and the associated image area to obtain the attention feature information of the attention context information.

The attention feature information may be information including feature information of each image region and the corresponding associated image region.

For example, specifically, for each image region, the image region and the associated image region are weighted and summed according to the content similarity between the image region and the associated image region as the weight corresponding to the associated image region, so as to obtain the attention feature information of the attention context information corresponding to each image region.

And obtaining the attention feature information of the attention context information of the text image to be recognized according to the attention feature information of the attention context information corresponding to each image area.

105. And performing text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result.

For example, text content recognition may be specifically performed on an image to be recognized according to attention feature information, and for each image region in the text image to be recognized, the prediction probability of each character in a preset dictionary is a prediction probability of each predicted image region, for example, the preset dictionary includes A, B, C and four characters D, the text image to be recognized includes 5 image regions, and text content recognition is performed on the text image to be recognized, so that a probability that an image region 1 is an image region a is 0.8, a probability that an image region 1 is an image region B is 0.05, a probability that an image region 1 is an image region C is 0.05, a probability that an image region 1 is an image region D is 0.1, and other image regions of the text image to be recognized are similar.

The recognition result may be obtained according to the probability of each image region for each character, for example, the character with the highest probability is determined as the recognition result of the image region, and the recognition result of the image region 1 may be a, for example.

As can be seen from the above, the text image to be recognized can be obtained in the embodiment of the present application, and the text image to be recognized includes at least two image areas; performing feature extraction on the text image to be recognized to obtain feature information of each image area in the text image to be recognized; for each image area, calculating the content similarity between the image area and the associated image area according to the characteristic information of the image area and the characteristic information of the associated image area; for each image area, performing fusion processing on the feature information of the image area and the associated image area according to the content similarity between the image area and the associated image area to obtain attention feature information of the attention context information; and performing text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result. According to the scheme, the attention feature information of the attention context information of each image area is obtained by fusing the feature information of each image area and the corresponding associated image area, and text recognition is performed on each image area according to the attention feature information of the attention context information, so that parallel recognition of the image areas in the image to be recognized can be realized, and the recognition speed of the image to be recognized is improved.

On the basis of the above-described embodiments, further details will be given below by way of example.

The present embodiment will be described from the perspective of a text recognition apparatus, which may be specifically integrated in a computer device, which may be a server.

As shown in fig. 8, a specific process of the text recognition method provided in the embodiment of the present application may be as follows:

201. the server acquires a text image to be identified sent by the terminal.

For example, the terminal may specifically provide a user interface corresponding to text image recognition, such as the user interface shown in fig. 9. So that the user can upload the text image to be recognized (i.e. the uploaded picture in fig. 9) through the user interface, and instruct the server to acquire the text image to be recognized and perform text content recognition. The text recognition method can support multi-scene recognition, such as general character recognition, card character recognition, bill and document recognition, automobile related recognition, industry document recognition, intelligent code scanning and the like, and can also support multi-voice recognition, such as English, Korean, Japanese, Spanish and the like.

202. And the server performs feature extraction on the text image to be recognized by training the text recognition model to obtain a feature vector sequence of the text image to be recognized, wherein the feature vector sequence comprises a feature vector of each image area in the text image to be recognized.

For example, feature extraction may be specifically performed by training a Convolutional Neural Network (CNN) in the text recognition model, where the CNN may include at least one Convolutional layer, at least one sampling layer, and a classification layer, and features of the text image to be recognized may be extracted through a Convolutional kernel of the Convolutional layer, which is also referred to as a filter (convolution kernel), to obtain a feature map of the text image to be recognized.

The feature selection can be carried out on the feature map through the sampling layer, the features are classified through the full connection layer, a feature vector sequence gamma of the text image to be recognized is obtained, and each feature vector in the feature vector sequence gamma corresponds to one image area in the text image to be recognized.

203. The server performs space mapping processing on the feature vector sequence of the text image to be recognized through training the text recognition model based on an attention mechanism to obtain a query vector sequence, a key vector sequence and a content vector sequence corresponding to the feature vector sequence.

For example, the attention mechanism may be a self-attention mechanism (self-attention), which is a neural network for training a later-described recognition model, the attention mechanism includes attention weight information, the attention weight information may include a first weight matrix, a second weight matrix, and a third weight matrix, the server maps a feature vector sequence Γ by training the later-described recognition model based on the first weight matrix of the attention mechanism to obtain a Query vector sequence, which is designated as Query0, referred to as Q0, maps the feature vector sequence Γ based on the second weight matrix to obtain a Key vector sequence, which is designated as Key0, referred to as K0, and maps the feature vector sequence Γ based on the third weight matrix to obtain a content vector sequence, which is designated as Value0, referred to as V0.

The query vector sequence comprises a query vector of each image region in the text image to be recognized, the key vector sequence comprises a key vector of each image region in the text image to be recognized, and the content vector sequence comprises a content vector of each image region in the text image to be recognized.

204. And aiming at each image area, the server calculates the content similarity between the image area and the associated image area according to the query vector of the image area and the key vector of the associated image area to obtain a second content similarity matrix of the text image to be identified.

For example, the following may be specifically mentioned: the associated region may be all image regions of the text image to be recognized, for each image region, a query vector corresponding to the image region and a key vector corresponding to the associated image region are subjected to point multiplication, for example, Q · K, and a result obtained by the point multiplication is subjected to normalization processing to obtain an initial content similarity between each image region and the associated image region.

Based on the initial content similarity between each image area and the associated image area, an initial content similarity matrix SCORE of the text image to be recognized can be obtained₀。

Element score in the initial content similarity matrix at ith row and ith column_ijMay represent the initial content similarity of image area i to image area j, the element score in the jth row and ith column_jiMay represent the initial content similarity of image region j to image region i.

Exchanging the rows and columns of the initial content similarity matrix to obtain a transposed matrix SCORE^TAdding the transpose matrix of the initial content similarity matrix and the initial content similarity matrix to obtain a first content similarity matrix SCORE (SCORE)₀+SCORE^TThe first content similarity matrix is a symmetric matrix, and the element score of the first content similarity matrix_ijScore element ═ element_ji。

205. And aiming at each image area, the server sets a window matrix for the first content similarity matrix to obtain a second content similarity matrix corresponding to each image area.

For example, the server may set the window matrix to a value of 0 corresponding to the window position, and other positions to ∞ or other large negative number, such as 10^-16Of the matrix of (a). The window matrix is added to the first content similarity matrix, so that the content similarity of non-window positions in the first content similarity matrix can be addedThe content similarity of the non-window position can be mapped to 0 through normalization processing to obtain a second content similarity matrix, and the content similarity between the image area and the related image area can be determined according to the second content similarity.

The window matrix may be used to select a target related image area from the related image areas, where the content similarity of the window position is the content similarity between each image area and the target related image area corresponding to each image area, and the masked content similarity is the content similarity between each image area and the non-target related image area.

The target related image area may be a neighboring number of image areas of each image area, for example, the target image area of the image area 5 may be the image area 3, the image area 4, the image area 5, the image area 6, the image area 7, and the like.

206. And for each image area, the server performs fusion processing on the feature vectors of the image area and the associated image area based on the second content similarity matrix to obtain an attention feature vector of the attention context information of each image area.

For example, specifically, the attention feature vector C for obtaining the attention context information of the text image to be recognized may be calculated based on the content vector sequence of the text image to be recognized and the second content similarity matrix, for example, C ═ V0 · Γ. In the second content similarity matrix, due to the setting of the window matrix, the content similarity between the image region and the non-target related image region is 0 for each image region, so that during calculation, the content vector of the non-target related image region and the corresponding content similarity are calculated to obtain a zero vector, and the zero vector is added to the feature vector of the image region or the original feature vector, namely the feature vector of the image region is not fused with the feature vector of the non-target related image region.

A schematic diagram of feature vectors obtained by determining a current image area to be processed as a target image area and fusing the feature vectors of the target image area and a corresponding associated image area for the target image area is shown in fig. 10. The image area can be processed by a self-attribute mechanism, the recognition result can be predicted according to the attention feature vector corresponding to each image area, the current image area does not need to be predicted based on the recognition result of the previous image area, the parallel prediction of the text image to be recognized is realized, the prediction speed is improved, a plurality of servers do not need to be deployed, and the deployment cost of the servers is greatly reduced.

The common prediction based on the LSTM mechanism needs to predict the current image area according to the output of the previous image area, which results in long time consumption of the prediction process of the text image to be recognized, and when a large number of text images to be recognized need to be predicted, a plurality of servers need to be deployed to recognize numerous text images to be recognized, which results in high service deployment cost.

The feature vector sequence of the text image to be recognized is processed through a multi-head and multi-layer self-attribute mechanism, so that the attention feature vector which can better reflect the features of the text image to be recognized can be obtained.

For example, each attention mechanism in the n-head attention mechanisms may execute the above steps 203 to 206, each attention mechanism may correspond to different attention weight information, an attention feature vector Ci in each attention mechanism may be obtained based on the different attention weight information, where i represents an attention feature vector obtained by the i-th attention mechanism, n attention feature vectors Ci obtained by the n-head attention mechanism are concatenated to obtain C0 ═ concatee { C1, C2, C3, … … Cn-1, Cn }, and C0 is subjected to dimensionality reduction processing by training the full connection layer of the following identification model, so that the dimensionality of C0 is the same, for example, C ═ concatee { C1, C2, C3, … … Cn-1, Cn }, W is a linear transformation that maps C0 to C.

The single-layer attention mechanism obtains the attention feature information by performing the above steps 203-206, and the multi-layer attention mechanism repeatedly performs the above steps 203-206, and the output Ci of the ith layer attention mechanism is used as the input Γ i +1 of the ith + 1.

Namely:

207. and the server identifies the text content of the image to be identified based on the attention feature information to obtain an identification result of the text image to be identified, and sends the identification result to the terminal.

And obtaining an initial recognition result of the text image to be recognized according to the recognition result of each image area, deleting the interval characters in the initial recognition result to obtain the recognition result of the text image to be recognized, and returning the recognition result to the terminal.

As can be seen from the above, the server in the embodiment of the present application may obtain the text image to be identified sent by the terminal; performing feature extraction on the text image to be recognized through training the text recognition model to obtain a feature vector sequence of the text image to be recognized, wherein the feature vector sequence comprises a feature vector of each image area in the text image to be recognized; performing space mapping processing on a feature vector sequence of a text image to be recognized by training a text recognition model based on an attention mechanism to obtain a query vector sequence, a key vector sequence and a content vector sequence corresponding to the feature vector sequence; aiming at each image area, the server calculates the content similarity between the image area and the associated image area according to the query vector of the image area and the key vector of the associated image area to obtain a second content similarity matrix of the text image to be identified; aiming at each image area, the server sets a window matrix for the first content similarity matrix to obtain a second content similarity matrix corresponding to each image area; for each image area, the server performs fusion processing on the feature vectors of the image area and the associated image area based on the second content similarity matrix to obtain attention feature information of the attention context information of each image area; and performing text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result of the text image to be recognized, and sending the recognition result to a terminal. According to the scheme, the attention feature information of the attention context information of each image area is obtained by fusing the feature information of each image area and the corresponding associated image area, and text recognition is performed on each image area according to the attention feature information of the attention context information, so that parallel recognition of the image areas in the image to be recognized can be realized, and the recognition speed of the image to be recognized is improved.

In order to better implement the text recognition method provided by the embodiment of the application, a text recognition device is further provided in an embodiment. The meanings of the nouns are the same as those in the text recognition method, and specific implementation details can refer to the description in the method embodiment.

The text recognition apparatus may be specifically integrated in a computer device, as shown in fig. 11, and the text recognition apparatus may include: the acquiring unit 301, the extracting unit 302, the calculating unit 303, the fusing unit 304 and the identifying unit 305 are specifically as follows:

the acquisition unit 301: the method is used for acquiring a text image to be recognized, and the text image to be recognized comprises at least two image areas.

The extraction unit 302: the method is used for extracting the features of the text image to be recognized to obtain the feature information of each image area in the text image to be recognized.

In an embodiment, the extraction unit 302 may include a convolution subunit and an extraction subunit, specifically:

a convolution subunit: the image recognition system is used for performing convolution processing on an image to be recognized to obtain a feature map of the image to be recognized;

an extraction subunit: and the characteristic extraction module is used for extracting the characteristics of the characteristic graph to obtain the characteristic information of each image area in the image to be identified.

The above process of extracting features of the text image to be recognized to obtain feature information of each image region in the text image to be recognized may also be implemented by training a text content recognition model, that is, in an embodiment, the extracting unit 302 may include a model extracting subunit, specifically:

a model extraction subunit: the method is used for carrying out feature extraction on the image regions through training the later text content recognition model to obtain feature information of each image region.

The trained text recognition model may be a model obtained by training an initial text content recognition model, and the trained text recognition model is trained to continuously adjust network parameters, such as attention weight parameters and convolution kernels, in the initial text content recognition model until a preset training end condition is met, for example, the accuracy of a prediction result output by the initial text content recognition model is greater than a preset accuracy, or the training frequency is greater than a preset frequency, so as to obtain a trained model, that is, in an embodiment, the text recognition apparatus further includes a sample obtaining unit, a sample feature extracting unit, a similarity calculating unit, a feature fusion unit, a sample recognition unit, and a training unit:

a sample acquisition unit: the method comprises the steps of obtaining a text image sample, wherein the text image sample comprises at least two image areas;

a sample feature extraction unit: the text content recognition model is used for extracting the characteristics of the image areas of the text image samples through the initial text content recognition model to obtain the characteristic information of each image area of the text image samples;

a similarity calculation unit: the image processing device is used for calculating the content similarity between the image area and the related image area according to the characteristic information of the image area and the characteristic information of the related image area for each image area;

a feature fusion unit: the system comprises a fusion processing module, a context information acquiring module and a context information acquiring module, wherein the fusion processing module is used for performing fusion processing on the feature information of the image area and the associated image area according to the content similarity between the image area and the associated image area to obtain the attention feature information of the attention context information;

a sample identification unit: the method comprises the steps of recognizing text contents of an image to be recognized based on attention characteristic information to obtain a prediction result;

a training unit: and the initial text content recognition model is trained based on the prediction result and the sample label of the text image sample, so that the trained text content recognition model is obtained.

In an embodiment, the training unit may comprise a path determination subunit, a probability calculation subunit, an error calculation subunit, and a training subunit, in particular:

a path determination subunit: the system comprises a sample label, a character path set and a mapping strategy, wherein the sample label is used for acquiring a sample label;

a probability calculation subunit: the path probability of each character path in the character path set is calculated based on the prediction probability of each character of the prediction result;

an error calculation subunit: the error value between the prediction result and the sample label is calculated according to the character path probability;

a training subunit: and the initial text content recognition model is trained based on the error value to obtain a trained text content recognition model.

In an embodiment, the path determination subunit may comprise a sequence determination module and a path set determination module, in particular:

a sequence determination module: the method comprises the steps of determining a label sequence according to characters contained in a sample label, wherein the label sequence comprises a first character, a second character and a space character;

a path set determination module: the method includes determining a corresponding set of character paths based on the sequence of labels, the set of character paths including paths formed by jumping from a first character to a second character and jumping from the first character to an alternate character.

The calculation unit 303: and the image processing device is used for calculating the content similarity between the image area and the related image area according to the characteristic information of the image area and the characteristic information of the related image area for each image area.

In an embodiment, the calculating unit 303 may include a region determining subunit, a first mapping subunit, a second mapping subunit, and a calculating subunit, specifically:

a region determination subunit: the method comprises the steps of determining a target image area to be processed currently and a related image area of the target image area;

a first mapping subunit: the attention weight information is used for carrying out space mapping processing on the target characteristic vector of the target image area based on the attention weight information to obtain a first mapping vector corresponding to the target characteristic vector of the target image area in a characteristic space;

a second mapping subunit: the attention weight information is used for carrying out space mapping processing on the associated feature vector of the associated image area based on the attention weight information to obtain a second mapping vector corresponding to the associated feature vector of the associated image area in a feature space;

a calculation subunit: and the image processing device is used for calculating the content similarity between the target image area and the associated image area according to the distance between the first mapping vector and the second mapping vector to obtain the content similarity of each image area.

In an embodiment, the calculation subunit may be specifically configured to:

The fusion unit 304 may comprise a weighting subunit and a fusion subunit, in particular:

a weighting subunit: the content feature vector corresponding to the associated image area is weighted according to the content similarity between the image area and the associated image area to obtain a weighted content feature vector corresponding to the associated image area;

a fusion subunit: and the fusion processing is carried out on each image area according to the feature vector of the image area and the weighted content feature vector corresponding to the associated image area to obtain the attention feature information with context information corresponding to the image to be identified.

And regarding each image area, taking the content similarity between the feature vector of the image area and the feature vector of the associated image area as the corresponding weight of the associated image area, and performing point multiplication on the content similarity and the content feature vector of the associated image area to obtain the weighted content feature vector of each associated image area.

In an embodiment, the calculating subunit may include an initial content similarity calculating module, an obtaining module, and an adjusting module, specifically:

an initial content similarity calculation module: the method comprises the steps of calculating initial content similarity between a target image area and an associated image area according to the distance between a first query vector corresponding to the target image area and a second key vector corresponding to the associated image area to obtain initial content similarity of each image area;

obtaining a module: the initial content similarity matrix is used for obtaining an initial content similarity matrix of the image to be identified according to the initial content similarity of each image area;

an adjusting module: and the initial content similarity matrix and the transpose matrix of the initial content similarity matrix are used for adjusting the initial content similarity of each image area to obtain the content similarity between the target image area and the associated image area.

For example, the point multiplication processing may be specifically performed on the first query vector of the target image area and the second key vector of the associated image area to obtain the image areaInitial content similarity between the domain and the associated image region. Obtaining an initial content similarity matrix of the text image to be recognized according to the initial content similarity between each image area and the associated image area, and marking the initial content similarity matrix as SCORE₀Element score in the initial content similarity matrix at ith row and ith column_ijMay represent the initial content similarity of image area i to image area j, the element score in the jth row and ith column_jiMay represent the initial content similarity of image region j to image region i.

In an embodiment, a window matrix may be set on the content similarity matrix, and the window matrix may mask other image regions unrelated to the image region, that is, the adjusting module may include a specific first sub-module, a specific second sub-module, and a specific determining sub-module:

a first sub-module: the device comprises a first content similarity matrix, a second content similarity matrix and a third content similarity matrix, wherein the first content similarity matrix is used for obtaining an initial content similarity matrix;

a second sub-module: the window matrix is used for setting the corresponding window matrix for the first content similarity matrix aiming at each image area to obtain a second content similarity matrix;

determining a submodule: for determining the content similarity between the image area and the associated image area according to the second content similarity matrix.

In an embodiment, the calculating unit 303 may specifically be configured to:

and for each image region, performing calculation processing according to the feature information of the image region and the feature information of the associated image region based on the multi-head attention mechanism, and obtaining the content similarity between the image region and the associated image region under each attention mechanism.

The fusion unit 304 may comprise a first fusion subunit and a second fusion subunit, in particular:

a first fusion subunit: the system comprises a multi-head attention mechanism, a correlation image mechanism and a display mechanism, wherein the multi-head attention mechanism is used for carrying out fusion processing on feature information of an image region and the correlation image region according to content similarity between the image region and the correlation image region based on the multi-head attention mechanism to obtain attention feature information of attention context information under each attention mechanism;

a second fusion subunit: and the attention characteristic information fusion module is used for fusing the attention characteristic information under each attention mechanism to obtain the attention characteristic information of the attention context information.

Each attention mechanism performs fusion processing on the feature information of the image region and the associated image region for the content similarity between the image region and the associated image region, which are output correspondingly (the specific implementation process may refer to the description of the corresponding position in the embodiment, and is not described herein in detail), so as to obtain the attention feature information under each attention mechanism.

That is, in an embodiment, the calculating unit 303 may include a mechanism determining subunit and a calculating subunit, specifically:

mechanism determination subunit: the system comprises a plurality of layers of attention mechanisms, a recognition module, a display module and a display module, wherein the recognition module is used for determining a target layer attention mechanism used currently from the plurality of layers of attention mechanisms and determining the characteristic information of each image area in a text image to be recognized as target input characteristic information of the target attention mechanism;

a calculation subunit: and the method is used for calculating the content similarity between the image area and the related image area according to the characteristic information of the image area and the characteristic information of the related image area through a target layer attention mechanism aiming at each image area.

The fusion unit 304 may include a third fusion subunit, an iteration subunit, and an output subunit, specifically:

a third fusion subunit: the attention mechanism of the target layer is used for carrying out fusion processing on the feature information of the image area and the associated image area according to the content similarity between the image area and the associated image area to obtain the processed attention feature information of the attention context information;

an iteration subunit: when the target layer attention mechanism is not the preset layer attention mechanism, updating a target attention mechanism of an associated layer attention mechanism of the target layer attention mechanism in the multi-layer attention mechanism, updating the target input characteristic information of the processed attention characteristic information, and returning to execute the steps of calculating the content similarity between the image area and the associated image area through the target layer attention mechanism according to the characteristic information of the image area and the characteristic information of the associated image area;

an output subunit: and when the target layer attention mechanism is a preset layer attention mechanism, outputting the processed attention characteristic information to obtain the attention characteristic information of the text image to be recognized.

And performing fusion processing on the feature information of the image region and the associated image region according to the content similarity between the image region and the associated image region through a target layer attention mechanism to obtain the processed attention feature information of the attention context information, wherein the specific implementation process may refer to the description of the corresponding position in the embodiment, which is not described herein again.

The fusion unit 304: and the fusion processing module is used for performing fusion processing on the feature information of the image area and the associated image area according to the content similarity between the image area and the associated image area for each image area to obtain the attention feature information of the attention context information.

The recognition unit 305: and performing text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result.

As can be seen from the above, the text recognition apparatus in the embodiment of the present application can acquire the text image to be recognized through the acquisition unit 301, where the text image to be recognized includes at least two image areas; the extraction unit 302 performs feature extraction on the text image to be recognized to obtain feature information of each image area in the text image to be recognized; calculating, by the calculating unit 303, for each image region, a content similarity between the image region and the associated image region from the feature information of the image region and the feature information of the associated image region; for each image region, the fusion unit 304 performs fusion processing on the feature information of the image region and the associated image region according to the content similarity between the image region and the associated image region to obtain the attention feature information of the attention context information; finally, the recognition unit 305 performs text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result. According to the scheme, the attention feature information of the attention context information of each image area is obtained by fusing the feature information of each image area and the corresponding associated image area, and text recognition is performed on each image area according to the attention feature information of the attention context information, so that parallel recognition of the image areas in the image to be recognized can be realized, and the recognition speed of the image to be recognized is improved.

An embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 12, which shows a schematic structural diagram of the computer device according to the embodiment of the present application, and specifically:

the computer device may include components such as a processor 1001 of one or more processing cores, memory 1002 of one or more computer-readable storage media (which may also be referred to as storage media), a power supply 1003, and an input unit 1004. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 12 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 1001 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 1002 and calling data stored in the memory 1002, thereby monitoring the computer device as a whole. Optionally, processor 1001 may include one or more processing cores; preferably, the processor 1001 may integrate an application processor, which mainly handles operating systems, user interfaces, computer programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1001.

The memory 1002 may be used to store software programs and modules, and the processor 1001 executes various functional applications and data processing by operating the software programs and modules stored in the memory 1002. The memory 1002 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 1002 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 1002 may also include a memory controller to provide the processor 1001 access to the memory 1002.

The computer device further includes a power source 1003 for supplying power to each component, and preferably, the power source 1003 may be logically connected to the processor 1001 through a power management system, so that functions of managing charging, discharging, power consumption, and the like are implemented through the power management system. The power source 1003 may also include any component including one or more of a dc or ac power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 1004, and the input unit 1004 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 1001 in the computer device loads the executable file corresponding to the process of one or more computer programs into the memory 1002 according to the following instructions, and the processor 1001 runs the computer programs stored in the memory 1002, so as to implement various functions as follows:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the computer device according to the embodiment of the present application may obtain a text image to be recognized, where the text image to be recognized includes at least two image areas; performing feature extraction on the text image to be recognized to obtain feature information of each image area in the text image to be recognized; for each image area, calculating the content similarity between the image area and the associated image area according to the characteristic information of the image area and the characteristic information of the associated image area; for each image area, performing fusion processing on the feature information of the image area and the associated image area according to the content similarity between the image area and the associated image area to obtain attention feature information of the attention context information; and performing text content recognition on the image to be recognized based on the attention feature information to obtain a recognition result. According to the scheme, the attention feature information of the attention context information of each image area is obtained by fusing the feature information of each image area and the corresponding associated image area, and text recognition is performed on each image area according to the attention feature information of the attention context information, so that parallel recognition of the image areas in the images to be recognized can be realized, and the recognition speed of the images to be recognized is improved.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the above embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.

To this end, the present application provides a storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute any one of the text recognition methods provided in the present application.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

As the computer program stored in the storage medium can execute any text recognition method provided in the embodiments of the present application, beneficial effects that can be achieved by any text recognition method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The text recognition method, the text recognition device, the computer device, and the storage medium provided by the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A text recognition method, comprising:

2. The method according to claim 1, wherein the feature information comprises a feature vector, and the calculating, for each image region, a content similarity between the image region and an associated image region according to the feature information of the image region and the feature information of the associated image region matched with the image region comprises:

performing space mapping processing on the associated feature vector of the associated image region based on the attention weight information to obtain a second mapping vector corresponding to the associated feature vector of the associated image region in a feature space;

3. The method of claim 2, wherein the first mapping vector comprises a first query vector, a first key vector, and a first content vector, the second mapping vector comprises a second query vector, a second key vector, and a second content vector, and the calculating the content similarity between the target image region and the associated image region according to the distance between the first mapping vector and the second mapping vector, resulting in the content similarity for each image region comprises:

calculating the content similarity between the target image area and the associated image area according to the distance between the first query vector corresponding to the target image area and the second key vector corresponding to the associated image area to obtain the content similarity of each image area;

the acquiring, for each image region, the attention feature information of the attention context information by performing fusion processing on the feature information of the image region and the associated image region according to the content similarity between the image region and the associated image region includes:

for each image area, according to the content similarity between the image area and the associated image area, performing weighting processing on the content feature vector corresponding to the associated image area to obtain a weighted content feature vector corresponding to the associated image area;

and for each image area, carrying out fusion processing according to the feature vector of the image area and the weighted content feature vector corresponding to the associated image area to obtain the attention feature information of the attention context information.

4. The method according to claim 3, wherein the calculating the content similarity between the target image region and the associated image region according to the distance between the first query vector corresponding to the target image region and the second key vector corresponding to the associated image region to obtain the content similarity of each image region comprises:

calculating initial content similarity between the target image area and the associated image area according to the distance between a first query vector corresponding to the target image area and a second key vector corresponding to the associated image area to obtain initial content similarity of each image area;

5. The method according to claim 4, wherein the adjusting the initial content similarity by using the initial content similarity matrix and the transpose matrix of the initial content similarity matrix to obtain the content similarity between the image area and the associated image area comprises:

adding the initial content similarity matrix and the transpose matrix of the initial content similarity matrix to obtain a first content similarity matrix;

setting a corresponding window matrix for the first content similarity matrix aiming at each image area to obtain a second content similarity matrix;

6. The method according to claim 1, wherein the calculating, for each image region, a content similarity between the image region and an associated image region according to the feature information of the image region and the feature information of the associated image region comprises:

for each image region, calculating the content similarity between the image region and the associated image region based on a multi-head attention mechanism and according to the feature information of the image region and the feature information of the associated image region, and obtaining the content similarity between the image region and the associated image region under each attention mechanism;

for each image region, performing fusion processing on the feature information of the image region and the associated image region based on the multi-head attention mechanism according to the content similarity between the image region and the associated image region to obtain the attention feature information of the attention context information under each attention mechanism;

7. The method according to claim 1, wherein the calculating, for each image region, a content similarity between the image region and an associated image region according to the feature information of the image region and the feature information of the associated image region comprises:

performing fusion processing on the feature information of the image area and the associated image area through the target layer attention mechanism according to the content similarity between the image area and the associated image area to obtain processed attention feature information of the attention context information;

when the target layer attention mechanism is not the preset layer attention mechanism, updating an associated layer attention mechanism of the target layer attention mechanism in the multi-layer attention mechanism to be the target layer attention mechanism, updating the processed attention characteristic information to be target input characteristic information, returning to execute the steps of aiming at each image area, and calculating the content similarity between the image area and the associated image area through the target layer attention mechanism according to the characteristic information of the image area and the characteristic information of the associated image area;

8. The method according to claim 1, wherein the extracting features of the text image to be recognized to obtain feature information of each image area in the text image to be recognized comprises:

and performing feature extraction on the text image to be recognized through training the text content recognition model to obtain feature information of each image area of the text image to be recognized.

9. The method of claim 8, wherein prior to the acquiring the image to be identified, the method further comprises:

and training the initial text content recognition model based on the prediction result and the sample label of the text image sample to obtain a trained text content recognition model.

10. The method of claim 9, wherein the prediction result comprises characters and a prediction probability corresponding to each character, and wherein the training the initial text content recognition model based on the prediction result and the sample label to obtain a trained text content recognition model comprises:

calculating a path probability for each character path in the set of character paths based on a prediction probability for each character of the prediction result;

calculating an error value between the prediction result and a sample label according to the character path probability;

11. The method of claim 10, wherein determining a corresponding set of character paths from the characters in the sample label comprises:

determining a label sequence according to characters contained in the sample label, wherein the label sequence comprises a first character, a second character and a spacing character;

determining a corresponding set of character paths based on the sequence of labels, the set of character paths including paths formed by jumping from a first character to a second character and jumping from the first character to a space character.

12. The method according to claims 1 to 11, wherein the extracting features of the text image to be recognized to obtain feature information of each image area in the text image to be recognized comprises:

performing convolution processing on the image to be recognized to obtain a feature map of the image to be recognized;

13. A text recognition apparatus, comprising:

14. A computer device comprising a memory and a processor; the memory stores a computer program, and the processor is configured to execute the computer program in the memory to perform the text recognition method according to any one of claims 1 to 11.

15. A storage medium for storing a computer program which is loaded by a processor to perform the text recognition method of any one of claims 1 to 11.