CN114821560B

CN114821560B - Text recognition method and device

Info

Publication number: CN114821560B
Application number: CN202210374091.XA
Authority: CN
Inventors: 秦勇
Original assignee: Shenzhen Xingtong Technology Co ltd
Current assignee: Shenzhen Xingtong Technology Co ltd
Priority date: 2022-04-11
Filing date: 2022-04-11
Publication date: 2024-08-02
Anticipated expiration: 2042-04-11
Also published as: CN114821560A

Abstract

The disclosure provides a text recognition method and device, and belongs to the field of image processing. The method comprises the following steps: acquiring a text image to be identified; invoking a trained text recognition model, wherein the text recognition model comprises: the feature extraction module is used for obtaining a feature mapping group based on the text image; the context coding module is used for processing each time step based on the feature mapping group to obtain a target context vector of the current time step; and the decoding module is used for determining text information and an original hidden state vector corresponding to the current time step based on the target context vector and a target hidden state vector corresponding to the last time step for each current time step, wherein the target hidden state vector is determined based on an adjusting vector, and the adjusting vector is determined based on feature mapping of a plurality of reference time steps of the last time step. By adopting the method and the device, the decoding module can strengthen the characteristic information of the time steps nearby the current time step, and further improve the accuracy of text recognition.

Description

Text recognition method and device

Technical Field

The present invention relates to the field of image processing, and in particular, to a text recognition method and apparatus.

Background

Natural scene text recognition is a process of recognizing a character sequence from a picture with text (for chinese, one character is a kanji and for english, one character is a letter), and is a very challenging task. In practical applications, text recognition may be based on attention (attention). Besides the factors of complex background of the picture, illumination change and the like, the complexity of identifying the output space is also a great difficulty, and since the characters consist of non-fixed number of letters, the character identification of the natural scene needs to identify sequences with non-fixed lengths from the picture.

There are two solutions at present: the first method is based on a bottom-up strategy, and the recognition problem is divided into character detection, character recognition and character combination, and the problems are solved one by one; the second is a strategy based on whole analysis, namely a sequence-to-sequence method, wherein the image is encoded first, and then the sequence decoding is carried out to directly obtain the whole character string.

However, for long texts, the processing speed of the first strategy is slower and the time is longer; the second strategy described above is not accurate, resulting in lower accuracy in text recognition.

Disclosure of Invention

In view of the above, the embodiment of the invention provides a text recognition method and device, so as to solve the problem of low text recognition accuracy.

According to an aspect of the present disclosure, there is provided a text recognition method, including:

acquiring a text image to be identified;

Invoking a trained text recognition model, wherein the text recognition model comprises a feature extraction module, a context coding module and a decoding module;

In the feature extraction module, processing is carried out based on the text image to obtain a feature mapping group of the text image, wherein the feature mapping group comprises feature mappings of a plurality of time steps;

in the context coding module, processing is carried out on each current time step based on the feature mapping group to obtain a target context vector of the current time step;

in the decoding module, for each current time step, based on a target context vector of the current time step and a target hidden state vector corresponding to a previous time step of the current time step, determining text information corresponding to the current time step and an original hidden state vector of the current time step in a text image, wherein the target hidden state vector is determined based on the original hidden state vector of the previous time step and an adjustment vector, and the adjustment vector is determined based on feature maps of a plurality of reference time steps of the previous time step, and each reference time step is within a preset range of the previous time step.

According to another aspect of the present disclosure, there is provided a text recognition apparatus, including:

the acquisition module is used for acquiring a text image to be identified;

The calling module is used for calling the trained text recognition model, wherein the text recognition model comprises a feature extraction module, a context coding module and a decoding module;

According to another aspect of the present disclosure, there is provided an electronic device including:

A processor; and

A memory in which a program is stored,

Wherein the program comprises instructions that, when executed by the processor, cause the processor to perform the text recognition method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described text recognition method.

In the embodiment of the disclosure, after a text image to be recognized is acquired, a trained text recognition model is called, the text recognition model comprises a feature extraction module, a context coding module and a decoding module, in the feature extraction module, a feature mapping group of the text image is obtained by processing based on the text image, in the context coding module, for each time step, a target context vector of the current time step is obtained by processing based on the feature mapping group, in the decoding module, for each current time step, text information corresponding to the current time step in the text image and an original hidden state vector of the current time step are determined based on the target context vector of the current time step and a target hidden state vector corresponding to a previous time step of the current time step, wherein the target hidden state vector is determined based on the original hidden state vector of the previous time step and an adjustment vector, the adjustment vector is determined based on feature mapping of a plurality of reference time steps of the previous time step, and each reference time step is within a preset range of the previous time step. Therefore, when the decoding module processes each time step, the characteristic information of the time step nearby the current time step can be enhanced through the characteristic mapping of the plurality of reference time steps, and for the identification of any text, the nearby text information has more reference value, so that the accuracy of text identification can be improved.

Drawings

Further details, features and advantages of the present disclosure are disclosed in the following description of exemplary embodiments, with reference to the following drawings, wherein:

FIG. 1 illustrates a flow chart of a text recognition method according to an exemplary embodiment of the present disclosure;

FIG. 2 illustrates a context encoding module process flow diagram according to an exemplary embodiment of the present disclosure;

FIG. 3 shows a schematic block diagram of a text recognition device according to an exemplary embodiment of the present disclosure;

Fig. 4 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Embodiments of the present disclosure provide a text recognition method that may be performed by a terminal, a server, and/or other processing-capable devices. The method provided by the embodiment of the present disclosure may be implemented by any one of the above devices, or may be implemented by a plurality of devices together, which is not limited in this disclosure.

Taking a terminal as an example in the embodiment of the present disclosure, a text recognition method will be described below with reference to a flowchart of the text recognition method shown in fig. 1.

And step 101, acquiring a text image to be recognized.

In one possible implementation, when text in an image needs to be identified, the user may trigger a signal for text identification on the terminal. For example, the user may take an image using the terminal and click on an option to identify text, thereby triggering a signal for text identification. For another example, the user may press the image displayed by the terminal for a long time and click on an option to identify text after the terminal displays the option, thereby triggering a signal to identify text. The specific scene of triggering the text recognition signal is not limited in this embodiment.

When the terminal receives a text recognition signal, a text image corresponding to the signal can be acquired. The text image may contain other images than the text image to be identified. Therefore, before the text recognition processing, the terminal can preprocess the image to be recognized, intercept the text image therein, and take the text image as the text image to be recognized. The specific pretreatment method is not described in this embodiment.

The text image to be recognized may be a single-line text image, and the text image to be recognized may include a straight text image, a tilted text image, a curved text image, a conventional blurred or photocopied text image, etc., and the specific form and content of the text image to be recognized are not limited in the embodiments of the present disclosure.

Step 102, calling a trained text recognition model.

The text recognition model comprises a feature extraction module, a context coding module and a decoding module.

In a possible implementation manner, before text recognition is performed using the text recognition model, the text recognition model may be trained accordingly, and a specific training process will be described in another embodiment, which will not be described in detail herein. After training is completed, the trained text recognition model can be stored. When executing the text recognition task, the text recognition model may be invoked for subsequent processing.

The processing of the various modules in the text recognition model will be described in steps 103-105.

And step 103, in the feature extraction module, processing is performed based on the text image to obtain a feature mapping group of the text image.

The feature mapping group comprises feature mapping of a plurality of time steps, and the sequence of the feature mapping of the plurality of time steps is matched with the sequence of the text information.

In one possible implementation, the text image to be identified may be input to a feature extraction module, and the feature extraction module processes the text image to obtain a feature map set.

Optionally, the feature extraction module includes a first feature extraction sub-module and a second feature extraction sub-module, based on which specific processing in the feature extraction module may be as follows:

in a first feature extraction sub-module, processing a text image to be identified to obtain an initial feature map of the text image;

in the second feature extraction submodule, the initial feature mapping is subjected to sequence modeling according to time steps to obtain a feature mapping group.

In one possible implementation manner, after the text image to be identified is input into the first feature extraction sub-module in the feature extraction module, the first feature extraction sub-module may process the text image to be identified to obtain and output a set of initial feature maps in the second feature extraction sub-module, and the second feature extraction sub-module models the set of initial feature maps through a time-step sequence, so as to obtain a feature map set, where the initial feature maps are the same as the feature map set in dimensions, and may be one-dimensional or two-dimensional, and the embodiment of the disclosure is not limited to this. Further, the second feature extraction submodule may strengthen the sequence relation of the initial feature map output by the first feature extraction submodule.

For example, the first feature extraction submodule may employ a resnet network, resnet may be composed of 4 blocks, each block may be composed of a plurality of convolution operations, an output of each block is an input of a subsequent block, the second feature extraction submodule may employ two layers of bidirectional LSTM, and specific tools employed by the first feature extraction submodule and the second feature extraction submodule in the embodiments of the present disclosure are not limited.

And step 104, in the context coding module, processing based on the feature mapping group for each current time step to obtain a target context vector of the current time step.

In one possible implementation manner, in the context coding module, for each current time step, the information of the current time step and the information of other time steps of the feature mapping group may be processed based on the feature mapping group, so as to obtain a target context vector of the current time step, so that the target context vector may carry the information amount corresponding to the current time step and the information amount of the context.

Optionally, for long text, when recognizing a certain character to be recognized, focusing on character information far away from the character to be recognized may reduce recognition accuracy, based on which, as shown in the context encoding module processing flowchart of fig. 2, the processing after obtaining the target context vector of the current time step in the step 104 may be as follows:

step 201, determining a past vector of a current time step;

Step 202, processing based on the target context vector and the past vector of the current time step to obtain a processed context vector;

and step 203, updating the target context vector into the processed context vector.

Wherein the past vector is used to represent context information for a historical time step prior to the current time step.

In one possible implementation, after the context encoding module obtains the target context vector of a certain time step based on the feature mapping group, a past vector of the current time step for representing context information of the historical time step may be determined, and then the processing is performed based on the target context vector and the past vector, so as to obtain a processed context vector, and the target context vector of the current time step may be updated to the processed context vector. Therefore, the used information can be highlighted through the past vector, so that the context information near the current time step in the target context vector can be highlighted, the subsequent decoding module carries out text recognition based on the target context, and the accuracy of text recognition can be further improved.

Specifically, the processing in step 202 may be as follows:

The past vector is subtracted point by point from the pre-update target context vector to obtain a processed context vector.

Of course, other specific processes may be adopted so that the context information near the current time step in the target context vector may be more prominent, for example, the duty ratio of the past vector in the target context vector before updating is calculated point by point, and then the target context vector before updating is proportionally reduced point by point.

Optionally, the context encoding module includes an attention module, which may be configured to process based on the feature map set to obtain an attention score for each time step. The past vectors may be derived based on the attention score, based on which the specific process of determining the past vectors for each time step may be as follows:

accumulating the attention scores of a plurality of time steps before the current time step to obtain a historical attention score of the current time step;

and processing the historical attention score and the feature mapping group of the current time step through the context coding module to obtain the past vector of the current time step.

In one possible implementation manner, for each time step, the context encoding module may obtain the attention score of the current time step through the attention module, after obtaining the target context vector before updating of the current time step in the context encoding module, the attention scores of all time steps before the current time step may be accumulated to obtain the historical attention score, and then the attention module uses the feature mapping group as a value vector and calculates the historical attention score and the value vector, so as to obtain the past vector of the current time step, where the past vector of the first time step may be set as a zero vector.

Step 105, in the decoding module, for each current time step, determining text information corresponding to the current time step and an original hidden state vector of the current time step in the text image based on the target context vector of the current time step and the target hidden state vector corresponding to the time step previous to the current time step.

The target hidden state vector is determined based on the original hidden state vector and the adjustment vector of the previous time step, the adjustment vector is determined based on the feature mapping of a plurality of reference time steps of the previous time step, and each reference time step is within a preset range of the previous time step.

In one possible implementation manner, after the context encoding module obtains the target context vector of the current time step for each current time step, the context encoding module may output the target context vector to the encoding module, and the decoding module may process based on the target context vector and feature maps of a plurality of reference time steps corresponding to the previous time step, may identify text information in the text image, and may obtain an original hidden state vector of the current time step, and may determine an adjustment vector of the current time step based on the feature maps of a plurality of reference time steps of the current time step, and then determine the target hidden state vector of the current time step based on the original hidden state vector and the adjustment vector of the current time step.

Therefore, when the decoding module processes each time step, the characteristic information of the time step nearby the current time step can be enhanced through the characteristic mapping of the plurality of reference time steps, and for the identification of any text, the nearby text information has more reference value, so that the accuracy of text identification can be improved.

Optionally, the feature map set includes feature maps of M time steps, where M is an integer greater than 0, based on which specific processing in the decoding module may be as follows:

In the 1 st time step, processing based on a target context vector and a target hidden state vector initial value of the 1 st time step to obtain a recognition result corresponding to the 1 st time step and an original hidden state vector of the 1 st time step, and determining the target hidden state vector of the N time step based on the original hidden state vector of the 1 st time step and an adjustment vector of the 1 st time step, wherein the target hidden state vector initial value is set as a preset vector;

in the N-th time step, processing based on the N-th time step target context vector and the N-1-th time step target hidden state vector to obtain an identification result corresponding to the N-th time step and an original hidden state vector of the N-th time step, and determining the N-th time step target hidden state vector based on the N-th time step original hidden state vector and an N-th time step adjustment vector, wherein N is more than or equal to 2 and less than or equal to M;

And in the M time steps, sequentially executing the process of identifying and obtaining the identification result corresponding to the N time step until the feature mapping identification of the M time step is completed, and obtaining the text information in the text image based on the identification result corresponding to the M time steps.

In one possible implementation, in the decoding module:

For the 1 st time step, after the context coding module obtains the target context vector of the 1 st time step, the processing may be performed based on the target hidden state vector initial value and the target context vector to obtain the recognition result corresponding to the 1 st time step and the original hidden state vector of the 1 st time step, at this time, the adjustment vector of the 1 st time step may be determined based on the feature maps of the multiple reference time steps of the 1 st time step, and then the target hidden state vector of the 1 st time step may be determined based on the original hidden state vector of the 1 st time step and the adjustment vector, and the target hidden state vector of the 1 st time step may be output to the processing of the 2 nd time step of the decoding module, where the target hidden state vector initial value may be set as a zero vector.

For the N-th time step, if N is less than or equal to 2 and less than M, after the context coding module obtains the target context vector of the N-th time step, processing is firstly carried out based on the target hidden state vector of the N-1-th time step and the target context vector to obtain a recognition result corresponding to the N-th time step and the original hidden state vector of the N-th time step, at the moment, the adjusting vector of the N-th time step can be determined based on the feature mapping of a plurality of reference time steps of the N-th time step, then the target hidden state vector of the N-th time step is determined based on the original hidden state vector and the adjusting vector of the N-th time step, and the target hidden state vector of the N-th time step is output to the processing of the N+1-th time step of the decoding module.

And for the nth time step, if n=m, after the context coding module obtains the target context vector of the nth time step, processing based on the target hidden state vector of the (N-1) th time step and the target context vector to obtain a recognition result corresponding to the nth time step and an original hidden state vector of the nth time step.

And in the M time steps, sequentially executing the process of identifying and obtaining the identification result corresponding to the N time steps until the feature mapping identification of the M time steps is completed, and obtaining the text information in the text image based on the identification result corresponding to the M time steps.

The recognition result may be a probability vector for a preset dictionary, where the preset dictionary includes a plurality of words, and the preset dictionary establishment method may be as follows: and (3) manually labeling a large number of text images to be recognized in character sequences, and establishing a dictionary according to labeling information.

The terminal can process the probability vector of each time step in a preset decoding mode based on a preset dictionary, and then the text information corresponding to each probability vector can be determined. For example, words in a preset dictionary corresponding to the highest probability value in each probability vector are used as text information corresponding to the probability vector, and then the text information in the text image can be obtained. The embodiment of the present disclosure does not limit the specific decoding manner of the probability vector.

Optionally, the decoding module may include a unidirectional RNN (Recurrent Neural Network ), LSTM (Long Short Term Memory, long and short term memory network), or GRU (gated recurrent neural network ), which is not limited by the embodiments of the present disclosure.

Illustratively, the decoding module includes a unidirectional RNN in which:

For the 1 st time step, after the context coding module obtains the target context vector of the 1 st time step, the decoding module splices the target context vector with a zero vector with the same column number, takes the spliced vector as the input of the RNN, and the RNN processes the input based on the initial value of the target hidden state vector, so that the corresponding identification result of the 1 st time step and the original hidden state vector of the 1 st time step can be output.

For each time step after the 1 st time step, after the context coding module obtains the target context vector of the current time step, the decoding module splices the target context vector with the probability vector of the last time step, takes the spliced vector as the input of the RNN, and the RNN processes the input based on the target hidden state vector of the last time step, so that the corresponding probability vector of the current time step and the original hidden state vector of the current time step can be output.

Optionally, the method for determining the target hidden state vector based on the original hidden state vector and the adjustment vector of the last time step in step 105 includes:

Adding the original hidden state vector of the previous time step and the adjusting vector of the previous time step point by point to obtain the target hidden state vector

In one possible implementation, after obtaining the original hidden state vector and the adjustment vector for a certain time step, the process of obtaining the target hidden state vector for the time step may be as follows: and adding the original hidden state vector of the time step with the adjustment vector of the time step point by point to obtain the target hidden state vector of the time step.

Of course, other specific processes may be adopted to make the hidden layer information near the time step in the target hidden state vector more prominent, for example, the duty ratio of the adjustment vector in the original hidden state vector is calculated point by point, and then the original hidden state vector is increased in proportion point by point.

Optionally, the method for determining the adjustment vector based on the feature map of the plurality of reference time steps of the last time step in step 105 includes:

and carrying out average pooling processing on the adjustment vector corresponding to the last time step based on the characteristic mapping of a plurality of reference time steps corresponding to the last time step to obtain the adjustment vector of the last time step.

In one possible implementation manner, after obtaining the original hidden state vector of a certain time step, the decoding module may perform an average pooling operation on feature maps of a plurality of reference time steps corresponding to the time step to obtain the adjustment vector of the time step. The window size of the average pooling operation may be preset, for example, the window size of the average pooling operation is set to 5, which is not limited in the embodiment of the present disclosure, and the current time step is located in the window of the average pooling operation.

Optionally, the previous time step is a center of a preset range, that is, the preset range of a certain time step is formed by a plurality of time steps taking the time step as a center, and based on this, the method for determining the adjustment vector includes: and taking the feature mapping point corresponding to the last time step as a center point, and selecting feature mapping of a plurality of reference time steps in a feature mapping group by using a preset window size to perform average pooling operation to obtain an adjustment vector of the last time step.

In one possible implementation manner, for a certain time step, the decoding module performs an average pooling operation on the feature maps of a plurality of reference time steps corresponding to the time step by taking the feature map corresponding to the time step as a window center point to obtain an adjustment vector of the time step. For example, if the window of the feature map set corresponding to the current time step is (5,5,4,6,5), the feature map corresponding to the current time step is 4 in (5,5,4,6,5), and the adjustment vector is (0,0,5,0,0).

Further, in order to improve accuracy of the target context vector, the target hidden state vector may be fed back to the context encoding module, so that the context encoding module may pay more attention to hidden layer information of a time step near the current time step when processing. Based on this, the processing in each time step of the context encoding module (i.e., step 104 described above) may be as follows:

and processing the object hidden state vector based on the feature mapping group and the last time step to obtain the object context vector of the current time step.

In a possible implementation manner, in the context coding module, for each time step, the feature mapping group and the received target hidden state vector of the last time step are processed, so as to obtain the target context vector of the current time step.

Alternatively, corresponding to the case where the context encoding module includes the attention module, the processing in one time step of the context encoding module (i.e., the step 104) may be as follows:

And using the feature mapping group as a key vector and a value vector, using the target hidden state vector of the last time step as a query vector, and processing the query vector based on the attention module to obtain the target context vector of the current time step.

In one possible approach in an embodiment of the present invention, after the feature extraction module outputs the feature map set to the context encoding module, for each time step, the attention module may take the feature map set as a key vector and a value vector and the target hidden state vector of the last time step as a query vector, and calculating based on the query vector and the key vector by the attention module to obtain the attention score of the current time step. And accumulating the attention score of the current time step and the attention scores of all time steps before the current time step to obtain the total attention score of the current time step, and calculating based on the total attention score and the value vector to obtain the target context vector of the current time step.

Optionally, the target context vector may be updated thereafter based on the past vector. Specifically, attention scores of a plurality of time steps before the current time step can be accumulated to obtain a historical attention score of the current time step, calculation is performed based on the historical attention score and a value vector, a past vector of the current time step can be obtained, then the past vector is subtracted from the target context vector point by point, a processed context vector can be obtained, and the target context vector is updated to the processed context vector. Wherein the total attention score for time step 1 may be set to the attention score for time step 1.

The embodiment of the disclosure can obtain the following technical effects:

(1) When the decoding module processes each time step, the characteristic information of the time step nearby the current time step can be enhanced through the characteristic mapping of the plurality of reference time steps, and for the identification of any text, the nearby text information has more reference value, so that more accurate decoding can be realized, and the accuracy of text identification can be improved.

(2) For each time step, the context coding module can process based on the target context vector before updating and the past vector to obtain the updated target context vector, namely the past vector can be used for highlighting the used context information, so that the context information near the current time step in the target context vector can be more prominent, and the subsequent decoding module can perform text recognition based on the target context vector, thereby improving the accuracy of text recognition.

The text recognition model used in the above disclosed embodiments may be a machine learning model that may be trained prior to the above process using the text recognition model.

The training method of the text recognition model can be as follows: the text recognition model is trained based on the plurality of text image samples and the text information corresponding to each text image sample.

In one possible implementation manner, the training method of the text recognition model may specifically be as follows: acquiring a plurality of training samples and an initial text recognition model, wherein each training sample comprises a text image and text information corresponding to the text image, the text recognition model is input into each text image, and probability vectors corresponding to the text information are output; inputting the text image in any training sample into an initial text recognition model for training, wherein the initial text recognition model can output a corresponding probability vector; inputting the probability vector output by the initial text recognition model and text information in the corresponding training sample into a loss function to calculate loss, and adjusting parameters of the initial text recognition model based on the loss calculation; and when the training ending condition is reached, acquiring the current text recognition model as a trained text recognition model.

The training ending condition may be that the training number reaches a first threshold, and/or the model accuracy reaches a second threshold, and/or the loss function is below a third threshold. The first, second and third thresholds described above may be empirically set. The present embodiment is not limited to specific training end conditions.

In the embodiment of the disclosure, after the text recognition model is obtained by training, the method can be used for realizing the text recognition method, so that when each time step is processed by a decoding module, the feature information of the time step nearby the current time step can be enhanced through the feature mapping of the plurality of reference time steps, and for the recognition of any text, the nearby text information has more reference value, so that more accurate decoding can be realized, and the accuracy of text recognition can be improved.

The embodiment of the disclosure provides a text recognition device which is used for realizing the text recognition method. As shown in the schematic block diagram of the text recognition apparatus of fig. 3, the text recognition apparatus 300 includes: the module 301 is acquired and the module 302 is invoked.

An acquiring module 301, configured to acquire a text image to be identified;

A calling module 302, configured to call a trained text recognition model, where the text recognition model includes a feature extraction module, a context encoding module, and a decoding module; in the feature extraction module, processing is carried out based on the text image to obtain a feature mapping group of the text image, wherein the feature mapping group comprises feature mappings of a plurality of time steps; in the context coding module, processing is carried out on each current time step based on the feature mapping group to obtain a target context vector of the current time step; in the decoding module, for each current time step, based on a target context vector of the current time step and a target hidden state vector corresponding to a previous time step of the current time step, determining text information corresponding to the current time step and an original hidden state vector of the current time step in the text image, wherein the target hidden state vector is determined based on the original hidden state vector of the previous time step and an adjustment vector, the adjustment vector is determined based on feature maps of a plurality of reference time steps of the previous time step, and each reference time step is within a preset range of the previous time step.

Optionally, the text recognition device 300 further includes a processing module configured to:

And carrying out average pooling processing on the adjustment vector corresponding to the last time step based on the feature maps of the plurality of reference time steps corresponding to the last time step to obtain the adjustment vector of the last time step.

Optionally, the processing module is configured to:

And taking the feature mapping point corresponding to the last time step as a central point, and selecting feature mapping of a plurality of reference time steps in the feature mapping group by using a preset window size to carry out average pooling operation to obtain an adjustment vector of the last time step.

Optionally, the text recognition device 300 further includes a determining module configured to:

And adding the original hidden state vector of the last time step with the adjustment vector of the last time step point by point to obtain the target hidden state vector.

Optionally, the feature mapping group includes feature mappings of M time steps, where M is an integer greater than 0;

The calling module 302 is configured to:

In the 1 st time step, processing based on the 1 st time step target context vector and a target hidden state vector initial value to obtain a recognition result corresponding to the 1 st time step and the 1 st time step original hidden state vector, and determining the 1 st time step target hidden state vector based on the 1 st time step original hidden state vector and an N time step adjustment vector, wherein the target hidden state vector initial value is set as a preset vector;

In the N-th time step, processing based on the N-th time step target context vector and the N-1-th time step target hidden state vector to obtain a recognition result corresponding to the N-th time step and an original hidden state vector of the N-th time step, and determining the N-th time step target hidden state vector based on the N-th time step original hidden state vector and an N-th time step adjustment vector, wherein N is more than or equal to 2 and less than or equal to M;

Optionally, the calling module 302 is configured to:

determining a past vector of the current time step, wherein the past vector is used for representing context information of a historical time step before the current time step;

Processing based on the target context vector and the past vector to obtain a processed context vector;

Updating the target context vector to the processed context vector.

Optionally, the context coding module includes an attention module, where the attention module is configured to process based on the feature mapping group to obtain an attention score of each time step;

The calling module 302 is configured to:

Accumulating the attention scores of the time steps before the current time step to obtain a historical attention score;

And processing the historical attention score and the characteristic mapping group through the context coding module to obtain the past vector.

Optionally, the calling module 302 is configured to:

And subtracting the past vector point by point from the target context vector before updating to obtain the processed context vector.

Optionally, the feature extraction module includes a first feature extraction sub-module and a second feature extraction sub-module;

The calling module 302 is configured to:

In the first feature extraction sub-module, processing the text image to obtain an initial feature map of the text image;

And in the second feature extraction submodule, performing sequence modeling on the initial feature map according to time steps to obtain the feature map group.

The exemplary embodiments of the present disclosure also provide an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to embodiments of the present disclosure when executed by the at least one processor.

The present disclosure also provides a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present disclosure.

The present disclosure also provides a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to embodiments of the disclosure.

Referring to fig. 4, a block diagram of an electronic device 400 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the electronic device 400 includes a computing unit 401 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In RAM 403, various programs and data required for the operation of device 400 may also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Various components in electronic device 400 are connected to I/O interface 405, including: an input unit 406, an output unit 407, a storage unit 408, and a communication unit 409. The input unit 406 may be any type of device capable of inputting information to the electronic device 400, and the input unit 406 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 407 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 408 may include, but is not limited to, magnetic disks, optical disks. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the respective methods and processes described above. For example, in some embodiments, the text recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 400 via the ROM 402 and/or the communication unit 409. In some embodiments, the computing unit 401 may be configured to perform the text recognition method by any other suitable means (e.g., by means of firmware).

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method of text recognition, the method comprising:

acquiring a text image to be identified;

In the decoding module, for each current time step, based on a target context vector of the current time step and a target hidden state vector corresponding to a previous time step of the current time step, determining text information corresponding to the current time step and an original hidden state vector of the current time step in the text image, wherein the target hidden state vector is determined based on the original hidden state vector of the previous time step and an adjustment vector, the adjustment vector is determined based on feature maps of a plurality of reference time steps of the previous time step, and each reference time step is within a preset range of the previous time step.

2. The text recognition method of claim 1, wherein the method of determining the adjustment vector based on the feature map of the plurality of reference time steps of the last time step comprises:

3. The text recognition method according to claim 2, wherein the performing the averaging pooling processing based on the feature maps of the plurality of reference time steps corresponding to the previous time step to obtain the adjustment vector of the previous time step includes:

4. The text recognition method of claim 1, wherein the method of determining the target hidden state vector based on the original hidden state vector and the adjustment vector of the previous time step comprises:

5. The text recognition method of claim 1, wherein the feature map set includes feature maps for M time steps, M being an integer greater than 0;

For each current time step, determining text information corresponding to the current time step and an original hidden state vector of the current time step in the text image based on a target context vector of the current time step and a target hidden state vector corresponding to a time step previous to the current time step, including:

In the 1 st time step, processing based on the 1 st time step target context vector and a target hidden state vector initial value to obtain a recognition result corresponding to the 1 st time step and the 1 st time step original hidden state vector, and determining the 1 st time step target hidden state vector based on the 1 st time step original hidden state vector and the 1 st time step adjustment vector, wherein the target hidden state vector initial value is set as a preset vector;

6. The text recognition method according to claim 1, wherein, in the context encoding module, after processing based on the feature map set for each current time step to obtain the target context vector of the current time step, the method further comprises:

Updating the target context vector to the processed context vector.

7. The text recognition method of claim 6, wherein the context encoding module comprises an attention module for processing based on the feature map set to obtain an attention score for each time step;

the determining the past vector of the current time step comprises the following steps:

8. The text recognition method of claim 6, wherein the processing based on the target context vector and the past vector results in a processed context vector, comprising:

9. The text recognition method of claim 1, wherein the feature extraction module comprises a first feature extraction sub-module and a second feature extraction sub-module;

the processing based on the text image to obtain a feature mapping vector group of the text image comprises the following steps:

10. A text recognition device, the device comprising:

the acquisition module is used for acquiring a text image to be identified;

11. An electronic device, comprising:

A processor; and

A memory in which a program is stored,

Wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to any of claims 1-9.

12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9.