CN114283432A

CN114283432A - Text block identification method and device and electronic equipment

Info

Publication number: CN114283432A
Application number: CN202110931940.2A
Authority: CN
Inventors: 郑岩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2022-04-05

Abstract

The embodiment of the invention discloses a text block identification method, a text block identification device and electronic equipment, wherein the method comprises the following steps: acquiring a text block image, wherein the text block image comprises one line or a plurality of lines of characters; extracting the characteristics of the text block images through a characteristic extraction network to obtain a first characteristic diagram; and identifying the characters in the first feature map through a codec to obtain an identification result of the one or more lines of characters, wherein the codec comprises an attention network. The embodiment of the invention can improve the accuracy of text block identification.

Description

Text block identification method and device and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of natural language processing, in particular to a text block identification method and device and electronic equipment.

Background

Based on various requirements, it is necessary to recognize characters in the text block image. At present, a common method for recognizing characters in a text block image is as follows: the text block is divided according to lines, then characters in each line are identified, and the identified characters in each line are spliced to obtain a final identification result. However, in the case where there is distortion, vertical sticking, or the like between a plurality of lines of characters included in the text block image, there is a possibility that the characters of the line are divided into the lines when dividing, resulting in inaccurate divided lines, so that the characters in each line cannot be accurately recognized, reducing the accuracy of text block recognition.

Disclosure of Invention

The embodiment of the invention discloses a text block identification method, a text block identification device and electronic equipment, which are used for improving the accuracy of text block identification.

A first aspect discloses a text block recognition method, which is characterized by comprising:

acquiring a text block image, wherein the text block image comprises one line or a plurality of lines of characters;

extracting the characteristics of the text block images through a characteristic extraction network to obtain a first characteristic diagram;

and identifying the characters in the first feature map through a codec, and obtaining the identification result of the one or more lines of characters, wherein the codec comprises an attention (attention) network.

As a possible implementation manner, the feature extraction network includes a feature extraction module and a classification module, and the extracting features of the text block image through the feature extraction network to obtain the first feature map includes:

extracting the features of the text block image through the feature extraction module to obtain N feature maps, wherein the N feature maps are different in size, and N is an integer greater than or equal to 4;

and selecting one feature map from the N feature maps through the classification module to obtain a first feature map.

As a possible implementation manner, the feature extraction module includes N Convolutional Neural Network (CNN) blocks, the classification module includes N classification units, and the N CNN blocks are in one-to-one correspondence with the N classification units;

the extracting the features of the text block image by the feature extraction module to obtain N feature maps comprises:

inputting the text block image into a first CNN block to obtain a first feature map;

performing dimensionality reduction on the ith feature map by using the (i + 1) th CNN block to obtain an (i + 1) th feature map, wherein i is 1,2, … and N-1;

the step of selecting one feature map from the N feature maps by the classification module to obtain a first feature map includes:

and under the condition that the size of a second feature map is the size of a target feature map corresponding to a first classification unit, determining the second feature map as the first feature map, wherein the first classification unit is any one classification unit in the N classification units, and the second feature map is the feature map corresponding to the first classification unit in the first feature map and the (i + 1) th feature map.

As a possible implementation manner, the N classification units correspond to N threshold ranges in a one-to-one manner, where there is no intersection between any two threshold ranges in the N threshold ranges, the jth classification unit includes a pooling layer, a fully connected layer (FC), and a classification layer, the pooling layer is configured to convert the jth feature map into a feature map with a first size, the FC layer is configured to perform dimension reduction processing on the converted jth feature map, the classification layer is configured to determine whether a number of rows of characters included in the jth feature map after dimension reduction is within a threshold range corresponding to the jth classification unit, and determine the size of the jth feature map as a target feature map size corresponding to the jth classification unit when it is determined that the number of rows of characters included in the jth feature map after dimension reduction is within the threshold range corresponding to the jth classification unit, and determining the jth characteristic diagram as a first characteristic diagram, wherein j is 1,2, … and N.

As a possible implementation, the size of the (i + 1) th feature map is smaller than the size of the ith feature map.

As a possible implementation, the method further comprises:

converting the text block image into an image of a second size;

the extracting the feature of the text block image through the feature extraction network to obtain a first feature map comprises:

and extracting the features of the converted text block image through a feature extraction network to obtain a first feature map.

As a possible implementation, the method further comprises:

inputting the text block image into a text block detection network to obtain a plurality of text block image segments;

inputting the text block image segments into a feature extraction network to obtain a plurality of feature maps;

the identifying the characters in the first feature map by the codec to obtain an identification result includes:

and inputting the plurality of feature maps into a coder-decoder to obtain a recognition result.

A second aspect discloses a text block recognition apparatus, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text block image which comprises one line or a plurality of lines of characters;

the extraction unit is used for extracting the characteristics of the text block images through a characteristic extraction network to obtain a first characteristic diagram;

and the identification unit is used for identifying the characters in the first characteristic diagram through a coder-decoder to obtain the identification result of the one or more lines of characters, and the coder-decoder comprises an attention network.

As a possible implementation manner, the feature extraction network includes a feature extraction module and a classification module, and the extraction unit is specifically configured to:

As a possible implementation manner, the feature extraction module includes N CNN blocks, the classification module includes N classification units, and the N CNN blocks are in one-to-one correspondence with the N classification units;

the extracting unit extracts the features of the text block images through the feature extracting module to obtain N feature maps, and the N feature maps comprise:

As a possible implementation manner, the N classification units correspond to N threshold ranges in a one-to-one manner, no intersection exists between any two threshold ranges in the N threshold ranges, the jth classification unit includes a pooling layer, an FC and a classification layer, the pooling layer is configured to convert the jth feature map into a feature map of a first size, the FC is configured to perform dimension reduction processing on the jth feature map after conversion, the classification layer is configured to determine whether a line number of characters included in the jth feature map after dimension reduction is within a threshold range corresponding to the jth classification unit, determine, when it is determined that the line number of characters included in the jth feature map after dimension reduction is within the threshold range corresponding to the jth classification unit, the size of the jth feature map is the target feature map size corresponding to the jth classification unit, and determine the jth feature map as the first feature map, j is 1,2, …, N.

As a possible implementation, the apparatus further comprises:

a conversion unit configured to convert the text block image into an image of a second size;

the extraction unit is specifically configured to extract features of the converted text block image through a feature extraction network to obtain a first feature map.

As a possible implementation, the apparatus further comprises:

the input unit is used for inputting the text block images into a text block detection network to obtain a plurality of text block image fragments;

the extraction unit is specifically configured to input the text block image segments into a feature extraction network to obtain a plurality of feature maps;

the identification unit is specifically configured to input the plurality of feature maps into a codec to obtain an identification result.

A third aspect discloses an electronic device, which may comprise a processor and a memory for storing a set of computer program code, which when used to invoke the computer program code stored in said memory, causes the processor to perform the method disclosed in the first aspect or any of its possible embodiments.

A fourth aspect discloses an electronic device that may include a processor, a memory to store a set of computer program code, and a transceiver to receive information from and output information to electronic devices other than the electronic device. The processor is caused to perform the method disclosed in the first aspect or any of its possible embodiments when the processor is adapted to invoke the computer program code stored in the memory.

A fifth aspect discloses a computer readable storage medium having stored thereon a computer program or computer instructions which, when executed, implement the method as disclosed in the first aspect or any possible implementation of the first aspect.

A sixth aspect discloses a computer program product which, when run on a computer, causes the computer to perform the method disclosed in the first aspect or any possible implementation of the first aspect.

In the embodiment of the invention, a text block image comprising one or more lines of characters can be acquired, the characteristics of the text block image can be extracted through the characteristic extraction network to obtain a first characteristic diagram, and the characters in the first characteristic diagram can be identified through a coder/decoder comprising an attention network to obtain the identification result of one or more lines of characters. The features of the whole text block can be extracted firstly, and then the features of the whole text block can be directly identified by using a codec, and since the attention network can see the features of any position in the two-dimensional feature map, the codec comprising the attention network can directly identify a plurality of lines of characters, so that the accuracy of text block identification can be improved. In addition, because the characters of multiple lines can be directly identified, the characters do not need to be identified after the characters of one line are detected, and therefore, the efficiency of text block identification can be improved.

Drawings

FIG. 1 is a schematic flow chart of a text block recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a CNN block disclosed in the embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a feature extraction network according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a text block recognition model according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a recognition result using a text block recognition model according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a text block recognition apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention discloses a text block identification method, a text block identification device and electronic equipment, which are used for improving identification accuracy. The following are detailed below.

For a better understanding of the embodiments of the present invention, the related art will be described below. Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. Artificial intelligence software techniques include voice techniques.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The invention realizes the recognition of characters in the text block by NLP technology. The following examples are intended to illustrate the details.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a text block recognition method according to an embodiment of the present invention. The text block recognition method may be applied to an electronic device capable of image processing, and the electronic device may be provided with a Graphics Processing Unit (GPU). The text block recognition method may also be applied to an application installed on an electronic device. As shown in fig. 1, the text block recognition method may include the following steps.

101. An image of a text block comprising one or more lines of text is acquired.

In the case where the text in the text block needs to be recognized, a text block image may be acquired. The text block image may be an image including characters shot by a camera, a PDF document including characters, an electronically scanned document including characters, or a document or picture including characters obtained by other methods. The text block image may include one or more lines of text. The words may be words of different languages and may include numbers, characters, etc. A line of words may be understood as a line of mathematical formula, a line of characters, a formula + characters, and a line of other words, which are not limited herein.

The text block image may be obtained from a locally stored text block image, from a server, or from a specialized database. The text block image may also be retrieved by an inserted storage device. For example, in the case where the electronic device is a notebook computer, a desktop computer, or the like, the text block image may be acquired from a usb disk, a mobile hard disk, or the like inserted into the notebook computer, the desktop computer, or the like. The text block image may also be obtained by intercepting a displayed document, picture, or the like. For example, in the case that a user needs to input some words in a PDF document into a Word document, but the user cannot directly copy the required contents from the PDF document, the user may cut the contents in a screenshot manner to obtain a text block image.

102. And extracting the characteristics of the text block image through a characteristic extraction network to obtain a first characteristic diagram.

After the text block image comprising one or more lines of characters is acquired, the feature of the text block image can be extracted through the feature extraction network to obtain a first feature map, that is, the text block image can be input into the feature extraction network, and the output of the feature extraction network is the first feature map. In the case where the text block image is an entire image, the first feature map is a feature map.

The feature extraction network may include a feature extraction module and a classification module. The feature extraction module may extract features of the text block image to obtain N feature maps, that is, the text block image may be input to the feature extraction module, and the feature extraction module may output the N feature maps. The N feature maps are different in size, that is, the size between any two of the N feature maps is different. However, the number of lines and the content of the characters included in the N feature maps are the same, and are the number of lines and the content of the characters included in the text block image. N is an integer greater than or equal to 4. The deeper the depth of the network is, the higher the level of the abstract features that the network can learn, and the higher the level of the abstract features is, the more accurate the abstract features are extracted. Therefore, in order to improve the text block identification accuracy, the number of CNN blocks included in the feature extraction module may be 4, and may be 4 or more.

And then, selecting one feature map from the N feature maps through the classification module to obtain a first feature map, namely inputting the N feature maps into the classification module, wherein the output of the classification module is the first feature map. The first feature map is one of the N feature maps. It can be seen that although there are N feature maps input to the classification module, the feature map output by the classification module is one, and the one feature map is one of the N feature maps. Since the size of the text region corresponding to each unit in the feature map of the input codec is fixed, that is, each unit corresponds to a fixed number of lines of text, the recognition effect of the codec is the best. Therefore, in order to improve the recognition accuracy of the codec, the larger the number of lines included in the text block image, the larger the size of the corresponding first feature map, so that the larger the number of lines of the text block uses the larger feature map, and the smaller the number of lines of the feature block uses the smaller feature map, so that the size of the character area corresponding to each unit in the feature map is substantially the same, and the best recognition effect can be achieved.

The feature extraction module may include N CNN blocks (blocks), and the classification module may include N classification units. The N CNN blocks correspond to the N classification units one by one, namely, each CNN block in the N CNN blocks is connected with one classification unit in the N classification units, and the classification units connected with different CNN blocks are different. The N CNN blocks are connected in sequence, it is understood that the output of the first CNN block is connected to the input of the second CNN block, the output of the second CNN block is connected to the input of the third CNN block, …, and the output of the N-1 st CNN block is connected to the input of the nth CNN block.

The text block image may be input into the first CNN block to obtain a first feature map, that is, the output of the first CNN block is the first feature map. And then, performing dimensionality reduction on the ith feature map by using the (i + 1) th CNN block to obtain an (i + 1) th feature map, namely inputting the ith feature map into the (i + 1) th CNN block, wherein the output of the (i + 1) th CNN block is the (i + 1) th feature map, and the output of the last CCN block is used as the input of the next CNN block. Therefore, the text block image is input into the feature extraction module, so that a first feature map and an i +1 th feature map can be obtained, and N feature maps are obtained in total, namely each CNN block outputs one feature map. i is 1,2, …, N-1.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a CNN block according to an embodiment of the present invention. As shown in fig. 2, the CNN block may include 2 3 × 3 convolution (convolution) layers, two Batch Normalization (BN) layers, two activation layers, and one pooling (pool) layer. The padding of the two convolutional layers may be 1 so that the size of the feature before and after convolution can be guaranteed to be unchanged. Convolutional layers are used to extract features. The BN layer is used to normalize the output result of the convolutional layer so that the output of the CNN block can be prevented from being excessively large. The active layer may be a linear rectification function (ReLU) layer for converting linearity to non-linearity. The first 3 x 3 convolutional layer is connected to the first BN layer, the first BN layer is connected to the first active layer, the first active layer is connected to the second convolutional layer, the second convolutional layer is connected to the second BN layer, the second BN layer is connected to the second active layer, and finally a 2 x 2 pooling layer is connected. The pooling layer is used to perform down-sampling, i.e., dimension reduction, and can reduce the length and width of the feature map to 1/2.

Therefore, the signature size output by the second CNN block is 1/2, i.e., the signature length output by the second CNN block is 1/2 of the signature length output by the first CNN block, and the signature width output by the second CNN block is 1/2 of the signature width output by the first CNN block, i.e., the length and width are reduced to 1/2. Similarly, the feature map size output by the third CNN block is 1/2 of the feature map size output by the second CNN block, and the feature map size output by the fourth CNN block is 1/2 of the feature map size output by the third CNN block.

It should be understood that fig. 2 is only an exemplary illustration of the structure of the CNN block, and does not limit the structure of the CNN block. For example, a CNN block may include 3 × 3 convolutional layers, three batch BN layers, three active layers, and one pooling (pool) layer.

After the first classifying unit receives the second feature map input by the corresponding CNN block, it may be determined whether the size of the second feature map is the target feature map size corresponding to the first classifying unit, and when the size of the second feature map is determined to be the target feature map size corresponding to the first classifying unit, the second feature map may be determined as the first feature map, where the first classifying unit is any one of the N classifying units, and the second feature map is the first feature map and the feature map corresponding to the first classifying unit in the i +1 th feature map, that is, the feature map output by the CNN block corresponding to the first classifying unit. It can be seen that the outputs of the N CNN blocks will be respectively used as the inputs of the N classification units, and the input of the jth classification unit is the output of the jth CNN block. The structure and the processing process of the N classifying units are the same, but the sizes of the feature maps input into the N classifying units are different.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a feature extraction network according to an embodiment of the present invention. As shown in fig. 3, N is 4, i.e. the feature extraction module includes 4 CNN blocks, and the classification module includes 4 classification units. The number c in fig. 3 is followed by a channel. It can be seen that the 4 CNN blocks are connected in sequence, and the output position of each CNN block in the 4 CNN blocks is connected with one classification unit. As the CNN block is continuously deepened, the number of channels (channels) of the CNN block is also continuously increased. For example, the number of channels of the first CNN block is 64, the number of channels of the second CNN block is 128, the number of channels of the third CNN block is 256, and the number of channels of the fourth CNN block is 256. It can be seen that the number of channels of the next CNN block may be greater than or equal to the number of channels of the previous CNN block.

The N classification units are in one-to-one correspondence with the N threshold ranges, no intersection exists between any two threshold ranges in the N threshold ranges, namely, each classification unit in the N classification units corresponds to a different threshold range respectively, and the threshold ranges corresponding to different classification units are not intersected. As shown in fig. 3, the jth taxon may include a pooling layer, FC, and a classification layer. That is, each of the N taxons includes a pooling layer, an FC, and a classification layer. The pooling layer is used for converting the jth feature map into a feature map with a first size, that is, converting the input feature map into the first size, and the first sizes corresponding to the N classification units are the same, that is, the pooling layer in the N classification units converts the N feature maps output by the N CNN blocks into the feature map with the same size. The FC is used for reducing the dimension of the j-th feature map after conversion, namely the FC performs dimension reduction processing on the feature map output by the pooling layer, and the dimension reduction is the feature map of 1024 channels. The classification layer is used for judging whether the line number of the characters included in the jth feature map after dimensionality reduction is in a threshold range corresponding to the jth classification unit, namely, whether the line number of the characters included in the characteristic diagram output by the FC is in the threshold range corresponding to the jth classification unit is judged, in the case where the number of lines of the text included in the feature map output by the FC is determined to be within the threshold range corresponding to the jth classification unit, the dimension of the jth feature map can be determined as the dimension of the target feature map corresponding to the jth classification unit, i.e., the size of the jth feature map can be determined as the size of the required feature map, the jth feature map can be classified into a YES (YES) category, that is, the classification category of the jth feature map is YES category, and then the jth feature map may be determined as the first feature map, and the first feature map may be output to the codec. When the number of lines of the characters included in the feature map output by the FC is determined to be outside the threshold range corresponding to the jth classification unit, it may be determined that the size of the jth feature map is not the target feature map size corresponding to the jth classification unit, that is, it may be determined that the size of the jth feature map is not the size of the required feature map, and the jth feature map may be classified into a NO (NO) category, that is, the classification category corresponding to the jth feature map is (NO). As can be seen, the classification layer is a classifier for two classifications, and when the size of the jth feature map is the size of the target feature map corresponding to the jth classification unit, the jth feature map can be classified into a YES classification; in the case where the size of the jth feature map is not the target feature map size corresponding to the jth classification unit, the jth feature map may be classified into a category of NO. The j-th feature map has different classification results and different processing results. j is 1,2, …, N.

It can be seen that the structures and processing procedures of the N classification units are the same, but the sizes of the feature maps of the N classification units are input, and the corresponding target feature maps have different sizes, that is, the corresponding threshold ranges are different, that is, the basis or criterion for classification is different. For example, in the case that N is 4, the threshold range corresponding to the first classification unit may be greater than 6, the threshold range corresponding to the second classification unit may be greater than 4 and less than or equal to 6, the threshold range corresponding to the third classification unit may be greater than 2 and less than or equal to 4, and the threshold range corresponding to the fourth classification unit may be less than or equal to 2.

It can be seen that although each classification unit in the N classification units has an input feature map, that is, a corresponding feature map is input, but since the feature maps have the same number of lines, but the size of the target feature map, that is, the threshold range, that is, the classification criterion, corresponding to each classification unit in the N classification units is different, the number of lines of the text block image included in the text block image is only within one threshold range of the N threshold ranges corresponding to the N classification units, and therefore, for the same text block image, the classification result of only one classification unit in the N classification units is YES, and the classification results of the remaining N-1 classification units are NO.

The pooling layer in the taxon may be an adaptive average pooling layer, which may convert the size of the feature map to 1 x 1. The pooling layer may also be another pooling layer. The layering layer may also convert the feature map into feature maps of other sizes, which is not limited herein. The output of the pooling layer may then be input into FC, after which the channel dimensions may be transformed to 1024. In addition, an activation function can be arranged between the FC and the classification layer and used for activating the feature diagram output by the FC. For example, the activation may be performed via a linear rectification function (ReLU) function. The classification layer may perform two classifications on the activated feature map, that is, whether the classification is a feature map of a target size, for FC with a channel dimension of 2. The classification layer can also be other classifiers with a binary classification function. When the number of lines of the characters included in the feature map of the input classification unit is within the corresponding threshold range, the feature map indicating that the feature map of the input classification unit is the feature map of the target feature map size may be sent to the codec. When the number of lines of the characters included in the feature map of the input classification unit is out of the corresponding threshold range, the feature map of the input classification unit is not the feature map of the target feature map size, and the result may not be sent to the codec.

The feature extraction network is a trained network. The training data may include various types of text block images acquired in various ways. These text block images may include marked lines, i.e., where each line of text is marked, and marked lines.

103. And identifying the characters in the first characteristic diagram through a coder-decoder to obtain an identification result.

After the features of the text block image are extracted through the feature extraction network to obtain a first feature map, the characters in the first feature map can be identified through the codec to obtain an identification result of one or more lines of characters, that is, the first feature map can be input into the codec, and the output of the codec is one or more lines of characters included in the identified text block image. The codec may include an attention network.

The codec may include an encoder and a decoder. The first profile may pass through an encoder and a decoder in sequence. The encoder may comprise a layer of bidirectional Long Short Term Memory (LSTM) network. Assuming that the first feature map is denoted as V, the feature input encoder for each line of characters in the first feature map can obtain new features of each line of characters

From V

The calculation formula of (c) can be expressed as follows:

wherein the content of the first and second substances,

in the initial state, w represents the number of lines of the current line of text in the text block.

As a result of the encoding of the w-th line by the encoder,

for the encoder result of the w-1 line encoding, V_wIs the feature on the w-th row in the first feature map. It can be seen that the encoding result of the encoder for the next line is related to the encoding result of the encoder for the previous line and the characteristics of the input of the present line. The Recurrent Neural Network (RNN) in the above formula can be understood as an LSTM network.

The decoder may include a layer of bi-directional LSTM network and an authentication network. The decoder may be based on the prediction result (y) in front of the decoder₁，y₂，…，y_t) And input features

To predict the next character y_t+1The calculation formula can be expressed as follows:

wherein the content of the first and second substances,

representation is based on y₁，y₂，…，y_tAnd

predicted y_t+1The result of (1).

Is the output matrix of the encoder, i.e. the first profile, i.e. the input of the decoder. W is a linear transformation matrix. softmax (·) denotes a softmax function. o_tCan be expressed as follows:

o_t＝tan h(W[h_t；c_t])

wherein h is_tFor hidden states of the decoder, h_tThe calculation formula of (c) can be expressed as follows:

h_t＝RNN(h_t-1；o_t-1)

c_tfor the context feature matrix, the feature matrix can be based on the attention network and the output of the encoder

Is calculated to obtain c_tThe calculation formula can be expressed as follows:

i is the position of the context feature matrix,

is a feature matrix at time t, alpha_tIs the attention weight matrix. As can be seen, the above formula represents: the value of the i position in the context feature matrix is the feature matrix at the t moment

And a weighted sum of the corresponding attention weight matrix at the i position. Alpha is alpha_tThe calculation formula of (c) can be expressed as follows:

W_his h_i-1Weight matrix of W_vIs composed of

The weight matrix of (2).

The codec is a trained codec. The training data used in training the codec may include both real data and construction data. The real data may be obtained from the collected text block images, and the real data may include the annotated text content and the line number. In addition, the real data can be subjected to augmentation processing. For example, the data may be rotational radial transformed. As another example, the brightness of the data may be changed. As another example, the contrast of the data may be changed. As another example, the data may be noisy. The construction data may generate a text block image and corresponding tag information according to the font library file rendering, and the tag information may include the labeled text content and the line number. The training of the decoder may be as training of a conditional language model.

In the method described in fig. 1, a text block image including one or more lines of words may be acquired, a first feature map may be obtained by extracting features of the text block image through a feature extraction network, and a recognition result of the one or more lines of words may be obtained by recognizing words in the first feature map through a codec including an attention network. The features of the whole text block can be extracted firstly, and then the features of the whole text block can be directly identified by using a codec, and since the attention network can see the features of any position in the two-dimensional feature map, the codec comprising the attention network can directly identify a plurality of lines of characters, so that the accuracy of text block identification can be improved. In addition, because the characters in multiple lines can be directly identified, the characters in the lines do not need to be identified after being detected, and therefore, the identification efficiency can be improved.

Optionally, before performing step 101, the text block recognition method may also convert the text block image into an image of a second size, that is, convert the size of the image into a fixed size. And then, extracting the features of the converted text block image through a feature extraction network to obtain a first feature map. Because the different training images have different sizes, batch training cannot be performed, so in the process of training the feature extraction network, in order to implement batch training, the trained text block images can be firstly converted into the text block images of the second size and then trained. Accordingly, after the feature extraction network is trained, in the text block recognition process, before the text block image is input into the feature extraction network, the text block image needs to be converted into a text block image of a second size and then input into the feature extraction network. As can be seen, the size of the text block image of the input feature extraction network is fixed regardless of the size of the acquired text block image. The text block image may be converted to an image of a second size by a size conversion module.

Optionally, before the text block recognition method executes step 101, the text block image may be input into a text block detection network to obtain a plurality of text block image segments, that is, the text block image may be divided by using the text block detection network, and one text block image is divided into a plurality of text block image segments. A text block image segment may include a segment of words, i.e., the text block image may be divided in units of paragraphs. A text block image segment may also include a formula, i.e., the text block image may be divided in units of formulas. A text block image segment may also comprise fixed lines of text, i.e. the text block image may be divided in units of fixed lines. For example, each text box image segment may include 5 lines of text. The text block image may also be divided in other ways. The dividing manner of the text block image may also include two or more of the above, which is not limited herein.

And then, inputting the plurality of text block image segments into a feature extraction network to obtain a plurality of feature maps. The plurality of feature maps correspond to the plurality of text block image segments one to one, that is, one feature map can be obtained from one text block image segment. The processing mode of the feature extraction network on a plurality of text block image fragments is similar to the processing mode on one text block image. The feature extraction network may process each of the plurality of text block image segments separately, rather than together with a mashup.

A plurality of text block image segments may be input into the feature extraction module, each CNN block of the N CNN blocks included in the feature extraction module outputs a plurality of feature maps, and the plurality of feature maps output by each CNN block correspond to the plurality of text block image segments one to one, that is, each CNN block outputs one feature map for each text block image segment of the plurality of text block image segments.

The plurality of feature maps output by each CNN block are input into a corresponding classification unit, the classification unit can respectively process each feature map in the plurality of feature maps output by the corresponding CNN block, and finally the classification module outputs the plurality of feature maps. And the plurality of feature images output by the classification module correspond to the plurality of text block image segments one by one. The feature maps output by the classification module may be output by the same classification unit, or may be output by a plurality of classification units. For example, in the case that the number of lines of the text block image segments includes a word in the same one of the N threshold ranges corresponding to the N classification units, the feature maps output by the classification module may be output by the same classification unit. For another example, in a case where the number of lines of the text block image segments includes a text in a different threshold range of the N threshold ranges corresponding to the N classification units, the plurality of feature maps output by the classification module may be output by the plurality of classification units. It should be understood that the feature maps output by the classification modules corresponding to the text block image segments with the line number of the characters in the text block image segment within the same threshold range are output by the same classification unit.

For example, the text block detection network may divide one text block image into 4 text block image segments, each of the N CNN blocks will output 4 feature maps. The 4 feature maps output by one CNN block are feature maps corresponding to the 4 text block image segments. Each classifying unit inputs 4 feature maps output by one CNN block, and each classifying unit can determine whether the size of a feature map is the target feature map size corresponding to the classifying unit for each feature map in the input 4 feature maps, and when the size of the feature map is determined to be the target feature map size corresponding to the classifying unit, the feature map is determined to be the size of a required feature map, and the feature map can be classified into YES category, and then the feature map can be output to a codec. When the size of the feature map is determined not to be the size of the target feature map corresponding to the classification unit, it indicates that the size of the feature map is not the size of the required feature map, and the feature map may be classified into the NO category.

The plurality of text block image segments may be converted into a second size text block image segment before being input into the feature extraction network, and then the converted plurality of text block image segments may be input into the feature extraction network. And then, a plurality of feature maps output by the feature extraction network can be input into a coder-decoder to obtain a recognition result.

Under the condition that the number of lines of characters included in the text block image is large, the speed of recognizing the text block image by the text block recognition model is low, so that one text block image can be cut into a plurality of text block image segments through a text block detection network and then recognized, the number of lines of characters included in the text block image segments can be reduced, and the text block recognition efficiency can be improved.

In one case, the text block recognition model may include a feature extraction network and a codec. In another case, the text block recognition model may include a size conversion module, a feature extraction network, and a codec. In yet another case, the text block recognition model may include a text block detection network, a feature extraction network, and a codec. In yet another case, the text block recognition model may include a text block detection network, a size conversion module, a feature extraction network, and a codec. Referring to fig. 4, fig. 4 is a schematic structural diagram of a text block recognition model according to an embodiment of the present invention. As shown in fig. 4, the text block recognition model may include a feature extraction network and a codec. It should be understood that the text block recognition model shown in fig. 4 is a schematic illustration of the structure of the text block recognition model, and is not limited to the structure of the text block recognition model. For example, the text block recognition model may also include a text block detection network. As another example, the text block recognition model may also include a size conversion module.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a recognition result by using a text block recognition model according to an embodiment of the present invention. As shown in fig. 5, the left part is the input text block image, and the right part is the recognition result output by the text block recognition model. Therefore, the text block recognition model can accurately recognize the characters which are stuck together.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a text block recognition apparatus according to an embodiment of the present invention. The text block recognition device may be an electronic device capable of image processing, and the electronic device may be provided with a GPU. The text block recognition means may also be an application installed on the electronic device. As shown in fig. 6, the text block recognition apparatus may include:

an acquiring unit 601 configured to acquire a text block image, where the text block image includes one or more lines of characters;

the extracting unit 602 is configured to extract features of the text block image through a feature extraction network to obtain a first feature map;

the identifying unit 603 is configured to identify the text in the first feature map through a codec, and obtain an identification result of the one or more lines of text, where the codec includes an entry network.

In an embodiment, the feature extraction network may include a feature extraction module and a classification module, and the extraction unit 602 is specifically configured to:

extracting the features of the text block images through a feature extraction module to obtain N feature maps, wherein the N feature maps are different in size, and N is an integer greater than or equal to 4;

and selecting one feature map from the N feature maps through a classification module to obtain a first feature map.

In one embodiment, the feature extraction module may include N CNN blocks, the classification module may include N classification units, and the N CNN blocks are in one-to-one correspondence with the N classification units;

the extracting unit 602, extracting features of the text block image through the feature extracting module to obtain N feature maps, may include:

the extracting unit 602 selects a feature map from the N feature maps through the classifying module, and obtaining a first feature map includes:

and under the condition that the size of the second feature map is the size of the target feature map corresponding to the first classification unit, determining the second feature map as the first feature map, wherein the first classification unit is any one of the N classification units, and the second feature map is the feature map corresponding to the first classification unit in the first feature map and the (i + 1) th feature map.

In one embodiment, the N classification units are in one-to-one correspondence with N threshold ranges, no intersection exists between any two threshold ranges in the N threshold ranges, the jth classification unit includes a pooling layer, an FC and a classification layer, the pooling layer is configured to convert the jth feature map into a feature map of a first size, the FC is configured to perform dimension reduction processing on the converted jth feature map, the classification layer is configured to determine whether a number of lines of text included in the jth feature map after dimension reduction is within a threshold range corresponding to the jth classification unit, under the condition that the line number of the characters included in the jth characteristic diagram after the dimension reduction is judged to be in the threshold range corresponding to the jth classification unit, and determining the dimension of the jth feature map as the dimension of a target feature map corresponding to the jth classification unit, and determining the jth feature map as the first feature map, wherein j is 1,2, …, and N.

In one embodiment, the size of the (i + 1) th feature map is smaller than the size of the (i) th feature map.

In one embodiment, the text block recognition apparatus may further include:

a conversion unit 604 for converting the text block image into an image of a second size;

the extracting unit 602 is specifically configured to extract features of the converted text block image through a feature extraction network to obtain a first feature map.

In one embodiment, the text block recognition apparatus may further include:

an input unit 605, configured to input the text block image into a text block detection network, so as to obtain a plurality of text block image segments;

an extracting unit 602, configured to input the multiple text block image segments into a feature extraction network to obtain multiple feature maps;

the identifying unit 603 is specifically configured to input the plurality of feature maps into the codec to obtain an identification result.

The detailed descriptions of the obtaining unit 601, the extracting unit 602, the identifying unit 603, the converting unit 604 and the inputting unit 605 can be obtained by directly referring to the method embodiment shown in fig. 1, and are not repeated here.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 7, the electronic device may include a processor 701, a memory 702, and a connection 703. The memory 702 may be separate and the connection 703 may be to the processor 701. The memory 702 may also be integrated with the processor 701. The connection 703 may include a path for transmitting information between the above components. The processor 701 includes a GPU. Wherein, the memory 702 stores computer program instructions, and the processor 701 is configured to call the computer program instructions stored in the memory 702 to perform the following operations:

and identifying the characters in the first characteristic diagram through a coder-decoder to obtain the identification result of one or more lines of characters, wherein the coder-decoder comprises an attention network.

In one embodiment, the feature extraction network may include a feature extraction module and a classification module, and the processor 701 extracts features of the text block image through the feature extraction network to obtain the first feature map includes:

the processor 701 may obtain N feature maps by extracting features of the text block image through the feature extraction module, where the N feature maps include:

the processor 701 selects one feature map from the N feature maps through the classification module, and obtaining the first feature map may include:

In one embodiment, the processor 701 is further configured to invoke computer program instructions stored in the memory 702 to perform the following operations:

converting the text block image into an image of a second size;

the extracting, by the processor 701, the feature of the text block image through the feature extraction network to obtain the first feature map may include:

inputting a plurality of text block image fragments into a feature extraction network to obtain a plurality of feature graphs;

the processor 701 recognizes the characters in the first feature map through the codec, and obtaining a recognition result includes:

The electronic device may also include a transceiver 704. The transceiver 704 is used for outputting information to and receiving information from other electronic devices than the electronic device.

The embodiment of the invention also discloses a computer readable storage medium, wherein a computer program or computer instructions are stored on the computer readable storage medium, and when the computer program or the computer instructions are executed, the method in the embodiment of the method is executed.

Embodiments of the present invention also disclose a computer program product comprising a computer program or computer instructions, which, when executed, perform the method in the above-described method embodiments.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A text block recognition method, comprising:

and identifying the characters in the first feature map through a codec to obtain an identification result of the one or more lines of characters, wherein the codec comprises an attention network.

2. The method of claim 1, wherein the feature extraction network comprises a feature extraction module and a classification module, and the extracting the features of the text block image through the feature extraction network to obtain the first feature map comprises:

3. The method according to claim 2, wherein the feature extraction module comprises N Convolutional Neural Network (CNN) blocks, the classification module comprises N classification units, and the N CNN blocks are in one-to-one correspondence with the N classification units;

4. The method according to claim 3, wherein the N classification units are in one-to-one correspondence with N threshold ranges, there is no intersection between any two threshold ranges in the N threshold ranges, the jth classification unit includes a pooling layer, a fully-connected layer (FC) and a classification layer, the pooling layer is configured to convert the jth feature map into a feature map of a first size, the FC is configured to perform dimension reduction processing on the converted jth feature map, the classification layer is configured to determine whether a number of rows of characters included in the jth feature map after dimension reduction is within a threshold range corresponding to the jth classification unit, and determine the size of the jth feature map as the target feature map size corresponding to the jth classification unit if it is determined that the number of rows of characters included in the jth feature map after dimension reduction is within the threshold range corresponding to the jth classification unit, and determining the jth characteristic diagram as a first characteristic diagram, wherein j is 1,2, … and N.

5. The method of claim 3 or 4, wherein the size of the (i + 1) th feature map is smaller than the size of the (i) th feature map.

6. The method of claim 1, further comprising:

converting the text block image into an image of a second size;

7. The method of claim 1, further comprising:

inputting the text block image segments into a feature extraction network to obtain a plurality of feature maps, wherein the feature maps correspond to the text block segment images one by one;

8. A text block recognition apparatus, comprising:

and the identification unit is used for identifying the characters in the first characteristic diagram through a coder decoder to obtain the identification result of the one or more lines of characters, and the coder decoder comprises an attention network.

9. An electronic device comprising a processor and a memory, the memory configured to store a set of computer program code, the processor configured to invoke the computer program code stored in the memory to implement the method of any of claims 1-7.

10. A computer-readable storage medium, in which a computer program or computer instructions are stored which, when executed, implement the method according to any one of claims 1 to 7.