WO2021218706A1 - 文本识别方法、装置、设备及存储介质 - Google Patents

文本识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021218706A1
WO2021218706A1 PCT/CN2021/088389 CN2021088389W WO2021218706A1 WO 2021218706 A1 WO2021218706 A1 WO 2021218706A1 CN 2021088389 W CN2021088389 W CN 2021088389W WO 2021218706 A1 WO2021218706 A1 WO 2021218706A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
text
training
neural network
text image
Prior art date
Application number
PCT/CN2021/088389
Other languages
English (en)
French (fr)
Chinese (zh)
Inventor
王文佳
刘学博
谢恩泽
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to JP2022520075A priority Critical patent/JP2022550195A/ja
Publication of WO2021218706A1 publication Critical patent/WO2021218706A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of computer vision technology, and in particular to text recognition methods, devices, equipment, and storage media.
  • Low-resolution text images are very common in daily life. For example, the resolution of a text image collected by a terminal device equipped with an image collection device such as a mobile phone may be low. These images have lost detailed content information, resulting in a low recognition accuracy of the text in the image.
  • the traditional text recognition method generally reconstructs the texture of the image first, and then performs text recognition based on the reconstructed image. However, the recognition accuracy of this method is low.
  • the present disclosure provides a text recognition method, device, equipment and storage medium.
  • a text recognition method includes: acquiring a feature map of a first text image, the feature map includes at least one feature sequence, and the feature sequence is used to represent all The correlation between at least two image blocks in the first text image; processing the first text image according to the at least one feature sequence to obtain a second text image, the resolution of the second text image is greater than The resolution of the first text image; performing text recognition on the second text image.
  • the acquiring a feature map of the first text image includes: acquiring a plurality of channel maps of the first text image and a binary image corresponding to the first text image; Perform feature extraction on the image and the binary image to obtain the feature map of the first text image.
  • the obtaining a feature map of the first text image includes: inputting the first text image into a pre-trained neural network, and obtaining a feature map output by the neural network.
  • the neural network obtains the feature map based on the following method: generating an intermediate image according to the first text image, the number of channels of the intermediate image is greater than the number of channels of the first text image; Perform feature extraction on the intermediate image to obtain the feature map.
  • the neural network includes at least one convolutional neural network and a bidirectional long and short-term memory network, and an output end of the at least one convolutional neural network is connected to an input end of the bidirectional long and short-term memory network;
  • Obtaining the feature sequence of the first text image includes: inputting the first text image into the at least one convolutional neural network, acquiring an intermediate image output by the at least one convolutional neural network; inputting the intermediate image into the A bidirectional long and short-term memory network to obtain the feature map output by the bidirectional long and short-term memory network.
  • the neural network includes multiple sub-networks connected in sequence; the inputting the first text image into a pre-trained neural network and obtaining the feature map output by the neural network includes: The i-th output image output by the i-th sub-network in the plurality of sub-networks is input to the i+1-th sub-network in the plurality of sub-networks to generate the i+1-th intermediate image through the i+1-th sub-network, and compare Feature extraction is performed on the i+1th intermediate image to obtain the i+1th output image; the number of channels of the i+1th intermediate image is greater than the number of channels of the ith output image; the Nth output image is determined to be all The feature map; where i and N are positive integers, N is the total number of sub-networks, 1 ⁇ i ⁇ N-1, N ⁇ 2, where the first output image is obtained by the following method: the first sub-network is based on the The first text image generates a first intermediate image, and performs feature extraction on
  • the method further includes: before processing the first text image according to the at least one feature sequence, processing the first text image so that the The number of channels is the same as that of the feature map.
  • the method further includes: after obtaining the second text image, processing the second text image so that the number of channels of the second text image is equal to the number of channels of the first text image. The number is the same; the performing text recognition on the second text image includes: performing text recognition on the processed second text image.
  • the method further includes: training the neural network based on at least one set of training images, each set of training images includes a first training image and a second training image, the first training image and the The second training image includes the same text; wherein the resolution of the first training image is less than the first resolution threshold, the resolution of the second training image is greater than the second resolution threshold, and the first resolution threshold is less than Or equal to the second resolution threshold.
  • the training the neural network based on at least one set of training images includes: inputting the first training image into the neural network, and obtaining an output image of the neural network; based on the The second training image corresponding to the first training image and the output image determine a loss function; and the neural network is supervised and trained based on the loss function.
  • the loss function includes at least one of a first loss function and a second loss function; the first loss function is based on each corresponding pixel in the first training image and the second training image And/or, the second loss function is determined based on the difference between the gradient fields of each corresponding pixel in the first training image and the second training image.
  • the method further includes aligning the first training image and the second training image before training the neural network based on the at least one set of training images.
  • the aligning the first training image and the second training image includes: processing the first training image through a pre-trained spatial transformation network to transform the first training image The text in is aligned with the text in the second training image.
  • the first training image is obtained by photographing the subject at the first position by a first image acquisition device with a first focal length;
  • the second training image is obtained by a first image acquisition device with a second focal length.
  • the second image acquisition device is obtained by photographing the subject at the first position; the first focal length is smaller than the second focal length.
  • a text recognition device the device includes: an acquisition module for acquiring a feature map of a first text image, the feature map includes at least one feature sequence, the feature The sequence is used to represent the correlation between at least two image blocks in the first text image; the first processing module is used to process the first text image according to the at least one characteristic sequence to obtain the second text Image, the resolution of the second text image is greater than the resolution of the first text image; a text recognition module is used to perform text recognition on the second text image.
  • the acquisition module includes: a first acquisition unit configured to acquire multiple channel diagrams of the first text image and a binary image corresponding to the first text image; and a feature extraction unit configured to Perform feature extraction on the multiple channel images and the binary image to obtain a feature map of the first text image.
  • the acquisition module is configured to: input the first text image into a pre-trained neural network, and acquire a feature map output by the neural network.
  • the neural network obtains the feature map based on the following method: generating an intermediate image according to the first text image, the number of channels of the intermediate image is greater than the number of channels of the first text image; Perform feature extraction on the intermediate image to obtain the feature map.
  • the neural network includes at least one convolutional neural network and a bidirectional long and short-term memory network, and an output end of the at least one convolutional neural network is connected to an input end of the bidirectional long and short-term memory network;
  • the acquisition module includes: a second acquisition unit, configured to input the first text image into the at least one convolutional neural network, and acquire an intermediate image output by the at least one convolutional neural network; and a third acquisition unit, configured to The intermediate image is input to the two-way long and short-term memory network, and the feature map output by the two-way long and short-term memory network is obtained.
  • the neural network includes multiple sub-networks connected in sequence; the acquisition module is configured to: input the i-th output image output by the i-th sub-network of the multiple sub-networks into the multiple sub-networks To generate the i+1th intermediate image through the i+1th subnetwork, and perform feature extraction on the i+1th intermediate image to obtain the i+1th output image; The number of channels of the i+1th intermediate image is greater than the number of channels of the i-th output image; the Nth output image is determined as the feature map; where i and N are positive integers, and N is the total number of sub-networks, 1 ⁇ i ⁇ N-1, N ⁇ 2, where the first output image is obtained by the following method: the first sub-network generates a first intermediate image according to the first text image, and performs feature extraction on the first intermediate image, Obtain the first output image.
  • the device further includes: a second processing module, configured to process the first text image before processing the first text image according to the at least one characteristic sequence, so that The number of channels of the first text image is the same as the number of channels of the feature map.
  • the device further includes: a third processing module, configured to process the second text image after obtaining the second text image, so that the number of channels of the second text image is equal to that of the second text image.
  • the number of channels of the first text image is the same; the text recognition module is used to perform text recognition on the processed second text image.
  • the device further includes: a training module for training the neural network based on at least one set of training images, and each set of training images includes a first training image and a second training image.
  • the training image and the second training image include the same text; wherein the resolution of the first training image is less than the first resolution threshold, the resolution of the second training image is greater than the second resolution threshold, and the first A resolution threshold is less than or equal to the second resolution threshold.
  • the training module includes: an input unit, configured to input the first training image into the neural network, and obtain an output image of the neural network; and a determining unit, configured based on the first training image The second training image corresponding to the training image and the output image determine a loss function; the training unit is configured to perform supervised training on the neural network based on the loss function.
  • the loss function includes at least one of a first loss function and a second loss function; the first loss function is based on each corresponding pixel in the first training image and the second training image And/or, the second loss function is determined based on the difference between the gradient fields of each corresponding pixel in the first training image and the second training image.
  • the device further includes: an alignment module for performing training on the first training image and the second training image before training the neural network based on the at least one set of training images Aligned.
  • the alignment module is used to process the first training image through a pre-trained spatial transformation network, so as to compare the text in the first training image with the text in the second training image.
  • the text is aligned.
  • the first training image is obtained by photographing the subject at the first position by a first image acquisition device with a first focal length;
  • the second training image is obtained by a first image acquisition device with a second focal length.
  • the second image acquisition device is obtained by photographing the subject at the first position; the first focal length is smaller than the second focal length.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method described in any of the embodiments is implemented.
  • a computer device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor implements any implementation when the program is executed. The method described in the example.
  • a computer program wherein when the computer program is executed by a processor, the method described in any one of the embodiments is implemented.
  • the embodiment of the present disclosure obtains a feature map of a first text image, and processes the first text image according to at least one feature sequence included in the feature map to obtain a second text with a larger resolution than the first text image Image, due to the correlation between the image blocks in the first text image, the correlation between the text can be effectively used through the above method to restore the first text image with a lower resolution to a second text image with a higher resolution.
  • text recognition is performed on the second text image, thereby recognizing the text content in the first text image, which improves the accuracy of text recognition.
  • Fig. 1A is a first schematic diagram of a text image according to an embodiment of the present disclosure.
  • Fig. 1B is a second schematic diagram of a text image according to an embodiment of the present disclosure.
  • Fig. 1C is a third schematic diagram of a text image according to an embodiment of the present disclosure.
  • Fig. 2 is a flowchart of a text recognition method according to an embodiment of the present disclosure.
  • FIG. 3 is a schematic diagram of the misalignment phenomenon between images in an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of the overall flow of a text recognition method according to an embodiment of the present disclosure.
  • Fig. 5 is a block diagram of a text recognition device according to an embodiment of the present disclosure.
  • Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
  • first, second, third, etc. may be used in the present disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • word “if” as used herein can be interpreted as "when” or “when” or "in response to determination”.
  • a scene text image is an image that contains text information captured in a natural scene.
  • the text information in the scene text image may include but is not limited to at least one of an ID number, a ticket, a billboard, a license plate, and the like. Examples of text information are shown in Figs. 1A to 1C.
  • the difficulty of text recognition for scene text images is much greater than that of scanned documents
  • the text in the image is recognized, which leads to a lower recognition accuracy of the scene text image than the recognition accuracy of the printed text image.
  • the traditional text recognition method generally uses the color similarity of adjacent pixels in the text image to interpolate between the colors of adjacent pixels according to a predefined method, thereby reconstructing the texture of the text image, and then based on The reconstructed text image performs text recognition.
  • This text recognition method has a higher recognition accuracy rate for relatively clear text images, but the recognition accuracy rate for low-resolution text images drops sharply.
  • an embodiment of the present disclosure provides a text recognition method. As shown in FIG. 2, the method may include step 201 to step 203.
  • Step 201 Obtain a feature map of a first text image, where the feature map includes at least one feature sequence, and the feature sequence is used to represent the correlation between at least two image blocks in the first text image.
  • Step 202 Process the first text image according to the at least one characteristic sequence to obtain a second text image, the resolution of the second text image is greater than the resolution of the first text image.
  • Step 203 Perform text recognition on the second text image.
  • the text in the first text image may include at least one of characters, symbols, and numbers.
  • the first text image may be an image captured in a natural scene, and the text in the first text image may be various types of text in a natural scene.
  • the first text image may be an image of an ID card, and the text in the first text image is the ID number and name on the ID card.
  • the first text image may be an image of a billboard, and the text in the first text image is a slogan on the billboard.
  • the first text image may also be an image including printed text.
  • the first text image may be a text image whose resolution is low and the accuracy of text recognition is lower than a preset accuracy threshold.
  • each pixel can be regarded as an image block, and each element in the feature sequence can represent the correlation between adjacent pixels in the first text image.
  • multiple adjacent pixels can also be used as an image block, and each element in the feature sequence can represent the correlation between adjacent image blocks in the first text image. .
  • the background of the first text image is monochrome, and the color of the background is generally different from the color of the text. Therefore, the approximate position of the text in the first text image can be determined according to the binary image corresponding to the first text image . In the case of a large difference between the background color and the text color, a more accurate result can be obtained by using a binary image to determine the position of the text.
  • the color of the text in the first text image can be determined according to the channel map of the first text image. Therefore, in some embodiments, multiple channel images of the first text image and binary images corresponding to the first text image can be acquired; feature extraction is performed on the multiple channel images and the binary image To obtain a feature map of the first text image.
  • the binary image may be obtained according to the average gray value of the first text image.
  • the average gray value of each pixel in the first text image can be calculated, the gray value of the pixel whose pixel value is greater than the average gray value is determined as the first gray value, and the pixel value is less than or equal to The gray value of the pixel of the average gray value is determined as the second gray value, and the first gray value is greater than the second gray value.
  • the difference between the first gray value and the second gray value may be greater than a preset pixel value.
  • the first gray value may be 255
  • the second gray value may be 0, so that each pixel in the binary image is a black pixel or a white pixel.
  • the channel diagram may be a channel diagram of the R channel, G channel, and B channel of an RGB (Red Green Blue) image, or may be a channel diagram of other channels used to characterize the color of the image.
  • RGB Red Green Blue
  • the first text image can be input to a pre-trained neural network, and a feature map output by the neural network can be obtained.
  • the neural network may be a convolutional neural network (Convolutional Neural Networks, CNN), a long-short-term memory network (Long-Short Term Memory, LSTM), or other types of neural networks, or a neural network composed of multiple neural networks.
  • CNN convolutional Neural Networks
  • LSTM long-short-term memory
  • LSTM long-short-term memory
  • LSTM long-short-term Memory
  • the neural network may first generate an intermediate image according to the first text image, the number of channels of the intermediate image is greater than the number of channels of the first text image, and then perform feature extraction on the intermediate image to obtain the feature map .
  • the neural network may include at least one convolutional neural network and a bidirectional long- and short-term memory network, each convolutional neural network in the at least one convolutional neural network is connected in turn, and the bidirectional long- and short-term memory The network is connected to the last one of the at least one convolutional neural network.
  • the intermediate image may be generated through the at least one convolutional neural network, and feature extraction may be performed through a bidirectional long and short-term memory network.
  • the neural network includes a plurality of sub-networks connected in sequence, and the structure of each sub-network may be the same as the structure of a single neural network in the above-mentioned embodiment, which will not be repeated here.
  • the i-th sub-network in the neural network is called the i-th sub-network
  • the i-th output image output by the i-th sub-network among the multiple sub-networks can be input to the multiple sub-networks.
  • An i+1th subnetwork in the network to generate an i+1th intermediate image through the i+1th subnetwork, and perform feature extraction on the i+1th intermediate image to obtain an i+1th output image;
  • the number of channels of the i+1th intermediate image is greater than the number of channels of the ith output image;
  • the Nth output image is determined as the feature map; where i and N are positive integers, and N is the total number of sub-networks, 1 ⁇ i ⁇ N-1, N ⁇ 2, where the first output image is obtained in the following way: the first sub-network generates a first intermediate image according to the first text image, and characterizes the first intermediate image Extract to get the first output image.
  • the first sub-network generates a first intermediate image based on the first text image, performs feature extraction on the first intermediate image to obtain the first output image, and inputs the first output image to the second sub-network, where the first The number of channels in the intermediate image is greater than the number of channels in the first text image.
  • the second sub-network generates a second intermediate image based on the first output image, extracts features from the second intermediate image to obtain the second output image, and inputs the second output image to the third sub-network, where the channel of the second intermediate image The number is greater than the number of channels of the first output image. And so on.
  • the first text image may be up-sampled using an up-sampling method such as pixel shuffle to obtain a second text image corresponding to the first text image.
  • an up-sampling method such as pixel shuffle
  • the first text image is processed so that the number of channels of the first text image is the same as the number of channels of the feature map.
  • the processed first text image is processed according to the feature sequence in the feature map to obtain the second text image.
  • the process of processing the first text image to increase the number of channels of the first text image can be implemented by using a convolutional neural network.
  • the second text image can also be processed so that the number of channels of the second text image is the same as the number of channels of the first text image, that is, Restore the second text image to four channels.
  • This process can also be implemented by a convolutional neural network.
  • the neural network used in step 201 may be trained based on multiple sets of training images.
  • Each set of training images includes a first training image and a second training image with the same text.
  • the second training image includes the same text; wherein the resolution of the first training image is less than a preset first resolution threshold, the resolution of the second training image is greater than a preset second resolution threshold, and The first resolution threshold is less than or equal to the second resolution threshold.
  • the first training image may be referred to as a low resolution (Low Resolution, LR) image
  • the second training image may be referred to as a high resolution (High Resolution, HR) image.
  • a text image data set may be established in advance.
  • the text image data set may include a plurality of text image pairs, and each text image pair includes a low-resolution text image and a high-resolution text image corresponding to the low-resolution text image.
  • the text in the text image pair may be text in various natural scenes, and the natural scene may include, but is not limited to, at least one of scenes such as streets, libraries, shops, and interiors of vehicles.
  • the following neural network can also be used as a general neural network, and the general neural network is directly trained through the first training image and the second training image: for feature extraction to obtain A neural network of feature maps, a convolutional neural network used to process the first text image to increase the number of channels of the first text image before feature extraction, and after obtaining the second text image, channel the second text image Restored convolutional neural network.
  • the first training image may be input to the neural network, and an output image of the neural network may be obtained; a loss function may be determined based on the second training image corresponding to the first training image and the output image; Perform supervised training on the neural network based on the loss function.
  • the loss function may be various types of loss functions, or a combination of two or more loss functions.
  • the loss function includes at least one of a first loss function and a second loss function, and the first loss function may be based on the mean square error of each corresponding pixel in the first training image and the second training image. Determining, for example, can be the L2 loss function.
  • the second loss function may be determined based on the difference between the gradient fields of each corresponding pixel in the first training image and the second training image, for example, may be a gradient profile loss function (Gradient Profile Loss, GPL) .
  • GPL Gradient Profile Loss
  • the gradient profile loss function L GP is defined as follows:
  • x 0 represents the lower limit of the pixel
  • x 1 represents the upper limit of the pixel
  • E represents the calculated energy
  • the subscript 1 represents the calculation of the L1 loss function.
  • the gradient field vividly shows the text characteristics and background characteristics of the text image.
  • the LR image always has a wider gradient field curve, while the HR image has a narrower gradient field curve.
  • the gradient field curve can be compressed to be narrower without performing complicated mathematical operations. Therefore, by using the gradient profile loss function, the sharp boundary between the text feature and the background feature can be reconstructed, which helps to better distinguish the text and the background, and can produce clearer shapes, making the trained neural network more reliable.
  • low-resolution images are artificially generated by down-sampling high-resolution images (low-resolution images generated in this way are called artificial low-resolution images), and then pass Artificial low-resolution images are used for model training.
  • real low-resolution images low-resolution images due to a long shooting focal length, etc.
  • the text in the text image has various shapes, scattered lighting and different backgrounds. Therefore, the model trained through artificial low-resolution images cannot obtain the feature map of the text image well, resulting in low accuracy of text recognition.
  • the first training image and the second training image used in the embodiments of the present disclosure are both real images, that is, images taken at different focal lengths.
  • the first training image is obtained by a first image acquisition device provided with a first focal length to photograph the subject at a first position
  • the second training image is obtained by a second image acquisition device provided with a second focal length It is obtained by photographing the subject at the first position that the first focal length is smaller than the second focal length.
  • the first image acquisition device and the second image acquisition device may be the same image acquisition device, or may be different image acquisition devices.
  • the value of the first focal length may be between 24 mm and 120 mm, for example, it may be 70 mm.
  • the value of the second focal length may be between 120 mm and 240 mm, for example, it may be 150 mm.
  • the number of the first focal length and the second focal length may be multiple, and each of the multiple first focal lengths is smaller than the smallest one of the multiple second focal lengths.
  • the second focal length may include 35mm, 50mm, 70mm, etc.
  • the second focal length may include 150mm, 170mm, 190mm, etc.
  • the text image in the text image pair is first cropped from the area including the text, and the text image in the text image pair is cropped from the low-resolution text image.
  • the resulting image area is used as the first training image, and the image area cropped from the high-resolution text image in the text image pair is used as the second training image.
  • the cropped first training image and the second training image have the same size.
  • one image in the text image pair is generally used as a reference image, the position of the region to be cropped in the reference image is obtained, and then the other image is adjusted according to the position Make cropping.
  • the high-resolution image in the text image pair can be used as the reference image, and the low-resolution image can be cropped according to the position of the text in the high-resolution image.
  • the position of the center point of each image will be different. Therefore, through the above-mentioned method of cropping, the position of the text in the first training image and the second training image obtained will be different. The difference is that this phenomenon is called misalignment, as shown in Figure 3. Misalignment will cause the model to mistakenly correspond the background part of one image with the text part of another image, thereby learning the wrong pixel corresponding information and causing ghosting problems.
  • the first training image and the second training image may also be aligned.
  • the first training image can be processed through a pre-trained model, so that the first training image is aligned with the second training image.
  • the model can interpolate and translate the first training image, thereby aligning the first training image with the second training image.
  • the pre-trained model may be a spatial transformation network (Spatial Transformation Networks, STN).
  • the number of the first training image and the second training image in each group of training images is 1. In order to better recognize images, all images can be rotated to the horizontal direction, and then the neural network can be trained according to the rotated first training image and second training image.
  • the first training image whose pixel size is smaller than the first size can be up-sampled, so that the first training image reaches the first size; the second training image whose pixel size is smaller than the second size can be uploaded. Sampling processing is performed so that the second training image reaches a second size, and the first size is smaller than the second size.
  • the pixel height of the text image reaches 16
  • the reconstruction of the text image can greatly improve the text recognition effect.
  • the first size may be set to a pixel size of 64 ⁇ 16.
  • the second size may be set to a pixel size of 128 ⁇ 32.
  • the test set can be divided into three subsets, wherein the resolution of the low-resolution images in the first subset is less than the preset third resolution threshold, The resolution of the low-resolution images in the second subset is greater than the third resolution threshold and less than the preset fourth resolution threshold, and the resolution of the low-resolution images in the third subset is greater than the preset fourth resolution
  • the third resolution threshold is smaller than the fourth resolution threshold.
  • the third resolution threshold and the fourth resolution threshold may be set according to the resolution range of the low-resolution images in the test set.
  • the performance of the neural network can be tested through three subsets, and the performance of the neural network can be determined according to the test results corresponding to the three subsets.
  • Fig. 4 shows the overall flow of the text recognition method according to an embodiment of the present disclosure.
  • the first training image is input to the neural network, where the neural network includes a neural network for feature extraction, and a neural network for increasing and decreasing the number of channels of the first text image, for example, a convolutional neural network, which may also include A neural network used to align training images, for example, a spatial transformation network.
  • each neural network used for feature extraction may be referred to as a sequential residual block (Sequential Residual Block, SRB), and each SRB may include two convolutional neural networks and a bidirectional long short-term memory network (BLSTM).
  • SRB sequential residual block
  • BLSTM bidirectional long short-term memory network
  • the first training image processed by the neural network is input to a plurality of cascaded sequence residual modules for feature extraction to obtain a feature map of the first training image.
  • the feature map is up-sampled by the up-sampling module, and then the number of channels of the up-sampled image is restored to the original number of channels through the convolutional neural network to obtain the output image corresponding to the first training image.
  • the first text image to be processed is input to the general neural network, and the output image of the general neural network is the second text image. Perform text recognition on the second text image to obtain a text recognition result.
  • the writing order of the steps does not mean a strict execution order but constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possibility.
  • the inner logic is determined.
  • an image processing device which includes:
  • the acquiring module 501 is configured to acquire a feature map of a first text image, the feature map includes at least one feature sequence, and the feature sequence is used to represent the correlation between at least two image blocks in the first text image ;
  • the first processing module 502 is configured to process the first text image according to the at least one characteristic sequence to obtain a second text image, the resolution of the second text image is greater than the resolution of the first text image ;
  • the text recognition module 503 is used to perform text recognition on the second text image.
  • the acquisition module includes: a first acquisition unit configured to acquire multiple channel diagrams of the first text image and a binary image corresponding to the first text image; and a feature extraction unit configured to Perform feature extraction on the multiple channel images and the binary image to obtain a feature map of the first text image.
  • the acquisition module is configured to: input the first text image into a pre-trained neural network, and acquire a feature map output by the neural network.
  • the neural network obtains the feature map based on the following method: generating an intermediate image according to the first text image, the number of channels of the intermediate image is greater than the number of channels of the first text image; Perform feature extraction on the intermediate image to obtain the feature map.
  • the neural network includes at least one convolutional neural network and a bidirectional long and short-term memory network, and an output end of the at least one convolutional neural network is connected to an input end of the bidirectional long and short-term memory network;
  • the acquisition module includes: a second acquisition unit, configured to input the first text image into the at least one convolutional neural network, and acquire an intermediate image output by the at least one convolutional neural network; and a third acquisition unit, configured to The intermediate image is input to the two-way long and short-term memory network, and the feature map output by the two-way long and short-term memory network is obtained.
  • the neural network includes multiple sub-networks connected in sequence; the acquisition module is configured to: input the i-th output image output by the i-th sub-network of the multiple sub-networks into the multiple sub-networks To generate the i+1th intermediate image through the i+1th subnetwork, and perform feature extraction on the i+1th intermediate image to obtain the i+1th output image; The number of channels of the i+1th intermediate image is greater than the number of channels of the i-th output image; the Nth output image is determined as the feature map; where i and N are positive integers, and N is the total number of sub-networks, 1 ⁇ i ⁇ N-1, N ⁇ 2, where the first output image is obtained by the following method: the first sub-network generates a first intermediate image according to the first text image, and performs feature extraction on the first intermediate image, Obtain the first feature map.
  • the device further includes: a second processing module, configured to process the first text image before processing the first text image according to the at least one characteristic sequence, so that The number of channels of the first text image is the same as the number of channels of the feature map.
  • the device further includes: a third processing module, configured to process the second text image after obtaining the second text image, so that the number of channels of the second text image is equal to that of the second text image.
  • the number of channels of the first text image is the same; the text recognition module is used to perform text recognition on the processed second text image.
  • the device further includes: a training module for training the neural network based on at least one set of training images, and each set of training images includes a first training image and a second training image.
  • the training image and the second training image include the same text; wherein the resolution of the first training image is less than the first resolution threshold, the resolution of the second training image is greater than the second resolution threshold, and the first A resolution threshold is less than or equal to the second resolution threshold.
  • the training module includes: an input unit, configured to input the first training image into the neural network, and obtain an output image of the neural network; and a determining unit, configured based on the first training image The second training image corresponding to the training image and the output image determine a loss function; the training unit is configured to perform supervised training on the neural network based on the loss function.
  • the loss function includes at least one of a first loss function and a second loss function; the first loss function is based on each corresponding pixel in the first training image and the second training image And/or, the second loss function is determined based on the difference between the gradient fields of each corresponding pixel in the first training image and the second training image.
  • the device further includes an alignment module for aligning the first training image and the second training image before training the neural network based on the at least one set of training images .
  • the alignment module is used to process the first training image through a pre-trained spatial transformation network, so as to compare the text in the first training image with the text in the second training image.
  • the text is aligned.
  • the first training image is obtained by photographing the subject at the first position by a first image acquisition device with a first focal length;
  • the second training image is obtained by a first image acquisition device with a second focal length.
  • the second image acquisition device is obtained by photographing the subject at the first position; the first focal length is smaller than the second focal length.
  • the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments.
  • the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments.
  • the embodiments of the present specification also provide a computer device, which includes at least a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program when the program is executed. The method described.
  • the embodiments of the present disclosure also include a computer device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the method described in any embodiment when the program is executed. .
  • FIG. 6 shows a more specific hardware structure diagram of a computing device provided by an embodiment of this specification.
  • the device may include a processor 601, a memory 602, an input/output interface 603, a communication interface 604, and a bus 605.
  • the processor 601, the memory 602, the input/output interface 603, and the communication interface 604 realize the communication connection between each other in the device through the bus 605.
  • the processor 601 can be implemented by a general CPU (Central Processing Unit, central processing unit), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for execution related Program to implement the technical solutions provided in the embodiments of this specification.
  • CPU Central Processing Unit
  • microprocessor microprocessor
  • ASIC Application Specific Integrated Circuit
  • the memory 602 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc.
  • the memory 602 may store an operating system and other application programs.
  • related program codes are stored in the memory 602 and called and executed by the processor 601.
  • the input/output interface 603 is used to connect an input/output module to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the communication interface 604 is used to connect a communication module (not shown in the figure) to realize the communication interaction between the device and other devices.
  • the communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
  • the bus 605 includes a path for transmitting information between various components of the device (for example, the processor 601, the memory 602, the input/output interface 603, and the communication interface 604).
  • the device may also include the equipment necessary for normal operation. Other components.
  • the above-mentioned device may also include only the components necessary to implement the solutions of the embodiments of the present specification, and not necessarily include all the components shown in the figures.
  • the embodiments of this specification also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method described in any of the foregoing embodiments is implemented.
  • the embodiments of this specification also provide a computer program, wherein the computer program implements the method described in any of the foregoing embodiments when the computer program is executed by a processor.
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • a typical implementation device is a computer.
  • the specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.
  • the various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • the device embodiments described above are merely illustrative.
  • the modules described as separate components may or may not be physically separated.
  • the functions of the modules can be combined in the same way when implementing the solutions of the embodiments of this specification. Or multiple software and/or hardware implementations. It is also possible to select some or all of the modules according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)
  • Image Processing (AREA)
PCT/CN2021/088389 2020-04-30 2021-04-20 文本识别方法、装置、设备及存储介质 WO2021218706A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2022520075A JP2022550195A (ja) 2020-04-30 2021-04-20 テキスト認識方法、装置、機器、記憶媒体及びコンピュータプログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010362519.XA CN111553290A (zh) 2020-04-30 2020-04-30 文本识别方法、装置、设备及存储介质
CN202010362519.X 2020-04-30

Publications (1)

Publication Number Publication Date
WO2021218706A1 true WO2021218706A1 (zh) 2021-11-04

Family

ID=72000292

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088389 WO2021218706A1 (zh) 2020-04-30 2021-04-20 文本识别方法、装置、设备及存储介质

Country Status (3)

Country Link
JP (1) JP2022550195A (ja)
CN (1) CN111553290A (ja)
WO (1) WO2021218706A1 (ja)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553290A (zh) * 2020-04-30 2020-08-18 北京市商汤科技开发有限公司 文本识别方法、装置、设备及存储介质
CN112633429A (zh) * 2020-12-21 2021-04-09 安徽七天教育科技有限公司 一种学生手写选择题识别方法
CN117037136B (zh) * 2023-10-10 2024-02-23 中国科学技术大学 场景文本识别方法、系统、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389091A (zh) * 2018-10-22 2019-02-26 重庆邮电大学 基于神经网络和注意力机制结合的文字识别系统及方法
CN110033000A (zh) * 2019-03-21 2019-07-19 华中科技大学 一种票据图像的文本检测与识别方法
CN110084172A (zh) * 2019-04-23 2019-08-02 北京字节跳动网络技术有限公司 文字识别方法、装置和电子设备
CN110168573A (zh) * 2016-11-18 2019-08-23 易享信息技术有限公司 用于图像标注的空间注意力模型
CN111553290A (zh) * 2020-04-30 2020-08-18 北京市商汤科技开发有限公司 文本识别方法、装置、设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10043231B2 (en) * 2015-06-30 2018-08-07 Oath Inc. Methods and systems for detecting and recognizing text from images
CN107368831B (zh) * 2017-07-19 2019-08-02 中国人民解放军国防科学技术大学 一种自然场景图像中的英文文字和数字识别方法
CN109800749A (zh) * 2019-01-17 2019-05-24 湖南师范大学 一种文字识别方法及装置
CN110443239A (zh) * 2019-06-28 2019-11-12 平安科技(深圳)有限公司 文字图像的识别方法及其装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110168573A (zh) * 2016-11-18 2019-08-23 易享信息技术有限公司 用于图像标注的空间注意力模型
CN109389091A (zh) * 2018-10-22 2019-02-26 重庆邮电大学 基于神经网络和注意力机制结合的文字识别系统及方法
CN110033000A (zh) * 2019-03-21 2019-07-19 华中科技大学 一种票据图像的文本检测与识别方法
CN110084172A (zh) * 2019-04-23 2019-08-02 北京字节跳动网络技术有限公司 文字识别方法、装置和电子设备
CN111553290A (zh) * 2020-04-30 2020-08-18 北京市商汤科技开发有限公司 文本识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN111553290A (zh) 2020-08-18
JP2022550195A (ja) 2022-11-30

Similar Documents

Publication Publication Date Title
WO2021218706A1 (zh) 文本识别方法、装置、设备及存储介质
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
US20200364478A1 (en) Method and apparatus for liveness detection, device, and storage medium
CN110059728B (zh) 基于注意力模型的rgb-d图像视觉显著性检测方法
US10062195B2 (en) Method and device for processing a picture
Lu et al. Robust blur kernel estimation for license plate images from fast moving vehicles
WO2014014678A1 (en) Feature extraction and use with a probability density function and divergence|metric
US20210256657A1 (en) Method, system, and computer-readable medium for improving quality of low-light images
US20200279166A1 (en) Information processing device
JP2013541119A (ja) オブジェクト認識における特徴生成を改善するためのシステム及び方法
US8873839B2 (en) Apparatus of learning recognition dictionary, and method of learning recognition dictionary
CN108876716B (zh) 超分辨率重建方法及装置
CN109740542B (zh) 基于改进型east算法的文本检测方法
Li et al. CG-DIQA: No-reference document image quality assessment based on character gradient
CN103198299A (zh) 基于多方向尺度与Gabor相位投影特征结合的人脸识别方法
CN113436222A (zh) 图像处理方法、图像处理装置、电子设备及存储介质
CN115187456A (zh) 基于图像强化处理的文本识别方法、装置、设备及介质
US20200286254A1 (en) Information processing device
CN112348008A (zh) 证件信息的识别方法、装置、终端设备及存储介质
CN115393868B (zh) 文本检测方法、装置、电子设备和存储介质
CN116469172A (zh) 一种多时间尺度下的骨骼行为识别视频帧提取方法及系统
CN113112531B (zh) 一种图像匹配方法及装置
WO2022034678A1 (en) Image augmentation apparatus, control method, and non-transitory computer-readable storage medium
CN111402281B (zh) 一种书籍边缘检测方法及装置
CN114170589A (zh) 一种基于nas的岩石岩性识别方法、终端设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21795694

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022520075

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21795694

Country of ref document: EP

Kind code of ref document: A1