WO2023109086A1 - 文字识别方法、装置、设备及存储介质 - Google Patents

文字识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023109086A1
WO2023109086A1 PCT/CN2022/102163 CN2022102163W WO2023109086A1 WO 2023109086 A1 WO2023109086 A1 WO 2023109086A1 CN 2022102163 W CN2022102163 W CN 2022102163W WO 2023109086 A1 WO2023109086 A1 WO 2023109086A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
text
processed
different scales
processing
Prior art date
Application number
PCT/CN2022/102163
Other languages
English (en)
French (fr)
Inventor
文玉茹
卢道和
杨军
程志峰
李勋棋
罗海湾
何勇彬
陈鉴镔
胡仲臣
陈刚
周佳振
朱嘉伟
郭英亚
李兴龙
周琪
熊思清
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2023109086A1 publication Critical patent/WO2023109086A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the image recognition technology of financial technology (Fintech), in particular to a character recognition method, device, equipment and storage medium.
  • image recognition technology mainly refers to the use of computers to process captured front-end pictures of the system according to established goals.
  • neural networks are the most widely used in the field of image recognition.
  • Neural network models can implement things such as face recognition, image detection, image classification, object tracking, and text recognition. Among them, functions such as face recognition, image classification and text recognition have achieved good recognition results after a long period of development.
  • Character recognition generally refers to the technology of automatically recognizing characters by various devices including computers, and has important applications in many fields of today's society. However, after the image is deformed or the angle moves, the existing image recognition technology does not have the equivariant property, which leads to a decrease in the character recognition rate and cannot achieve the ideal recognition effect.
  • the present application provides a character recognition method, device, equipment and storage medium.
  • the embodiment of the present application provides a method for character recognition, the method comprising:
  • the image to be processed carries one or more characters
  • the performing feature extraction on the image to be processed to obtain image features corresponding to the image to be processed includes:
  • a densely connected network Based on a densely connected network, perform feature extraction on the image to be processed to obtain the image features corresponding to the image to be processed, wherein the densely connected network includes one or more dense blocks, and any of the densely connected networks There are direct connections between two dense blocks, and the input of each dense block is the union of the outputs of all previous dense blocks.
  • the densely connected network further includes one or more transitionally connected layers, and the transitionally connected layers include a 1 ⁇ 1 convolutional layer, and the input of each transitionally connected layer is all previous dense blocks and the union of the output of the transition connection layer;
  • the step of performing feature extraction on the image to be processed based on the densely connected network, and obtaining the image feature corresponding to the image to be processed includes:
  • obtaining a plurality of text boxes of different scales in the image to be processed, and performing text box regression processing on the multiple text boxes of different scales includes :
  • a text box regression process is performed on the plurality of text boxes with different scales.
  • the obtaining multiple text boxes of different scales in the image to be processed according to the image features, and determining the offset data of the multiple text boxes of different scales includes:
  • the image features after the downsampling and convolution processing are used as the new image features after the downsampling processing, and the steps of downsampling and convolution processing for the image features after the downsampling processing are re-executed until the described
  • the multiple text boxes of different scales in the image are to be processed, and offset data of the multiple text boxes of different scales are determined.
  • the determining the position of the one or more characters in the image to be processed according to the plurality of text boxes of different scales after text box regression processing includes:
  • the scores of the multiple different scale text boxes after the text box regression processing are obtained, wherein the preset score model uses Determining the scores of multiple text frames of different scales according to the ratio of the intersection and union of the text frame with the highest score among the multiple text frames of different scales and the multiple text frames of different scales;
  • the positions of the text boxes of different scales after the text box regression processing calculate the positions of the text boxes of different scales after the text box regression processing, and based on the multiple text box regression processing of the text box.
  • the positions of the text boxes of different scales determine the positions of the one or more texts in the image to be processed.
  • the calculation of the positions of the multiple text boxes of different scales after the text box regression processing according to the scores of the multiple text boxes of different scales after the text box regression processing includes :
  • the number of scale text boxes is determined;
  • the position of the text box i after the text box regression processing is calculated according to the score of the text box i after the text box regression processing.
  • the method before performing feature extraction on the image to be processed and obtaining image features corresponding to the image to be processed, the method further includes:
  • the feature extraction of the image to be processed to obtain the image feature corresponding to the image to be processed includes:
  • Feature extraction is performed on the image to be processed after the parameter reduction process, and image features corresponding to the image to be processed are obtained.
  • the performing parameter reduction processing on the image to be processed includes:
  • performing character recognition on the image to be processed based on the position of the one or more characters includes:
  • the embodiment of the present application provides a character recognition device, the device includes:
  • An image acquisition module configured to acquire an image to be processed, the image to be processed carries one or more characters
  • a feature extraction module configured to perform feature extraction on the image to be processed, and obtain image features corresponding to the image to be processed
  • a text box processing module configured to obtain multiple text boxes of different scales in the image to be processed according to the image features, and perform text box regression processing on the multiple text boxes of different scales;
  • a text recognition module configured to determine the position of the one or more texts in the image to be processed according to the multiple text boxes of different scales after text box regression processing, and based on the positions of the one or more texts, Perform character recognition on the image to be processed.
  • the feature extraction module is specifically used for:
  • a densely connected network Based on a densely connected network, perform feature extraction on the image to be processed to obtain the image features corresponding to the image to be processed, wherein the densely connected network includes one or more dense blocks, and any of the densely connected networks There are direct connections between two dense blocks, and the input of each dense block is the union of the outputs of all previous dense blocks.
  • the densely connected network further includes one or more transitionally connected layers, and the transitionally connected layers include a 1 ⁇ 1 convolutional layer, and the input of each transitionally connected layer is all previous dense blocks and the union of the output of the transition connection layer.
  • the feature extraction module is specifically used for:
  • the text box processing module is specifically used for:
  • a text box regression process is performed on the plurality of text boxes with different scales.
  • the text box processing module is specifically used for:
  • the image features after the downsampling and convolution processing are used as the new image features after the downsampling processing, and the steps of downsampling and convolution processing for the image features after the downsampling processing are re-executed until the described
  • the multiple text boxes of different scales in the image are to be processed, and offset data of the multiple text boxes of different scales are determined.
  • the character recognition module is specifically used for:
  • the scores of the multiple different scale text boxes after the text box regression processing are obtained, wherein the preset score model uses Determining the scores of multiple text frames of different scales according to the ratio of the intersection and union of the text frame with the highest score among the multiple text frames of different scales and the multiple text frames of different scales;
  • the positions of the text boxes of different scales after the text box regression processing calculate the positions of the text boxes of different scales after the text box regression processing, and based on the multiple text box regression processing of the text box.
  • the positions of the text boxes of different scales determine the positions of the one or more texts in the image to be processed.
  • the character recognition module is specifically used for:
  • the number of scale text boxes is determined;
  • the position of the text box i after the text box regression processing is calculated according to the score of the text box i after the text box regression processing.
  • the feature extraction module is specifically used for:
  • Feature extraction is performed on the image to be processed after the parameter reduction process, and image features corresponding to the image to be processed are obtained.
  • the feature extraction module is specifically used for:
  • the character recognition module is specifically used for:
  • the embodiment of the present application provides a character recognition device, including:
  • the computer program is stored in the memory and is configured to be executed by the processor, the computer program including instructions for performing the method as described in the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a server to execute the method described in the first aspect.
  • an embodiment of the present application provides a computer program product, including computer instructions, and the computer instructions are executed by a processor according to the method described in the first aspect.
  • the method obtains an image to be processed, the image to be processed carries one or more characters, and then performs feature extraction on the image to be processed to obtain image features , thus, according to the image features, obtain a plurality of text boxes of different scales in the image to be processed, and perform text box regression processing on the multiple text boxes of different scales to solve the problem of image deformation or angular movement, and then, According to multiple text boxes of different scales after text box regression processing, determine the position of the text in the image to be processed, and based on the position, perform text recognition on the image to be processed, improve the text recognition rate, and achieve better text recognition effect.
  • FIG. 1 is a schematic diagram of a character recognition system architecture provided by an embodiment of the present application
  • FIG. 2 is a schematic flow diagram of a character recognition method provided in an embodiment of the present application.
  • FIG. 3 is a schematic flow chart of another character recognition method provided in the embodiment of the present application.
  • FIG. 4 is a schematic diagram of downsampling and convolution processing provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an offset of a text box provided by the embodiment of the present application.
  • FIG. 6 is a schematic flowchart of another character recognition method provided in the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a character recognition device provided in an embodiment of the present application.
  • FIG. 8 shows a possible structural schematic diagram of a character recognition device of the present application.
  • the embodiment of the present application proposes a character recognition method.
  • the image feature is obtained by performing feature extraction on the image to be processed, and then, according to the image feature, the above-mentioned
  • the text box regression processing is performed on the multiple text boxes of different scales to solve the problem of image deformation or angular movement, and improve the subsequent processing of multiple different text boxes based on text box regression.
  • a character recognition method provided in the present application can be applied to the structural diagram of a character recognition system shown in FIG. 1 . As shown in FIG. 1 .
  • the receiving device 101 may be an input/output interface or a communication interface, and may be used to receive an image to be processed carrying one or more characters.
  • the processing device 102 can obtain the above-mentioned image to be processed through the above-mentioned receiving device 101, and then perform feature extraction on the above-mentioned image to be processed to obtain image features, and then, according to the image features, obtain multiple text boxes of different scales in the above-mentioned image to be processed , and perform text box regression processing on the multiple text boxes of different scales to solve the problem of image deformation or angular movement, and then, according to the multiple text boxes of different scales after the text box regression processing, perform Text recognition improves the text recognition rate and achieves better text recognition results.
  • the display device 103 may be used to display the above-mentioned image to be processed, multiple text boxes of different scales, and the like.
  • the display device may also be a touch screen, configured to receive user instructions while displaying the above content, so as to realize interaction with the user.
  • the processing device 102 may also send the result of character recognition on the image to be processed to the decoder, and the decoder decodes the result and outputs the corresponding character.
  • processing device may be implemented by a processor reading instructions in a memory and executing the instructions, or may be implemented by a chip circuit.
  • the above system is only an exemplary system, and can be set according to application requirements during specific implementation.
  • system architecture described in the embodiment of the present application is to illustrate the technical solutions of the embodiments of the present application more clearly, and does not constitute a limitation to the technical solutions provided in the embodiments of the present application.
  • evolution of the technology and the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.
  • FIG. 2 is a schematic flowchart of a character recognition method provided by the embodiment of the present application.
  • the execution subject of this embodiment may be the processing device in the embodiment shown in FIG. 1 , which may be determined according to actual conditions.
  • the text recognition method provided by the embodiment of the present application includes the following steps:
  • S201 Acquire an image to be processed, where the image to be processed carries one or more characters.
  • the above-mentioned images to be processed can be set according to actual conditions, for example, images obtained in scenarios such as license plate recognition, bill recognition, and book text recognition.
  • S202 Perform feature extraction on the image to be processed to obtain image features corresponding to the image to be processed.
  • the processing device may also perform parameter reduction processing on the image to be processed, so as to reduce parameters and calculation amount, and improve the efficiency of subsequent character recognition.
  • the above-mentioned processing device may use three 3 ⁇ 3 convolutional layers and one 2 ⁇ 2 pooling layer to perform parameter reduction processing on the above-mentioned image to be processed, wherein the above-mentioned three 3 ⁇ 3 convolutional layers After sequential connection, it is connected with the above 2 ⁇ 2 pooling layer.
  • the above three 3 ⁇ 3 convolutional layers and one 2 ⁇ 2 pooling layer convolution kernel size (kernel_size), convolution step size (stride) and feature map padding width (padding) and other parameters can be as shown in the table 1 shows:
  • the above-mentioned processing device when the above-mentioned processing device performs feature extraction on the above-mentioned image to be processed, it can perform feature extraction on the above-mentioned image to be processed based on a densely connected network to obtain image features corresponding to the above-mentioned image to be processed, wherein the above-mentioned densely connected network includes one or Multiple dense blocks, there is a direct connection between any two dense blocks in the above densely connected network, and the input of each dense block is the union of the outputs of all previous dense blocks.
  • the above-mentioned processing device uses a densely connected network as a feature extraction network, which can take the output of all previous layers as the input of the current layer, making the gradient and information propagation more accurate, so that the image to be processed based on the densely connected network can be extracted features, the accuracy of subsequent text recognition is higher.
  • the above-mentioned densely connected network may also include one or more transitional connection layers, which are used to increase the number of dense blocks in the above-mentioned densely connected network, and in When the number is increased, the resolution of the original feature map will not be changed.
  • the above-mentioned transition connection layer includes a 1 ⁇ 1 convolutional layer, which can not only increase the depth of the feature extraction of the above-mentioned dense connection network, but also eliminate the restriction on the overall number of the above-mentioned dense blocks.
  • the input of each transition connection layer is all previous dense Union of block and transition layer outputs.
  • the above-mentioned processing device can perform feature extraction on the above-mentioned image to be processed based on the above-mentioned one or more dense blocks and the above-mentioned one or more transition connection layers, so that the extracted features are more abundant, and the subsequent text recognition based on the above-mentioned extracted features is improved. Accuracy.
  • the number of the above-mentioned dense blocks and transition connection layers can be set according to the actual situation, for example, as shown in the above-mentioned Table 2, the number of the above-mentioned dense blocks is 4, the number of the above-mentioned transition connection layers is 2, the first The first transition connection layer is set between the third dense block and the fourth dense block, and the second transition connection layer is set behind the fourth dense block. Parameters such as kernel_size, stride and padding of the 4 dense blocks and 2 transition connection layers shown in Table 2.
  • S203 Obtain a plurality of text frames of different scales in the image to be processed according to the above image features, and perform text frame regression processing on the multiple text frames of different scales.
  • the processing device may use a preset dense layer to obtain multiple text boxes of different scales in the image to be processed according to the image features, and perform text box regression processing on the multiple text boxes of different scales.
  • the preset dense layer may include two blocks, one for obtaining multiple text frames of different scales in the image to be processed, and one for performing text frame regression processing on the multiple text frames of different scales.
  • the above-mentioned processing device solves the problem of image deformation or angular movement by performing text frame regression processing on multiple text boxes of different scales in the image to be processed, and improves the subsequent processing based on multiple text box regression processing. Text boxes of different scales, the recognition rate of text recognition for the above image to be processed.
  • S204 Determine the position of the one or more characters in the image to be processed according to multiple text boxes of different scales after the text frame regression processing, and perform text processing on the image to be processed based on the positions of the one or more characters identify.
  • the above-mentioned processing device can obtain the scores of the multiple text boxes of different scales after the text box regression processing according to the multiple text boxes of different scales after the text box regression processing and the preset score model, and then, according to The score calculates the positions of multiple text boxes of different scales after the text box regression processing, and based on the positions, determines the position of the one or more texts in the image to be processed.
  • the above-mentioned preset score model is used to determine the ratio of the above-mentioned multiple text boxes of different scales according to the ratio of the intersection and union of the text box with the highest score among the multiple text boxes of different scales and the above-mentioned multiple text boxes of different scales. Score.
  • the preset scoring model above includes the expression:
  • s i represents the score of the i-th text box
  • iou represents the intersection over union (Intersection over Union), which is the ratio of the intersection and union of the text box and other text boxes.
  • T represents the calculated text box with the highest score
  • ci represents the candidate frame
  • N represents a threshold, which can be set according to the actual situation.
  • the above-mentioned processing device can set a plurality of text boxes of different scales after the above-mentioned text box regression processing as the above-mentioned candidate boxes, and calculate the scores of all candidate boxes to obtain the text box T with the highest score, and obtain the above-mentioned text box according to the above expression Scores of multiple text boxes of different scales after regression processing.
  • t' denotes the positions of multiple text boxes of different scales after the above text box regression processing
  • t i denotes the coordinates of the i-th text box.
  • the above-mentioned processing device calculates the position of the multiple text boxes of different scales after the above-mentioned text box regression processing according to the above-mentioned scores, it may also consider calculating the highest score among the multiple different-scale text boxes after the above-mentioned text box regression processing.
  • the ratio of the intersection and union of the text box and the text box i after text box regression processing If the calculated ratio is smaller than the preset threshold, the processing device may calculate the position of the text box i after the text box regression processing according to the score of the text box i after the text box regression processing.
  • the number of processed multiple text boxes of different scales is determined. That is, the above-mentioned processing device may use a non-maximum suppression (non maximum suppression, NMS) algorithm to calculate the positions of multiple text boxes of different scales after the above-mentioned text box regression processing, so that the calculation results are more accurate.
  • NMS non maximum suppression
  • the above-mentioned processing device may enumerate all the candidate frames a, that is, enumerate a plurality of text boxes of different scales after the above-mentioned text box regression processing, and the calculated scores s i , and initialize a detection set Bi, which Set to empty. Then, the above-mentioned processing device can collect all the text boxes in the candidate box a to calculate, and get the text box T with the highest score, and put it into the set Bi, where i represents the i-th selected box.
  • the above-mentioned processing device can set a threshold N, and then traverse all the remaining text boxes, calculate the iou of the text box and the highest-scoring detection box, and if the result is greater than or equal to the threshold, put it into the set Bi.
  • the above-mentioned processing device repeats the above operations until a is empty to obtain the set Bi.
  • the above-mentioned processing device can calculate the position of the text box based on the above-mentioned score s i , so that the subsequent position of the text box calculated based on the position is more accurate.
  • the above-mentioned processing device when the above-mentioned processing device performs character recognition on the image to be processed based on the position of the one or more characters, it may also recognize the above-mentioned The text in the image to process.
  • the above-mentioned preset recognition model is used to recognize the characters in the image according to the position of the characters in the image.
  • the image to be processed carries one or more characters, and then performing feature extraction on the image to be processed to obtain image features, and then, according to the image features, obtain the Multiple text boxes of different scales, and perform text box regression processing on the multiple text boxes of different scales to solve the problem of image deformation or angular movement, and then, according to the text box regression processing, multiple text boxes of different scales , determining the position of the character in the image to be processed, and based on the position, performing character recognition on the image to be processed, improving the character recognition rate and achieving a better character recognition effect.
  • parameter reduction processing is performed on the image to be processed, which reduces parameters and calculation amount, and improves the efficiency of subsequent character recognition.
  • the embodiment of the present application uses a densely connected network as a feature extraction network, which can take the output of all previous layers as the input of the current layer, making the gradient and information propagation more accurate, so that the image to be processed based on the densely connected network can be extracted features, the accuracy of subsequent text recognition is higher.
  • the embodiment of the present application may also use the NMS algorithm to calculate the positions of multiple text boxes of different scales after the text box regression processing, so that the calculation results are more accurate.
  • the above-mentioned processing device recognizes the characters in the image to be processed based on the position of the one or more characters and the preset recognition model, it needs to train the preset recognition model, so that the model can be used to recognize the above-mentioned to-be-processed image.
  • Handle text in images the above-mentioned processing device may input the image carrying the text into the above-mentioned preset recognition model, wherein the above-mentioned input image also carries the position of the text in the image, and then, according to the text output by the above-mentioned preset recognition model, And the text corresponding to the above input image to determine the output accuracy.
  • the processing device may adjust the preset recognition model according to the output accuracy to improve the output accuracy, and use the adjusted preset recognition as a new preset recognition model, re-executing the above step of inputting the image with text into the above preset recognition model.
  • FIG. 3 is a schematic flowchart of another character recognition method proposed in the embodiment of the present application. As shown in Figure 3, the method includes:
  • S301 Acquire an image to be processed, where the image to be processed carries one or more characters.
  • S302 Perform feature extraction on the image to be processed to obtain image features corresponding to the image to be processed.
  • steps S301-S302 are implemented in the same manner as the above-mentioned steps S201-S202, and will not be repeated here.
  • S303 Obtain a plurality of text boxes of different scales in the image to be processed according to the above image features, and determine offset data of the plurality of text boxes of different scales.
  • the above-mentioned processing device can perform down-sampling processing on the above-mentioned image features, and perform down-sampling and convolution processing on the image features after the down-sampling processing, and use the image features after the down-sampling and convolution processing as the new above-mentioned down-sampling processing
  • the above image features after the downsampling process are re-executed in the steps of downsampling and convolution processing on the image features after the downsampling process, until the above-mentioned multiple text boxes of different scales in the above-mentioned image to be processed are obtained, and the above-mentioned multiple different scales are determined.
  • the offset data of the text box is determined.
  • the above-mentioned processing device may use a down-sampling module to perform down-sampling processing on the above-mentioned image features, and the above-mentioned down-sampling module may include a 1 ⁇ 1 convolution and a 2 ⁇ 2 pooling layer.
  • the above-mentioned processing device uses a 2 ⁇ 2 pooling layer to match the size of the feature map, and uses a 1 ⁇ 1 convolution to reduce the number of channels by half.
  • the scale of the entire module includes the features of the feature map and the previous one. The features of the feature map, which can make the parameters less and the results more accurate.
  • the above-mentioned processing device can also use a convolution module to perform convolution processing on the above-mentioned image features.
  • the above-mentioned convolution module can include a 1 ⁇ 1 convolution layer and a 3 ⁇ 3 convolution layer to perform two convolution operations.
  • the feature map of a layer is passed to the feature map of the next layer.
  • the above-mentioned processing device can obtain text frames of 6 different scales.
  • the text boxes of the above six different scales include text boxes of scale 1 , scale 2 , scale 3 , scale 4 , scale 5 and scale 6 .
  • the above-mentioned processing device determines a text frame of scale 1 according to the above-mentioned image features, and then performs downsampling processing on the text frame of scale 1 to obtain a text frame of scale 2, and performs downsampling and convolution processing on the text frame of scale 2, Obtain a text box of scale 3, repeat the above steps, that is, perform downsampling and convolution processing on the text box of scale 3, obtain a text box of scale 4, perform downsampling and convolution processing on the text box of scale 4, A text frame of scale 5 is obtained, and the text frame of scale 5 is subjected to downsampling and convolution processing to obtain a text frame of scale 6.
  • the processing device determines the offset data of the plurality of text frames of different scales during the processing process, and then performs text frame regression processing on the plurality of text frames of different scales based on the offset data.
  • Figure 5 shows a schematic diagram of the offset of a text box.
  • b0 represents the default border
  • 4 arrows lead from b0 to Gq, indicating a frame from the default frame
  • Gb represents a minimum circumscribed matrix of the actual target Gq
  • Represents the real value of the rectangle, which is the smallest enclosing rectangle of G represents the center point of Gb, means width, means high.
  • the processing device determines the offset data of the text frame, based on the offset data, it performs text frame regression processing on the text frame to solve the problem of image deformation or angular movement, thereby improving the accuracy of subsequent text recognition.
  • S304 Based on the offset data, perform text frame regression processing on the multiple text frames of different scales.
  • S305 Determine the position of the one or more characters in the image to be processed according to multiple text boxes of different scales after the text frame regression processing, and perform text processing on the image to be processed based on the positions of the one or more characters identify.
  • step S305 is the same as that of the above step S204, which will not be repeated here.
  • the text frame regression processing is performed on the text frame to solve the problem of image deformation or angular movement, and then, according to the multiple text frame regression processing Text recognition is performed on text boxes of different scales, which improves the text recognition rate.
  • FIG. 6 shows a schematic flow chart of another character recognition method proposed in the embodiment of the present application.
  • the above-mentioned processing device may use a parameter reduction module to perform parameter reduction, wherein the parameter reduction module may include a 3 ⁇ 3 convolutional layer and a 2 ⁇ 2 pooling layer, and the three 3 ⁇ 3 convolutional layers are sequentially connected Then connect with the 2 ⁇ 2 pooling layer.
  • the above-mentioned processing device may perform feature extraction on the image to be processed after parameter reduction processing, for example, feature extraction may be performed based on a densely connected network.
  • the densely connected network may include one or more dense blocks, and may also include one or more transitionally connected layers.
  • the 1st transitional connection layer is set between the 3rd dense block and the 4th dense block
  • the 2nd transitional connection layer is set between the 4th dense
  • the above-mentioned processing device can obtain a plurality of text boxes of different scales in the above-mentioned image to be processed based on the extracted image features, and determine the offset data of the multiple text boxes of different scales, thereby, based on the offset Move the data, and perform text box regression processing on the above-mentioned multiple text boxes of different scales.
  • the above-mentioned processing device can use a preset dense layer to perform the above-mentioned processing, and the preset dense layer can include two blocks, one for obtaining a plurality of text boxes of different scales in the image to be processed, and one for The text box of the scale is processed by text box regression.
  • the processing device determines the position of one or more characters in the image to be processed according to the multiple character frames of different scales after the character frame regression processing, and performs character recognition on the image to be processed based on the position.
  • the above-mentioned processing device may use an NMS algorithm to calculate the positions of multiple text boxes of different scales after the above-mentioned text box regression processing, so that the calculation results are more accurate.
  • the processing device may also send the result of character recognition on the image to be processed to the decoder, and the decoder decodes the result and outputs the corresponding character.
  • the above-mentioned processing device performs text frame regression processing on multiple text boxes of different scales in the above-mentioned image to be processed to solve the problem of image deformation or angular movement, and then, according to the multiple text boxes after the text box regression processing Text frames of different scales perform text recognition on the image to be processed, which improves the text recognition rate and achieves a better text recognition effect.
  • the above-mentioned processing device also performs parameter reduction processing on the above-mentioned image to be processed, which reduces parameters and calculation amount, and improves the efficiency of subsequent character recognition.
  • the above-mentioned processing device uses a densely connected network as a feature extraction network, which can use the output of all previous layers as the input of the current layer, so that the gradient and information propagation are more accurate, so that the image to be processed based on the densely connected network is extracted.
  • the above-mentioned processing device may also use the NMS algorithm to calculate the positions of multiple text boxes of different scales after the above-mentioned text box regression processing, so that the calculation results are more accurate.
  • FIG. 7 is a schematic structural diagram of a character recognition device provided in the embodiment of the present application.
  • the text recognition device 70 includes: an image acquisition module 701 , a feature extraction module 702 , a text frame processing module 703 and a text recognition module 704 .
  • the character recognition device here may be the above-mentioned processing device itself, or a chip or an integrated circuit that realizes the functions of the processing device. What needs to be explained here is that the division of image acquisition module, feature extraction module, text box processing module and text recognition module is only a division of logical functions. Physically, the two can be integrated or independent.
  • the image acquiring module 701 is configured to acquire an image to be processed, and the image to be processed carries one or more characters.
  • the feature extraction module 702 is configured to perform feature extraction on the image to be processed, and obtain image features corresponding to the image to be processed.
  • the text box processing module 703 is configured to obtain multiple text boxes of different scales in the image to be processed according to the image features, and perform text box regression processing on the multiple text boxes of different scales.
  • the character recognition module 704 is configured to determine the position of the one or more characters in the image to be processed according to a plurality of character frames of different scales after the character frame regression processing, and based on the position of the one or more characters , performing character recognition on the image to be processed.
  • the feature extraction module 702 is specifically used for:
  • a densely connected network Based on a densely connected network, perform feature extraction on the image to be processed to obtain the image features corresponding to the image to be processed, wherein the densely connected network includes one or more dense blocks, and any of the densely connected networks There are direct connections between two dense blocks, and the input of each dense block is the union of the outputs of all previous dense blocks.
  • the densely connected network further includes one or more transitionally connected layers, and the transitionally connected layers include a 1 ⁇ 1 convolutional layer, and the input of each transitionally connected layer is all previous dense blocks and the union of the output of the transition connection layer.
  • the feature extraction module 702 is specifically used for:
  • the text box processing module 703 is specifically configured to:
  • a text box regression process is performed on the plurality of text boxes with different scales.
  • the text box processing module 703 is specifically configured to:
  • the image features after the downsampling and convolution processing are used as the new image features after the downsampling processing, and the steps of downsampling and convolution processing for the image features after the downsampling processing are re-executed until the described
  • the multiple text boxes of different scales in the image are to be processed, and offset data of the multiple text boxes of different scales are determined.
  • the character recognition module 704 is specifically configured to:
  • the scores of the multiple different scale text boxes after the text box regression processing are obtained, wherein the preset score model uses Determining the scores of multiple text frames of different scales according to the ratio of the intersection and union of the text frame with the highest score among the multiple text frames of different scales and the multiple text frames of different scales;
  • the positions of the text boxes of different scales after the text box regression processing calculate the positions of the text boxes of different scales after the text box regression processing, and based on the multiple text box regression processing of the text box.
  • the positions of the text boxes of different scales determine the positions of the one or more texts in the image to be processed.
  • the character recognition module 704 is specifically configured to:
  • the number of scale text boxes is determined;
  • the position of the text box i after the text box regression processing is calculated according to the score of the text box i after the text box regression processing.
  • the feature extraction module 702 is specifically configured to:
  • Feature extraction is performed on the image to be processed after the parameter reduction process, and image features corresponding to the image to be processed are obtained.
  • the feature extraction module 702 is specifically configured to:
  • the character recognition module 704 is specifically configured to:
  • the device provided in the embodiment of the present application can be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, so the embodiments of the present application will not repeat them here.
  • FIG. 8 schematically provides a possible basic hardware architecture of the character recognition device described in this application.
  • a character recognition device 800 includes at least one processor 801 and a communication interface 803 . Further optionally, a memory 802 and a bus 804 may also be included.
  • the character recognition device 800 may be the above-mentioned processing device, which is not particularly limited in this application.
  • the text recognition device 800 there may be one or more processors 801, and FIG. 8 only shows one of the processors 801.
  • the processor 801 may be a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU) or a digital signal processing (Digital Signal Process, DSP). If the character recognition device 800 has multiple processors 801, the types of the multiple processors 801 may be different, or may be the same. Optionally, multiple processors 801 of the character recognition device 800 may also be integrated into a multi-core processor.
  • the memory 802 stores computer instructions and data; the memory 802 may store computer instructions and data required to realize the above-mentioned character recognition method provided by the present application, for example, the memory 802 stores instructions for implementing the steps of the above-mentioned character recognition method.
  • the memory 802 may be any one or any combination of the following storage media: non-volatile memory (such as read only memory (ROM), solid state disk (SSD), hard disk (HDD), optical disk), volatile memory.
  • the communication interface 803 may provide information input/output for the at least one processor. Any one or any combination of the following components may also be included: a network interface (such as an Ethernet interface), a wireless network card and other devices with network access functions.
  • a network interface such as an Ethernet interface
  • the communication interface 803 may also be used for data communication between the character recognition device 800 and other computing devices or terminals.
  • a thick line represents the bus 804 .
  • the bus 804 can connect the processor 801 with the memory 802 and the communication interface 803 .
  • the processor 801 can access the memory 802 through the bus 804 , and can also use the communication interface 803 to perform data interaction with other computing devices or terminals.
  • the text recognition device 800 executes the computer instructions in the memory 802, so that the text recognition device 800 implements the above text recognition method provided in this application, or makes the text recognition device 800 deploy the above text recognition device.
  • the memory 802 may include an image acquisition module 701 , a feature extraction module 702 , a text frame processing module 703 and a text recognition module 704 .
  • the inclusion here only refers to the functions of the image acquisition module, the feature extraction module, the text box processing module and the text recognition module that can be realized respectively when the instructions stored in the memory are executed, and is not limited to the physical structure.
  • the above-mentioned character recognition device can be implemented by software as in FIG. 8 , or it can be implemented by hardware as a hardware module or as a circuit unit.
  • the application provides a computer-readable storage medium
  • the computer program product includes computer instructions
  • the computer instructions instruct a computing device to perform the above-mentioned character recognition method provided by the application.
  • the present application provides a chip, including at least one processor and a communication interface, and the communication interface provides information input and/or output for the at least one processor. Further, the chip may further include at least one memory, and the memory is used to store computer instructions. The at least one processor is used to call and execute the computer instructions to execute the above-mentioned character recognition method provided by the present application.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or in the form of hardware plus software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

本申请提供一种文字识别方法、装置、设备及存储介质,该方法通过获取待处理图像,该待处理图像携带一个或多个文字,进而,对上述待处理图像进行特征提取,获得图像特征,从而,根据该图像特征,获得上述待处理图像中多个不同尺度的文字框,并对该多个不同尺度的文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,然后,根据文字框回归处理后的多个不同尺度的文字框,确定上述待处理图像中文字的位置,并基于该位置,对上述待处理图像进行文字识别,提高了文字识别率,达到较好的文字识别效果。

Description

文字识别方法、装置、设备及存储介质
本申请要求于2021年12月15日提交中国专利局、申请号为202111535285.5、申请名称为“文字识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及金融科技(Fintech)的图像识别技术,尤其涉及一种文字识别方法、装置、设备及存储介质。
背景技术
随着信息技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技转变,图像识别技术也不例外,但由于金融行业的安全性、实时性要求,也对图像识别技术提出更高的要求。
相关技术中,图像识别技术主要是指采用计算机按照既定目标对捕获的系统前端图片进行处理,在人工智能领域,神经网络是图像识别领域最广泛的应用。神经网络模型可以实现诸如人脸识别、图像检测、图像分类、目标跟踪和文字识别等。其中,人脸识别、图像分类和文字识别等功能经过长时间的发展已经达到较好的识别效果。
文字识别一般是指利用包括计算机在内的各种设备自动识别字符的技术,在当今社会的许多领域都有着重要应用。但是,在图像发生变形或者角度移动后,现有图像识别技术不具备等变属性,导致文字识别率下降,无法达到理想的识别效果。
发明内容
为解决现有技术中存在的问题,本申请提供一种文字识别方法、装置、设备及存储介质。
第一方面,本申请实施例提供一种文字识别方法,所述方法包括:
获取待处理图像,所述待处理图像携带一个或多个文字;
对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征;
根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并对所述多个不同尺度的文字框进行文字框回归处理;
根据文字框回归处理后的多个不同尺度的文字框,确定所述待处理图像中所述一个或多个文字的位置,并基于所述一个或多个文字的位置,对所述待处理图像进行文字识别。
在一种可能的实现方式中,所述对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征,包括:
基于密集连接网络,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征,其中,所述密集连接网络包括一个或多个密集块,所述密集连接网络中任意 两个密集块之间都有直接的连接,每一密集块的输入都是前面所有密集块输出的并集。
在一种可能的实现方式中,所述密集连接网络还包括一个或多个过渡连接层,所述过渡连接层包括1×1卷积层,每一过渡连接层的输入都是前面所有密集块和过渡连接层输出的并集;
所述基于密集连接网络,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征,包括:
基于所述一个或多个密集块,以及所述一个或多个过渡连接层,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征。
在一种可能的实现方式中,所述根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并对所述多个不同尺度的文字框进行文字框回归处理,包括:
根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据;
基于所述偏移数据,对所述多个不同尺度的文字框进行文字框回归处理。
在一种可能的实现方式中,所述根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据,包括:
对所述图像特征进行下采样处理,并对下采样处理后的图像特征进行下采样和卷积处理;
将下采样和卷积处理后的图像特征作为新的所述下采样处理后的图像特征,重新执行所述对下采样处理后的图像特征进行下采样和卷积处理的步骤,直至获得所述待处理图像中所述多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据。
在一种可能的实现方式中,所述根据文字框回归处理后的多个不同尺度的文字框,确定所述待处理图像中所述一个或多个文字的位置,包括:
根据所述文字框回归处理后的多个不同尺度的文字框和预设得分模型,获得所述文字框回归处理后的多个不同尺度的文字框的得分,其中,所述预设得分模型用于根据多个不同尺度的文字框中得分最高的文字框与多个不同尺度的文字框的交集和并集的比值,确定多个不同尺度的文字框的得分;
根据所述文字框回归处理后的多个不同尺度的文字框的得分,计算所述文字框回归处理后的多个不同尺度的文字框的位置,并基于所述文字框回归处理后的多个不同尺度的文字框的位置,确定所述待处理图像中所述一个或多个文字的位置。
在一种可能的实现方式中,所述根据所述文字框回归处理后的多个不同尺度的文字框的得分,计算所述文字框回归处理后的多个不同尺度的文字框的位置,包括:
计算所述文字框回归处理后的多个不同尺度的文字框中得分最高的文字框与文字框回归处理后的文字框i的交集和并集的比值,其中,所述文字框回归处理后的文字框i为所述文字框回归处理后的多个不同尺度的文字框中任意一个文字框,i=1,…,n,n为整数,n根据所述文字框回归处理后的多个不同尺度的文字框的数目确定;
若计算的比值小于预设阈值,则根据所述文字框回归处理后的文字框i的得分,计算所述文字框回归处理后的文字框i的位置。
在一种可能的实现方式中,在所述对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征之前,还包括:
对所述待处理图像进行降参处理;
所述对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征,包括:
对降参处理后的待处理图像进行特征提取,获得所述待处理图像对应的图像特征。
在一种可能的实现方式中,所述对所述待处理图像进行降参处理包括:
利用3个3×3的卷积层和1个2×2的池化层,对所述待处理图像进行降参处理,其中,所述3个3×3的卷积层依次连接后与所述2×2的池化层连接。
在一种可能的实现方式中,所述基于所述一个或多个文字的位置,对所述待处理图像进行文字识别,包括:
基于所述一个或多个文字的位置和预设识别模型,识别所述待处理图像中的文字,其中,所述预设识别模型用于根据图像中文字的位置,识别图像中的文字。
第二方面,本申请实施例提供一种文字识别装置,所述装置包括:
图像获取模块,用于获取待处理图像,所述待处理图像携带一个或多个文字;
特征提取模块,用于对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征;
文字框处理模块,用于根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并对所述多个不同尺度的文字框进行文字框回归处理;
文字识别模块,用于根据文字框回归处理后的多个不同尺度的文字框,确定所述待处理图像中所述一个或多个文字的位置,并基于所述一个或多个文字的位置,对所述待处理图像进行文字识别。
在一种可能的实现方式中,所述特征提取模块,具体用于:
基于密集连接网络,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征,其中,所述密集连接网络包括一个或多个密集块,所述密集连接网络中任意两个密集块之间都有直接的连接,每一密集块的输入都是前面所有密集块输出的并集。
在一种可能的实现方式中,所述密集连接网络还包括一个或多个过渡连接层,所述过渡连接层包括1×1卷积层,每一过渡连接层的输入都是前面所有密集块和过渡连接层输出的并集。
所述特征提取模块,具体用于:
基于所述一个或多个密集块,以及所述一个或多个过渡连接层,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征。
在一种可能的实现方式中,所述文字框处理模块,具体用于:
根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据;
基于所述偏移数据,对所述多个不同尺度的文字框进行文字框回归处理。
在一种可能的实现方式中,所述文字框处理模块,具体用于:
对所述图像特征进行下采样处理,并对下采样处理后的图像特征进行下采样和卷积处理;
将下采样和卷积处理后的图像特征作为新的所述下采样处理后的图像特征,重新执行所述对下采样处理后的图像特征进行下采样和卷积处理的步骤,直至获得所述待处理图像中所述多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据。
在一种可能的实现方式中,所述文字识别模块,具体用于:
根据所述文字框回归处理后的多个不同尺度的文字框和预设得分模型,获得所述文字框回归处理后的多个不同尺度的文字框的得分,其中,所述预设得分模型用于根据多个不同尺度的文字框中得分最高的文字框与多个不同尺度的文字框的交集和并集的比值,确定多个不同尺度的文字框的得分;
根据所述文字框回归处理后的多个不同尺度的文字框的得分,计算所述文字框回归处理后的多个不同尺度的文字框的位置,并基于所述文字框回归处理后的多个不同尺度的文字框的位置,确定所述待处理图像中所述一个或多个文字的位置。
在一种可能的实现方式中,所述文字识别模块,具体用于:
计算所述文字框回归处理后的多个不同尺度的文字框中得分最高的文字框与文字框回归处理后的文字框i的交集和并集的比值,其中,所述文字框回归处理后的文字框i为所述文字框回归处理后的多个不同尺度的文字框中任意一个文字框,i=1,…,n,n为整数,n根据所述文字框回归处理后的多个不同尺度的文字框的数目确定;
若计算的比值小于预设阈值,则根据所述文字框回归处理后的文字框i的得分,计算所述文字框回归处理后的文字框i的位置。
在一种可能的实现方式中,所述特征提取模块,具体用于:
对所述待处理图像进行降参处理;
对降参处理后的待处理图像进行特征提取,获得所述待处理图像对应的图像特征。
在一种可能的实现方式中,所述特征提取模块,具体用于:
利用3个3×3的卷积层和1个2×2的池化层,对所述待处理图像进行降参处理,其中,所述3个3×3的卷积层依次连接后与所述2×2的池化层连接。
在一种可能的实现方式中,所述文字识别模块,具体用于:
基于所述一个或多个文字的位置和预设识别模型,识别所述待处理图像中的文字,其中,所述预设识别模型用于根据图像中文字的位置,识别图像中的文字。
第三方面,本申请实施例提供一种文字识别设备,包括:
处理器;
存储器;以及
计算机程序;
其中,所述计算机程序被存储在所述存储器中,并且被配置为由所述处理器执行,所述计算机程序包括用于执行如第一方面所述的方法的指令。
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序使得服务器执行第一方面所述的方法。
第五方面,本申请实施例提供一种计算机程序产品,包括计算机指令,所述计算机指令被处理器执行第一方面所述的方法。
本申请实施例提供的文字识别方法、装置、设备及存储介质,该方法通过获取待处理图像,该待处理图像携带一个或多个文字,进而,对上述待处理图像进行特征提取,获得图像特征,从而,根据该图像特征,获得上述待处理图像中多个不同尺度的文字框,并对该多个不同尺度的文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,然后,根据文字框回归处理后的多个不同尺度的文字框,确定上述待处理图像中文字的位 置,并基于该位置,对上述待处理图像进行文字识别,提高了文字识别率,达到较好的文字识别效果。
附图说明
图1为本申请实施例提供的一种文字识别系统架构示意图;
图2为本申请实施例提供的一种文字识别方法的流程示意图;
图3为本申请实施例提供的另一种文字识别方法的流程示意图;
图4为本申请实施例提供的一种下采样和卷积处理的示意图;
图5为本申请实施例提供的一个文字框的偏移示意图;
图6为本申请实施例提供的再一种文字识别方法的流程示意图;
图7为本申请实施例提供的一种文字识别装置的结构示意图;
图8示出了本申请文字识别设备的一种可能的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”及“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
现有文字识别目前在计算机图像和视觉方面有比较广泛的研究,在车牌识别、票据识别、书籍文本识别等场景中有着极高的应用需求,不少技术已经比较成熟,并且效果比较好。但是,如果图像发生了变形或者角度移动之后,现有图像识别技术不具备等变属性,导致文字识别率下降,无法达到理想的识别效果。
因此,本申请实施例提出一种文字识别方法,在获取携带一个或多个文字的待处理图像后,通过对该待处理图像进行特征提取,获得图像特征,进而,根据该图像特征,获得上述待处理图像中多个不同尺度的文字框,并对该多个不同尺度的文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,提高后续基于文字框回归处理后的多个不同尺度的文字框,对上述待处理图像进行文字识别的识别率,达到较好的文字识别效果。
可选地,本申请提供的一种文字识别方法,可以适用于图1所示的文字识别系统架构示意图,如图1所示,该系统可以包括接收装置101、处理装置102和显示装置103。
在具体实现过程中,接收装置101可以是输入/输出接口,也可以是通信接口,可以用于接收携带一个或多个文字的待处理图像。
处理装置102可以通过上述接收装置101获取上述待处理图像,进而,对上述待处理 图像进行特征提取,获得图像特征,从而,根据该图像特征,获得上述待处理图像中多个不同尺度的文字框,并对该多个不同尺度的文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,然后,根据文字框回归处理后的多个不同尺度的文字框,对上述待处理图像进行文字识别,提高了文字识别率,达到较好的文字识别效果。
另外,显示装置103可以用于对上述待处理图像和多个不同尺度的文字框等进行显示。
显示装置还可以是触摸显示屏,用于在显示的上述内容的同时接收用户指令,以实现与用户的交互。
处理装置102还可以将对上述待处理图像进行文字识别的结果发送至解码器,由解码器对上述结果进行解码,输出相应的文字。
应理解,上述处理装置可以通过处理器读取存储器中的指令并执行指令的方式实现,也可以通过芯片电路实现。
上述系统仅为一种示例性系统,具体实施时,可以根据应用需求设置。
另外,本申请实施例描述的系统架构是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着系统架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
下面以几个实施例为例对本申请的技术方案进行描述,对于相同或相似的概念或过程可能在某些实施例不再赘述。
图2为本申请实施例提供的一种文字识别方法的流程示意图,本实施例的执行主体可以为图1所示实施例中的处理装置,具体可以根据实际情况确定。如图2所示,本申请实施例提供的文字识别方法包括如下步骤:
S201:获取待处理图像,该待处理图像携带一个或多个文字。
其中,上述待处理图像可以根据实际情况设置,例如在车牌识别、票据识别、书籍文本识别等场景中获得的图像。
S202:对上述待处理图像进行特征提取,获得上述待处理图像对应的图像特征。
这里,上述处理装置在对上述待处理图像进行特征提取之前,还可以对上述待处理图像进行降参处理,以减少参数和计算量,提高后续文字识别的效率。
示例性的,上述处理装置可以利用3个3×3的卷积层和1个2×2的池化层,对上述待处理图像进行降参处理,其中,上述3个3×3的卷积层依次连接后与上述2×2的池化层连接。其中,上述3个3×3的卷积层和1个2×2的池化层卷积核大小(kernel_size)、卷积步长(stride)和特征图填充宽度(padding)等参数可以如表1所示:
表1
Figure PCTCN2022102163-appb-000001
另外,上述处理装置在对上述待处理图像进行特征提取时,可以基于密集连接网络, 对上述待处理图像进行特征提取,获得上述待处理图像对应的图像特征,其中,上述密集连接网络包括一个或多个密集块,上述密集连接网络中任意两个密集块之间都有直接的连接,每一密集块的输入都是前面所有密集块输出的并集。
这里,上述使用处理装置将密集连接网络作为特征提取网络,该网络能够把之前所有层的输出当作当前层的输入,让梯度和信息传播更准确,从而使得基于密集连接网络提取的待处理图像的特征,进行后续文字识别的准确率较高。
在本申请实施例中,为了增加提取特征的深度,上述密集连接网络中还可以包括一个或多个过渡连接层,该过渡连接层是用来增加上述密集连接网络中密集块的数量,并且在增加了数量的情况下,不会改变原有特征图的分辨率。其中,上述过渡连接层包括1×1卷积层,不仅能增加上述密集连接网络提取特征的深度,并且可以消除对上述密集块的整体数量限制,每一过渡连接层的输入都是前面所有密集块和过渡连接层输出的并集。上述处理装置可以基于上述一个或多个密集块,以及上述一个或多个过渡连接层,对上述待处理图像进行特征提取,使得提取的特征更丰富,提高后续基于上述提取的特征进行文字识别的准确率。
表2
Figure PCTCN2022102163-appb-000002
示例性的,上述密集块和过渡连接层的数量可以根据实际情况设置,例如如上述表2所示,上述密集块的个数为4个,上述过渡连接层的个数为2个,第1个过渡连接层设置在第3个密集块和第4个密集块之间,第2个过渡连接层设置在第4个密集块后面。表2中示出的4个密集块和2个过渡连接层的kernel_size、stride和padding等参数。
S203:根据上述图像特征,获得上述待处理图像中多个不同尺度的文字框,并对该多个不同尺度的文字框进行文字框回归处理。
这里,上述处理装置可以利用预设密集层,根据上述图像特征,获得上述待处理图像中多个不同尺度的文字框,并对该多个不同尺度的文字框进行文字框回归处理。
其中,上述预设密集层可以包含两块,一块用于获得上述待处理图像中多个不同尺度的文字框,一块用于对该多个不同尺度的文字框进行文字框回归处理。
在本申请实施例,上述处理装置通过对上述待处理图像中多个不同尺度的文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,提高后续基于文字框回归处理 后的多个不同尺度的文字框,对上述待处理图像进行文字识别的识别率。
S204:根据文字框回归处理后的多个不同尺度的文字框,确定上述待处理图像中上述一个或多个文字的位置,并基于上述一个或多个文字的位置,对上述待处理图像进行文字识别。
示例性的,上述处理装置可以根据上述文字框回归处理后的多个不同尺度的文字框和预设得分模型,获得上述文字框回归处理后的多个不同尺度的文字框的得分,进而,根据该得分,计算上述文字框回归处理后的多个不同尺度的文字框的位置,并基于该位置,确定上述待处理图像中所述一个或多个文字的位置。
其中,上述预设得分模型用于根据多个不同尺度的文字框中得分最高的文字框与上述多个不同尺度的文字框的交集和并集的比值,确定上述多个不同尺度的文字框的得分。
例如上述预设得分模型包括表达式:
Figure PCTCN2022102163-appb-000003
其中,s i表示第i个文字框的得分,iou表示交并比(Intersection over Union),是文字框和其它文字框的交集和并集的比值。T表示计算出的最高分的文字框,c i表示候选框,N表示一个阈值,可以根据实际情况设置。这里,上述处理装置可以设置上述文字框回归处理后的多个不同尺度的文字框作为上述候选框,并计算所有候选框的得分,得到最高分的文字框T,根据上述表达式获得上述文字框回归处理后的多个不同尺度的文字框的得分。
进一步地,上述处理装置在根据上述得分,计算上述文字框回归处理后的多个不同尺度的文字框的位置时,可以利用表达式:
Figure PCTCN2022102163-appb-000004
其中,t′表示上述文字框回归处理后的多个不同尺度的文字框的位置,t i表示第i个文字框的坐标。
另外,上述处理装置在根据上述得分,计算上述文字框回归处理后的多个不同尺度的文字框的位置时,还可以考虑计算上述文字框回归处理后的多个不同尺度的文字框中得分最高的文字框与文字框回归处理后的文字框i的交集和并集的比值。如果计算的比值小于预设阈值,则上述处理装置可以根据文字框回归处理后的文字框i的得分,计算上述文字框回归处理后的文字框i的位置。其中,文字框回归处理后的文字框i为上述文字框回归处理后的多个不同尺度的文字框中任意一个文字框,i=1,…,n,n为整数,n根据上述文字框回归处理后的多个不同尺度的文字框的数目确定。即上述处理装置可以采用非极大抑制(non maximum suppression,NMS)算法,计算上述文字框回归处理后的多个不同尺度的文字框的位置,使得计算结果更加准确。
示例性的,上述处理装置可以列举出所有的候选框a,即列举出上述文字框回归处理后的多个不同尺度的文字框,以及计算的分数s i,并初始化一检测集合Bi,将其设置为空。然后,上述处理装置可以集合候选框a中计算所有的文字框,得到最高分的文字框T,放到集合Bi中,i表示第i次选取选框。进一步地,上述处理装置可以设置一个阈值N,然后遍历所有剩下的文字框,计算该文字框和最高分检测框的iou,如果结果大于或等于阈值,则将其放入集合Bi。上述处理装置重复上面的操作,直到a为空,得到集合集Bi。最后,针对每一文字框上述处理装置可以基于上述分数s i,计算文字框的位置,使得后续基于该 位置计算的文字框的位置更加精确。
在本申请实施例中,上述处理装置在基于上述一个或多个文字的位置,对上述待处理图像进行文字识别时,还可以基于上述一个或多个文字的位置和预设识别模型,识别上述待处理图像中的文字。
其中,上述预设识别模型用于根据图像中文字的位置,识别图像中的文字。
本申请实施例,通过获取待处理图像,该待处理图像携带一个或多个文字,进而,对上述待处理图像进行特征提取,获得图像特征,从而,根据该图像特征,获得上述待处理图像中多个不同尺度的文字框,并对该多个不同尺度的文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,然后,根据文字框回归处理后的多个不同尺度的文字框,确定上述待处理图像中文字的位置,并基于该位置,对上述待处理图像进行文字识别,提高了文字识别率,达到较好的文字识别效果。而且,本申请实施例还对上述待处理图像进行降参处理,减少了参数和计算量,提高后续文字识别的效率。另外,本申请实施例将密集连接网络作为特征提取网络,该网络能够把之前所有层的输出当作当前层的输入,让梯度和信息传播更准确,从而使得基于密集连接网络提取的待处理图像的特征,进行后续文字识别的准确率较高。本申请实施例还可以采用NMS算法,计算上述文字框回归处理后的多个不同尺度的文字框的位置,使得计算结果更加准确。
这里,上述处理装置在基于上述一个或多个文字的位置和预设识别模型,识别上述待处理图像中的文字之前,需要对上述预设识别模型进行训练,以便后续利用该模型识别出上述待处理图像中的文字。其中,在训练过程中,上述处理装置可以将携带文字的图像输入上述预设识别模型,其中,上述输入的图像中还携带图像中文字的位置,然后,根据上述预设识别模型输出的文字,以及上述输入图像对应的文字,确定输出准确率。如果该输出准确率低于预设准确率阈值,上述处理装置可以根据上述输出准确率,调整上述预设识别模型,以提高上述输出准确率,将调整后的预设识别作为新的预设识别模型,重新执行上述将携带文字的图像输入上述预设识别模型的步骤。
另外,上述处理装置在根据上述图像特征,获得上述待处理图像中多个不同尺度的文字框,并对上述多个不同尺度的文字框进行文字框回归处理时,还考虑根据上述图像特征,获得上述待处理图像中多个不同尺度的文字框,并确定上述多个不同尺度的文字框的偏移数据,进而,基于该偏移数据,对上述多个不同尺度的文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,然后,根据文字框回归处理后的多个不同尺度的文字框,对上述待处理图像进行文字识别,提高了文字识别率。图3为本申请实施例提出的另一种文字识别方法的流程示意图。如图3所示,该方法包括:
S301:获取待处理图像,该待处理图像携带一个或多个文字。
S302:对上述待处理图像进行特征提取,获得上述待处理图像对应的图像特征。
其中,步骤S301-S302与上述步骤S201-S202的实现方式相同,此处不再赘述。
S303:根据上述图像特征,获得上述待处理图像中多个不同尺度的文字框,并确定上述多个不同尺度的文字框的偏移数据。
这里,上述处理装置可以对上述图像特征进行下采样处理,并对下采样处理后的图像特征进行下采样和卷积处理,将下采样和卷积处理后的图像特征作为新的上述下采样处理 后的图像特征,重新执行上述对下采样处理后的图像特征进行下采样和卷积处理的步骤,直至获得上述待处理图像中上述多个不同尺度的文字框,并确定上述多个不同尺度的文字框的偏移数据。
其中,上述处理装置可以利用下采样模块对上述图像特征进行下采样处理,上述下采样模块可以包括1×1的卷积和2×2的池化层。这里,上述处理装置使用2×2的池化层是为了特征图能够大小匹配,使用1×1的卷积是为了把通道数减少一半,整个模块的尺度包含了该特征图的特征和上一个特征图的特征,这样可以使得参数较少、结果更加准确。
另外,上述处理装置还可以利用卷积模块对上述图像特征进行卷积处理,上述卷积模块可以包括1×1的卷积和3×3的卷积层,进行两个卷积操作,前一层的特征图传入到后一层的特征图中。
在本申请实施例中,以上述处理装置可以获得6种不同尺度的文字框为例。如图4所示,上述6种不同尺度的文字框包括尺度1、尺度2、尺度3、尺度4、尺度5和尺度6的文字框。上述处理装置根据上述图像特征,确定尺度1的文字框,进而,对尺度1的文字框进行下采样处理,获得尺度2的文字框,对该尺度2的文字框进行下采样和卷积处理,获得尺度3的文字框,重复执行上述步骤,即对该尺度3的文字框进行下采样和卷积处理,获得尺度4的文字框,对该尺度4的文字框进行下采样和卷积处理,获得尺度5的文字框,对该尺度5的文字框进行下采样和卷积处理,获得尺度6的文字框。
其中,上述处理装置在上述处理过程中确定上述多个不同尺度的文字框的偏移数据,从而,基于该偏移数据,对上述多个不同尺度的文字框进行文字框回归处理。示例性的,为了更好的理解上述文字框的偏移,图5给出一个文字框的偏移示意图,图中,b0表示默认边框,4个箭头从b0引出指向Gq,表示一个从默认框到实际文字框的回归学习过程,Gb表示实际的目标Gq的一个最小外接矩阵,
Figure PCTCN2022102163-appb-000005
表示矩形的真实值,是G的最小的包围矩形,
Figure PCTCN2022102163-appb-000006
表示Gb的中心点,
Figure PCTCN2022102163-appb-000007
表示宽,
Figure PCTCN2022102163-appb-000008
表示高。
这里,上述处理装置在确定文字框的偏移数据后,基于该偏移数据,对文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,从而提高后续文字识别的准确率。
S304:基于上述偏移数据,对上述多个不同尺度的文字框进行文字框回归处理。
S305:根据文字框回归处理后的多个不同尺度的文字框,确定上述待处理图像中上述一个或多个文字的位置,并基于上述一个或多个文字的位置,对上述待处理图像进行文字识别。
其中,步骤S305与上述步骤S204的实现方式相同,此处不再赘述。
本申请实施例在确定文字框的偏移数据后,基于该偏移数据,对文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,然后,根据文字框回归处理后的多个不同尺度的文字框进行文字识别,提高了文字识别率。
这里,图6给出本申请实施例提出的再一种文字识别方法的流程示意图,在图中,上述处理装置在获取携带一个或多个文字的待处理图像后,可以对该待处理图像进行降参处理。具体的,上述处理装置可以利用降参模块进行降参,其中,该降参模块可以包括3×3的卷积层和1个2×2的池化层,该3个3×3的卷积层依次连接后与所述2×2的池化层连接。进一步地,上述处理装置可以对降参处理后的待处理图像进行特征提取,示例性的,可以基于密集连接网络进行特征提取。其中,该密集连接网络可以包括一个或多个密集块,还可以包括一个或多个过渡连接层。这里,图中以4个密集块和2个过渡连接层,第1个过渡连接层设置在第3个密集块和第4个密集块之间,第2个过渡连接层设置在第4个密集块后面为例。在进行特征提取后,上述处理装置可以基于提取的图像特征,获得上述待处理图像中多个不同尺度的文字框,并确定该多个不同尺度的文字框的偏移数据,从而,基于该偏移数据,对上述多个不同尺度的文字框进行文字框回归处理。这里,上述处理装置可以利用预设密集层进行上述处理,该预设密集层可以包含两块,一块用于获得上述待处理图像中多个不同尺度的文字框,一块用于对该多个不同尺度的文字框进行文字框回归处理。最后,上述处理装置根据文字框回归处理后的多个不同尺度的文字框,确定上述待处理图像中一个或多个文字的位置,并基于该位置,对上述待处理图像进行文字识别。其中,上述处理装置可以采用NMS算法,计算上述文字框回归处理后的多个不同尺度的文字框的位置,使得计算结果更加准确。
另外,上述处理装置还可以将对上述待处理图像进行文字识别的结果发送至解码器,由解码器对上述结果进行解码,输出相应的文字。
在本申请实施例中,上述处理装置对上述待处理图像中多个不同尺度的文字框进行文字框回归处理,解决图像发生变形或者角度移动的问题,然后,根据文字框回归处理后的多个不同尺度的文字框,对上述待处理图像进行文字识别,提高了文字识别率,达到较好的文字识别效果。而且,上述处理装置还对上述待处理图像进行降参处理,减少了参数和计算量,提高后续文字识别的效率。另外,上述处理装置将密集连接网络作为特征提取网络,该网络能够把之前所有层的输出当作当前层的输入,让梯度和信息传播更准确,从而使得基于密集连接网络提取的待处理图像的特征,进行后续文字识别的准确率较高。上述处理装置还可以采用NMS算法,计算上述文字框回归处理后的多个不同尺度的文字框的位置,使得计算结果更加准确。
对应于上文实施例的文字识别方法,图7为本申请实施例提供的文字识别装置的结构示意图。为了便于说明,仅示出了与本申请实施例相关的部分。图7为本申请实施例提供 的一种文字识别装置的结构示意图,该文字识别装置70包括:图像获取模块701、特征提取模块702、文字框处理模块703以及文字识别模块704。这里的文字识别装置可以是上述处理装置本身,或者是实现处理装置的功能的芯片或者集成电路。这里需要说明的是,图像获取模块、特征提取模块、文字框处理模块以及文字识别模块的划分只是一种逻辑功能的划分,物理上两者可以是集成的,也可以是独立的。
其中,图像获取模块701,用于获取待处理图像,所述待处理图像携带一个或多个文字。
特征提取模块702,用于对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征。
文字框处理模块703,用于根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并对所述多个不同尺度的文字框进行文字框回归处理。
文字识别模块704,用于根据文字框回归处理后的多个不同尺度的文字框,确定所述待处理图像中所述一个或多个文字的位置,并基于所述一个或多个文字的位置,对所述待处理图像进行文字识别。
在一种可能的设计中,所述特征提取模块702,具体用于:
基于密集连接网络,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征,其中,所述密集连接网络包括一个或多个密集块,所述密集连接网络中任意两个密集块之间都有直接的连接,每一密集块的输入都是前面所有密集块输出的并集。
在一种可能的实现方式中,所述密集连接网络还包括一个或多个过渡连接层,所述过渡连接层包括1×1卷积层,每一过渡连接层的输入都是前面所有密集块和过渡连接层输出的并集。
所述特征提取模块702,具体用于:
基于所述一个或多个密集块,以及所述一个或多个过渡连接层,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征。
在一种可能的实现方式中,所述文字框处理模块703,具体用于:
根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据;
基于所述偏移数据,对所述多个不同尺度的文字框进行文字框回归处理。
在一种可能的实现方式中,所述文字框处理模块703,具体用于:
对所述图像特征进行下采样处理,并对下采样处理后的图像特征进行下采样和卷积处理;
将下采样和卷积处理后的图像特征作为新的所述下采样处理后的图像特征,重新执行所述对下采样处理后的图像特征进行下采样和卷积处理的步骤,直至获得所述待处理图像中所述多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据。
在一种可能的实现方式中,所述文字识别模块704,具体用于:
根据所述文字框回归处理后的多个不同尺度的文字框和预设得分模型,获得所述文字框回归处理后的多个不同尺度的文字框的得分,其中,所述预设得分模型用于根据多个不同尺度的文字框中得分最高的文字框与多个不同尺度的文字框的交集和并集的比值,确定多个不同尺度的文字框的得分;
根据所述文字框回归处理后的多个不同尺度的文字框的得分,计算所述文字框回归处理后的多个不同尺度的文字框的位置,并基于所述文字框回归处理后的多个不同尺度的文字框的位置,确定所述待处理图像中所述一个或多个文字的位置。
在一种可能的实现方式中,所述文字识别模块704,具体用于:
计算所述文字框回归处理后的多个不同尺度的文字框中得分最高的文字框与文字框回归处理后的文字框i的交集和并集的比值,其中,所述文字框回归处理后的文字框i为所述文字框回归处理后的多个不同尺度的文字框中任意一个文字框,i=1,…,n,n为整数,n根据所述文字框回归处理后的多个不同尺度的文字框的数目确定;
若计算的比值小于预设阈值,则根据所述文字框回归处理后的文字框i的得分,计算所述文字框回归处理后的文字框i的位置。
在一种可能的实现方式中,所述特征提取模块702,具体用于:
对所述待处理图像进行降参处理;
对降参处理后的待处理图像进行特征提取,获得所述待处理图像对应的图像特征。
在一种可能的实现方式中,所述特征提取模块702,具体用于:
利用3个3×3的卷积层和1个2×2的池化层,对所述待处理图像进行降参处理,其中,所述3个3×3的卷积层依次连接后与所述2×2的池化层连接。
在一种可能的实现方式中,所述文字识别模块704,具体用于:
基于所述一个或多个文字的位置和预设识别模型,识别所述待处理图像中的文字,其中,所述预设识别模型用于根据图像中文字的位置,识别图像中的文字。
本申请实施例提供的装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本申请实施例此处不再赘述。
可选地,图8示意性地提供本申请所述文字识别设备的一种可能的基本硬件架构。
参见图8,文字识别设备800包括至少一个处理器801以及通信接口803。进一步可选的,还可以包括存储器802和总线804。
其中,文字识别设备800可以是上述处理装置,本申请对此不作特别限制。文字识别设备800中,处理器801的数量可以是一个或多个,图8仅示意了其中一个处理器801。可选地,处理器801,可以是中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)或者数字信号处理(Digital Signal Process,DSP)。如果文字识别设备800具有多个处理器801,多个处理器801的类型可以不同,或者可以相同。可选地,文字识别设备800的多个处理器801还可以集成为多核处理器。
存储器802存储计算机指令和数据;存储器802可以存储实现本申请提供的上述文字识别方法所需的计算机指令和数据,例如,存储器802存储用于实现上述文字识别方法的步骤的指令。存储器802可以是以下存储介质的任一种或任一种组合:非易失性存储器(例如只读存储器(ROM)、固态硬盘(SSD)、硬盘(HDD)、光盘),易失性存储器。
通信接口803可以为所述至少一个处理器提供信息输入/输出。也可以包括以下器件的任一种或任一种组合:网络接口(例如以太网接口)、无线网卡等具有网络接入功能的器件。
可选的,通信接口803还可以用于文字识别设备800与其它计算设备或者终端进行数据通信。
进一步可选的,图8用一条粗线表示总线804。总线804可以将处理器801与存储器802和通信接口803连接。这样,通过总线804,处理器801可以访问存储器802,还可以利用通信接口803与其它计算设备或者终端进行数据交互。
在本申请中,文字识别设备800执行存储器802中的计算机指令,使得文字识别设备800实现本申请提供的上述文字识别方法,或者使得文字识别设备800部署上述的文字识别装置。
从逻辑功能划分来看,示例性的,如图8所示,存储器802中可以包括图像获取模块701、特征提取模块702、文字框处理模块703以及文字识别模块704。这里的包括仅仅涉及存储器中所存储的指令被执行时可以分别实现图像获取模块、特征提取模块、文字框处理模块以及文字识别模块的功能,而不限定是物理上的结构。
另外,上述的文字识别设备除了可以像上述图8通过软件实现外,也可以作为硬件模块,或者作为电路单元,通过硬件实现。
本申请提供一种计算机可读存储介质,所述计算机程序产品包括计算机指令,所述计 算机指令指示计算设备执行本申请提供的上述文字识别方法。
本申请提供一种芯片,包括至少一个处理器和通信接口,所述通信接口为所述至少一个处理器提供信息输入和/或输出。进一步,所述芯片还可以包含至少一个存储器,所述存储器用于存储计算机指令。所述至少一个处理器用于调用并运行该计算机指令,以执行本申请提供的上述文字识别方法。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。

Claims (14)

  1. 一种文字识别方法,其特征在于,包括:
    获取待处理图像,所述待处理图像携带一个或多个文字;
    对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征;
    根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并对所述多个不同尺度的文字框进行文字框回归处理;
    根据文字框回归处理后的多个不同尺度的文字框,确定所述待处理图像中所述一个或多个文字的位置,并基于所述一个或多个文字的位置,对所述待处理图像进行文字识别。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征,包括:
    基于密集连接网络,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征,其中,所述密集连接网络包括一个或多个密集块,所述密集连接网络中任意两个密集块之间都有直接的连接,每一密集块的输入都是前面所有密集块输出的并集。
  3. 根据权利要求2所述的方法,其特征在于,所述密集连接网络还包括一个或多个过渡连接层,所述过渡连接层包括1×1卷积层,每一过渡连接层的输入都是前面所有密集块和过渡连接层输出的并集;
    所述基于密集连接网络,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征,包括:
    基于所述一个或多个密集块,以及所述一个或多个过渡连接层,对所述待处理图像进行特征提取,获得所述待处理图像对应的所述图像特征。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并对所述多个不同尺度的文字框进行文字框回归处理,包括:
    根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据;
    基于所述偏移数据,对所述多个不同尺度的文字框进行文字框回归处理。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据,包括:
    对所述图像特征进行下采样处理,并对下采样处理后的图像特征进行下采样和卷积处理;
    将下采样和卷积处理后的图像特征作为新的所述下采样处理后的图像特征,重新执行所述对下采样处理后的图像特征进行下采样和卷积处理的步骤,直至获得所述待处理图像中所述多个不同尺度的文字框,并确定所述多个不同尺度的文字框的偏移数据。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述根据文字框回归处理后的多个不同尺度的文字框,确定所述待处理图像中所述一个或多个文字的位置,包括:
    根据所述文字框回归处理后的多个不同尺度的文字框和预设得分模型,获得所述文字框回归处理后的多个不同尺度的文字框的得分,其中,所述预设得分模型用于根据多个不同尺度的文字框中得分最高的文字框与多个不同尺度的文字框的交集和并集 的比值,确定多个不同尺度的文字框的得分;
    根据所述文字框回归处理后的多个不同尺度的文字框的得分,计算所述文字框回归处理后的多个不同尺度的文字框的位置,并基于所述文字框回归处理后的多个不同尺度的文字框的位置,确定所述待处理图像中所述一个或多个文字的位置。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述文字框回归处理后的多个不同尺度的文字框的得分,计算所述文字框回归处理后的多个不同尺度的文字框的位置,包括:
    计算所述文字框回归处理后的多个不同尺度的文字框中得分最高的文字框与文字框回归处理后的文字框i的交集和并集的比值,其中,所述文字框回归处理后的文字框i为所述文字框回归处理后的多个不同尺度的文字框中任意一个文字框,i=1,…,n,n为整数,n根据所述文字框回归处理后的多个不同尺度的文字框的数目确定;
    若计算的比值小于预设阈值,则根据所述文字框回归处理后的文字框i的得分,计算所述文字框回归处理后的文字框i的位置。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,在所述对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征之前,还包括:
    对所述待处理图像进行降参处理;
    所述对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征,包括:
    对降参处理后的待处理图像进行特征提取,获得所述待处理图像对应的图像特征。
  9. 根据权利要求8所述的方法,其特征在于,所述对所述待处理图像进行降参处理包括:
    利用3个3×3的卷积层和1个2×2的池化层,对所述待处理图像进行降参处理,其中,所述3个3×3的卷积层依次连接后与所述2×2的池化层连接。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述基于所述一个或多个文字的位置,对所述待处理图像进行文字识别,包括:
    基于所述一个或多个文字的位置和预设识别模型,识别所述待处理图像中的文字,其中,所述预设识别模型用于根据图像中文字的位置,识别图像中的文字。
  11. 一种文字识别装置,其特征在于,包括:
    图像获取模块,用于获取待处理图像,所述待处理图像携带一个或多个文字;
    特征提取模块,用于对所述待处理图像进行特征提取,获得所述待处理图像对应的图像特征;
    文字框处理模块,用于根据所述图像特征,获得所述待处理图像中多个不同尺度的文字框,并对所述多个不同尺度的文字框进行文字框回归处理;
    文字识别模块,用于根据文字框回归处理后的多个不同尺度的文字框,确定所述待处理图像中所述一个或多个文字的位置,并基于所述一个或多个文字的位置,对所述待处理图像进行文字识别。
  12. 一种文字识别设备,其特征在于,包括:
    处理器;
    存储器;以及
    计算机程序;
    其中,所述计算机程序被存储在所述存储器中,并且被配置为由所述处理器执行,所述计算机程序包括用于执行如权利要求1-10任一项所述的方法的指令。
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序使得服务器执行权利要求1-10任一项所述的方法。
  14. 一种计算机程序产品,其特征在于,包括计算机指令,所述计算机指令被处理器执行权利要求1-10任一项所述的方法。
PCT/CN2022/102163 2021-12-15 2022-06-29 文字识别方法、装置、设备及存储介质 WO2023109086A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111535285.5 2021-12-15
CN202111535285.5A CN114495132A (zh) 2021-12-15 2021-12-15 文字识别方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023109086A1 true WO2023109086A1 (zh) 2023-06-22

Family

ID=81493740

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/102163 WO2023109086A1 (zh) 2021-12-15 2022-06-29 文字识别方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114495132A (zh)
WO (1) WO2023109086A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114495132A (zh) * 2021-12-15 2022-05-13 深圳前海微众银行股份有限公司 文字识别方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583449A (zh) * 2018-10-29 2019-04-05 深圳市华尊科技股份有限公司 字符识别方法及相关产品
CN110443258A (zh) * 2019-07-08 2019-11-12 北京三快在线科技有限公司 文字检测方法、装置、电子设备及存储介质
CN111476067A (zh) * 2019-01-23 2020-07-31 腾讯科技(深圳)有限公司 图像的文字识别方法、装置、电子设备及可读存储介质
CN112364873A (zh) * 2020-11-20 2021-02-12 深圳壹账通智能科技有限公司 弯曲文本图像的文字识别方法、装置及计算机设备
CN114495132A (zh) * 2021-12-15 2022-05-13 深圳前海微众银行股份有限公司 文字识别方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583449A (zh) * 2018-10-29 2019-04-05 深圳市华尊科技股份有限公司 字符识别方法及相关产品
CN111476067A (zh) * 2019-01-23 2020-07-31 腾讯科技(深圳)有限公司 图像的文字识别方法、装置、电子设备及可读存储介质
CN110443258A (zh) * 2019-07-08 2019-11-12 北京三快在线科技有限公司 文字检测方法、装置、电子设备及存储介质
CN112364873A (zh) * 2020-11-20 2021-02-12 深圳壹账通智能科技有限公司 弯曲文本图像的文字识别方法、装置及计算机设备
CN114495132A (zh) * 2021-12-15 2022-05-13 深圳前海微众银行股份有限公司 文字识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN114495132A (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
US11880927B2 (en) Three-dimensional object reconstruction from a video
US11430134B2 (en) Hardware-based optical flow acceleration
WO2020244075A1 (zh) 手语识别方法、装置、计算机设备及存储介质
WO2017118356A1 (zh) 文本图像处理方法和装置
US11960570B2 (en) Learning contrastive representation for semantic correspondence
WO2023151237A1 (zh) 人脸位姿估计方法、装置、电子设备及存储介质
WO2022227218A1 (zh) 药名识别方法、装置、计算机设备和存储介质
WO2020244151A1 (zh) 图像处理方法、装置、终端及存储介质
US20220222832A1 (en) Machine learning framework applied in a semi-supervised setting to perform instance tracking in a sequence of image frames
US20240153093A1 (en) Diffusion-based open-vocabulary segmentation
WO2022127333A1 (zh) 图像分割模型的训练方法、图像分割方法、装置、设备
CN110796108A (zh) 一种人脸质量检测的方法、装置、设备及存储介质
US20240104842A1 (en) Encoder-based approach for inferring a three-dimensional representation from an image
WO2023109086A1 (zh) 文字识别方法、装置、设备及存储介质
CN111931557B (zh) 瓶装饮品的规格识别方法、装置、终端设备及可读存介质
WO2021120578A1 (zh) 神经网络的前向计算方法、装置及计算机可读存储介质
WO2023061195A1 (zh) 图像获取模型的训练方法、图像检测方法、装置及设备
CN116309643A (zh) 人脸遮挡分确定方法、电子设备及介质
WO2020244076A1 (zh) 人脸识别方法、装置、电子设备及存储介质
EP4187504A1 (en) Method for training text classification model, apparatus, storage medium and computer program product
CN116363561A (zh) 一种时序动作定位方法、装置、设备及存储介质
CN113610856B (zh) 训练图像分割模型和图像分割的方法和装置
CN112348069B (zh) 数据增强方法、装置、计算机可读存储介质及终端设备
CN112001479B (zh) 基于深度学习模型的处理方法、系统及电子设备
CN113763313A (zh) 文本图像的质量检测方法、装置、介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22905851

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE