WO2024088269A1 - 文字识别方法、装置、电子设备及介质 - Google Patents

文字识别方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2024088269A1
WO2024088269A1 PCT/CN2023/126280 CN2023126280W WO2024088269A1 WO 2024088269 A1 WO2024088269 A1 WO 2024088269A1 CN 2023126280 W CN2023126280 W CN 2023126280W WO 2024088269 A1 WO2024088269 A1 WO 2024088269A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
prediction
convolution
image
sequence
Prior art date
Application number
PCT/CN2023/126280
Other languages
English (en)
French (fr)
Inventor
胡妍
Original Assignee
维沃移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 维沃移动通信有限公司 filed Critical 维沃移动通信有限公司
Publication of WO2024088269A1 publication Critical patent/WO2024088269A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/1444Selective acquisition, locating or processing of specific regions, e.g. highlighted text, fiducial marks or predetermined fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/166Normalisation of pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present application belongs to the field of artificial intelligence technology, and specifically relates to a text recognition method, device, electronic device and medium.
  • This text recognition technology can be used to extract text from images.
  • the purpose of the embodiments of the present application is to provide a text recognition method, device, electronic device and medium, which can solve the problem of low recognition accuracy of convolutional neural network models, resulting in poor overall recognition effect.
  • an embodiment of the present application provides a text recognition method, which includes: obtaining a text image, which includes at least one text; inputting the above text image into a grouped convolutional neural network model for prediction to obtain text sequence prediction information corresponding to the above text image; based on the above text sequence prediction information, obtaining a text recognition result corresponding to the above text image.
  • an embodiment of the present application provides a text recognition device, which includes: an acquisition module, a prediction module and a processing module, wherein: the above-mentioned acquisition module is used to acquire a text image, which includes at least one text; the above-mentioned prediction module is used to input the above-mentioned text image acquired by the acquisition module into a grouped convolutional neural network model for prediction, and obtain text sequence prediction information corresponding to the above-mentioned text image; the above-mentioned processing module is used to obtain the text recognition result corresponding to the above-mentioned text image based on the above-mentioned text sequence prediction information obtained by the prediction module.
  • an embodiment of the present application provides an electronic device, which includes a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the method described in the first aspect are implemented.
  • an embodiment of the present application provides a readable storage medium, wherein the readable storage medium stores A program or instruction, which, when executed by a processor, implements the steps of the method described in the first aspect.
  • an embodiment of the present application provides a chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.
  • an embodiment of the present application provides a computer program product, which is stored in a storage medium and is executed by at least one processor to implement the method described in the first aspect.
  • an embodiment of the present application provides an electronic device, which is configured to execute the method described in the first aspect.
  • an electronic device can obtain a text image, which includes at least one text; input the text image into a grouped convolutional neural network model for prediction to obtain text sequence prediction information corresponding to the text image; based on the text sequence prediction information, obtain a target text recognition result corresponding to the text image.
  • the grouped convolutional neural network model has a small number of parameters; and the grouped convolutional neural network model can divide the input data into multiple groups to process the multiple groups of data at the same time. Therefore, the amount of calculation of the grouped convolutional neural network model can be reduced, while ensuring the recognition accuracy, thereby improving the recognition effect of the electronic device.
  • FIG1 is a schematic diagram of a method flow of a text recognition method provided by an embodiment of the present application.
  • FIG2 is a schematic diagram of the structure of a convolutional recurrent neural network model provided in an embodiment of the present application
  • FIG3 is a schematic diagram of the structure of a grouped convolutional neural network model provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of the structure of a text recognition device provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG. 6 is a hardware schematic diagram of an electronic device provided in an embodiment of the present application.
  • first, second, etc. in the specification and claims of this application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first”, “second”, etc. are generally of one type, and the number of objects is not limited.
  • the first object can be one or more.
  • “and/or” in the specification and claims represents at least one of the connected objects, and the character “/" generally indicates that the objects associated with each other are in an "or” relationship.
  • OCR optical character recognition
  • OCR Character Recognition
  • the above OCR algorithm model adopts the network structure of Convolutional Recurrent Neural Network (CRNN) and Connectionist Temporal Classification (CTC), which is mainly composed of three parts: convolutional neural network, recurrent neural network and transcription neural network.
  • the convolutional neural network is constructed by a series of convolutional layers, pooling layers and batch normalization (BN) layers.
  • the recurrent neural network is composed of a bidirectional long short-term memory (LSTM), which has a strong ability to capture information on the sequence and can obtain more context information to better identify the text information in the image and obtain a predicted sequence;
  • the transcription neural network uses the CTC algorithm to convert the predicted sequence obtained by the recurrent neural network into a labeled sequence to obtain the final recognition result.
  • LSTM long short-term memory
  • the electronic device can obtain a text image, which includes at least one text; input the above text image into the grouped convolutional neural network model for prediction, and obtain text sequence prediction information corresponding to the above text image; based on the above text sequence prediction information, obtain the text recognition result corresponding to the above text image.
  • the above-mentioned grouped convolutional neural network model has a small number of parameters, and the grouped convolutional neural network model can divide the input data into multiple groups to process the multiple groups of data at the same time. Therefore, the amount of calculation of the grouped convolutional neural network model can be reduced, while ensuring the recognition accuracy, thereby improving the recognition effect of the electronic device.
  • the text recognition method provided in this embodiment may be executed by a text recognition device, which may be an electronic device, or a control module or a processing module in the electronic device.
  • a text recognition device which may be an electronic device, or a control module or a processing module in the electronic device.
  • the technical solution provided in the embodiment of this application is described below using an electronic device as an example.
  • the text recognition method may include the following steps 201 to 203:
  • Step 201 The electronic device obtains a text image.
  • the above-mentioned text image includes at least one text.
  • the above text may be Chinese characters, English, or other text, which is not limited in the embodiments of the present application.
  • the above-mentioned text image may be a text image that has been gray-scale processed by an electronic device.
  • the electronic device may scale the sizes of the above text images to adjust the sizes of all text images to be equal.
  • Step 202 The electronic device inputs the text image into the grouped convolutional neural network model for prediction, and obtains text sequence prediction information corresponding to the text image.
  • the above-mentioned grouped convolutional neural network model includes a group convolution layer, which is used to extract at least two groups of image feature information corresponding to the above-mentioned text pictures.
  • the above-mentioned text sequence prediction information is obtained based on the above-mentioned at least two sets of image feature information.
  • the above-mentioned grouped convolutional neural network model is generated by improving the network structure model of CRNN+CTC.
  • the recurrent neural network in the above CRNN is removed and replaced with a network structure model of convolutional neural network (CNN) + CTC. Then, the number of parameters in each layer of CNN is reduced, and some standard convolutions are replaced by group convolutions with the same convolution kernel size and 1*1 convolution kernel with fewer parameters. Finally, in order to compensate for the decrease in recognition accuracy caused by removing the recurrent neural network and reducing the number of parameters, the representation ability of the above group convolution neural network model is improved by increasing the network depth of CNN.
  • CNN convolutional neural network
  • the above-mentioned increase in the network depth of CNN can be achieved by customizing a convolution module consisting of a group convolution with a convolution kernel of 3*3 and a convolution with a convolution kernel of 1*1 alternating three times.
  • the above-mentioned improved CNN+CTC refers to a prediction model that can be deployed on an electronic device to perform text recognition on text images.
  • the above-mentioned sequence position can be a grouped convolutional neural network model, which sets multiple probability values to predict the position based on the order of the text positions in the above-mentioned text image.
  • Step 203 The electronic device obtains a text recognition result corresponding to the text image based on the text sequence prediction information.
  • the above-mentioned text sequence prediction information may include a text sequence prediction matrix.
  • the above text sequence is used to indicate the position order of the text in the above text image.
  • step 203 “the electronic device obtains a text recognition result corresponding to the text image based on the text sequence prediction information” may include the following steps 203a to 203c:
  • Step 203a The electronic device calculates target prediction probability information based on the text sequence prediction information.
  • the above-mentioned target prediction probability information is used to represent the probability of each character index corresponding to each sequence position in the character sequence corresponding to the above-mentioned character sequence prediction information.
  • each of the above character indexes corresponds to a character in the character library.
  • the above-mentioned target prediction probability information may include a text sequence prediction probability matrix.
  • the electronic device can use a normalized exponential function to generalize the text sequence prediction matrix.
  • the rate is calculated to obtain the text sequence prediction probability matrix.
  • the normalized exponential function may be a softmax function.
  • Step 203b The electronic device determines the text prediction result at each sequence position based on the target prediction probability information.
  • each of the above-mentioned sequence positions may correspond to multiple text prediction results, and the electronic device may determine the text prediction result with the highest prediction probability among the multiple text prediction results as the text prediction result for the sequence position.
  • the electronic device can use the prediction information corresponding to the maximum probability value at each sequence position in the above-mentioned text sequence prediction probability as the recognition result index of the sequence position, and then index the text prediction result corresponding to the prediction information from the character set dictionary pre-stored in the electronic device to obtain the text recognition result at each sequence position.
  • Step 203c The electronic device determines the text recognition result corresponding to the text image based on the text prediction result at each sequence position.
  • the electronic device may repeat the above indexing step to obtain a text recognition result sequence corresponding to the above text sequence. Then, the electronic device may merge repeated recognition results of adjacent sequence positions through CTC and remove empty recognition results to obtain the final text recognition result.
  • the electronic device can count the frequency of all Chinese characters that appear when training the above-mentioned grouped convolutional neural network model, and take the Chinese characters whose frequency is greater than a preset threshold as the character set dictionary.
  • the electronic device can obtain a text image, which includes at least one text; input the above text image into the grouped convolutional neural network model for prediction, and obtain text sequence prediction information corresponding to the image features in the above text image; based on the above text sequence prediction information, obtain the text recognition result corresponding to the above text image.
  • the above grouped convolutional neural network model has a small number of parameters; and the grouped convolutional neural network model can divide the input data into multiple groups to process the multiple groups of data at the same time. Therefore, the amount of calculation of the grouped convolutional neural network model can be reduced, while ensuring the recognition accuracy, thereby improving the recognition effect of the electronic device.
  • the above-mentioned grouped convolutional neural network model includes: a first standard convolutional layer, a group convolutional layer, a second standard convolutional layer and a fully connected layer.
  • the first standard convolutional layer, the group convolutional layer, the second standard convolutional layer and the fully connected layer are connected in sequence.
  • the first standard convolution layer includes a target standard convolution unit, and the first standard convolution layer includes a convolution kernel.
  • the above-mentioned target standard convolution unit is used to reduce the number of parameters of the above-mentioned grouped convolutional neural network model.
  • each convolution in the above-mentioned first standard convolution layer includes a convolution kernel.
  • the first standard convolutional layer may be a convolutional layer consisting of a 3*3 convolution, a pooling layer, a 3*3 convolution, a pooling layer, a 1*1 convolution, and a pooling layer.
  • the target standard convolution unit may be a 1*1 convolution.
  • the group convolution layer includes a target group convolution unit, and the group convolution layer includes M convolution kernels, where M is an integer greater than 1.
  • the above-mentioned target group convolution unit is used to reduce the computational complexity of the above-mentioned grouped convolutional neural network model.
  • the above-mentioned group convolution layer can be a group convolution layer consisting of 1*1 convolution, 3*3 group convolution, 1*1 convolution, 3*3 group convolution, 1*1 convolution, 3*3 group convolution, 1*1 convolution, 3*3 group convolution, 1*1 convolution, and a pooling layer.
  • the target group convolution unit may be a 3*3 group convolution.
  • the second standard convolution layer includes a convolution kernel.
  • the number of parameters and the amount of calculation of the grouped convolutional model can be reduced, thereby improving the recognition efficiency of the electronic device.
  • step 202 "the electronic device inputs the text image into the grouped convolutional neural network model for prediction to obtain text sequence prediction information corresponding to the text image” may include the following steps 202a to 202d:
  • Step 202a After the electronic device inputs the text image into the grouped convolutional neural network model, the first standard convolution layer is used to extract the first image feature information of the text image.
  • the first image feature information is used to characterize the text area features in the text image.
  • the electronic device may sequentially use 3*3 convolution, pooling layer, 3*3 convolution, pooling layer, 1*1 convolution, and pooling layer (i.e., the first standard convolution layer) to extract primary features (i.e., the first image feature information) from the text image.
  • 3*3 convolution i.e., the first standard convolution layer
  • pooling layer i.e., the first standard convolution layer
  • Step 202b The electronic device uses a group convolution layer to group the first image feature information to obtain M groups of image feature information, and uses M convolution kernels in the group convolution layer to respectively extract key image feature information in each group of image feature information, and fuses the obtained M groups of key image feature information to obtain first key image feature information.
  • each convolution kernel in the above-mentioned group of convolutional layers is used to process a group of image feature information.
  • the first key image feature information is used to represent the text feature information in the text area feature.
  • the electronic device may sequentially use 1*1 convolution, group convolution, 1*1 convolution, group convolution, 1*1 convolution, group convolution, 1*1 convolution, pooling layer (i.e., the group convolution layer) to extract intermediate features from the primary features.
  • the 1*1 convolution is used to process the irregular results of the output of the previous pooling layer to improve the network expression ability.
  • 1*1 convolution, group convolution, 1*1 convolution, group convolution, 1*1 convolution, group convolution, 1*1 convolution, group convolution, 1*1 convolution, pooling layer are used again in sequence to extract high-level features (i.e., the first key image feature information) from the above intermediate features.
  • the above group convolution is a group convolution with a convolution kernel size of 3*3 and a group number of 4.
  • the group convolution can divide the first image feature information into 4 groups, each of which uses a 3*3 convolution kernel for convolution calculation to obtain the key image feature information of each group, and then the 4 groups of key image feature information are merged to obtain a convolution output (i.e., the first key image feature information).
  • Step 202c The electronic device uses a second standard convolutional layer to extract text sequence features of the first key image feature information.
  • the above-mentioned text sequence features are used to characterize the text content of the text in the above-mentioned text image.
  • the electronic device can first use 1*1 convolution to process the irregular information in the first key image feature information, and then use 2*2 convolution (that is, the above-mentioned second standard convolution layer) to convert the height dimension of the processed first key image feature information to 1 (that is, remove the height dimension), thereby extracting the above-mentioned text sequence feature from the first key image feature information after removing the height dimension.
  • 1*1 convolution that is, the above-mentioned second standard convolution layer
  • 2*2 convolution that is, the above-mentioned second standard convolution layer
  • Step 202d The electronic device uses a fully connected layer to obtain text sequence prediction information corresponding to the text sequence features.
  • two LSTMs are used to extract the sequence features, and the above-mentioned text sequence features are converted into a text sequence prediction matrix.
  • the LSTM cannot perform parallel processing, and its processing efficiency in electronic devices is low, resulting in poor recognition effect of text recognition.
  • the electronic device can use a fully connected layer to reduce the feature dimension of the above-mentioned text sequence features to reduce the number of parameters of the next fully connected layer. Then, another fully connected layer is used to convert the text sequence features into a text sequence prediction matrix (i.e., the above-mentioned text sequence prediction information).
  • the above feature dimension size is equal to the number of characters in the above character set dictionary plus one.
  • the electronic device can add a blank character based on the number of all characters included in the above character set dictionary, and then set the feature dimension size according to the number of characters after adding the blank character, so that the feature dimension size is equal to the number of characters after adding the blank character.
  • the electronic device can obtain the corresponding text sequence prediction information more quickly, and by using the fully connected layer to process the above-mentioned first key image feature information, the parameter amount of the above-mentioned group convolutional neural network model is further reduced, thereby improving the recognition effect of the electronic device in recognizing text.
  • the text recognition method provided in the embodiment of the present application further includes the following step 201a:
  • Step 201a The electronic device cuts the text image into N sub-text images.
  • each of the above-mentioned N sub-text images contains at least one text, and N is an integer greater than 1.
  • the image sizes and heights of the above-mentioned N sub-text images are all equal.
  • the electronic device can detect the positions of all text lines in the above-mentioned text image, and then crop all text line images (i.e., the above-mentioned N sub-text images) according to the detected position coordinates, and then scale the above-mentioned text line images to convert them into images of equal height.
  • crop all text line images i.e., the above-mentioned N sub-text images
  • the height of the above text line image matches the data size that can be processed by the above grouped convolutional neural network model.
  • the above step 202 of "the electronic device inputs the text image into the grouped convolutional neural network model for prediction to obtain text sequence prediction information corresponding to the text image” may include the following step 202e:
  • Step 202e The electronic device inputs the N sub-text images into the grouped convolutional neural network model for prediction, and obtains text sequence prediction information corresponding to each of the N sub-text images.
  • the electronic device can input the first sub-text image among the above-mentioned N sub-text images into the grouped convolutional neural network model for prediction, and after obtaining the prediction result, input the second sub-text image and perform prediction in sequence.
  • the electronic device after obtaining the text sequence prediction information corresponding to each of the N sub-text images, the electronic device can obtain the text recognition result based on the prediction information. Then, according to the detected text position coordinates, the text recognition result is typeset to obtain the target text recognition result of the text image.
  • the training process of the above-mentioned grouped convolutional neural network model may include the following steps S1 to S4:
  • Step S1 Data collection and expansion.
  • the collected text pictures when the above data is collected, in order to make the above group convolutional neural network model applicable to various scenes, the collected text pictures also need to include as many scenes as possible (such as cards, books and newspapers, screenshots, screens, posters, street scenes, handwriting, etc.). Then, the collected text pictures need to be manually annotated to obtain the corresponding text label files.
  • Data augmentation is the process of processing the labeled real data into new data through random geometric deformation, blurring, brightness and contrast adjustment, image compression, etc.
  • Font synthesis that is, drawing text images through font files and corpus, and adding synthetic images through random background, text color, font, geometric deformation, perspective change, blur processing, brightness and contrast adjustment, image compression, etc. The authenticity and diversity of the film.
  • sufficient training data can be obtained through the three methods of real collection, data augmentation and font synthesis mentioned above.
  • Step S2 Data preprocessing.
  • the data before the collected data is sent to the model training, the data needs to be processed uniformly, specifically: size scaling, width sorting, and dictionary creation.
  • Text images are characterized by different lengths. When training, multiple text images are often input in batches, which requires that the width and height of the text images in a batch be consistent. When the widths of text images in the same batch vary greatly, forcibly adjusting the width to be consistent will cause the text in some text images to be distorted, resulting in a large loss of information, making it difficult to achieve a good training effect. Therefore, the text images in the training set can be sorted according to the aspect ratio, and several text images with adjacent aspect ratios are taken as the same batch, and all text images in the batch are uniformly scaled to the size of the text image with the smallest width in the batch.
  • Step S3 Model building.
  • the classic CRNN network structure is composed of a CNN based on 3*3 convolution and a recurrent neural network (RNN) based on LSTM.
  • RNN recurrent neural network
  • the electronic device After the electronic device inputs a text picture with a height of 32 into the model, it first extracts the image feature information through a CNN. For example, one 3*3 convolution (3*3Conv), a pooling layer (pool), one 3*3 convolution, a pooling layer, two 3*3 convolutions, a pooling layer, two 3*3 convolutions, a pooling layer, and a pooling layer are sequentially used to extract image feature information, and the feature dimension size is gradually increased from 64 to 512. Then, the sequence features are generated through the image mapping sequence structure (Map-to-Sequence). Then, two LSTMs are used to extract the sequence features in the image feature information, and the sequence features are converted into a sequence prediction matrix output.
  • 3*3 convolution 3*3Conv
  • the above CNN is mainly composed of convolution and pooling layers with gradually increasing feature dimension and 3*3 convolution kernel, which are used to extract image feature information;
  • the above RNN is composed of two layers of LSTM, which are used to extract sequence features and convert sequence features into sequence prediction matrix.
  • LSTM is not conducive to deployment on the electronic device side.
  • 1*1 convolution, group convolution, 1*1 convolution, group convolution, 1*1 convolution, group convolution, 1*1 convolution, group convolution, 1*1 convolution, pooling layer are used again to extract high-level image feature information from the above intermediate image feature information.
  • 1*1 convolution is used to add nonlinear excitation to the above high-level image feature information
  • 2*2 convolution is used to convert the height dimension size to 1, then the height dimension is removed, and the feature dimension and width dimension are exchanged to meet the requirements of the next layer of input, and the four-dimensional high-level image feature information is converted into a three-dimensional feature sequence.
  • the feature sequence is then passed through a fully connected layer with fewer parameters to reduce the feature dimension, which is used to reduce the number of parameters in the next layer.
  • the sequence features after the feature dimension reduction are converted into a sequence prediction matrix through another fully connected layer.
  • the obtained sequence prediction matrix is the output result of the entire model.
  • Step S4 Model training and quantization.
  • model training the training text images are divided into multiple batches, each batch consists of a fixed number of text images, and then randomly sent to the model in batches.
  • the model built in the above step S3 is calculated layer by layer to obtain a text sequence prediction matrix, and then the normalized exponential function (softmax) is used to convert the values in the text sequence prediction matrix into a text sequence prediction probability matrix with a value range of 0-1.
  • softmax normalized exponential function
  • a greedy algorithm is used to take the result corresponding to the maximum probability value as the prediction result of the sequence position, and the predicted text sequence is obtained according to the above character set dictionary index mapping.
  • CTC loss The classic loss function (CTC loss) is used to calculate the loss value between the predicted text sequence and the corresponding label text sequence in the text image
  • random optimizer Adaptive momentum, Adam
  • the initial learning rate of the above random optimizer is set to 0.0005, and then gradually decreases using the cosine learning rate descent method. Then, repeat the above operation for the next batch of text images to update the model parameters again. After multiple rounds of parameter updates, the loss value drops to an appropriate range and tends to be stable, and the model training is completed.
  • Model quantization In order to speed up the model inference speed and maintain good accuracy, half-precision (Full Precise Float, FP) 16 is used to store parameters and infer the model to obtain the above-mentioned grouped convolutional neural network model.
  • the text recognition method provided in the embodiment of the present application can be executed by a text recognition device.
  • the text recognition device provided in the embodiment of the present application is described by taking the text recognition method executed by the text recognition device as an example.
  • the text recognition device 400 includes: an acquisition module 401, a prediction module 402 and a processing module 403, wherein: the above-mentioned acquisition module 401 is used to obtain a text image, which includes at least one text; the above-mentioned prediction module 402 is used to input the above-mentioned text image obtained by the acquisition module 401 into a grouped convolutional neural network model for prediction, and obtain text sequence prediction information corresponding to the above-mentioned text image; the above-mentioned processing module 403 is used to obtain the text recognition result corresponding to the above-mentioned text image based on the above-mentioned text sequence prediction information obtained by the prediction module 402.
  • the above-mentioned grouped convolutional neural network model includes: a first standard convolutional layer, a group convolutional layer, a second standard convolutional layer and a fully connected layer;
  • the above-mentioned prediction module 402 is specifically used to: after the above-mentioned text picture acquired by the acquisition module 401 is input into the grouped convolutional neural network model, the above-mentioned first standard convolutional layer is used to extract the first image feature information of the above-mentioned text picture;
  • the above-mentioned group convolutional layer is used to group the above-mentioned first image feature information to obtain M groups of image feature information, and the M convolution kernels in the above-mentioned group convolutional layer are used to extract the key image feature information in each group of image feature information respectively, and the obtained M groups of key image feature information are fused to obtain the first key image feature information, each convolution kernel in the above-mentioned group convolutional layer is used to
  • the first standard convolutional layer, the group convolutional layer, the second standard convolutional layer and the fully connected layer are connected in sequence;
  • the first standard convolutional layer includes a target standard convolutional unit, which is used to reduce the parameter amount of the grouped convolutional neural network model, and the first standard convolutional layer includes a convolution kernel;
  • the group convolutional layer includes a target group convolutional unit, which is used to reduce the calculation amount of the grouped convolutional neural network model, the group convolutional layer includes M convolution kernels, and the second standard convolutional layer includes one convolution kernel.
  • the above-mentioned text recognition device 400 also includes: a cropping module, wherein: the above-mentioned cropping module is used to crop the text image into N sub-text images after the acquisition module 401 acquires the text image, each sub-text image contains at least one text, and N is an integer greater than 1; the above-mentioned prediction module 402 is specifically used to input the above-mentioned N sub-text images obtained by the cropping module into a grouped convolutional neural network model for prediction, and obtain the text sequence prediction information corresponding to each of the above-mentioned N sub-text images.
  • a cropping module wherein: the above-mentioned cropping module is used to crop the text image into N sub-text images after the acquisition module 401 acquires the text image, each sub-text image contains at least one text, and N is an integer greater than 1; the above-mentioned prediction module 402 is specifically used to input the above-mentioned N sub-text images obtained by the cropping module into a
  • the processing module 403 is specifically used to: calculate target prediction probability information based on the text sequence prediction information obtained by the prediction module 402, the target prediction probability information being used to characterize the probability of each text index corresponding to each sequence position in the text sequence corresponding to the text sequence prediction information, each text index corresponding to a text in the character library; determine the text prediction result at each sequence position based on the target prediction probability information; and determine the text recognition result corresponding to the text image based on the text prediction result at each sequence position.
  • the text recognition device can obtain a text image, which includes at least one text; input the above text image into the grouped convolutional neural network model for prediction, and obtain text sequence prediction information corresponding to the above text image; based on the above text sequence prediction information, obtain the text recognition result corresponding to the above text image.
  • the above grouped convolutional neural network model has a small number of parameters; and the grouped convolutional neural network model can divide the input data into multiple groups to process the multiple groups of data at the same time. Therefore, the amount of calculation of the grouped convolutional neural network model can be reduced, while ensuring the recognition accuracy, thereby improving the recognition effect of the above text recognition device.
  • the text recognition device in the embodiment of the present application can be an electronic device, or a component in an electronic device, such as an integrated circuit or a chip.
  • the electronic device can be a terminal, or other devices other than a terminal.
  • the electronic device may be a mobile phone, a tablet computer, a laptop computer, a PDA, an in-vehicle electronic device, a mobile Internet Device (MID), an augmented reality (AR)/virtual reality (VR) device, a robot, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a personal digital assistant (PDA), etc.
  • It may also be a server, a network attached storage (NAS), a personal computer (PC), a television (TV), a teller machine or a self-service machine, etc., and the embodiments of the present application are not specifically limited.
  • the text recognition device in the embodiment of the present application may be a device having an operating system.
  • the operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiment of the present application.
  • the text recognition device provided in the embodiment of the present application can implement each process implemented by the method embodiment of Figure 1. To avoid repetition, it will not be repeated here.
  • an embodiment of the present application also provides an electronic device 600, including a processor 601 and a memory 602, and the memory 602 stores a program or instruction that can be executed on the processor 601.
  • the program or instruction is executed by the processor 601
  • the various steps of the above-mentioned text recognition method embodiment are implemented and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
  • the electronic devices in the embodiments of the present application include the mobile electronic devices and non-mobile electronic devices mentioned above.
  • FIG. 6 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.
  • the electronic device 100 includes but is not limited to components such as a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.
  • components such as a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.
  • the electronic device 100 may also include a power source (such as a battery) for supplying power to each component, and the power source may be logically connected to the processor 110 through a power management system, so that the power management system can manage charging, discharging, and power consumption.
  • a power source such as a battery
  • the electronic device structure shown in FIG6 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown, or combine certain components, or arrange components differently, which will not be described in detail here.
  • the above-mentioned processor 110 is used to: obtain a text image, which includes at least one text; input the above-mentioned text image into the grouped convolutional neural network model for prediction to obtain text sequence prediction information corresponding to the above-mentioned text image; based on the above-mentioned text sequence prediction information, obtain the text recognition result corresponding to the above-mentioned text image.
  • the grouped convolutional neural network model includes: a first standard convolutional layer, a group convolutional layer, a second standard convolutional layer and a fully connected layer; the processor 110 is specifically used to: after inputting the text image into the grouped convolutional neural network model, use the first standard convolutional layer to extract the first image feature information of the text image; use the group convolutional layer to group the first image feature information to obtain M groups of image feature information, and use the M convolutional kernels in the group convolutional layer to extract the key image feature information in each group of image feature information, respectively.
  • each convolution kernel in the above group of convolution layers is used to process a group of image feature information, and M is an integer greater than 1;
  • the above second standard convolution layer is used to extract the text sequence features of the above first key image feature information;
  • the above fully connected layer is used to obtain the text sequence prediction information corresponding to the above text sequence features.
  • the first standard convolutional layer, the group convolutional layer, the second standard convolutional layer and the fully connected layer are connected in sequence;
  • the first standard convolutional layer includes a target standard convolutional unit, which is used to reduce the parameter amount of the grouped convolutional neural network model, and the first standard convolutional layer includes a convolution kernel;
  • the group convolutional layer includes a target group convolutional unit, which is used to reduce the calculation amount of the grouped convolutional neural network model, the group convolutional layer includes M convolution kernels, and the second standard convolutional layer includes one convolution kernel.
  • the processor 110 is further used to cut the text image into N sub-text images, each sub-text image containing at least one text, and N is an integer greater than 1; the processor 110 is specifically used to input the N sub-text images into a grouped convolutional neural network model for prediction, and obtain text sequence prediction information corresponding to each of the N sub-text images.
  • the processor 110 is specifically used to: calculate target prediction probability information based on the text sequence prediction information obtained by the prediction module 402, where the target prediction probability information is used to characterize the probability of each text index corresponding to each sequence position in the text sequence corresponding to the text sequence prediction information, where each text index corresponds to a text in the character library; determine the text prediction result at each sequence position based on the target prediction probability information; and determine the text recognition result corresponding to the text image based on the text prediction result at each sequence position.
  • the electronic device can obtain a text image, which includes at least one text; input the above text image into the grouped convolutional neural network model for prediction, and obtain text sequence prediction information corresponding to the above text image; based on the above text sequence prediction information, obtain the text recognition result corresponding to the above text image.
  • the above grouped convolutional neural network model has a small number of parameters; and the grouped convolutional neural network model can divide the input data into multiple groups to process the multiple groups of data at the same time. Therefore, the amount of calculation of the grouped convolutional neural network model can be reduced, while ensuring the recognition accuracy, thereby improving the recognition effect of the electronic device.
  • the input unit 104 may include a graphics processor (GPU) 1041 and a microphone 1042, and the graphics processor 1041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode.
  • the display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc.
  • the user input unit 107 includes a touch panel 1071 and at least one of other input devices 1072.
  • the touch panel 1071 is also called a touch screen.
  • the touch panel 1071 may include two parts: a touch detection device and a touch controller.
  • Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.
  • the memory 109 can be used to store software programs and various data.
  • the memory 109 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, an image playback function, etc.), etc.
  • the memory 109 may include a volatile memory or a non-volatile memory, or the memory 109 may include both volatile and non-volatile memories.
  • the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory.
  • the volatile memory may be a random access memory (RAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDRSDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchronous link dynamic random access memory (SLDRAM) and a direct memory bus random access memory (DRRAM).
  • the memory 109 in the embodiment of the present application includes but is not limited to these and any other suitable types of memory.
  • the processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 110.
  • An embodiment of the present application also provides a readable storage medium, on which a program or instruction is stored.
  • a program or instruction is stored.
  • the various processes of the above-mentioned text recognition method embodiment are implemented and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
  • the processor is the processor in the electronic device described in the above embodiment.
  • the readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.
  • An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned text recognition method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.
  • An embodiment of the present application provides a computer program product, which is stored in a storage medium.
  • the program product is executed by at least one processor to implement the various processes of the above-mentioned text recognition method embodiment and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the technical solution of the present application can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present application.
  • a storage medium such as ROM/RAM, a disk, or an optical disk
  • a terminal which can be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Character Discrimination (AREA)

Abstract

本申请公开了一种文字识别方法、装置、电子设备及介质,属于文字识别算法领域。该文字识别方法包括:获取文字图片,该文字图片包括至少一个文字;将上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;基于上述文字序列预测信息,得到上述文字图片对应的文字识别结果。

Description

文字识别方法、装置、电子设备及介质
相关申请的交叉引用
本申请要求在2022年10月26日提交中国专利局、申请号为202211320472.6、名称为“文字识别方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能技术领域,具体涉及一种文字识别方法、装置、电子设备及介质。
背景技术
随着智能终端技术的发展,文字识别技术应用越来越广泛,使用该文字识别技术可以实现将图片中的文字提取出来。
在相关技术中,电子设备在进行文字识别时,通常是直接消减所运用的卷积神经网络模型的各层网络参数数量,来降低计算量和参数量,以提高识别速度,但是该方法会使得上述卷积神经网络模型的识别准确率降低,从而导致整体的识别效果较差。
发明内容
本申请实施例的目的是提供一种文字识别方法、装置、电子设备及介质,能够解决卷积神经网络模型识别准确率低,导致整体的识别效果较差的问题。
为了解决上述技术问题,本申请是这样实现的:
第一方面,本申请实施例提供了一种文字识别方法,该方法包括:获取文字图片,该文字图片包括至少一个文字;将上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;基于上述文字序列预测信息,得到上述文字图片对应的文字识别结果。
第二方面,本申请实施例提供了一种文字识别装置,该装置包括:获取模块、预测模块和处理模块,其中:上述获取模块,用于获取文字图片,该文字图片包括至少一个文字;上述预测模块,用于将获取模块获取到的上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;上述处理模块,用于基于预测模块得到的上述文字序列预测信息,得到上述文字图片对应的文字识别结果。
第三方面,本申请实施例提供了一种电子设备,该电子设备包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如第一方面所述的方法的步骤。
第四方面,本申请实施例提供了一种可读存储介质,所述可读存储介质上存储 程序或指令,所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤。
第五方面,本申请实施例提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如第一方面所述的方法。
第六方面,本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如第一方面所述的方法。
第七方面,本申请实施例提供一种电子设备,所述电子设备被配置成用于执行如第一方面所述的方法。
在本申请实施例中,电子设备可以获取文字图片,该文字图片包括至少一个文字;将上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;基于上述文字序列预测信息,得到上述文字图片对应的目标文字识别结果。如此,由于上述分组卷积神经网络模型的参数量较少;并且,该分组卷积神经网络模型能够将输入的数据分成多组,以同时对该多组数据进行处理。因此,可以减少该分组卷积神经网络模型的计算量,同时保证了识别准确率,从而提高了电子设备的识别效果。
附图说明
图1是本申请实施例提供的一种文字识别方法的方法流程示意图;
图2是本申请实施例提供的卷积循环神经网络模型的结构示意图;
图3是本申请实施例提供的分组卷积神经网络模型的结构示意图;
图4是本申请实施例提供的一种文字识别装置的结构示意图;
图5是本申请实施例提供的一种电子设备的结构示意图;
图6是本申请实施例提供的一种电子设备的硬件示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的文字识别方法、装置、电子设备及介质进行详细地说明。
目前,文字识别技术应用广泛,相比云端计算方式,移动端光学字符识别(Optical  Character Recognition,OCR)算法可在离线情况下完成图片文字的提取,该算法具有低时延、保护数据隐私与安全、减少云端能耗、不依赖网络稳定等显著优势,适用于牵涉时效性、成本和隐私考虑的场景。然而由于移动端电子设备计算资源有限,无法运行复杂的OCR算法模型,来满足快速、精准识别图片文字的用户需求。
上述OCR算法模型中,采用的是卷积循环神经网络(Convolutional Recurrent Neural Network,CRNN)时序分类算法(Connectionist Temporal Classification,CTC)的网络结构,该网络结构主要由三部分构成,卷积神经网络、循环神经网络和转录神经网络。其中,卷积神经网络由一系列的卷积层、池化层、归一化(Batch Normalization,BN)层构造而成。将图片输入卷积神经网络后,转化为具有特征信息的特征图,并以序列形式输出,以作为循环层的输入;循环神经网络由双向长短期记忆(Long Short Term Memory,LSTM)构成,该LSTM对序列有较强的信息捕获能力,并可以获取更多上下文信息,以对图片中的文本信息进行更好地识别,得到预测序列;转录神经网络采用CTC算法将循环神经网络得到的预测序列转换成标记序列,用来获取最终的识别结果。
在相关技术中,电子设备在进行文字识别时,需要采用计算量很小的模型,同时要求能够实现较好的文字识别效果。而为了使上述CRNN网络模型能够应用到电子设备中,需要对该CRNN网络模型中的卷积神经网络中的卷积层的参数量进行削减,来降低其计算量,以达到实时性和降低CRNN网络模型的体积。然而,上述削减参数量的方法会使得文字识别的准确率也明显降低。从而,导致最终的文字识别效果较差。
在本申请实施例提供的文字识别方法、装置、电子设备及介质中,电子设备可以获取文字图片,该文字图片包括至少一个文字;将上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;基于上述文字序列预测信息,得到上述文字图片对应的文字识别结果。如此,由于上述分组卷积神经网络模型的参数量较少,并且,该分组卷积神经网络模型能够将输入的数据分成多组,以同时对该多组数据进行处理。因此,可以减少该分组卷积神经网络模型的计算量,同时保证了识别准确率,从而提高了电子设备的识别效果。
本实施例提供的文字识别方法的执行主体可以为文字识别装置,该文字识别装置可以为电子设备,也可以为该电子设备中的控制模块或处理模块等。以下以电子设备为例来对本申请实施例提供的技术方案进行说明。
本申请实施例提供一种文字识别方法,如图1所示,该文字识别方法可以包括如下步骤201至步骤203:
步骤201:电子设备获取文字图片。
在本申请实施例中,上述文字图片包括至少一个文字。
示例性地,上述文字可以为汉字,也可以为英文,或者其他文字,本申请实施例对此不做限定。
在本申请实施例中,上述文字图片可以为经电子设备进行灰度处理后的文字图片。
在本申请实施例中,上述灰度处理是将上述文字图片中的红色(Red,R)、绿色(Green,G)、蓝色(Blue,B)值进行统一处理,使得R=G=B。
示例性地,上述文字图片的尺寸高度相等。
示例性地,电子设备可以缩放上述文字图片的尺寸,将所有文字图片的尺寸都调整相等。
步骤202:电子设备将文字图片输入分组卷积神经网络模型进行预测,得到文字图片对应的文字序列预测信息。
在本申请实施例中,上述分组卷积神经网络模型包括组卷积层,用于提取上述文字图片对应的至少两组图像特征信息。
在本申请实施例中,上述文字序列预测信息是基于上述至少两组图像特征信息得到的。
在本申请实施例中,上述分组卷积神经网络模型是在CRNN+CTC的网络结构模型的基础上改进生成的。
示例性地,将上述CRNN中的循环神经网络去除,改为卷积神经网络(convolutional neural network,CNN)+CTC的网络结构模型。然后,再将CNN中各层的参数量进行了削减,并将部分的标准卷积改用参数量更少的卷积核尺寸相同的组卷积和卷积核为1*1的卷积代替。最后,为了弥补上述去掉循环神经网络和削减参数量导致的识别精度下降,通过增加CNN的网络深度来提升上述分组卷积神经网络模型的表征能力。
需要说明的是,上述增加CNN的网络深度可以为自定义一种由卷积核为3*3的组卷积和卷积核为1*1的卷积交替3次构成卷积模块。
在本申请实施例中,上述改进后的CNN+CTC是指能够在电子设备上部署的针对文字图片进行文字识别的预测模型。
示例性地,上述序列位置可以为分组卷积神经网络模型,基于上述文字图片中的文字位置顺序,设置的多个概率值预测位置。
步骤203:电子设备基于文字序列预测信息,得到文字图片对应的文字识别结果。
在本申请实施例中,上述文字序列预测信息可以包括文字序列预测矩阵。
示例性地,上述文字序列用于指示上述文字图片中的文字的位置顺序。
可选地,在本申请实施例中,上述步骤203中“电子设备基于文字序列预测信息,得到文字图片对应的文字识别结果”可以包括如下步骤203a至步骤203c:
步骤203a:电子设备基于文字序列预测信息,计算目标预测概率信息。
在本申请实施例中,上述目标预测概率信息用于表征上述文字序列预测信息对应的文字序列中每个序列位置上所对应的每个文字索引的概率。
示例性地,上述每个文字索引在字符库中对应一个文字。
在本申请实施例中,上述目标预测概率信息可以包括文字序列预测概率矩阵。
在本申请实施例中,电子设备可以采用归一化指数函数对文字序列预测矩阵进行概 率计算,得到文字序列预测概率矩阵。
在本申请实施例中,上述归一化指数函数可以为softmax函数。
需要说明的是,上述归一化指数函数用于将上述文字序列预测矩阵的值统一转化为范围在0至1的概率值。
步骤203b:电子设备基于目标预测概率信息,确定每个序列位置上的文字预测结果。
在本申请实施例中,上述每个序列位置可能对应多个文字预测结果,电子设备可以将该多个文字预测结果中,预测概率最大的文字预测结果确定为该序列位置的文字预测结果。
在本申请实施例中,电子设备可以将上述文字序列预测概率中每个序列位置上最大概率值所对应的预测信息做为该序列位置的识别结果索引,然后,从电子设备预存的字符集字典中索引该预测信息对应的文字预测结果,得到每个序列位置上的文字识别结果。
步骤203c:电子设备基于每个序列位置上的文字预测结果,确定文字图片对应的文字识别结果。
在本申请实施例中,电子设备可以重复上述索引步骤,得到上述文字序列对应的文字识别结果序列。然后,电子设备可以通过CTC合并相邻序列位置的重复识别结果,并去掉空位识别结果。得到最终的文字识别结果。
以下将对本申请实施例采用的字符集字典的生成进行解释说明:
示例性地,电子设备可以统计训练上述分组卷积神经网络模型时出现的所有汉字的字频,并取字频大于预设阈值的汉字做为字符集字典。
如此,通过计算每个序列位置上对应的文字识别结果的概率,并从该多个识别结果的概率中,选择概率最大的识别结果,作为最终的文字识别结果,提高了文字识别的准确度。
在本申请实施例提供的文字识别方法中,电子设备可以获取文字图片,该文字图片包括至少一个文字;将上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片中的图像特征对应的文字序列预测信息;基于上述文字序列预测信息,得到上述文字图片对应的文字识别结果。如此,由于上述分组卷积神经网络模型的参数量较少;并且,该分组卷积神经网络模型能够将输入的数据分成多组,以同时对该多组数据进行处理。因此,可以减少该分组卷积神经网络模型的计算量,同时保证了识别准确率,从而提高了电子设备的识别效果。
可选地,在本申请实施例中,上述分组卷积神经网络模型包括:第一标准卷积层、组卷积层、第二标准卷积层和全连接层。
在本申请实施例中,上述第一标准卷积层、上述组卷积层、上述第二标准卷积层以及上述全连接层依次连接。
在本申请实施例中,上述第一标准卷积层包括目标标准卷积单元,该第一标准卷积层包括一个卷积核。
需要说明的是,上述目标标准卷积单元用于减小上述分组卷积神经网络模型的参数量。
在本申请实施例中,上述第一标准卷积层中的每个卷积包括一个卷积核。
示例性地,上述第一标准卷积层可以为由3*3卷积、池化层、3*3卷积、池化层、1*1卷积、池化层组成的卷积层。
示例性地,上述目标标准卷积单元可以为1*1卷积。
需要说明的是,上述1*1卷积用于提示特征为尺寸,避免上一个3*3卷积的参数量过大。
在本申请实施例中,上述组卷积层包括目标组卷积单元,上述组卷积层包括M个卷积核,M为大于1的整数。
需要说明的是,上述目标组卷积单元用于降低上述分组卷积神经网络模型的计算量。
示例性地,上述组卷积层可以为由1*1卷积、3*3组卷积、1*1卷积、3*3组卷积、1*1卷积、3*3组卷积、1*1卷积、池化层组成的组卷积层。
示例性地,上述目标组卷积单元可以为3*3组卷积。
在本申请实施例中,上述第二标准卷积层包括一个卷积核。
如此,通过在分组卷积神经网络模型中设置目标标准卷积单元和目标组卷积单元,可以减少分组卷积模型的参数量和计算量,提高了电子设备的识别效率。
可选地,在本申请实施例中,上述步骤202中“电子设备将文字图片输入分组卷积神经网络模型进行预测,得到文字图片对应的文字序列预测信息”可以包括如下步骤202a至步骤202d:
步骤202a:电子设备将文字图片输入分组卷积神经网络模型后,采用第一标准卷积层提取文字图片的第一图像特征信息。
在本申请实施例中,上述第一图像特征信息用于表征上述文字图片中的文字区域特征。
示例性地,电子设备可以依次采用3*3卷积、池化层、3*3卷积、池化层、1*1卷积、池化层(即上述第一标准卷积层)从上述文字图片中提取初级特征(即上述第一图像特征信息)。
步骤202b:电子设备采用组卷积层对第一图像特征信息进行分组,得到M组图像特征信息,并采用所述组卷积层中的M个卷积核分别提取每组图像特征信息中的关键图像特征信息,并将得到的M组关键图像特征信息融合,得到第一关键图像特征信息。
在本申请实施例中,上述组卷积层中的每个卷积核用于处理一组图像特征信息。
在本申请实施例中,上述第一关键图像特征信息用于表征上述文字区域特征中的文字特征信息。
示例性地,电子设备可以依次采用1*1卷积、组卷积、1*1卷积、组卷积、1*1卷积、组卷积、1*1卷积、池化层(即上述组卷积层)从上述初级特征中提取中级特征。其 中,上述1*1卷积用于为上一个池化层的输出的不规则结果进行处理,以提升网络表达能力。然后,再次依次采用1*1卷积、组卷积、1*1卷积、组卷积、1*1卷积、组卷积、1*1卷积、池化层从上述中级特征中提取高级特征(即上述第一关键图像特征信息)。其中,上述组卷积为卷积核尺寸为3*3,分组数为4的组卷积,该组卷积可以将上述第一图像特征信息均分为4组,每组分别采用3*3卷积核进行卷积计算,得到每组各自的关键图像特征信息,然后将4组关键图像特征信息合并,便可得到一个卷积输出(即上述第一关键图像特征信息)。
需要说明的是,上述卷积核为3*3的组卷积的参数量仅为卷积核为3*3的卷积的参数量的四分之一。
步骤202c:电子设备采用第二标准卷积层提取第一关键图像特征信息的文字序列特征。
在本申请实施例中,上述文字序列特征用于表征上述文字图片中的文字的文字内容。
示例性地,电子设备在得到上述第一关键图像特征信息后,可以先采用1*1卷积对该第一关键图像特征信息中的不规则信息进行处理,然后再采用2*2卷积(即上述第二标准卷积层)将处理后的第一关键图像特征信息的高度维度尺寸转换为1(即将高度维度去除),从而从该去除高度维度之后的第一关键图像特征信息中提取到上述文字序列特征。
步骤202d:电子设备采用全连接层获取文字序列特征对应的文字序列预测信息。
在相关技术中,在得到上述文字序列特征后,是采用两个LSTM提取序列特征,并将上述文字序列特征转换为文字序列预测矩阵。但该LSTM不能进行并行处理,且其应用在电子设备中的处理效率较低。导致文字识别的识别效果较差。
在本申请实施例中,电子设备在得到上述文字序列特征后,可以采用一个全连接层降低上述文字序列特征的特征维尺寸,以减少下一个全连接层的参数量。然后,再采用一个全连接层将文字序列特征转换为文字序列预测矩阵(即上述文字序列预测信息)。
需要说明的是,上述特征维尺寸等于上述字符集字典中的字符个数加一。
可以理解的是,电子设备可以在上述字符集字典包括的所有字符的个数的基础上,再添加一个空字符,然后按照添加空字符之后的字符个数,设置特征维尺寸,使得特征维尺寸与添加空字符后的字符个数相等。
如此,通过采用改进后的分组卷积神经网络模型对输入的文字图片进行处理,使得电子设备可以更加快速地得到对应的文字序列预测信息,并且,通过采用全连接层对上述第一关键图像特征信息进行处理,进一步减少上述分组卷积神经网络模型的参数量,提高了电子设备识别文字的识别效果。
可选地,在本申请实施例中,上述步骤201之后,本申请实施例提供的文字识别方法还包括图下步骤201a:
步骤201a:电子设备将文字图片剪裁为N个子文字图片。
在本申请实施例中,上述N个子文字图片中的每个子文字图片中包含至少一个文字,N为大于1的整数。
在本申请实施例中,上述N个子文字图片的图片尺寸高度均相等。
在本申请实施例中,电子设备可以检测上述文字图片中的所有文本行位置,然后,根据检测得到的位置坐标裁剪出所有文本行图片(即上述N个子文字图片),然后将上述文本行图片进行尺度缩放,转为高度相等的图片。
需要说明的是,上述文本行图片的高度与上述分组卷积神经网络模型能够处理的数据尺寸相匹配。
进一步可选地,在本申请实施例中,结合上述步骤201a,上述步骤202中“电子设备将文字图片输入分组卷积神经网络模型进行预测,得到文字图片对应的文字序列预测信息”可以包括如下步骤202e:
步骤202e:电子设备将N个子文字图片输入分组卷积神经网络模型进行预测,得到N个子文字图片中的每个子文字图片对应的文字序列预测信息。
在本申请实施例中,电子设备可以将上述N个子文字图片中的第一个子文字图片输入分组卷积神经网络模型进行预测,得到预测结果后,再将第二个子文字图片输入,依次进行预测。
在本申请实施例中,电子设备在得到上述N个子文字图片中的每个子文字图片对应的文字序列预测信息后,可以基于该预测信息得到文字识别结果。然后,根据上述检测到的文本位置坐标,将该文字识别结果进行排版,以得到上述文字图片的目标文字识别结果。
如此,通过将文字图片进行裁剪逐次处理,可以使得上述分组卷积神经网络模型的计算量更少,进一步提高了识别速度,并保证了识别精度。
以下将对本申请实施例采用的分组卷积神经网络模型的训练过程进行示例性说明:
示例性地,上述分组卷积神经网络模型的训练过程可以包括如下步骤S1至步骤S4:
步骤S1:数据采集及扩充。
在本申请实施例中,上述数据采集时,为了使上述分组卷积神经网络模型可以通用于各种场景,所以采集的文字图片也需要尽可能多的包含多种场景(如卡证、书籍报纸、截图、屏幕、海报、街景、手写字)等等。然后,采集到的文字图片需要通过人工标注的方式得到对应文字标签文件。
由于通过人工采集数据及标注的效率很低,因而需要通过数据合成的方式去扩充数据。该扩充数据的方式分为两种:数据增广和字体合成。
数据增广,即对标注的真实数据通过随机的几何形变、模糊处理、亮度对比度调整、图像压缩等方式,处理为新的数据。
字体合成,即通过字体文件和语料,绘制文字图片,并通过随机的背景、文字颜色、字体、几何形变、透视变化、模糊处理、亮度对比度调整、图像压缩等方式增加合成图 片的真实度和多样性。
在本申请实施例,通过上述真实采集、数据增广和字体合成三种方法,即可得到充足的训练数据。
步骤S2:数据预处理。
在本申请实施例中,在将采集到的数据送入模型训练前,需要对数据进行统一处理,具体为:尺寸缩放、宽度排序、字典制作。
尺寸缩放:模型的设计要求输入的文字图片高度固定为32,宽度不固定。所以需要将数据统一等比缩放到高度为32的尺寸。
宽度排序:文字图片特点是长短不一,而进行训练时,往往是将多张文字图片以批次的形式输入,这要求一个批次里的文字图片宽高一致,而当同一个批次中的文字图片宽度差异较大时,强行调整宽度一致会使部分文字图片中的文字畸变,导致信息损失较大,从而难以达到较好的训练效果。因而可以对训练集的文字图片依据长宽比进行排序,取长宽比相邻的若干个文字图片做为同一个批次,并以批次内宽度最小的文字图片尺寸统一缩放批次内的所有文字图片。
步骤S3:模型搭建。
在本申请实施例中,如图2所示,经典的CRNN网络结构由基于3*3卷积的CNN和基于LSTM的循环神经网络(Recurrent Neural Network,RNN)构成。电子设备将高度为32的文字图片输入模型后,首先通过一个CNN提取图像特征信息。例如,依次采用1个3*3卷积(3*3Conv)、池化层(pool)、1个3*3卷积、池化层、2个3*3卷积、池化层、2个3*3卷积、池化层进行图像特征信息提取,同时将特征维尺寸从64逐步增至512,接着,通过图像映射序列结构(Map-to-Sequence)生成序列特征。然后,采用两个LSTM提取图像特征信息中的序列特征,并将序列特征转为序列预测矩阵输出。
需要说明的是,上述CNN主要由特征维尺寸逐渐增大、卷积核为3*3的卷积和池化层构成,用于提取图像特征信息;上述RNN由两层LSTM构成,用于提取序列特征,并将序列特征转为序列预测矩阵。然而该CRNN网络结构的计算量过大,性能和模型体积都不能达到电子设备侧的要求,另外LSTM也不利于在电子设备侧进行部署。
在本申请实施例中,为了使模型在计算能力较小的电子设备侧能有较好的性能和效果,如图3所示,我们大幅缩减了特征维尺寸;并且,去掉了不易在电子设备侧部署的LSTM,改用全连接层(Fully Connected layers,FC)将序列特征转为序列预测矩阵;此外,仅采用CNN网络而非CNN+RNN网络来提取图像特征信息,并且,CNN网络也丢弃了原本均采用3*3卷积核卷积的方案,而是将部分3*3卷积核的卷积替换为参数量较小的组卷积和1*1卷积,并通过较深的网络层数提升模型特征学习能力。
例如,为了减少参数量同时保证较好的特征学习能力,我们缩减特征维尺寸为从32逐步增至192。然后,首先依次采用3*3卷积、池化层、3*3卷积、1*1卷积(1*1Conv)、池化层从输入的文字图片中提取初级图像特征信息,其中增加的1*1卷积用于提升特征 维尺寸,避免其前一个3*3卷积参数量过大;再依次采用1*1卷积、组卷积(3*3group Conv)、1*1卷积、组卷积、1*1卷积、组卷积、1*1卷积、池化层从上述初级图像特征信息中提取中级图像特征信息,其中,第一个1*1卷积用于为前一个池化层的输出添加非线性激励,以提升网络表达能力。接着,再次采用1*1卷积、组卷积、1*1卷积、组卷积、1*1卷积、组卷积、1*1卷积、池化层的处理方式从上述中级图像特征信息中提取高级图像特征信息。最后,再采用1*1卷积对上述高级图像特征信息添加非线性激励,并采用2*2卷积将高度维度尺寸转换为1,然后将高度维度去掉,并交换特征维度和宽度维度,从而满足输入下一层的要求,并将四维的高级图像特征信息转换成三维的特征序列。再将该特征序列通过一个参数量较少的全连接层降低特征维尺寸,用于减少下一层的参数量,然后再通过一个全连接层将降低特征维尺寸后的序列特征转为序列预测矩阵。得到的序列预测矩阵即是整个模型的输出结果。
需要说明的是,上述交替重复3次的组卷积、1*1卷积的组合相比传统CRNN中2个3*3卷积的结构,在参数量减小的同时加深了网络深度,模型表征能力得到提升。
步骤S4:模型训练、量化。
在本申请实施例中,模型训练:将训练的文字图片分成多个批次,每个批次由固定张数的文字图片组成,然后随机按批次送入模型。当一个批次的文字图片送入模型后,通过上述步骤S3中搭建的模型逐层计算,得到文字序列预测矩阵,再采用归一化指数函数(softmax)将文字序列预测矩阵中的值转换为取值范围在0-1的文字序列预测概率矩阵。然后,根据文字序列预测概率矩阵,采用贪婪算法,将最大概率值所对应的结果做为该序列位置的预测结果,并根据上述字符集字典索引映射得到预测出的文字序列。采用经典的损失函数(CTC loss)计算预测出的文字序列与文字图片中对应的标签文字序列之间的损失值,根据损失值采用随机优化器(Adaptive momentum,Adam)对模型进行反向传播,更新模型参数。上述随机优化器的初始学习率设置为0.0005,随后采用余弦学习率下降方式逐渐减小。随后,将下一个批次的文字图片重复上述操作再次更新模型参数,在多轮参数更新后,损失值降到合适范围且趋于稳定,便完成对模型的训练。
模型量化:为了加速模型推理速度并保持较好的精度,采用半精度(Full Precise Float,FP)16的方式储存参数并推理模型,得到上述分组卷积神经网络模型。
本申请实施例提供的文字识别方法,执行主体可以为文字识别装置。本申请实施例中以文字识别装置执行文字识别方法为例,说明本申请实施例提供的文字识别装置。
本申请实施例提供一种文字识别装置,如图4所示,该文字识别装置400包括:获取模块401、预测模块402和处理模块403,其中:上述获取模块401,用于获取文字图片,该文字图片包括至少一个文字;上述预测模块402,用于将获取模块401获取到的上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;上述处理模块403,用于基于预测模块402得到的上述文字序列预测信息,得到上述文字图片对应的文字识别结果。
可选地,在本申请实施例中,上述分组卷积神经网络模型包括:第一标准卷积层、组卷积层、第二标准卷积层和全连接层;上述预测模块402,具体用于:将获取模块401获取到的上述文字图片输入分组卷积神经网络模型后,采用上述第一标准卷积层提取上述文字图片的第一图像特征信息;采用上述组卷积层对上述第一图像特征信息进行分组,得到M组图像特征信息,并采用上述组卷积层中的M个卷积核分别提取每组图像特征信息中的关键图像特征信息,并将得到的M组关键图像特征信息融合,得到第一关键图像特征信息,上述组卷积层中的每个卷积核用于处理一组图像特征信息,M为大于1的整数;采用上述第二标准卷积层提取上述第一关键图像特征信息的文字序列特征;采用上述全连接层获取上述文字序列特征对应的文字序列预测信息。
可选地,在本申请实施例中,上述第一标准卷积层、上述组卷积层、上述第二标准卷积层以及上述全连接层依次连接;上述第一标准卷积层包括目标标准卷积单元,该目标标准卷积单元用于减小上述分组卷积神经网络模型的参数量,上述第一标准卷积层包括一个卷积核;上述组卷积层包括目标组卷积单元,该目标组卷积单元用于降低上述分组卷积神经网络模型的计算量,上述组卷积层包括M个卷积核,上述第二标准卷积层包括一个卷积核。
可选地,在本申请实施例中,上述文字识别装置400还包括:剪裁模块,其中:上述剪裁模块,用于在获取模块401获取文字图片之后,将该文字图片剪裁为N个子文字图片,每个子文字图片中包含至少一个文字,N为大于1的整数;上述预测模块402,具体用于将剪裁模块得到的上述N个子文字图片输入分组卷积神经网络模型进行预测,得到上述N个子文字图片中的每个子文字图片对应的文字序列预测信息。
可选地,在本申请实施例中,上述处理模块403,具体用于:基于预测模块402得到的上述文字序列预测信息,计算目标预测概率信息,该目标预测概率信息用于表征上述文字序列预测信息对应的文字序列中每个序列位置上所对应的每个文字索引的概率,该每个文字索引在字符库中对应一个文字;基于上述目标预测概率信息,确定上述每个序列位置上的文字预测结果;基于该每个序列位置上的文字预测结果,确定上述文字图片对应的文字识别结果。
本申请实施例提供的文字识别装置中,该文字识别装置可以获取文字图片,该文字图片包括至少一个文字;将上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;基于上述文字序列预测信息,得到上述文字图片对应的文字识别结果。如此,由于上述分组卷积神经网络模型的参数量较少;并且,该分组卷积神经网络模型能够将输入的数据分成多组,以同时对该多组数据进行处理。因此,可以减少该分组卷积神经网络模型的计算量,同时保证了识别准确率,从而提高了上述文字识别装置的识别效果。
本申请实施例中的文字识别装置可以是电子设备,也可以是电子设备中的部件,例如集成电路或芯片。该电子设备可以是终端,也可以为除终端之外的其他设备。示例性 的,电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device,MID)、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,还可以为服务器、网络附属存储器(Network Attached Storage,NAS)、个人计算机(personal computer,PC)、电视机(television,TV)、柜员机或者自助机等,本申请实施例不作具体限定。
本申请实施例中的文字识别装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为iOS操作系统,还可以为其他可能的操作系统,本申请实施例不作具体限定。
本申请实施例提供的文字识别装置能够实现图1的方法实施例实现的各个过程,为避免重复,这里不再赘述。
可选地,如图5所示,本申请实施例还提供一种电子设备600,包括处理器601和存储器602,存储器602上存储有可在所述处理器601上运行的程序或指令,该程序或指令被处理器601执行时实现上述文字识别方法实施例的各个步骤,且能达到相同的技术效果,为避免重复,这里不再赘述。
需要说明的是,本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。
图6为实现本申请实施例的一种电子设备的硬件结构示意图。
该电子设备100包括但不限于:射频单元101、网络模块102、音频输出单元103、输入单元104、传感器105、显示单元106、用户输入单元107、接口单元108、存储器109、以及处理器110等部件。
本领域技术人员可以理解,电子设备100还可以包括给各个部件供电的电源(比如电池),电源可以通过电源管理系统与处理器110逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图6中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置,在此不再赘述。
其中,上述处理器110,用于:获取文字图片,该文字图片包括至少一个文字;将上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;基于上述文字序列预测信息,得到上述文字图片对应的文字识别结果。
可选地,在本申请实施例中,上述分组卷积神经网络模型包括:第一标准卷积层、组卷积层、第二标准卷积层和全连接层;上述处理器110,具体用于:将上述文字图片输入分组卷积神经网络模型后,采用上述第一标准卷积层提取上述文字图片的第一图像特征信息;采用上述组卷积层对上述第一图像特征信息进行分组,得到M组图像特征信息,并采用上述组卷积层中的M个卷积核分别提取每组图像特征信息中的关键图像特征信息, 并将得到的M组关键图像特征信息融合,得到第一关键图像特征信息,上述组卷积层中的每个卷积核用于处理一组图像特征信息,M为大于1的整数;采用上述第二标准卷积层提取上述第一关键图像特征信息的文字序列特征;采用上述全连接层获取上述文字序列特征对应的文字序列预测信息。
可选地,在本申请实施例中,上述第一标准卷积层、上述组卷积层、上述第二标准卷积层以及上述全连接层依次连接;上述第一标准卷积层包括目标标准卷积单元,该目标标准卷积单元用于减小上述分组卷积神经网络模型的参数量,上述第一标准卷积层包括一个卷积核;上述组卷积层包括目标组卷积单元,该目标组卷积单元用于降低上述分组卷积神经网络模型的计算量,上述组卷积层包括M个卷积核,上述第二标准卷积层包括一个卷积核。
可选地,在本申请实施例中,上述处理器110,还用于将上述文字图片剪裁为N个子文字图片,每个子文字图片中包含至少一个文字,N为大于1的整数;上述处理器110,具体用于将上述N个子文字图片输入分组卷积神经网络模型进行预测,得到上述N个子文字图片中的每个子文字图片对应的文字序列预测信息。
可选地,在本申请实施例中,上述处理器110,具体用于:基于预测模块402得到的上述文字序列预测信息,计算目标预测概率信息,该目标预测概率信息用于表征上述文字序列预测信息对应的文字序列中每个序列位置上所对应的每个文字索引的概率,该每个文字索引在字符库中对应一个文字;基于上述目标预测概率信息,确定上述每个序列位置上的文字预测结果;基于该每个序列位置上的文字预测结果,确定上述文字图片对应的文字识别结果。
在本申请实施例提供的电子设备中,电子设备可以获取文字图片,该文字图片包括至少一个文字;将上述文字图片输入分组卷积神经网络模型进行预测,得到上述文字图片对应的文字序列预测信息;基于上述文字序列预测信息,得到上述文字图片对应的文字识别结果。如此,由于上述分组卷积神经网络模型的参数量较少;并且,该分组卷积神经网络模型能够将输入的数据分成多组,以同时对该多组数据进行处理。因此,可以减少该分组卷积神经网络模型的计算量,同时保证了识别准确率,从而提高了电子设备的识别效果。
应理解的是,本申请实施例中,输入单元104可以包括图形处理器(Graphics Processing Unit,GPU)1041和麦克风1042,图形处理器1041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元106可包括显示面板1061,可以采用液晶显示器、有机发光二极管等形式来配置显示面板1061。用户输入单元107包括触控面板1071以及其他输入设备1072中的至少一种。触控面板1071,也称为触摸屏。触控面板1071可包括触摸检测装置和触摸控制器两个部分。其他输入设备1072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。
存储器109可用于存储软件程序以及各种数据。存储器109可主要包括存储程序或指令的第一存储区和存储数据的第二存储区,其中,第一存储区可存储操作系统、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外,存储器109可以包括易失性存储器或非易失性存储器,或者,存储器109可以包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本申请实施例中的存储器109包括但不限于这些和任意其它适合类型的存储器。
处理器110可包括一个或多个处理单元;可选的,处理器110集成应用处理器和调制解调处理器,其中,应用处理器主要处理涉及操作系统、用户界面和应用程序等的操作,调制解调处理器主要处理无线通信信号,如基带处理器。可以理解的是,上述调制解调处理器也可以不集成到处理器110中。
本申请实施例还提供一种可读存储介质,所述可读存储介质上存储有程序或指令,该程序或指令被处理器执行时实现上述文字识别方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
其中,所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。
本申请实施例另提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述文字识别方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。
本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如上述文字识别方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或 者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。

Claims (15)

  1. 一种文字识别方法,其中,所述方法包括:
    获取文字图片,所述文字图片包括至少一个文字;
    将所述文字图片输入分组卷积神经网络模型进行预测,得到所述文字图片对应的文字序列预测信息;
    基于所述文字序列预测信息,得到所述文字图片对应的文字识别结果。
  2. 根据权利要求1所述的方法,其中,所述分组卷积神经网络模型包括:第一标准卷积层、组卷积层、第二标准卷积层和全连接层;
    所述将所述文字图片输入分组卷积神经网络模型进行预测,得到所述文字图片对应的文字序列预测信息,包括:
    将所述文字图片输入分组卷积神经网络模型后,采用所述第一标准卷积层提取所述文字图片的第一图像特征信息;
    采用所述组卷积层对所述第一图像特征信息进行分组,得到M组图像特征信息,并采用所述组卷积层中的M个卷积核分别提取每组图像特征信息中的关键图像特征信息,并将得到的M组关键图像特征信息融合,得到第一关键图像特征信息,所述组卷积层中的每个卷积核用于处理一组图像特征信息,M为大于1的整数;
    采用所述第二标准卷积层提取所述第一关键图像特征信息的文字序列特征;
    采用所述全连接层获取所述文字序列特征对应的文字序列预测信息。
  3. 根据权利要求2所述的方法,其中,
    所述第一标准卷积层、所述组卷积层、所述第二标准卷积层以及所述全连接层依次连接;
    所述第一标准卷积层包括目标标准卷积单元,所述目标标准卷积单元用于减小所述分组卷积神经网络模型的参数量,所述第一标准卷积层包括一个卷积核;
    所述组卷积层包括目标组卷积单元,所述目标组卷积单元用于降低所述分组卷积神经网络模型的计算量,所述组卷积层包括M个卷积核;
    所述第二标准卷积层包括一个卷积核。
  4. 根据权利要求1所述的方法,其中,所述获取文字图片之后,所述方法还包括:
    将所述文字图片剪裁为N个子文字图片,每个子文字图片中包含至少一个文字,N为大于1的整数;
    所述将所述文字图片输入分组卷积神经网络模型进行预测,得到所述文字图片对应的文字序列预测信息,包括:
    将所述N个子文字图片输入分组卷积神经网络模型进行预测,得到所述N个子文 字图片中的每个子文字图片对应的文字序列预测信息。
  5. 根据权利要求1所述的方法,其中,所述基于所述文字序列预测信息,得到所述文字图片对应的文字识别结果,包括:
    基于所述文字序列预测信息,计算目标预测概率信息,所述目标预测概率信息用于表征所述文字序列预测信息对应的文字序列中每个序列位置上所对应的每个文字索引的概率,所述每个文字索引在字符库中对应一个文字;
    基于所述目标预测概率信息,确定所述每个序列位置上的文字预测结果;
    基于所述每个序列位置上的文字预测结果,确定所述文字图片对应的文字识别结果。
  6. 一种文字识别装置,其中,所述装置包括:获取模块、预测模块和处理模块,其中:
    所述获取模块,用于获取文字图片,所述文字图片包括至少一个文字;
    所述预测模块,用于将所述获取模块获取到的所述文字图片输入分组卷积神经网络模型进行预测,得到所述文字图片对应的文字序列预测信息;
    所述处理模块,用于基于所述预测模块得到的所述文字序列预测信息,得到所述文字图片对应的文字识别结果。
  7. 根据权利要求6所述的装置,其中,所述分组卷积神经网络模型包括:第一标准卷积层、组卷积层、第二标准卷积层和全连接层;
    所述预测模块,具体用于:
    将所述获取模块获取到的所述文字图片输入分组卷积神经网络模型后,采用所述第一标准卷积层提取所述文字图片的第一图像特征信息;
    采用所述组卷积层对所述第一图像特征信息进行分组,得到M组图像特征信息,并采用所述组卷积层中的M个卷积核分别提取每组图像特征信息中的关键图像特征信息,并将得到的M组关键图像特征信息融合,得到第一关键图像特征信息,所述组卷积层中的每个卷积核用于处理一组图像特征信息,M为大于1的整数;
    采用所述第二标准卷积层提取所述第一关键图像特征信息的文字序列特征;
    采用所述全连接层获取所述文字序列特征对应的文字序列预测信息。
  8. 根据权利要求7所述的装置,其中,
    所述第一标准卷积层、所述组卷积层、所述第二标准卷积层以及所述全连接层依次连接;
    所述第一标准卷积层包括目标标准卷积单元,所述目标标准卷积单元用于减小所述分组卷积神经网络模型的参数量,所述第一标准卷积层包括一个卷积核;
    所述组卷积层包括目标组卷积单元,所述目标组卷积单元用于降低所述分组卷积神经网络模型的计算量,所述组卷积层包括M个卷积核;
    所述第二标准卷积层包括一个卷积核。
  9. 根据权利要求6所述的装置,其中,所述装置还包括:剪裁模块,其中:
    所述剪裁模块,用于在所述获取模块获取文字图片之后,将所述文字图片剪裁为N个子文字图片,每个子文字图片中包含至少一个文字,N为大于1的整数;
    所述预测模块,具体用于将所述剪裁模块得到的所述N个子文字图片输入分组卷积神经网络模型进行预测,得到所述N个子文字图片中的每个子文字图片对应的文字序列预测信息。
  10. 根据权利要求6所述的装置,其中,
    所述处理模块,具体用于:
    基于所述预测模块得到的所述文字序列预测信息,计算目标预测概率信息,所述目标预测概率信息用于表征所述文字序列预测信息对应的文字序列中每个序列位置上所对应的每个文字索引的概率,所述每个文字索引在字符库中对应一个文字;
    基于所述目标预测概率信息,确定所述每个序列位置上的文字预测结果;
    基于所述每个序列位置上的文字预测结果,确定所述文字图片对应的文字识别结果。
  11. 一种电子设备,其中,包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如权利要求1至5任一项所述的文字识别方法的步骤。
  12. 一种可读存储介质,其中,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如权利要求1至5任一项所述的文字识别方法的步骤。
  13. 一种芯片,其中,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如权利要求1至5任一项所述的文字识别方法的步骤。
  14. 一种计算机程序产品,其中,所述程序产品被存储在非瞬态的存储介质中,所述程序产品被至少一个处理器执行以实现如权利要求1至5任一项所述的文字识别方法的步骤。
  15. 一种电子设备,其中,所述电子设备被配置成用于执行如权利要求1至5任一 项所述的文字识别方法的步骤。
PCT/CN2023/126280 2022-10-26 2023-10-24 文字识别方法、装置、电子设备及介质 WO2024088269A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211320472.6 2022-10-26
CN202211320472.6A CN115601752A (zh) 2022-10-26 2022-10-26 文字识别方法、装置、电子设备及介质

Publications (1)

Publication Number Publication Date
WO2024088269A1 true WO2024088269A1 (zh) 2024-05-02

Family

ID=84850315

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/126280 WO2024088269A1 (zh) 2022-10-26 2023-10-24 文字识别方法、装置、电子设备及介质

Country Status (2)

Country Link
CN (1) CN115601752A (zh)
WO (1) WO2024088269A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601752A (zh) * 2022-10-26 2023-01-13 维沃移动通信有限公司(Cn) 文字识别方法、装置、电子设备及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008961A (zh) * 2019-04-01 2019-07-12 深圳市华付信息技术有限公司 文字实时识别方法、装置、计算机设备及存储介质
CN110309836A (zh) * 2019-07-01 2019-10-08 北京地平线机器人技术研发有限公司 图像特征提取方法、装置、存储介质和设备
CN110522440A (zh) * 2019-08-12 2019-12-03 广州视源电子科技股份有限公司 基于分组卷积神经网络的心电信号识别装置
CN111666931A (zh) * 2020-05-21 2020-09-15 平安科技(深圳)有限公司 基于混合卷积文字图像识别方法、装置、设备及存储介质
US20210042474A1 (en) * 2019-03-29 2021-02-11 Beijing Sensetime Technology Development Co., Ltd. Method for text recognition, electronic device and storage medium
CN113239949A (zh) * 2021-03-15 2021-08-10 杭州电子科技大学 一种基于1d分组卷积神经网络的数据重构方法
CN115601752A (zh) * 2022-10-26 2023-01-13 维沃移动通信有限公司(Cn) 文字识别方法、装置、电子设备及介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210042474A1 (en) * 2019-03-29 2021-02-11 Beijing Sensetime Technology Development Co., Ltd. Method for text recognition, electronic device and storage medium
CN110008961A (zh) * 2019-04-01 2019-07-12 深圳市华付信息技术有限公司 文字实时识别方法、装置、计算机设备及存储介质
CN110309836A (zh) * 2019-07-01 2019-10-08 北京地平线机器人技术研发有限公司 图像特征提取方法、装置、存储介质和设备
CN110522440A (zh) * 2019-08-12 2019-12-03 广州视源电子科技股份有限公司 基于分组卷积神经网络的心电信号识别装置
CN111666931A (zh) * 2020-05-21 2020-09-15 平安科技(深圳)有限公司 基于混合卷积文字图像识别方法、装置、设备及存储介质
CN113239949A (zh) * 2021-03-15 2021-08-10 杭州电子科技大学 一种基于1d分组卷积神经网络的数据重构方法
CN115601752A (zh) * 2022-10-26 2023-01-13 维沃移动通信有限公司(Cn) 文字识别方法、装置、电子设备及介质

Also Published As

Publication number Publication date
CN115601752A (zh) 2023-01-13

Similar Documents

Publication Publication Date Title
WO2020221013A1 (zh) 一种图像处理方法、装置、电子设备以及存储介质
WO2021008320A1 (zh) 手语识别方法、装置、计算机可读存储介质和计算机设备
CN107358262B (zh) 一种高分辨率图像的分类方法及分类装置
US20200293786A1 (en) Video identification method, video identification device, and storage medium
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN111859912A (zh) 基于pcnn模型的带有实体感知的远程监督关系抽取方法
WO2024088269A1 (zh) 文字识别方法、装置、电子设备及介质
CN111488732B (zh) 一种变形关键词检测方法、系统及相关设备
CN113205047B (zh) 药名识别方法、装置、计算机设备和存储介质
JP2022160662A (ja) 文字認識方法、装置、機器、記憶媒体、スマート辞書ペン及びコンピュータプログラム
CN115878805A (zh) 情感分析方法、装置、电子设备及存储介质
CN115294055A (zh) 图像处理方法、装置、电子设备和可读存储介质
JP2023543964A (ja) 画像処理方法、画像処理装置、電子機器、記憶媒体およびコンピュータプログラム
WO2024012289A1 (zh) 视频生成方法、装置、电子设备及介质
CN114758054A (zh) 光斑添加方法、装置、设备及存储介质
CN113313066A (zh) 图像识别方法、装置、存储介质以及终端
WO2024041108A1 (zh) 图像矫正模型训练及图像矫正方法、装置和计算机设备
CN113095072A (zh) 文本处理方法及装置
CN116167014A (zh) 一种基于视觉和语音的多模态关联型情感识别方法及系统
WO2023173552A1 (zh) 目标检测模型的建立方法、应用方法、设备、装置及介质
CN115909408A (zh) 一种基于Transformer网络的行人重识别方法及装置
CN106469437B (zh) 图像处理方法和图像处理装置
CN111160265B (zh) 文件转换方法、装置、存储介质及电子设备
CN111967579A (zh) 使用卷积神经网络对图像进行卷积计算的方法和装置