CN116740730A - Multi-class text sequence recognition method and device and electronic equipment - Google Patents

Multi-class text sequence recognition method and device and electronic equipment Download PDF

Info

Publication number
CN116740730A
CN116740730A CN202310233711.2A CN202310233711A CN116740730A CN 116740730 A CN116740730 A CN 116740730A CN 202310233711 A CN202310233711 A CN 202310233711A CN 116740730 A CN116740730 A CN 116740730A
Authority
CN
China
Prior art keywords
feature
layer
feature mapping
text sequence
mapping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310233711.2A
Other languages
Chinese (zh)
Inventor
黄威
刘正珍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hanwang Technology Co Ltd
Original Assignee
Hanwang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hanwang Technology Co Ltd filed Critical Hanwang Technology Co Ltd
Priority to CN202310233711.2A priority Critical patent/CN116740730A/en
Publication of CN116740730A publication Critical patent/CN116740730A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques

Abstract

The application discloses a multi-class text sequence recognition method and device, and belongs to the technical field of character recognition. The method is performed by a text sequence recognition model comprising at least one feature extraction layer, a feature mapping layer and a transcription layer, the feature mapping layer comprising M parallel arranged convolution layers, the method comprising: extracting features of the input image through a feature extraction layer to obtain N feature images; distributing N feature graphs to the M convolution layers for feature mapping to obtain a grouping feature mapping result; combining the grouping feature mapping results to obtain feature mapping results of N feature graphs corresponding to a plurality of preset character categories; and (5) transcribing the feature mapping result to obtain a text sequence recognition result. The method improves the feature mapping layer into a plurality of convolution layers, groups the feature images output by the feature extraction layer to carry out feature mapping, and then combines feature mapping results, thereby reducing the parameter number of the text sequence recognition model and improving the model training and reasoning speed.

Description

Multi-class text sequence recognition method and device and electronic equipment
Technical Field
The present application relates to the field of text recognition technology, and in particular, to a method and apparatus for recognizing multi-class text sequences, an electronic device, and a computer readable storage medium.
Background
In the prior art, text sequence recognition models are typically built based on CRNN networks. CRNN is called Convolutional Recurrent Neural Network (convolutional neural network) and is mainly used for recognizing text sequences with indefinite lengths end to end, and text recognition is converted into a sequence learning problem of time sequence dependence without cutting single characters. When the text sequence recognition is carried out, the text sequence recognition model carries out feature extraction and mapping on an input image and outputs a recognition result mapped to a tag space, so that a text sequence recognition result in the input image is obtained.
In the prior art, a model for identifying a text sequence by using a CRNN-like technology generally comprises: convolution layer, full link layer and transcription layer. The convolution layer is used for extracting features of images input into the text sequence recognition model to obtain the extracted features of each neuron; the full connection layer is used for carrying out classified mapping on the features extracted by the convolution layer to obtain a feature mapping result; the transcription layer is used for transcribing the feature mapping result output by the full-connection layer and outputting the identification result of the corresponding label space. In the prior art, the full-connection layer processes the extracted characteristics of each neuron of the convolution layer to obtain a classification mapping result, which requires more network parameters of the full-connection layer, so that the text sequence recognition model has at least the following defects: the calculation amount is large in the running process, and the training and reasoning speed is low.
It can be seen that there is still a need for improvements in the art for multi-class text sequence recognition methods.
Disclosure of Invention
The embodiment of the application provides a multi-class text sequence recognition method and device and electronic equipment, which are used for overcoming the defects of large calculation amount and low training and reasoning speed in the running process of a model caused by multiple parameters of a full connection layer of a text sequence recognition model.
In a first aspect, an embodiment of the present application provides a method for identifying a multi-category text sequence, including:
the pre-trained text sequence recognition model includes: at least one feature extraction layer and a feature mapping layer, wherein the feature mapping layer comprises M convolution layers which are arranged in parallel, and the method comprises the following steps:
performing feature extraction on the input image through the at least one feature extraction layer to obtain N feature images;
distributing the N feature maps to the M convolution layers for feature mapping to obtain a grouping feature mapping result output by each convolution layer, wherein M and N are positive integers, and M is smaller than or equal to N;
combining the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories;
and transcribing the feature mapping result to obtain a text sequence recognition result of the input image.
In a second aspect, an embodiment of the present application provides a multi-category text sequence recognition apparatus, including:
the pre-trained text sequence recognition model includes: at least one feature extraction layer, feature mapping layer, wherein, the feature mapping layer includes M convolution layers that set up in parallel, and the device includes:
the feature extraction module is used for carrying out feature extraction on the input image through the at least one feature extraction layer to obtain N feature images;
the grouping feature mapping result acquisition module is used for distributing the N feature graphs to the M convolution layers to perform feature mapping to obtain a grouping feature mapping result output by each convolution layer, wherein M and N are positive integers, and M is smaller than or equal to N;
the merging module is used for merging the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories;
and the transcription identification module is used for transcribing the feature mapping result to obtain a text sequence identification result of the input image.
In a third aspect, the embodiment of the application further discloses an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the multi-category text sequence recognition method according to the embodiment of the application when executing the computer program.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for recognition of a multi-category text sequence disclosed in the embodiments of the present application.
The multi-category text sequence recognition method disclosed by the embodiment of the application is executed through a text sequence recognition model, and the text sequence recognition model comprises the following steps: the method comprises the steps of performing feature extraction on an input image through the feature extraction layer to obtain N feature images; distributing the N feature maps to the M convolution layers for feature mapping to obtain a grouping feature mapping result output by each convolution layer, wherein M and N are positive integers, and M is smaller than or equal to N; combining the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories; and finally, transcribing the feature mapping result to obtain a text sequence recognition result of the input image. According to the scheme, the feature mapping layer is improved into the plurality of convolution layers, the feature maps output by the feature extraction layer are subjected to feature mapping in a grouping mode, and then feature mapping results are combined, so that the number of parameters of the text sequence recognition model is effectively reduced on the premise that the output results identical to those of the full-connection layer in the prior art are obtained, and the model training and reasoning speed is improved.
The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.
Drawings
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
FIG. 1 is a flow chart of a method for identifying multi-category text sequences according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a text recognition model in a multi-class text sequence recognition method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a prior art text recognition model;
FIG. 4 is a schematic diagram of a multi-class text sequence recognition device according to an embodiment of the present application;
fig. 5 schematically shows a block diagram of an electronic device for performing the method according to the application; and
fig. 6 schematically shows a memory unit for holding or carrying program code for implementing the method according to the application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The embodiment of the application discloses a multi-category text sequence identification method, as shown in fig. 1, comprising the following steps: steps 110 to 140.
The multi-category text sequence recognition method disclosed by the embodiment of the application is implemented through the text sequence recognition model with a specific structure, and the text sequence recognition model with the specific structure needs to be trained in advance before text recognition is carried out.
As shown in fig. 2, the pre-trained text sequence recognition model includes: at least one feature extraction layer 210, a feature mapping layer 220, a merging layer 230, and a transcription layer 240, wherein the feature mapping layer 220 includes M convolution layers 2201 arranged in parallel. Wherein M is an integer greater than 1.
Wherein, the value of M is determined according to the experimental result.
The specific technical scheme of the multi-category text sequence recognition method disclosed by the embodiment of the application is further described below by combining the structure of the text sequence recognition model.
And 110, carrying out feature extraction on the input image through the at least one feature extraction layer to obtain N feature graphs.
Alternatively, the input image may be a text line image, including a text sequence.
The at least one feature extraction layer 210 is sequentially connected, a first feature extraction layer 210 performs feature extraction on the input image to obtain a feature map of a first size, and then a second feature extraction layer 210 performs feature extraction on the feature map of the first size to obtain a feature map of a second size. After each feature extraction layer 210 processes the feature map output by the previous feature extraction layer 210, the last feature extraction layer 210 inputs N feature maps of specified sizes.
In some embodiments of the application, the feature extraction layer comprises: convolution layer, batch Normalization layer (bulk normalization layer) and activation function. The convolution layer carries out convolution operation on an input image input to the feature extraction layer or a feature image output by a previous feature extraction layer based on a preset convolution kernel and step length to obtain a feature image with a specified size; the batch normalization layer is used for performing normalization processing on the feature graphs with the specified sizes obtained by the convolution layer so as to stabilize the feature distribution of the feature graphs; the activation function is used to perform nonlinear mapping on the feature map after the batch normalization layer processing, and wait for the hidden layer output vector of the current feature extraction layer 210.
Taking the text sequence recognition model including two feature extraction layers 210 as an example, for an input image with a size of c×h×w, where C represents the number of channels, H represents the height of the input image, and W represents the width of the input image, the first feature extraction layer 210 performs feature extraction on the input image to obtain 64 input images with a size of 64The second feature extraction layer 210 is +.>After feature extraction, 512 sizes can be obtainedIs a feature map of (1).
Optionally, when the network structure of the feature extraction layer 210 changes, for example, when the convolution kernel size or the step size of the convolution layer changes, the size and the number of feature images output by the feature extraction layer 210 correspondingly change, and in the embodiment of the present application, the number and the size of the feature images output by the feature extraction layer 210 are not limited. That is, in the embodiment of the present application, the structural parameters such as the convolution kernel size and the convolution step length of the convolution layer in the feature extraction layer 210 are not limited.
In some embodiments of the application, the activation function may be implemented using a GELU (Gaussian Error Linear Unit ) function.
In the text recognition model in the prior art, a ReLU (The Rectified Linear Unit) activation function is generally adopted to map the feature map output by the batch normalization layer to the output space of hidden layer neurons after the batch normalization layer in the feature extraction layer. The expression relu (x) =max (x, 0) is: reLU (x) =max (x, 0), the ReLU activation function maps data smaller than 0 to 0, and data larger than 0 is unchanged. Since the derivative of the ReLU activation function is 0 when x < = 0, this partial value will not affect the network parameter update during training, affecting the performance of the neural network to some extent. For example, the inventors have found through experiments that when processing text images, small variations between features between characters that are close to each other in a glyph, such as between a numeral "1" and a lower case "L" (lower case form of the character "L"), are easily ignored, resulting in word recognition errors.
The expression of the GELU activation function is: GELU (x) =x×Φ (x), where Φ (x) refers to a gaussian normal distribution cumulative function of x. The GELU activation function has the characteristic of mapping in combination with the context, and can take small changes among features of characters with similar fonts into consideration so as to learn language rules better.
Because the convolution layer is based on the characteristic of analyzing characters by pixels, for characters with similar fonts, the accuracy of text sequence recognition can be improved by considering the pixel characteristics of the context in which the characters are in the GELU function in combination with the language rule learned by the GELU function. For example, the character "1" in the text sequence "0123" would be more easily recognized as the number "1" than the lower case letter of the character "L".
And 120, distributing the N feature maps to the M convolution layers to perform feature mapping, and obtaining a grouping feature mapping result output by each convolution layer.
Wherein M and N are positive integers, M is less than or equal to N.
Next, the N feature maps output by the last feature extraction layer 210 are distributed to the M convolution layers 2201, so that each convolution layer 2201 is used for performing feature mapping on the feature map output by the feature extraction layer 210, and each convolution layer 2201 only processes part of the feature map output by the feature extraction layer 210, so that the speed of performing feature mapping on the text sequence recognition model can be improved.
In some embodiments of the present application, the distributing the N feature maps to the M convolutional layers to perform feature mapping, to obtain a packet feature mapping result output by each convolutional layer, includes: dividing the N feature images into M groups according to the preset corresponding relation between the feature images output by the feature extraction layer and the convolution layers to obtain a group of feature images corresponding to each convolution layer; for each group of feature graphs, feature mapping is carried out on the group of feature graphs through the corresponding convolution layer, and a grouping feature mapping result mapped to a specified character category is obtained, wherein the specified character category is: and presetting part of character categories in a plurality of character categories aiming at the corresponding convolution layers.
The preset character categories are label categories of a text sequence recognition model. Taking a text sequence recognition model as an example, a character set containing 10000 characters needs to be recognized, and the preset number of character categories are equal to 10000. Namely, the preset plurality of character categories are character categories which need to be identified by the text sequence identification model.
Optionally, in the model design stage, a preset correspondence between the feature map output by the feature extraction layer 210 and the M convolution layers 2201 may be preset, so that each convolution layer 2201 processes a part of the feature map output by the feature extraction layer 210 according to the preset correspondence.
Optionally, the preset correspondence between the feature map output by the feature extraction layer and the convolution layer includes: and each feature graph output by the feature extraction layer is used as a corresponding relation of the input of one convolution layer. For example, when the feature extraction layer 210 outputs N feature maps, the N feature maps may be divided into M groups, each group including a different feature map, and each group of feature maps is set to correspond to one specified convolution layer 2201, so that the N feature maps are assigned to the M convolution layers 2201 to perform feature mapping processing, respectively. Where N is an integer greater than 1, and the value of N is determined according to the network structure and parameters of the feature extraction layer 210. For example, N may take the value 512, or other integer value.
In the embodiment of the present application, the specific number of feature images included in each group is not limited.
Optionally, in the model design stage, a classification mapping space of the feature mapping result output by each convolution layer 2201 may be preset, and each convolution layer 2201 may be configured to perform feature mapping only in the specified character class space. For example, the preset number of character classes may be divided into M groups, and each group of character classes serves as a classification mapping space of the convolution layer 2201.
Then, for each convolution layer 2201, feature mapping is performed on the feature map corresponding to the convolution layer 2201 by using the convolution layer 2201, so as to obtain a grouping feature mapping result output by the convolution layer 2201. In this way, M convolution layers can obtain M grouping feature mapping results in total, where each grouping feature mapping result is a partial character class in the recognition class space corresponding to the text sequence recognition model, and the M grouping feature mapping results can cover the entire class space of the text sequence recognition model.
According to the embodiment of the application, the feature mapping layer equivalent to the full connection layer is improved into a network structure consisting of a plurality of parallel convolution layers, so that the number of parameters of the model can be greatly reduced, the operation amount of feature mapping in the training process and the reasoning process is reduced, and the running speed of the model is improved. Taking the example of replacing the full connection with M convolution layers, if the dimension of the feature map output by the feature extraction layer 210 is (512, 1, W/4), where 512 is the number of channels, 1 is the height of the feature map, W/4 is the width of the feature map, after the feature map is divided into M groups (here, it is assumed that 512 can be divided by M) on average by channels, the input dimension of each convolution layer may be represented as (512/M, 1, W/4), and the output dimension as (D/M, 1, W/4), where D is the number of categories that can be identified by the text sequence recognition model, M is the number of convolution layers, and W is the width of the output image. Based on this structural parameter, the number of parameters of M convolutional layers can be calculated as (512/M) x (D/M) x M, i.e., 512 x D/M.
Whereas in the prior art, as shown in fig. 3, a text sequence recognition model generally includes: at least one feature extraction layer 310, a full connection layer 320, and a transcription layer 330, wherein the neurons of the full connection layer 320 are all connected to all neurons of the previous layer, which increases the complexity of correlation modeling and increases the number of parameters. Taking the dimension of the feature map input to the full-connection layer 320 as (512, 1, w/4), the feature dimension output by the full-connection layer as (D, 1, w/4) as an example, the number of parameters of the full-connection layer 320 is 512×d, the number of parameters and D are positively correlated, and the larger the number of categories identifiable by the text sequence identification model, the more the parameters of the full-connection layer. The large number of parameters means a large calculation amount and a loss of model speed. Moreover, if D is sufficiently large, the correlation between the categories gradually weakens, which is not favorable for generalization of the model.
Under the condition that the category number D is the same, compared with the full connection layer in the prior art, the network structure design of the feature mapping layer disclosed by the embodiment of the application can reduce the parameter number (512 xD-512 xD/M). And, the larger M, the more the number of feature map layer parameters decreases. By adopting the network structure design of the feature mapping layer disclosed by the embodiment of the application, the network parameters of the feature mapping layer can be reduced, the operation speed of the model is improved, the modeling of the correlation among the categories of each convolution layer is facilitated, and the generalization of the model is facilitated.
In an embodiment of the present application, the plurality of convolution layers 2201 included in the feature mapping layer 220 are independent convolution layers, and parameters are not shared between the convolution layers 2201.
In some embodiments of the present application, when the N feature maps are distributed to the M convolutional layers to perform feature mapping, M groups of feature maps may be generated according to the N feature maps, where the number of feature maps in each group is smaller than N, the intersection of the M groups of feature maps is equal to the N feature maps, and some or some feature maps may be divided into a plurality of feature map combinations. And then, each group of feature images are respectively sent to a convolution layer for feature mapping.
And 130, merging the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories.
The N feature maps are distributed to the M convolution layers to carry out feature mapping, M grouping feature mapping results can be obtained, each feature mapping result corresponds to a group of character types which are configured in advance, and the M grouping feature mapping results can cover all character type spaces which can be identified by the text sequence identification model.
Next, the merging layer 230 merges the obtained M grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a preset number of character categories.
As described above, in some embodiments of the present application, the merging the obtained grouping feature mapping results to obtain feature mapping results of the N feature maps corresponding to the preset character categories includes: and merging the grouping feature mapping results according to the character categories corresponding to the grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories.
Taking the dimension of the grouping feature mapping result output by each convolution layer 2201 as (D/M, 1, W/4) as an example, where D is the total number of character categories that can be identified by the text sequence identification model, M is the number of convolution layers 2201, W is the width of the output image, D/M is the dimension of the mapping category of the grouping feature mapping result, and after the merging layer 230 merges the dimension of the mapping category of the M grouping feature mapping results, the feature mapping result with the dimension (D, 1, W/4) can be obtained. The feature mapping result obtained after merging can cover D mapping categories. In the embodiment of the application, the feature mapping result obtained after combination is used as the feature mapping result of the N feature graphs corresponding to a plurality of preset character categories.
And 140, transcribing the feature mapping result to obtain a text sequence identification result of the input image.
And then, inputting the feature mapping results corresponding to a plurality of preset character categories obtained after combination into the transcription layer 240, and transcribing the feature mapping results by the transcription layer 240 to obtain a text sequence recognition result of the input image.
The specific implementation manner of transcribing the feature mapping result by the transcription layer to obtain the text sequence identification result of the input image is the prior art, and is not repeated in the embodiment of the present application.
In the training process of the text sequence recognition model, the steps 110 to 140 are executed for each training sample to obtain a prediction result corresponding to the training sample. And then, calculating a model error based on the prediction result and a sample label (namely, a text sequence true value corresponding to a training sample), and adopting a gradient updating method to optimize model parameters to iteratively train the model.
The training method of the text sequence recognition model refers to the prior art, and is not repeated in the embodiment of the application.
For massive training samples, the model parameters are reduced by adopting the feature mapping layer with the structure, so that the model convergence speed can be effectively improved, and the model training speed is improved. On the other hand, due to the adoption of grouping feature mapping, the relevance between the category and the feature is stronger, and the generalization of the model is facilitated to be improved.
The multi-category text sequence recognition method disclosed by the embodiment of the application is executed through a text sequence recognition model, and the text sequence recognition model comprises the following steps: the method comprises the steps of performing feature extraction on an input image through the feature extraction layer to obtain N feature images; distributing the N feature maps to the M convolution layers for feature mapping to obtain a grouping feature mapping result output by each convolution layer, wherein M and N are positive integers, and M is smaller than or equal to N; combining the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories; and finally, transcribing the feature mapping result to obtain a text sequence recognition result of the input image. According to the scheme, the feature mapping layer is improved into the plurality of convolution layers, the feature maps output by the feature extraction layer are subjected to feature mapping in a grouping mode, and then feature mapping results are combined, so that the number of parameters of the text sequence recognition model is effectively reduced on the premise that the output results identical to those of the full-connection layer in the prior art are obtained, and the model training and reasoning speed is improved.
In a multi-category text sequence scene, for example, a text sequence scene supporting a GB18030 character set, the text sequence recognition model needs to execute classification mapping of nearly twenty thousands categories, and the feature mapping layer disclosed in the embodiment of the application is adopted for feature mapping, so that compared with a full-connection layer network structure in the prior art, the model parameters of the text sequence recognition model can be greatly reduced, and the reasoning and training speed is improved. On the other hand, the feature mapping layer disclosed by the embodiment of the application is adopted for feature grouping mapping, so that the character class space corresponding to one-time feature mapping is reduced, the relevance between the class and the feature is stronger, and the generalization of the model is facilitated to be improved.
The embodiment of the application also discloses a multi-category text sequence recognition device, wherein the pre-trained text sequence recognition model comprises the following components: at least one feature extraction layer and a feature mapping layer, wherein the feature mapping layer comprises M convolution layers arranged in parallel, as shown in fig. 4, and the device comprises:
a feature extraction module 410, configured to perform feature extraction on the input image through the at least one feature extraction layer, so as to obtain N feature graphs;
the grouping feature mapping result obtaining module 420 is configured to distribute the N feature maps to the M convolutional layers for feature mapping, so as to obtain a grouping feature mapping result output by each convolutional layer, where M and N are positive integers, and M is less than or equal to N;
the merging module 430 is configured to merge the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories;
and the transcription recognition module 440 is configured to transcribe the feature mapping result to obtain a text sequence recognition result of the input image.
Optionally, the grouping feature mapping result obtaining module 420 is further configured to:
dividing the N feature images into M groups according to the preset corresponding relation between the feature images output by the feature extraction layer and the convolution layers to obtain a group of feature images corresponding to each convolution layer;
for each group of feature graphs, feature mapping is carried out on the group of feature graphs through the corresponding convolution layer, and a grouping feature mapping result mapped to a specified character category is obtained, wherein the specified character category is: and presetting part of character categories in a plurality of character categories aiming at the corresponding convolution layers.
Optionally, the preset correspondence between the feature map output by the feature extraction layer and the convolution layer includes:
and each feature graph output by the feature extraction layer is used as a corresponding relation of the input of one convolution layer.
Optionally, the merging module 430 is further configured to:
and merging the grouping feature mapping results according to the character categories corresponding to the grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories.
Optionally, the feature extraction layer includes: the system comprises a convolution layer, a batch normalization layer and an activation function, wherein the activation function is realized by adopting a GELU function.
The multi-category text sequence recognition device disclosed by the embodiment of the application is used for realizing the multi-category text sequence recognition method disclosed by the embodiment of the application, and the specific implementation of each module of the device is not repeated, and can be referred to the specific implementation of the corresponding steps of the method embodiment.
The multi-category text sequence recognition device disclosed by the embodiment of the application is executed through a text sequence recognition model, wherein the text sequence recognition model comprises the following components: the device comprises at least one feature extraction layer, a feature mapping layer and a transcription layer, wherein the feature mapping layer comprises M convolution layers which are arranged in parallel, and the device performs feature extraction on an input image through the feature extraction layer to obtain N feature images; distributing the N feature maps to the M convolution layers for feature mapping to obtain a grouping feature mapping result output by each convolution layer, wherein M and N are positive integers, and M is smaller than or equal to N; combining the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories; and finally, transcribing the feature mapping result to obtain a text sequence recognition result of the input image. According to the scheme, the feature mapping layer is improved into the plurality of convolution layers, the feature maps output by the feature extraction layer are subjected to feature mapping in a grouping mode, and then feature mapping results are combined, so that the number of parameters of the text sequence recognition model is effectively reduced on the premise that the output results identical to those of the full-connection layer in the prior art are obtained, and the model training and reasoning speed is improved.
In a multi-category text sequence scene, for example, a text sequence scene supporting a GB18030 character set, the text sequence recognition model needs to execute classification mapping of nearly twenty thousands categories, and the feature mapping layer disclosed in the embodiment of the application is adopted for feature mapping, so that compared with a full-connection layer network structure in the prior art, the model parameters of the text sequence recognition model can be greatly reduced, and the reasoning and training speed is improved. On the other hand, the feature mapping layer disclosed by the embodiment of the application is adopted for feature grouping mapping, so that the character class space corresponding to one-time feature mapping is reduced, the relevance between the class and the feature is stronger, and the generalization of the model is facilitated to be improved.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
The above description has been made in detail on a multi-class text sequence recognition method and apparatus provided by the present application, and specific examples are applied herein to illustrate the principles and embodiments of the present application, and the above description of the examples is only for helping to understand the method and a core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.
Various component embodiments of the application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in an electronic device according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
For example, fig. 5 shows an electronic device in which the method according to the application may be implemented. The electronic device may be a PC, a mobile terminal, a personal digital assistant, a tablet computer, etc. The electronic device conventionally comprises a processor 510 and a memory 520 and a program code 530 stored on said memory 520 and executable on the processor 510, said processor 510 implementing the method described in the above embodiments when said program code 530 is executed. The memory 520 may be a computer program product or a computer readable medium. The memory 520 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 520 has a storage space 5201 for program code 530 of a computer program for performing any of the method steps described above. For example, the memory space 5201 for the program code 530 may include individual computer programs for implementing the various steps in the above method, respectively. The program code 530 is computer readable code. These computer programs may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. The computer program comprises computer readable code which, when run on an electronic device, causes the electronic device to perform a method according to the above-described embodiments.
The embodiment of the application also discloses a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps of the multi-category text sequence recognition method according to the embodiment of the application.
Such a computer program product may be a computer readable storage medium, which may have memory segments, memory spaces, etc. arranged similarly to the memory 520 in the electronic device shown in fig. 5. The program code may be stored in the computer readable storage medium, for example, in a suitable form. The computer readable storage medium is typically a portable or fixed storage unit as described with reference to fig. 6. In general, the memory unit comprises computer readable code 530', which computer readable code 530' is code that is read by a processor, which code, when executed by the processor, implements the steps of the method described above.
Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Furthermore, it is noted that the word examples "in one embodiment" herein do not necessarily all refer to the same embodiment.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A method of multi-category text sequence recognition, wherein a pre-trained text sequence recognition model comprises: at least one feature extraction layer and a feature mapping layer, wherein the feature mapping layer comprises M convolution layers which are arranged in parallel, and the method comprises the following steps:
performing feature extraction on the input image through the at least one feature extraction layer to obtain N feature images;
distributing the N feature maps to the M convolution layers for feature mapping to obtain a grouping feature mapping result output by each convolution layer, wherein M and N are positive integers, and M is smaller than or equal to N;
combining the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories;
and transcribing the feature mapping result to obtain a text sequence recognition result of the input image.
2. The method of claim 1, wherein distributing the N feature maps to the M convolutional layers for feature mapping to obtain a packet feature mapping result output by each convolutional layer, comprises:
dividing the N feature images into M groups according to the preset corresponding relation between the feature images output by the feature extraction layer and the convolution layers to obtain a group of feature images corresponding to each convolution layer;
for each group of feature graphs, feature mapping is carried out on the group of feature graphs through the corresponding convolution layer, and a grouping feature mapping result mapped to a specified character category is obtained, wherein the specified character category is: and presetting part of character categories in a plurality of character categories aiming at the corresponding convolution layers.
3. The method according to claim 2, wherein the preset correspondence between the feature map output by the feature extraction layer and the convolution layer includes:
and each feature graph output by the feature extraction layer is used as a corresponding relation of the input of one convolution layer.
4. The method of claim 1, wherein the grouping feature mapping result corresponds to a part of character categories in the preset number of character categories, and the merging the obtained grouping feature mapping result to obtain the feature mapping result of the N feature maps corresponding to the preset number of character categories includes:
and merging the grouping feature mapping results according to the character categories corresponding to the grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories.
5. The method of claim 1, wherein the feature extraction layer comprises: the system comprises a convolution layer, a batch normalization layer and an activation function, wherein the activation function is realized by adopting a GELU function.
6. A multi-class text sequence recognition device, wherein a pre-trained text sequence recognition model comprises: at least one feature extraction layer, feature mapping layer, wherein, the feature mapping layer includes M convolution layers that set up in parallel, and the device includes:
the feature extraction module is used for carrying out feature extraction on the input image through the at least one feature extraction layer to obtain N feature images;
the grouping feature mapping result acquisition module is used for distributing the N feature graphs to the M convolution layers to perform feature mapping to obtain a grouping feature mapping result output by each convolution layer, wherein M and N are positive integers, and M is smaller than or equal to N;
the merging module is used for merging the obtained grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories;
and the transcription identification module is used for transcribing the feature mapping result to obtain a text sequence identification result of the input image.
7. The apparatus of claim 6, wherein the packet feature map result acquisition module is further configured to:
dividing the N feature images into M groups according to the preset corresponding relation between the feature images output by the feature extraction layer and the convolution layers to obtain a group of feature images corresponding to each convolution layer;
for each group of feature graphs, feature mapping is carried out on the group of feature graphs through the corresponding convolution layer, and a grouping feature mapping result mapped to a specified character category is obtained, wherein the specified character category is: and presetting part of character categories in a plurality of character categories aiming at the corresponding convolution layers.
8. The apparatus of claim 6, wherein the grouping feature mapping result corresponds to a partial character class of the preset number of character classes, and the merging module is further configured to:
and merging the grouping feature mapping results according to the character categories corresponding to the grouping feature mapping results to obtain feature mapping results of the N feature graphs corresponding to a plurality of preset character categories.
9. An electronic device comprising a memory, a processor and program code stored on the memory and executable on the processor, wherein the processor implements the multi-category text sequence recognition method of any one of claims 1 to 5 when the program code is executed by the processor.
10. A computer readable storage medium having stored thereon program code, which when executed by a processor realizes the steps of the multi-category text sequence recognition method of any of claims 1 to 5.
CN202310233711.2A 2023-03-03 2023-03-03 Multi-class text sequence recognition method and device and electronic equipment Pending CN116740730A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310233711.2A CN116740730A (en) 2023-03-03 2023-03-03 Multi-class text sequence recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310233711.2A CN116740730A (en) 2023-03-03 2023-03-03 Multi-class text sequence recognition method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN116740730A true CN116740730A (en) 2023-09-12

Family

ID=87908652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310233711.2A Pending CN116740730A (en) 2023-03-03 2023-03-03 Multi-class text sequence recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116740730A (en)

Similar Documents

Publication Publication Date Title
US11392838B2 (en) Method, equipment, computing device and computer-readable storage medium for knowledge extraction based on TextCNN
CN109086811B (en) Multi-label image classification method and device and electronic equipment
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN106980856B (en) Formula identification method and system and symbolic reasoning calculation method and system
CN111615702B (en) Method, device and equipment for extracting structured data from image
CN109086654B (en) Handwriting model training method, text recognition method, device, equipment and medium
CN113254654B (en) Model training method, text recognition method, device, equipment and medium
CN111914085A (en) Text fine-grained emotion classification method, system, device and storage medium
CN111783767B (en) Character recognition method, character recognition device, electronic equipment and storage medium
US20200134382A1 (en) Neural network training utilizing specialized loss functions
KR102250728B1 (en) Sample processing method and device, related apparatus and storage medium
CN111914654A (en) Text layout analysis method, device, equipment and medium
CN111507337A (en) License plate recognition method based on hybrid neural network
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN115731422A (en) Training method, classification method and device of multi-label classification model
CN109919214B (en) Training method and training device for neural network model
CN115439850A (en) Image-text character recognition method, device, equipment and storage medium based on examination sheet
US11715288B2 (en) Optical character recognition using specialized confidence functions
CN115049546A (en) Sample data processing method and device, electronic equipment and storage medium
CN112801960B (en) Image processing method and device, storage medium and electronic equipment
CN116740730A (en) Multi-class text sequence recognition method and device and electronic equipment
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN114565751A (en) OCR recognition model training method, OCR recognition method and related device
CN114187445A (en) Method and device for recognizing text in image, electronic equipment and storage medium
CN116975298B (en) NLP-based modernized society governance scheduling system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination