CN110766050A

CN110766050A - Model generation method, text recognition method, device, equipment and storage medium

Info

Publication number: CN110766050A
Application number: CN201910888046.4A
Authority: CN
Inventors: 李健; 高大帅; 张连毅; 武卫东
Original assignee: BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Current assignee: BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-02-07
Anticipated expiration: 2039-09-19
Also published as: CN110766050B

Abstract

The embodiment of the invention provides a model generation method, a text recognition device and a storage medium, and relates to the field of text recognition. The generation method comprises the following steps: generating sample data containing characters in a simulation mode, wherein the sample data comprises standard printing font data, non-standard printing font data, data of diagonal arrangement and horizontal wave arrangement of the characters, alternate space and character occurrence data and image background interference data; and performing initial model training by adopting the sample data to generate a text recognition model. According to the model generation method provided by the embodiment of the invention, the accuracy of the character recognition model can be enhanced by generating diversified sample data through simulation; the improved Mobilenet-V2 network is used as a backbone network of the character recognition model, and the LSTM and CTC architectures are combined, so that the size of the model can be reduced, the model is lighter, and the method is suitable for being deployed at a mobile terminal, and the effect of real-time photographing and recognition is achieved.

Description

Model generation method, text recognition method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of text recognition, in particular to a model generation method, a text recognition device, electronic equipment and a storage medium.

Background

Optical Character recognition (ORC) technology refers to a process in which an electronic device (e.g., a scanner or a digital camera) examines a Character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into a computer text using a Character recognition method; the method is characterized in that characters in a paper document are converted into an image file with a black-white dot matrix in an optical mode aiming at print characters, and the characters in the image are converted into a text format through recognition software for further editing and processing by word processing software.

The existing OCR recognition technology process generally extracts picture features from a picture to be recognized through a 3-5-layer CNN (convolutional neural network), and then performs LSTM (long short-Term Memory) operation and CTC (connectivity Temporal Classification) operation on the extracted feature map. Among LSTM networks are: an Input Gate (Input Gate), an Output Gate (Output Gate), and a forgetting Gate (Forget Gate). The forgetting gate is used for controlling historical state information, wherein the degree of influence on the current state is used, although the influence of long-term memory on the current state does not disappear or increase according to gradients, the influence of the long-term memory is still attenuated by accumulating and multiplying a plurality of values between [0 and 1 ]. Therefore, unless these values all take the value "1", it is not guaranteed that long-term memory has 100% influence on the current state. The calculation formula of the LSTM hidden layer is as follows:

an input gate: i.e. i_t＝sigm(W₁x_t+W₂h_t-1)；

Input values are as follows: i.e. i_t’＝tanh(W₃x_t+W₄h_t-1)；

Forget the door: f. of_t＝sigm(W₅x_t+W₆h_t-1)；

An output gate: o_t＝sigm(W₇x_t+W₈h_t-1)；

A state gate: m is_t＝m_t-1e f_t+i_te i_t’；

Hidden layer node output value: h is_t＝m_te o_t。

The schematic diagram of the hidden layer of LSTM is given in fig. 1 in combination with the calculation formula of the hidden layer of the LSTM algorithm. Where σ i, i is 1,2, and 3 denote the sigm function in the formula, and correspond to different weight W parameters.

The OCR recognition technology in the prior art is large in training model and difficult to deploy at a mobile terminal. The existing OCR technology has a high requirement on pictures, the pictures to be recognized are often complex, for example, due to the inclination of the character lines in the pictures or the inclination of the shooting angles, the characters in the pictures are arranged in diagonal lines, in horizontal wave bending, and the like, and the effect of recognizing the characters in the pictures by using the existing OCR technology is often poor.

Disclosure of Invention

In view of the above, embodiments of the present invention are proposed in order to provide a model generation method, a text recognition method, an apparatus, an electronic device, and a storage medium that overcome or at least partially solve the above problems.

In order to solve the above problem, an embodiment of the present invention discloses a model generation method, including:

generating sample data containing characters in a simulation mode, wherein the sample data comprises standard printing font data, non-standard printing font data, data of diagonal arrangement and horizontal wave arrangement of the characters, alternate space and character occurrence data and image background interference data;

and performing initial model training by adopting the sample data to generate a text recognition model.

Preferably, the initial model comprises a backbone network formed by a Mobilene-V2 network improvement, an LSTM network, and a CTC module; the step of performing initial model training by using the sample data to generate a text recognition model comprises the following steps:

extracting the characteristics of the sample data by adopting the backbone network to obtain a picture characteristic diagram;

performing dimension conversion and reshape on the picture feature map, and inputting the picture feature map into an LSTM network for identification to obtain an initial identification result;

calculating a function loss value between the initial identification result and the characters of the corresponding sample data by using the CTC module;

and correcting the initial model parameters according to the loss function values to obtain the text recognition model.

Preferably, the step of extracting features of the sample data by using the backbone network to obtain a picture feature map includes:

performing feature extraction on the sample data by using a Mobilenet-V2 network to obtain an initial feature map;

carrying out deconvolution on the last layer of feature map of the initial feature map by 1x1 to obtain a first initial feature map with the same size as the feature map of the second last layer;

merging and splicing the first initial feature map and the penultimate layer feature map to obtain a first layer fusion feature map;

carrying out 1x1 deconvolution on the first-layer fusion feature map to obtain a second-layer fusion feature map with the same size as the last-but-third-layer feature map;

and after 1x1 convolution is respectively carried out on the last layer of feature map, the first layer of fusion feature map and the second layer of fusion feature map, splicing is carried out, and the picture feature map is obtained.

Preferably, the picture feature map is represented by N × H × W × C, and the step of performing dimension conversion and reshape on the picture feature map and inputting the picture feature map into the LSTM network for identification includes:

converting the representation form of the picture feature map into N W H C;

converting the converted picture feature map reshape into N W HC; where W is the time step input to the LSTM network and HC is the characteristic dimension of each time step.

The embodiment of the invention also discloses a text recognition method, which comprises the following steps:

acquiring a picture to be identified;

recognizing the picture to be recognized by adopting a preset text recognition model to obtain a corresponding text recognition result; the text recognition model is generated by one or more model generation methods according to the embodiments of the present invention.

Preferably, after the step of recognizing the picture to be recognized by using a preset text recognition model to obtain a corresponding text recognition result, the method further includes:

and outputting the text recognition result.

The embodiment of the invention also discloses a model generation device, which comprises:

the system comprises a sample generation module, a data acquisition module and a data processing module, wherein the sample generation module is used for generating sample data containing characters in a simulation mode, and the sample data comprises standard printing font data, non-standard printing font data, data of diagonal arrangement and horizontal wave arrangement of the characters, alternate appearance data of blank spaces and the characters and image background interference data;

and the model training module is used for performing initial model training by adopting the sample data to generate a text recognition model.

The embodiment of the invention also discloses a text recognition device, which comprises:

and the picture acquisition module is used for acquiring the picture to be identified.

The text recognition module is used for recognizing the picture to be recognized by adopting a preset text recognition model so as to obtain a corresponding text recognition result; the text recognition model is generated by the method in the embodiment of the model generation method.

The embodiment of the invention also discloses an electronic device, which comprises:

one or more processors; and

one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the electronic device to perform one or more of the steps of the model generation method as described in embodiments of the invention or the steps of the text recognition method as described in embodiments of the invention.

Embodiments of the present invention also disclose a computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the processors to perform one or more of the steps of the model generation method according to embodiments of the present invention or cause the processors to perform one or more of the steps of the text recognition method according to embodiments of the present invention.

The embodiment of the invention has the following advantages:

in the model generation method of the embodiment of the invention, the accuracy of the text recognition model can be enhanced by generating diversified sample data through simulation; the improved Mobilenet-V2 network is used as a backbone network of a text recognition model, and the LSTM and CTC architectures are combined, so that the size of the model can be reduced, the model is lighter, and the model is suitable for being deployed at a mobile terminal, and the effect of real-time photographing and recognition is achieved.

Drawings

FIG. 1 is a schematic representation of an LSTM hidden layer in prior art OCR technology;

FIG. 2 is a flow chart of the steps of embodiment 1 of a model generation method of the present invention;

FIGS. 3a-3c are sample pictorial illustrations of a model generation method of the present invention;

FIG. 4 is a flow chart of the steps of embodiment 2 of a model generation method of the present invention;

FIG. 5 is a flow chart of the steps of a text recognition method of the present invention;

FIG. 6 is a block diagram of a model generation apparatus according to the present invention;

fig. 7 is a block diagram of a text recognition apparatus according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 2, fig. 2 is a flowchart of steps of a model generation method provided in embodiment 1 of the present invention, and as shown in fig. 2, the method may include:

step 101, generating sample data containing characters in a simulation mode, wherein the sample data comprises standard printing font data, non-standard printing font data, data of diagonal arrangement and horizontal wave arrangement of the characters, data of alternate appearance of spaces and the characters, and data of background interference of the characters.

In the embodiment of the invention, sample data is generated through simulation software or program simulation, the sample data comprises a sample picture and a txt file corresponding to the sample picture, the sample picture comprises character information, and the txt file comprises the character information in the sample picture; the text information is a written symbol of a recorded language. Further, the sample picture in the sample data is highly similar to the complex picture in the complex scene. The complex picture can be a picture under the conditions that the character information of the picture to be recognized is fuzzy due to background interference such as illumination, angles and the like, and the character information in the picture is arranged in a non-horizontal mode. Specifically, the sample data comprises standard printing font data, non-standard printing font data, data of diagonal arrangement and horizontal wave arrangement of characters, data of alternate appearance of blank spaces and characters, and data of background interference of characters, which are distributed according to a preset proportion; FIG. 3a shows a picture with characters arranged diagonally, and FIG. 3b shows a picture with characters interfered by background; figure 3c shows a picture with the text arranged in horizontal waves. In a specific example, the process of generating sample data is simulated independently, and distribution is performed according to 40% of standard printing font data simulation, 15% of non-standard printing font data simulation, 15% of character diagonal arrangement and horizontal wave arrangement data simulation, 15% of blank and character alternate occurrence data simulation, and 15% of character background interference data simulation; the generation process of the sample data of each category is completed by modifying the simulation program, and the generation process continuously changes the hyper-parameters, wherein the hyper-parameters refer to various parameters designated in the process of simulating the sample data, such as the fuzzification range of the picture, the inclination angle of the character row arrangement in the picture, and the like.

Compared with the existing OCR recognition technology, the sample data of the existing OCR recognition technology is more vivid in printed characters, the characters are in a picture with black characters and white back, and the picture is clear in black and white. The model trained by the sample data is poor in effect when picture characters in an actual scene are recognized, and the expected effect cannot be achieved because the difference between simulation data and the picture of the actual scene is large. Therefore, the embodiment of the invention enables the sample data to be closer to the picture in the actual scene through the diversity of the simulation sample, and can improve the accuracy of the model when being applied to the scene recognition.

And 102, performing initial model training by adopting sample data to generate a text recognition model.

In the embodiment of the invention, in order to realize text recognition of complex pictures, an initial model is established first, and the initial model is trained by utilizing diversity sample data obtained by simulation, so that a text recognition model can be obtained.

In a specific implementation, the sample data may include a training sample and a testing sample, where the training sample is used to train the model, and the testing sample is used to test the trained target model. In the example, a tensoflow (a symbolic mathematical system based on data flow programming) frame is selected for model training, the number of training samples is 450 ten thousand, the number of test samples is 50 ten thousand, and in the process of initial model training, pictures of the training samples are randomly rotated in a certain range and/or data enhancement operations such as contrast, brightness, Gaussian fuzzification and the like are randomly added to all the pictures of the training samples; therefore, the effectiveness of training the initial model is improved, and the accuracy of the text recognition model is improved.

In summary, the model generation method provided in the embodiment of the present invention generates diverse sample data through simulation, and can enhance the accuracy of the text recognition model, so that the text recognition model can be used to accurately recognize a complex picture.

Fig. 4 is a flowchart of steps of a model generation method provided in embodiment 2 of the present invention, and as shown in fig. 4, the method may include:

step 201, generating sample data containing characters in a simulation mode, wherein the sample data comprises standard printing font data, non-standard printing font data, data of diagonal arrangement and horizontal wave arrangement of the characters, data of alternate appearance of spaces and the characters, and image background interference data.

Step 201 is the same as step 101, and the relevant description refers to the relevant steps of the above embodiments, which are not repeated herein.

Step 202, carrying out feature extraction on the sample data by adopting a backbone network formed by the improvement of the Mobilene-V2 network in the initial model to obtain a picture feature map.

The initial model included a backbone network formed by a Mobilene-V2 network modification, an LSTM network, and a CTC module.

The Mobilenet network is designed for deep learning application of a mobile terminal and an embedded terminal, and has the advantages of less model parameters, less network calculation amount and convenient deployment at the mobile terminal; the Mobileneet-V2 network is improved based on the Mobileneet-V1 network, and mainly increases the Residual Connection, the size of the network input picture is 224 multiplied by 3 according to the difference of stride, the difference block is different, the first step outputs the channel number 32 through a convolution operation, the subsequent 7 bottleeck (bottleneck) operations are continuously performed, the output channel number of each bottleeck is c, the input channel number is t, the repetition number is n, the stride operation is s when the first repetition is performed, the second repetition is 1, and the network structure of the Mobileneet-V2 network is as follows:

Input	Operator	t	c	n	s
						224²×3	conv2d	-	32	1	2
112²×32	bottleneck	1	16	1	1
						112²×16	bottleneck	6	24	2	2
56²×24	bottleneck	6	32	3	2
						28²×32	bottleneck	6	64	4	2
28²×64	bottleneck	6	96	3	1
						14²×96	bottleneck	6	160	3	2
7²×160	bottleneck	6	320	1	1
						7²×320	conv2d 1×1	-	1280	1	1
7²×1280	avgpool 7×7	-	-	1	-
						1×1×k	conv2d 1×1	-	k	-

from the network structure table of the mobileneet-V2 network, after seven bottleeck operations, the network outputs a 7 × 7 × 320 feature map size, and outputs a final result after passing through a convolution kernel avgpool (pooling function) and conv2D (2D convolution layer), where k is the number of categories to be formulated.

The embodiment of the invention extracts the picture characteristics through the backbone network formed by the improvement of the Mobilene-V2 network, can realize the lightweight of the model, and is beneficial to the deployment of the model at a mobile terminal.

Specifically, a backbone network formed by a Mobilene-V2 network improvement is adopted to perform feature extraction on a sample picture in sample data to obtain a corresponding picture feature map, and the method is realized by the following specific steps:

extracting the features of the picture to be identified by adopting a Mobilene-V2 network to obtain an initial feature map; the size of the last layer of feature map of the initial feature map is N x H x W x C, the size of the second last layer of feature map of the initial feature map is N x 2H x 2W x 2C, and the size of the third last layer of feature map of the initial feature map is N x 4H x 4W 4C; wherein, N is the batch number, H is the height of the picture feature map, W is the width of the picture feature map, and C is the channel number of the picture feature map;

carrying out deconvolution on the last layer of feature map of the initial feature map by 1x1 to obtain a first initial feature map with the same size as the feature map of the penultimate layer;

merging and splicing the first initial feature map and the penultimate layer feature map on the C dimension to obtain a first layer fused feature map;

carrying out 1x1 deconvolution on the first-layer fusion feature map to obtain a second-layer fusion feature map with the same size as the penultimate feature map;

and after performing 1x1 convolution on the final layer of feature map, the first layer of fused feature map and the second layer of fused feature map respectively, splicing in the dimension C to obtain the picture feature map, wherein the size of the picture feature map is N x H x W x C.

Step 203, inputting the LSTM network in the initial model for identification after dimension conversion and reshape (recombination operation) are carried out on the picture feature map, and obtaining an initial identification result.

The LSTM network is a time-cycle neural network, can comprehensively output results of current and historical useful information when outputting final results, and is widely applied to the fields of character recognition, voice recognition and the like. The LSTM network always has a state control unit called cell state in the whole prediction process, the unit keeps updating in the whole model process, network input and useless information of the current time point are informed to the cell state through an input gate and a forgetting gate, and the cell state continues to be transmitted to the next time point after updating the self state information. The forgetting gate mainly controls the forgetting amount of the hidden state information transmitted and input at the last time point through a certain probability. The input gate is an input for the next time point by updating the cell state at the previous time point after the current time point data is calculated by the activation function. And obtaining the output of the hidden state at the current moment by the input at the current moment through the point multiplication operation of the activation function and the cell state input.

And performing dimension conversion and reshape on the picture feature map, and then using the picture feature map as the input of the LSTM, and obtaining an initial recognition result by utilizing the sequence recognition advantages of the LSTM.

For example, in the sample picture shown in fig. 3b, picture features are extracted through a backbone network, that is, a part of the sample picture including text information is extracted to obtain a picture feature map, the picture feature map includes a code of "he quick brown fox jumps over the lay dog", after the picture feature map is input to an LSTM network, the LSTM network obtains a feature sequence of "the quick brown fox jumps over the lay dog" from the picture feature map, performs probability prediction on the feature sequence, sends a predicted result to a corresponding decoding end in a form of feature coding, and decodes the predicted result to obtain a text of which an initial recognition result is "he quick brown fox jumps over the lay dog".

And step 204, calculating a function loss value between the initial identification result and the characters of the corresponding sample data by using the CTC module in the initial model.

The CTC module is a loss function commonly used in an OCR recognition network and can be used for solving the problem that input sequences and output sequences are difficult to correspond one to one. In the training process of the model, a certain error exists between the predicted result and the actual target result, namely, a certain error exists between the initial recognition result and the characters of the actual sample data, the accuracy of the training model can be evaluated by calculating the loss function value of the initial recognition result on the actual sample data through the CTC module, and specifically, the smaller the output value of the loss function is, the higher the accuracy of the model is.

And step 205, correcting the initial model parameters according to the loss function values to obtain a text recognition model.

The initial model carries out corresponding parameter correction according to the loss function value, and a target text recognition model is obtained by continuously correcting parameters during training; and ensuring the accuracy of the recognition result of the target text recognition model in practical application.

In a specific example, the picture feature map is represented by N × H × W × C, and the step of performing dimension conversion and reshape on the picture feature map and inputting the converted picture feature map into an LSTM network for identification includes:

converting the representation form of the picture feature map into N W H C;

In a specific implementation, whether the width of the picture feature map is in a second dimension is judged; if the width of the picture characteristic diagram is not in the second dimension, performing dimension conversion on the picture characteristic diagram to enable the width of the picture characteristic diagram to be in the second dimension; and then, reshape is carried out on the picture feature map with the width in the second dimension.

In this example, the width of the picture feature map is located in the second dimension, and after reshape is performed, the width of the picture feature map may be directly input to the LSTM network for processing, where the input size of the LSTM network is the converted dimension. For example, when the size of the picture feature is "batch _ size _ feature _ h _ feature _ w _ output _ channels", the picture feature is first subjected to dimension conversion to become "batch _ size _ feature _ w _ feature _ h _ output _ channels" so that the width is in the second dimension, and then the above results are subjected to a restructuring operation, resulting in a result of "1" feature _ w _ output _ channels. For example, the input size before the recombination operation is 1 × 200 × 4 × 256, the picture feature map size after the recombination is 1 × 200 × 1024, the number of time steps of the LSTM operation process is 200, and the feature dimension input at each time step is 1024.

Further, the method further comprises:

and performing performance evaluation on the text recognition model through the test sample.

After the initial model is trained through the training sample, a text recognition model is obtained, at the moment, the parameters of the text recognition model are adjusted to be optimal, the text recognition model is used for recognizing the picture of the test sample, and the error between the recognition result and the target result is compared, so that the information of the text recognition model, such as accuracy, recognition speed, recognition generalization and the like, can be obtained.

In conclusion, the embodiment of the invention generates diversified sample data through simulation, so that the accuracy of the text recognition model can be enhanced; the improved Mobilene-V2 network is adopted as a main network of a text recognition model to replace a CNN network in the existing OCR recognition technology, and the LSTM and CTC architectures are combined, so that the size of the model can be reduced, the model is lighter, and the model is suitable for being deployed at a mobile terminal, and the effect of real-time photographing recognition is realized.

Fig. 5 is a flowchart of steps of a text recognition method according to an embodiment of the present invention, and as shown in fig. 5, the method is applied to a mobile terminal, and the method includes:

step 501, obtaining a picture to be identified.

In a specific implementation, the mobile terminal may include a mobile Phone, a tablet Computer, a PC (Personal Computer), a wearable device (such as a bracelet, glasses, a watch, and the like), and the like, and an operating system of the electronic device may include Android, IOS, Windows Phone, Windows, and the like, which is only an example, and the embodiment of the present invention is not limited thereto.

In the embodiment of the invention, the picture to be recognized comprises character information. The picture to be recognized may be a picture shot by a camera of the mobile terminal in real time, or may be a picture already stored by the mobile terminal, or a picture generated by the mobile terminal through a screenshot, or a picture downloaded by the mobile terminal through a network, and so on.

Step 502, identifying a picture to be identified by adopting a preset text identification model to obtain a corresponding text identification result; the text recognition model is generated by the method in the model generation method embodiment.

In the example, the picture to be recognized is used as the input of the text recognition model, and when the text recognition model is used for recognizing the characters in the picture to be recognized, the corresponding text recognition result can be automatically obtained based on each parameter in training and learning; because the optimal parameters of the text recognition model are determined in the model training process, the accuracy of the text recognition result can be ensured when the text recognition model recognizes the picture to be recognized in practical application. Specifically, the recognition process of the text recognition model on the picture to be recognized is similar to the step of recognizing sample data by the initial model in the embodiment of the model generation method.

Further, the method further comprises:

and outputting the text recognition result.

In the embodiment of the invention, after the text recognition result corresponding to the picture to be recognized is obtained by the text recognition model, the text recognition result is output to the display screen of the mobile terminal for displaying, so that the mobile terminal can obtain the picture to be recognized in real time and obtain the text recognition result corresponding to the picture to be recognized in real time, and the use convenience of a user is improved.

In summary, in the text recognition method provided by the embodiment of the present invention, since the adopted text recognition model adopts diverse sample data in the model training process, the accuracy of character recognition can be enhanced, and the size of the model can be reduced by using the improved mobilent-V2 network as the backbone network, so that the model is more lightweight, and the deployment at the mobile end is realized; the mobile terminal can acquire the picture to be recognized in real time to be used as the input of the text recognition model, the text recognition model performs character recognition on the picture to be recognized in time to obtain a corresponding text recognition result, and the text recognition result is output to the display device of the mobile terminal, so that the effects of real-time acquisition and real-time recognition are achieved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 6, a block diagram of a model generation apparatus according to an embodiment of the present invention is shown, and may specifically include the following modules:

the sample generation module 601 is configured to generate sample data including a text in a simulation manner, where the sample data includes standard print font data, non-standard print font data, data of diagonal arrangement and horizontal wave arrangement of characters, alternate appearance data of spaces and characters, and image background interference data;

and the model training module 602 is configured to perform initial model training by using the sample data to generate a text recognition model.

In a preferred embodiment of the present invention, the initial model comprises a backbone network formed by a Mobilenet-V2 network improvement, an LSTM network, and a CTC module; the model training module 602 includes:

the characteristic extraction module is used for extracting the characteristics of the sample data by adopting the backbone network to obtain a picture characteristic diagram;

the characteristic identification module is used for performing dimension conversion and reshape on the picture characteristic graph and then inputting the picture characteristic graph into an LSTM network for identification to obtain an initial identification result;

the comparison module is used for calculating a function loss value between the initial identification result and the text of the corresponding sample data by utilizing the CTC module;

a correction module for correcting the initial model parameters according to the loss function values to obtain the text recognition model

In a preferred embodiment of the present invention, the feature extraction module includes:

the first processing module is used for extracting the characteristics of the sample data by adopting a Mobilene-V2 network to obtain an initial characteristic diagram; the size of the last layer of feature map of the initial feature map is N x H x W x C, the size of the second last layer of feature map of the initial feature map is N x 2H x 2W x 2C, and the size of the third last layer of feature map of the initial feature map is N x 4H x 4W 4C; wherein, N is the batch number, H is the height of the picture characteristic diagram, W is the width of the picture characteristic diagram, and C is the channel number of the picture characteristic diagram;

the second processing module is used for carrying out 1x1 deconvolution on the last layer of feature map of the initial feature map to obtain a first initial feature map with the same size as the feature map of the second last layer;

the third processing module is used for merging and splicing the first initial feature map and the feature map of the penultimate layer to obtain a first layer fusion feature map;

the fourth processing module is used for carrying out 1x1 deconvolution on the first-layer fusion feature map to obtain a second-layer fusion feature map with the same size as the last-but-third-layer feature map;

the fifth processing module is used for performing 1x1 convolution on the last layer of feature map, the first layer of fusion feature map and the second layer of fusion feature map respectively, and then splicing in the C dimension to obtain the picture feature map; the size of the picture feature map is N H W C.

In a preferred embodiment of the present invention, the graph feature map is represented by N × H × W × C, and the model training module 602 includes:

a conversion submodule for converting the representation of the picture feature map into N W H C;

the recombination submodule is used for converting the converted picture feature map reshape into N W HC; where W is the time step input to the LSTM network and HC is the characteristic dimension of each time step.

In a preferred embodiment of the present invention, the apparatus further comprises:

and the model testing module is used for evaluating the performance of the text recognition model through the test sample.

Referring to fig. 7, a block diagram of a structure of an embodiment of a text recognition apparatus of the present invention is shown, which may specifically include the following modules:

the image obtaining module 701 is configured to obtain an image to be identified.

A text recognition module 702, configured to recognize the picture to be recognized by using a preset text recognition model to obtain a corresponding text recognition result; the text recognition model is generated by the method in the embodiment of the model generation method.

and the output module is used for outputting the text recognition result.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present invention further provides an electronic device, including:

one or more processors; and

one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the electronic device to perform steps of a method as described by embodiments of the invention.

Embodiments of the present invention also provide a computer-readable storage medium having stored thereon instructions, which, when executed by one or more processors, cause the processors to perform the steps of the method according to embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is provided for a model generation method, a text recognition method, a model generation device and a text recognition device, and the specific examples are applied herein to explain the principle and the implementation of the present invention, and the descriptions of the above examples are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of model generation, the method comprising:

2. The method of claim 1, wherein the initial model comprises a backbone network formed by a Mobilene-V2 network refinement, an LSTM network, and a CTC module; the step of performing initial model training by using the sample data to generate a text recognition model comprises the following steps:

3. The method according to claim 2, wherein the step of extracting features of the sample data by using the backbone network to obtain a picture feature map comprises:

4. The method of claim 2, wherein the graph feature map is represented by N × H × W × C, and the step of inputting the graph feature map into the LSTM network for recognition after performing dimension conversion and reshape includes:

converting the representation form of the picture feature map into N W H C;

5. A text recognition method, comprising:

acquiring a picture to be identified;

recognizing the picture to be recognized by adopting a preset text recognition model to obtain a corresponding text recognition result; the text recognition model is generated using the method of any one of claims 1 to 4.

6. The method according to claim 5, wherein after the step of recognizing the picture to be recognized by using a preset text recognition model to obtain a corresponding text recognition result, the method further comprises:

and outputting the text recognition result.

7. A model generation apparatus, comprising:

8. A text recognition apparatus, comprising:

The text recognition module is used for recognizing the picture to be recognized by adopting a preset text recognition model so as to obtain a corresponding text recognition result; the text recognition model is generated using the method of any one of claims 1 to 4.

9. An electronic device, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the electronic device to perform any of the methods of claims 1-4 or cause the electronic device to perform any of the methods of claims 5-6.

10. A computer-readable storage medium having stored thereon instructions, which when executed by one or more processors, cause the processors to perform any of the methods of claims 1-4 or cause the processors to perform any of the methods of claims 5-6.