CN115661845A

CN115661845A - Image text recognition method, and method and device for training image text recognition model

Info

Publication number: CN115661845A
Application number: CN202110767483.8A
Authority: CN
Inventors: 武维; 牛佩; 孙弋淼
Original assignee: Huawei Technical Service Co Ltd
Current assignee: Huawei Technical Service Co Ltd
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2023-01-31

Abstract

The embodiment of the application provides an image text recognition method, an image text recognition model training method and an image text recognition model training device, which relate to the field of artificial intelligence, in particular to the field of image text recognition, and the method comprises the following steps: acquiring data to be trained, acquiring the data to be trained, wherein the data to be trained comprises n images from a target domain and m images from a synthesis domain, and extracting the features of the n images and the m images through a first neural network to construct a spatial migration feature matrix; extracting the characteristics of n images and m images through a second neural network to construct a sequence migration characteristic matrix; and adjusting parameters of the image text recognition model according to the transition characteristic matrix, the sequence transition characteristic matrix and the prediction data of the data to be trained. The image text recognition method provided by the embodiment of the application performs combined training on the images of the synthesis domain and the target domain, so that the training efficiency and the text recognition rate are improved.

Description

Image text recognition method, and method and device for training image text recognition model

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to an image text recognition method, an image text recognition model training method, and an image text recognition model training apparatus.

Background

When a traditional image text recognition model is trained, the model is trained in advance by adopting synthetic data, then a real image is used for fine adjustment, but under an actual service scene, the model obtained by the method is easy to be over-fitted, and the robustness is low, so that the problem that how to recognize the image text better and how to train the image text recognition model is to be solved urgently is solved.

Disclosure of Invention

The image text recognition method, the image text recognition model training method and the image text recognition model training device can improve the accuracy of text recognition and the efficiency of model training, and are high in robustness.

In a first aspect, a method of training an image text recognition model is provided, the method comprising:

acquiring data to be trained, wherein the data to be trained comprises n images from a target domain and m images from a synthesis domain, n is more than or equal to 1, m is more than or equal to 1, and n and m are positive integers; extracting the features of the n images and the m images through a first neural network to construct a space migration feature matrix; extracting features of the n images and the m images through a second neural network; and adjusting parameters of the image text recognition model according to the space migration characteristic matrix, the sequence migration characteristic matrix and the prediction data of the data to be trained.

The image in the target domain described above may be understood as a real image, and the image in the synthesized domain may be understood as a synthesized image. Real images and composite images that have annotated text in the images.

In the embodiment of the application, the spatial feature and the sequence feature migration of the image in the synthetic domain and the spatial feature and the sequence feature of the target domain are combined to construct training data, and the training data is used for joint training, so that the training effect is improved, and under-fitting and over-fitting of a model are effectively avoided.

With reference to the first aspect, in certain implementations of the first aspect, the extracting, by the first neural network, features of the n images and the m images to construct a spatial migration feature matrix includes: performing convolution processing on the n images through the first neural network to obtain a first spatial feature map, and performing convolution processing on the m image matrixes through the first neural network to obtain a second spatial feature map; performing matrix change on the first spatial feature map to obtain a first transition matrix, and performing matrix change on the second spatial feature map to obtain a second transition matrix, wherein the dimensionality of the first transition matrix is smaller than that of the first spatial feature map, and the dimensionality of the second transition matrix is smaller than that of the second spatial feature map; stacking the first transition matrix and the second transition matrix to determine a third transition matrix; constructing a fourth transition matrix according to the third transition matrix and the transposed matrix of the third transition matrix; and determining the space migration characteristic matrix by taking the elements of the fourth transition matrix as the input of the kernel function.

With reference to the first aspect, in certain implementations of the first aspect, the determining the spatial migration feature matrix by using the elements of the fourth transition matrix as input of a kernel function includes: the kernel function is a Gaussian kernel function, and the space migration characteristic matrix is determined according to the following formula;

k(i,j)＝exp(-(R _ii +R _jj -2R _ij )/σ ² )；

wherein R is a fourth transition matrix, σ ² Is the variance of the fourth transition matrix.

With reference to the first aspect, in certain implementations of the first aspect, the second neural network is a recurrent neural network, and the extracting features of the n images and the m images through the second neural network to construct a sequence migration feature matrix includes: extracting sequence features of the first spatial feature map through the recurrent neural network to obtain a first sequence feature map, and extracting the second spatial feature map through the recurrent neural network to obtain a second sequence feature map; performing matrix change on the first sequence characteristic diagram to obtain a fifth transition matrix, and performing matrix change on the second sequence characteristic diagram to obtain a sixth transition matrix, wherein the dimensionality of the fifth transition matrix is smaller than that of the third spatial characteristic diagram, and the dimensionality of the sixth transition matrix is smaller than that of the fourth spatial characteristic diagram; stacking the fifth transition matrix and the sixth transition matrix to determine a seventh transition matrix; constructing an eighth transition matrix according to the seventh transition matrix and the transposed matrix of the seventh transition matrix; and determining the sequence migration characteristic matrix by taking the elements of the eighth transition matrix as the input of the kernel function.

With reference to the first aspect, in certain implementations of the first aspect, the determining a sequence migration feature matrix by using the elements of the eighth transition matrix as input of a kernel function includes: the kernel function is a Gaussian kernel function, and the sequence migration characteristic matrix is determined according to the following formula;

T(i,j)＝exp(-(D _ii +D _jj -2D _ij )/σ ₁ ² )；

where T is the eighth transition matrix, σ ₁ ² Is the variance of the eighth transition matrix.

With reference to the first aspect, in certain implementations of the first aspect, the adjusting parameters of the image text recognition model according to the spatial transition feature matrix, the sequence transition feature matrix, and the prediction data of the data to be trained includes: constructing a loss function according to the spatial migration characteristic matrix, the sequence migration characteristic matrix and the prediction data of the data to be trained; the hyperparameter of the loss function is adjusted.

With reference to the first aspect, in certain implementations of the first aspect, the constructing a loss function according to the spatial migration feature matrix, the sequence migration feature matrix, and the prediction data of the data to be trained includes: the loss function is constructed according to the following formula:

L _total ＝α ₁ L _cnn +α ₂ L _rnn +α ₃ L _tgt +α ₄ L _src ；

wherein L is _total Is the loss function; l is _cnn A loss function that is characteristic of the spatial migration;L _rnn a loss function that is characteristic of the sequence migration; l is _tgt A loss function for classifying the target domain text; l is _src A penalty function for classifying the synthesized domain text; alpha is alpha ₁ 、α ₂ 、α ₃ 、α ₄ To balance the over-parameters of the fractional loss functions.

With reference to the first aspect, in certain implementations of the first aspect, the loss function for the spatial migration feature is constructed according to the following formula:

wherein k (i, j) is an element of the spatial migration feature matrix;

constructing a loss function of the sequence migration characteristics according to the following formula:

wherein t (i, j) is an element of the sequence migration feature matrix.

In a second aspect, a method of image text recognition is provided, the method comprising: acquiring an image to be identified; processing the image to be recognized by using an image text recognition model to obtain a recognition result of the image to be recognized, wherein the image text recognition model is obtained by a method for training an image text recognition model, the method for training the image text recognition model comprises the steps of extracting the features of n images and m images through a first neural network to construct a space migration feature matrix, extracting the features of the n images and m images through a second neural network to construct a sequence migration feature matrix, and adjusting the image text recognition model according to the space migration feature matrix, the sequence migration feature matrix, the prediction data of the n images and the prediction data of the m images; the n images are images in the target domain, and the m images are images in the composite domain.

In the embodiment of the application, the spatial characteristic and the sequence characteristic migration of the image in the synthesis domain and the spatial characteristic and the sequence characteristic of the target domain are combined to construct training data, the training data is used for joint training, the training effect is improved, the image text recognition model is used for image text recognition, and the recognition accuracy is high.

With reference to the second aspect, in some implementations of the second aspect, the extracting, by the first neural network, features of the n images and the m images to construct a spatial migration feature matrix includes: performing convolution processing on the n images through the first neural network to obtain a first spatial feature map, and performing convolution processing on the m image matrixes through the first neural network to obtain a second spatial feature map; performing matrix change on the first spatial feature map to obtain a first transition matrix, and performing matrix change on the second spatial feature map to obtain a second transition matrix, wherein the dimensionality of the first transition matrix is smaller than that of the first spatial feature map, and the dimensionality of the second transition matrix is smaller than that of the second spatial feature map; stacking the first transition matrix and the second transition matrix to determine a third transition matrix; constructing a fourth transition matrix according to the third transition matrix and the transposed matrix of the third transition matrix; and determining the space migration characteristic matrix by taking the elements of the fourth transition matrix as the input of the kernel function.

With reference to the second aspect, in some implementations of the second aspect, the determining the spatial migration feature matrix using the elements of the fourth transition matrix as inputs to a kernel function includes: the kernel function is a Gaussian kernel function, and the space migration characteristic matrix is determined according to the following formula;

k(i,j)＝exp(-(R _ii +R _jj -2R _ij )/σ ² )；

With reference to the second aspect, in some implementations of the second aspect, the second neural network is a recurrent neural network, and the extracting features of the n images and the m images through the second neural network constructs a sequence migration feature matrix, including: extracting the sequence features of the first spatial feature map through the recurrent neural network to obtain a first sequence feature map, and extracting the second spatial feature map through the recurrent neural network to obtain a second sequence feature map; performing matrix change on the first sequence characteristic diagram to obtain a fifth transition matrix, and performing matrix change on the second sequence characteristic diagram to obtain a sixth transition matrix, wherein the dimensionality of the fifth transition matrix is smaller than that of the third spatial characteristic diagram, and the dimensionality of the sixth transition matrix is smaller than that of the fourth spatial characteristic diagram; stacking the fifth transition matrix and the sixth transition matrix to determine a seventh transition matrix; constructing an eighth transition matrix according to the seventh transition matrix and the transposed matrix of the seventh transition matrix; and determining the sequence migration characteristic matrix by taking the elements of the eighth transition matrix as the input of the kernel function.

With reference to the second aspect, in some implementations of the second aspect, the determining the sequence migration feature matrix by using the elements of the eighth transition matrix as input of the kernel function includes: the kernel function is a Gaussian kernel function, and a sequence migration characteristic matrix is determined according to the following formula;

T(i,j)＝exp(-(D _ii +D _jj -2D _ij )/σ ₁ ² )；

where T is an eighth transition matrix, σ ₁ ² Is the variance of the eighth transition matrix.

With reference to the second aspect, in some implementations of the second aspect, the adjusting parameters of the image text recognition model according to the spatial migration feature matrix, the sequence migration feature matrix, and the prediction data of the data to be trained includes: constructing a loss function according to the space migration characteristic matrix, the sequence migration characteristic matrix and the prediction data of the data to be trained; the hyperparameter of the loss function is adjusted.

With reference to the second aspect, in some implementations of the second aspect, the constructing a loss function according to the spatial migration feature matrix, the sequence migration feature matrix, and the prediction data of the data to be trained includes: the loss function is constructed according to the following formula:

L _total ＝α ₁ L _cnn +α ₂ L _rnn +α ₃ L _tgt +α ₄ L _src ；

wherein L is _total Is the loss function; l is _cnn A loss function that is characteristic of the spatial migration; l is _rnn A loss function that is characteristic of the sequence migration; l is _tgt A loss function for the target domain text classification; l is a radical of an alcohol _src A penalty function for classifying the synthesized domain text; alpha (alpha) ("alpha") ₁ 、α ₂ 、α ₃ 、α ₄ To balance the over-parameters of the fractional loss functions.

With reference to the second aspect, in some implementations of the second aspect, the loss function for the spatial migration feature is constructed according to the following formula:

wherein k (i, j) is an element of the spatial migration feature matrix;

wherein t (i, j) is an element of the sequence migration feature matrix.

A third aspect is a training apparatus for training an image text recognition model according to an embodiment of the present application, the training apparatus including a module/unit for performing the method according to the above aspect or any one of the possible designs of the above aspect; these modules/units may be implemented by hardware, or by hardware executing corresponding software.

A fourth aspect is an image text recognition apparatus according to an embodiment of the present application, the apparatus including a module/unit that performs the method according to the above aspect or any one of the possible designs of the above aspects; these modules/units may be implemented by hardware, or by hardware executing corresponding software.

A fifth aspect is a computer-readable storage medium according to an embodiment of the present application, where the computer-readable storage medium includes a computer program, and when the computer program runs on an electronic device, the electronic device is caused to execute any one of the technical solutions as described in the above aspect and any one of the above aspects.

A sixth aspect is a chip according to an embodiment of the present application, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to execute the foregoing aspects and any design solutions of the foregoing aspects.

A seventh aspect is a computer program according to an embodiment of the present application, where the computer program includes instructions that, when executed on a computer, cause the computer to perform the technical solution as set forth in the above aspect and any one of the above aspects may be designed.

For the beneficial effects of the fourth aspect to the seventh aspect, please refer to the beneficial effects of the first aspect to the second aspect, which is not repeated.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence body framework.

Fig. 2 is a block diagram illustrating a system architecture according to an embodiment of the present disclosure.

FIG. 3 is a schematic block diagram of a convolutional neural network model provided in an implementation of the present application.

FIG. 4 is a schematic block diagram of a recurrent neural network model provided in the practice of the present application.

Fig. 5 is a schematic diagram of a chip hardware structure according to an embodiment of the present application.

Fig. 6 is a structural block diagram of a system architecture provided in an embodiment of the present application.

Fig. 7 is a schematic flow chart of a method for training an image text recognition model provided in an embodiment of the present application.

Fig. 8 is a schematic flowchart of an image text recognition method according to an embodiment of the present application.

Fig. 9 is a schematic block diagram of a training apparatus for an image text recognition model according to an embodiment of the present application.

Fig. 10 is a schematic block diagram of an image text recognition apparatus provided in an embodiment of the present application.

Fig. 11 is a schematic hardware structure diagram of a training apparatus for an image text recognition model according to an embodiment of the present application.

Fig. 12 is a schematic hardware configuration diagram of an image text recognition apparatus according to an embodiment of the present application.

Detailed Description

The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the following embodiments of the present application, "at least one", "one or more" means one, two or more. The term "and/or" is used to describe the association relationship of the associated objects, and means that there may be three relationships; for example, a and/or B, may represent: a exists singly, A and B exist simultaneously, and B exists singly, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The technical solution in the present application will be described below with reference to the accompanying drawings.

FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, adapted to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is described in detail below in two dimensions, "intelligent information chain" (horizontal axis) and "Information Technology (IT) value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform.

The infrastructure may communicate with the outside through sensors, and the computing power of the infrastructure may be provided by a smart chip.

The intelligent chip may be a hardware acceleration chip such as a Central Processing Unit (CPU), a neural-Network Processing Unit (NPU), a Graphics Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), and a Field Programmable Gate Array (FPGA).

The infrastructure platform of the infrastructure may include distributed computing framework, network and other related platform guarantees and supports, and may include cloud storage and computing, interconnection networks and the like.

For example, for an infrastructure, data may be obtained through sensors and external communications and then provided to an intelligent chip in a distributed computing system provided by the base platform for computation.

(2) Data:

data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing:

the data processing generally includes processing modes such as data training, machine learning, deep learning, searching, reasoning, decision making and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General-purpose capability:

after the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent products and industrial applications:

the intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, intelligent terminal and the like.

The embodiment of the application can be applied to many fields in artificial intelligence, such as intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving and the like.

In particular, the embodiments of the present application may be specifically applied to fields requiring the use of (deep) neural networks, such as automatic driving, image classification, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, and natural language processing.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the sake of understanding, the following description will be made first of all with respect to terms and concepts of the neural networks to which the embodiments of the present application may relate.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x _s And an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s =1, 2, \8230, n is natural number greater than 1, and W is _s Is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by connecting together a plurality of the above-mentioned single neural units, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is a function of the input vector or vectors,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is also large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

Superscript 3 represents the number of layers in which the coefficient W lies, and the subscripts correspond to the third layer index 2 at the output and the second layer index 4 at the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (formed by a number of layers of vectors W) of all layers of the deep neural network that has been trained.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolution processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel may be formalized as a matrix of random size, and may be learned to obtain reasonable weights during the training of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, the process of changing the weight vector before the first updating, namely presetting parameters for each layer in the deep neural network) for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continued until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the greater the difference, the training of the deep neural network becomes a process of reducing the loss as much as possible.

(5) Back propagation algorithm

The neural network can adopt the size of parameters in the neural network model corrected in the training process by a Back Propagation (BP) algorithm, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and parameters in the neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

(6) Recurrent neural network

Recurrent Neural Networks (RNNs) are used to process sequence data. In a traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are fully connected, and there is no connection for each node between layers in each layer. Although the common neural network solves many problems, the common neural network still does not address many problems. For example, it is generally desirable to predict what is going to the next word in a sentence, and to use the previous word, because the words before and after a sentence are not independent. The RNN is called a recurrent neural network, i.e. the current input to a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the output in the current time, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer not only comprises the output of the input layer but also comprises the output of the hidden layer at the last time. In theory, RNNs can process sequence data of any length. The training for RNNs is the same as for conventional CNNs or DNNs.

Now that there is a convolutional neural network, why is a recurrent neural network? For the simple reason, in convolutional neural networks, there is a premise assumption that: the elements are independent of each other, and the input and the output are also independent, but in the real world, many elements are connected with each other. RNNs aim at giving machines the ability to remember like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.

(7) Pixel value

The pixel value of the image may be a red, green, blue (RGB) color value, and the pixel value may be a long integer that identifies the color. For example, the pixel value is 256 Red +100 Green +76 Blue, where Blue represents the Blue component, green represents the Green component, and Red represents the Red component. In each color component, the smaller the numerical value, the lower the luminance, and the larger the numerical value, the higher the luminance. For a grayscale image, the pixel values may be grayscale values.

(8) Transfer learning

In the field of artificial intelligence, the implementation of an artificial intelligence method usually requires the help of a model, when a model in one application scenario needs to be transferred to another scenario, a model suitable for a new scenario needs to be established, and in the process, it is a common method to use a pre-trained model (a model of an original application scenario) as an initial model of the new model (a model of the new application scenario) in order to utilize the pre-trained models obtained after huge time resources and calculation resources are consumed, so that transfer learning (transfer learning) comes along. That is, transfer learning can migrate learned powerful skills to related problems, or can be understood as reusing a model of one task in another task.

As shown in fig. 2, the present embodiment provides a system architecture 200. In fig. 2, a data acquisition device 260 is used to acquire training data. For the image text recognition method in the embodiment of the application, the training data may include training images and classification results corresponding to the training images, where the results of the training images may be results of manual pre-labeling.

After the training data is collected, the data collection device 260 stores the training data in the database 230, and the training device 220 trains the target model/rule 201 based on the training data maintained in the database 230.

The following describes that the training device 220 obtains the target model/rule 201 based on the training data, and the training device 220 processes the input original image, compares the output image with the original image until the difference between the output image of the training device 220 and the original image is smaller than a certain threshold, thereby completing the training of the target model/rule 201.

The above target model/rule 201 can be used to implement the image text recognition method of the embodiment of the present application. The target model/rule 201 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 230 does not necessarily come from the acquisition of the data acquisition device 260, and may also be received from other devices. It should be noted that, the training device 220 does not necessarily perform the training of the target model/rule 201 based on the training data maintained by the database 230, and may also obtain the training data from the cloud or other places for performing the model training, and the above description should not be taken as a limitation to the embodiments of the present application.

The target model/rule 201 obtained by training according to the training device 220 may be applied to different systems or devices, for example, the execution device 210 shown in fig. 2, where the execution device 210 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 2, the execution device 210 configures an input/output (I/O) interface 212 for data interaction with an external device, and a user may input data to the I/O interface 212 through the client device 240, where the input data may include: the image to be recognized is input by the client device.

The preprocessing module 213 and the preprocessing module 214 are configured to perform preprocessing according to input data (such as an image to be processed) received by the I/O interface 212, and in this embodiment of the application, the preprocessing module 213 and the preprocessing module 214 may not be provided (or only one of them may be provided), and the computing module 211 is directly used to process the input data.

In the process that the execution device 210 preprocesses the input data or in the process that the calculation module 211 of the execution device 210 executes the calculation or other related processes, the execution device 210 may call the data, the code, and the like in the data storage system 250 for corresponding processes, or store the data, the instruction, and the like obtained by corresponding processes in the data storage system 250. The calculation module 211 processes the input data using the object model/rule 201, for example, in this embodiment, processes the input image to be processed, and obtains an image processing result (e.g., a text recognition result).

Finally, the I/O interface 212 returns the processing result, such as the text recognition result of the image obtained as described above, to the client device 240, thereby providing it to the user.

It should be noted that the training device 220 may generate corresponding target models/rules 201 for different targets or different tasks based on different training data, and the corresponding target models/rules 201 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 2, the user may manually specify the input data, which may be operated through an interface provided by the I/O interface 212. Alternatively, the client device 240 may automatically send the input data to the I/O interface 212, and if the client device 240 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form may be display, sound, action, and the like. The client device 240 may also serve as a data collection terminal, collecting input data of the input I/O interface 212 and output results of the output I/O interface 212 as new sample data, and storing the new sample data in the database 230. Of course, the input data input to the I/O interface 212 and the output result output from the I/O interface 212 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 212 without being collected by the client device 240.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.

As shown in fig. 2, a target model/rule 201 is obtained according to training of a training device 220, where the target model/rule 201 may be a neural network in the present application in this embodiment, and specifically, the neural network constructed in the embodiment of the present application may be CNN, deep Convolutional Neural Networks (DCNN), recurrent Neural Networks (RNNS), and the like.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 3. As described in the introduction of the basic concept above, CNN is a deep neural network with a convolution structure, i.e. a deep learning (deep learning) architecture, which refers to learning at multiple levels at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in fig. 3, CNN300 may include an input layer 310, a convolutional/pooling layer 320 (where the pooling layer is optional), and a neural network layer 330. The relevant contents of these layers are described in detail below.

Convolutional layer/pooling layer 320:

a convolutional layer:

the convolutional/pooling layer 320 as shown in fig. 3 may include layers as in examples 321-326, for example: in one implementation, 321 layers are convolutional layers, 322 layers are pooling layers, 323 layers are convolutional layers, 323 layers are pooling layers, 323 layers are convolutional layers, 326 layers are pooling layers; in another implementation, the 321 and 322 layers are convolutional layers, the 323 layer is a pooling layer, the 324 and 325 layers are convolutional layers, and the 326 layer is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner operation of a layer of convolutional layer will be described below by taking convolutional layer 321 as an example.

Convolutional layer 321 may comprise a plurality of convolutional operators, also called kernels, whose role in image text recognition is equivalent to a filter for extracting specific information from an input image matrix, and the convolutional operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels) \8230; depending on the value of the step length), so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of isotyping matrices, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "multiple" as described above. Different weight matrices may be used to extract different features in the image, for example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract a specific color of the image, and another weight matrix is used to blur unwanted noise in the image. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract feature information from the input image, so that the convolutional neural network 300 can perform correct prediction.

When convolutional neural network 300 has multiple convolutional layers, the initial convolutional layer (e.g., 321) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 300 increases, the more convolutional layers (e.g., 326) that follow further are extracted with more complex features, such as features with higher levels of semantics, which are more suitable for the problem to be solved.

Pooling layer/pooling layer 320:

since it is often desirable to reduce the number of training parameters, pooling layers are often introduced periodically after the convolutional layers, which may be one convolutional layer followed by one pooling layer or multiple convolutional layers followed by one or more pooling layers as illustrated at 321-326 in fig. 2 as example 320. The purpose of the pooling layer is to reduce the spatial size of the image during image processing. The pooling layer may comprise an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller size images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 330:

after processing by convolutional layer/pooling layer 320, convolutional neural network 300 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 320 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information required or other relevant information), the convolutional neural network 300 needs to generate one or a set of the number of required classes of output using the neural network layer 330. Accordingly, a plurality of hidden layers (331, 332 to 33n shown in fig. 3) and an output layer 340 may be included in the neural network layer 330, and parameters included in the hidden layers may be obtained by pre-training according to related training data of a specific task type, for example, the task type may include image text recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 330, i.e. the last layer of the entire convolutional neural network 300 is the output layer 340, the output layer 340 has a loss function similar to the classified cross entropy, specifically for calculating the prediction error, once the forward propagation (i.e. the propagation from 310 to 340 in fig. 3 is the forward propagation) of the entire convolutional neural network 300 is completed, the backward propagation (i.e. the propagation from 340 to 310 in fig. 3 is the backward propagation) starts to update the weight values and the deviations of the aforementioned layers, so as to reduce the loss of the convolutional neural network 300 and the error between the result output from the convolutional neural network 300 through the output layer and the ideal result.

It should be noted that the convolutional neural network 300 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

In this application, the image text recognition model may include the convolutional neural network 300 shown in fig. 3, and the image text recognition model may process the image to be processed to obtain a text recognition result of the image to be processed.

Since RNN is a very common neural network, the structure of RNN will be described in detail below with reference to fig. 4.

Fig. 4 is a schematic structural diagram of an RNN model provided in an embodiment of the present application. Wherein each circle can be regarded as a unit and the same thing is done by each unit, thus being folded into the left half. The RNN is explained by a sentence, which is the repeated use of a unit structure.

RNN is a sequence-to-sequence (seq 2 seq) model, assuming x _t-1 ,x _t ,x _t+1 Is an input: "I is China", then o _t-1 ,o _t Should correspond to "yes" and "chinese", predict what is likely to be the rightmost word? Then o _t+1 The probability that it should be "human" is relatively large.

Therefore, we can make such a definition:

x _t : input indicating time t, o _t : output representing time t, s _t : indicating the memory at time t. Because the output at the current time is determined by memory and the output at the current time, as if a student were at great four, the student's knowledge is a combination of knowledge learned from great four (current input) and knowledge learned from great three and great three previously (memory), the RNN is similar in that the neural network can integrate many contents together through a series of parameters and then learn the parameters, thus defining the basis of the RNN:

s _t ＝f(U*x _t +W*s _t-1 )

the f () function is an activation function in the neural network that is used to filter information, and may be tanh or other. U and W are weight matrixes.

With current time memory s at RNN prediction _t To predict, softmax can be applied to predict the content, which can be expressed as:

o _t ＝soft max(V*s _t )

wherein o is _t : denotes the output at time t, and V is a weight matrix.

It should be noted that the RNN shown in fig. 4 is only an example of a recurrent neural network, and in a specific application, the recurrent neural network may also exist in the form of other network models, which is not limited in this embodiment of the present application.

Fig. 5 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 50. The chip may be provided in an execution device 210 as shown in fig. 2 to complete the calculation work of the calculation module 211. The chip may also be disposed in the training device 220 as shown in fig. 2 to complete the training work of the training device 220 and output the target model/rule 201. The algorithm for each layer in the convolutional neural network shown in fig. 3 can be implemented in a chip as shown in fig. 5.

The neural network processor NPU 50 is mounted as a coprocessor on a main CPU (host CPU) and tasks are distributed by the main CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform arithmetic.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 503 fetches the data corresponding to the matrix B from the weight memory 502 and buffers it in each PE in the arithmetic circuit 503. The arithmetic circuit 503 takes the matrix a data from the input memory 501 and performs matrix arithmetic on the matrix B, and stores a partial result or a final result of the obtained matrix in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit 503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculation of a non-convolution/non-FC layer in a neural network, such as pooling (posing), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 507 can store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example, for use in subsequent layers in a neural network.

The unified memory 506 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 505 (DMAC) to transfer the input data in the external memory to the input memory 501 and/or the unified memory 506, store the weight data in the external memory in the weight memory 502, and store the data in the unified memory 506 in the external memory.

A Bus Interface Unit (BIU) 510, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a bus.

An instruction fetch buffer (issue fetch buffer) 509 connected to the controller 504 for storing instructions used by the controller 504;

the controller 504 is configured to call the instruction cached in the instruction storage 509 to implement controlling the working process of the operation accelerator.

Generally, the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are On-Chip memories, and the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

The operation of each layer in the convolutional neural network shown in fig. 3 may be performed by the operation circuit 503 or the vector calculation unit 507.

The training device 220 in fig. 2 described above can perform the steps of the training method in the embodiment of the present application, the execution device 210 in fig. 2 can perform the steps of the image text recognition method in the embodiment of the present application, the CNN model shown in fig. 3, the RNN model shown in fig. 4, and the chip shown in fig. 5 can also be used to perform the steps of the image text recognition method in the embodiment of the present application, and the chip shown in fig. 6 can also be used to perform the steps of the training method in the embodiment of the present application.

As shown in fig. 6, the present embodiment provides a system architecture 600, which includes a local device 601, a local device 602, an execution device 610, and a data storage system 650, wherein the local device 601 and the local device 602 are connected to the execution device 610 through a communication network.

The execution device 610 may be implemented by one or more servers. Optionally, the execution device 610 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The enforcement device 610 may be disposed on one physical site or distributed across multiple physical sites. The execution device 610 may use data in the data storage system 650 or call program code in the data storage system 650 to implement the method of training a neural network or the image text recognition method of the embodiments of the present application.

Specifically, the execution device 610 may perform the following processes:

acquiring data to be trained, wherein the data to be trained comprises n images from a target domain and m images from a synthesis domain, n is more than or equal to 1, m is more than or equal to 1, and n and m are positive integers;

determining a spatial migration characteristic matrix of data to be trained according to a first neural network;

determining a sequence migration characteristic matrix of the data to be trained according to the second neural network;

and training by using the space migration characteristic matrix and the sequence migration characteristic matrix to obtain a trained image text recognition model.

The process execution device 610 can obtain a trained model, which can be used for image text recognition and the like.

A user may operate respective user devices (e.g., local device 601 and local device 602) to interact with the execution device 610. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

The local devices of each user may interact with the enforcement device 610 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the local device 601 and the local device 602 acquire relevant parameters of the neural network from the execution device 610, deploy the neural network on the local device 601 and the local device 602, and perform image text recognition by using the neural network.

In another implementation, the neural network predictor may be directly deployed on the execution device 610, and the execution device 610 performs image text recognition by obtaining the network structure from the local device 601 and the local device 602 and using the neural network.

In one implementation manner, the local device 601 and the local device 602 acquire relevant parameters of the image text recognition apparatus from the execution device 610, deploy the image text recognition apparatus on the local device 601 and the local device 602, and perform image text processing on an image to be processed by using the image text recognition apparatus.

In another implementation, the execution device 610 may directly deploy an image text recognition apparatus, and the execution device 610 obtains the image to be processed from the local device 601 and the local device 602 and performs image processing on the image to be processed by using the image processing apparatus.

That is to say, the execution device 610 may also be a cloud device, and in this case, the execution device 610 may be deployed in the cloud; alternatively, the execution device 610 may also be a terminal device, in which case, the execution device 610 may be deployed at a user terminal side, which is not limited in this embodiment of the application.

The image text recognition method provided by the embodiment of the application can be executed on a server, can also be executed on a cloud terminal, and can also be executed on a terminal device. Taking a terminal device as an example, the technical scheme of the embodiment of the invention can be applied to the terminal device, and the image text recognition method in the embodiment of the invention can perform text recognition on an input image to obtain a text existing in the input image. The terminal device may be mobile or fixed, for example, the terminal device may be a mobile phone with an image processing function, a Tablet Personal Computer (TPC), a media player, a smart tv, a notebook computer (LC), a Personal Digital Assistant (PDA), a Personal Computer (PC), a camera, a camcorder, a smart watch, a Wearable Device (WD), an autonomous vehicle, or the like, which is not limited in the embodiments of the present invention.

The correct recognition of the text information through the images is important content for constructing digital management, so that the time of complicated manual recording and manual reading can be saved, the labor cost is saved, and the efficiency of constructing a digital station is improved in an auxiliary manner. However, training of the image text recognition model requires a large amount of labeled training data, and for images with few training samples, text in the images cannot be recognized effectively. In many cases, it is very difficult to obtain valid data.

Based on the above problems, the embodiment of the application provides an image text recognition method and a training method of an image text recognition model, which can well realize image text recognition and improve the generalization capability of the model under the condition of less training samples (real pictures).

FIG. 7 is a schematic flow chart diagram of an image text recognition model training method 700 according to an embodiment of the present application. The method shown in fig. 7 may be executed by a training system of a neural network model, where the training system may be a service device, or may be a mobile terminal, and for example, a device with strong computing capability, such as a computer device, a server device, or a computing device, and the process includes:

the method 700 includes steps S710 to S750, and the steps S710 to S750 are described in detail below.

In some embodiments, the method 700 may be performed by the performing device 210 in fig. 2, the training device 220, the chip shown in fig. 5, and the performing device 610 in fig. 6.

S710, acquiring n real images and m composite images.

Where n ≧ 1 and n are positive integers, m ≧ 1 and m are positive integers, in some embodiments, n = m.

Specifically, n real images may be acquired from the target domain, and m composite images may be acquired from the composite domain.

The target domain can be understood as a data set space consisting of the images that are actually acquired and already annotated. Illustratively, an image of a telecommunication device, which has text information of the type "DCDU" of the telecommunication device and has labeled "DCDU", can be understood as an image in the target domain.

A composite domain may be understood as a data set space made up of images generated by a composite technique. For example, a plurality of images containing the "DCDU" text information may be generated by a synthesis technique, wherein the plurality of images may differ in terms of size, font, and the like of the "DCDU" text information. The image generation by the synthesis technique can refer to the prior art, and is not described in detail herein.

In summary, the n real images and the m combined images have labels.

Optionally, the number of images in the composite domain is greater than the number of images in the target domain. In generating the target domain and the synthesized domain, the number of images in the target domain is smaller than the number of images in the synthesized domain in some embodiments, because the acquisition of the real images is difficult. For example, there are 100 images in the target domain and 10000 images in the composite domain.

Alternatively, after acquiring the n real images and the m composite images, the n real images and the m composite images may be preprocessed, for example, randomly rotated, normalized, and the like.

Optionally, the n real images and the m composite images may be subjected to image expansion processing such as cropping processing (cropping), flipping processing (flipping), and/or data magic processing (hallucinator), so as to obtain more new images.

S720, determining a first image matrix and a second image matrix according to the n real images and the m composite images.

Specifically, a first image matrix may be determined according to the acquired n real images, where the size of the first image matrix is [ n, h [ ] ₁ ,w ₁ ,c ₁ ]Wherein h is ₁ Is the height of the real image, w ₁ Width of the real image, c ₁ The number of color channels of a real image, when c =3, the image may be considered to be composed of three elements of RGB, R being one red channel; g is a green channel; b is a blue channel.

It should be noted that, in the embodiment of the present application, only the number of color channels is 3 as an example, but the present application is not limited thereto, for example, when the image is preprocessed and exists in the form of a gray scale map, the number of color channels is not equal to 3.

It should be understood that, reference may be made to the prior art for determining the first image matrix after acquiring n real images, which is not described herein in detail.

The size of the matrix can be understood as the overall dimension of the matrix or also as the length of the matrix in the individual dimensions. Illustratively, [ n, h, w, c ] may indicate that the first image matrix is a 4-dimensional matrix, where the first dimension is n, the second dimension is h, the third dimension is w, and the fourth dimension is c.

Alternatively, the dimension of the first image matrix may also be expressed as n × h ₁ ×w ₁ ×c ₁ 。

The present embodiment relates to a multidimensional matrix, and a representation of the multidimensional matrix will be described below.

The one-dimensional matrix can be represented as [1,2,3,4];

a two-dimensional matrix can be represented as [ [1,2,3] [4,5,6] ], the size of the two-dimensional matrix being [2,3];

the three-dimensional matrix can be expressed as [ [1,2,3] [4,5,6] ] [ [7,8,9] [10,11,12] ] ], the size of which is [2,3];

the four-dimensional matrix can be represented as [ [ [ [ [1,2,3] [4,5,6] ] [ [7,8,9] [10,11,12] ] ] [ [13,14,15] [16,17,18] ] [ [19,20,21] [22,23,24] ] ], and the size of the four-dimensional matrix is [2, 3].

A second image matrix may be determined according to the obtained m composite images, where the size of the second image matrix is [ m, h ₂ , w ₂ ,c ₂ ]Wherein h is ₂ For the height of the composite image, w ₂ For the width of the composite image, c ₂ For the number of color channels of the composite image, when c =3, the image can be considered to be composed of three elements of RGB, R being one red channel; g is a green channel; b is a blue channel.

For convenience of description, in the embodiments of the present application, h is given ₁ ＝h ₂ ，w ₁ ＝w ₂ ，c ₁ ＝c ₂ For example, but not limited thereto, the 3 parameters may be different from each other, or some of the parameters may be the same, and some of the parameters may be different.

And S730, determining a space migration characteristic matrix according to the first neural network, the first image matrix and the second image matrix.

The first neural network may be a convolutional neural network, and the first image matrix may be converted into a size of [ n, h ] by convolution and downsampling of the convolutional neural network ₃ ,w ₃ ,channels ₁ ]Where h is ₃ For down-sampled image height, w ₃ For down-sampled image width, channels ₁ The number of spatial signatures.

The spatial feature map can be understood as the feature extracted after the first neural network is convolved on the height and width of the image, and then channels are obtained ₁ Can be understood as the number of extracted features, channels ₁ Is related to the number of convolution kernels of the convolutional neural network.

The spatial feature map may be used to indicate image features of the n real images and the m composite images.

Alternatively, the feature vectors of the n real images may be extracted according to the first neural network to construct the first spatial feature map without going through step S720, and similarly, the feature vectors of the m synthetic images may be extracted according to the first neural network to construct the second spatial feature map without going through step S720.

In one embodiment, the first image matrix is h after passing down-sampling ₃ And =1. For example, a size of [16,32,100, 3] may be used]Is converted into a first image matrix of size [16,1,25,512 ]]For convenience of explanation and presentation, the following description will use 1 as an example of the height of the down-sampled image, but the spatial feature map of (1) is not limited thereto.

Similarly, the second image matrix may be converted to a size [ m, h ] by convolution neural network convolution and downsampling ₃ , w ₃ ,channels ₁ ]The spatial signature of (2).

And performing dimension conversion on the first spatial feature map and the second spatial feature map which are obtained by the convolutional neural network and the down-sampling, so that kernel function processing is facilitated.

In one embodiment, when h ₃ If =1, the first space characteristic diagram is convertedReduced to a size of [ n,1 x w ] ₃ *channels ₁ ]The second spatial signature is converted into a first transition matrix of size [ m,1 x w ] ₃ *channels ₁ ]It is to be understood that the first transition matrix and the second transition matrix are two-dimensional matrices.

Alternatively, the dimension of the first transition matrix may also be denoted as n × 1 × w ₃ *channels ₁ The dimensions of the second transition matrix may also represent m × 1 × w ₃ *channels ₁ 。

In the embodiment of the present application, the size of the first transition matrix is [ n,1 × w ] ₃ *channels ₁ ]The second transition matrix has a size of [ m,1 × w ] ₃ *channels ₁ ]For example, but not limited thereto, the lengths of the second dimensions of the first transition matrix and the second transition matrix may be obtained by different mathematical operations according to the height, the width, and the number of the spatial feature maps of the image, for example, in another embodiment, the lengths of the second dimensions of the first transition matrix and the second transition matrix may also be h ₃ +w ₃ +channels ₁ . As another example, the length of the second dimension of the first transition matrix and the second transition matrix may also be h ₃ +w ₃ *channels ₁ 。

Determining a space migration characteristic matrix according to the obtained first transition matrix and the second transition matrix, and performing matrix stacking on the first transition matrix and the second transition matrix to obtain the size of [ n + m,1 x w [ ] ₃ *channels ₁ ]And then based on the kernel function, the third transition matrix may be converted into a spatial migration feature matrix.

It should be understood that matrix stacking is prior art and will not be described in detail herein.

The present application relates to a kernel function in the embodiments, and the kernel function will be described in detail below.

Kernel function: support vector machines transform by some non-linearity

The input space is mapped to a high-dimensional feature space. Feature(s)The dimensions of the space may be very high. If the solution of the support vector machine only uses the inner product operation, and a certain function K (x, x') exists in the low-dimensional input space, the function K is exactly equal to the inner product in the high-dimensional space, namely

The support vector machine does not need to calculate complex nonlinear transformation, and the inner product of the nonlinear transformation is directly obtained by the function K (x, x'), so that the calculation is greatly simplified. Such a function K (x, x') is called a kernel function.

The kernel function includes a linear kernel function, a polynomial kernel function, a gaussian kernel function, and the like.

Illustratively, the kernel function in the embodiment of the present application may be a gaussian kernel function. A gaussian kernel, also called a radial basis function, is a commonly used kernel that can map finite dimensions into a high-dimensional space. Defining a third transition matrix as M, and defining a fourth transition matrix R = MM ^T . Constructing a space migration feature matrix K based on the fourth transition matrix R, wherein each element in the matrix K satisfies formula (1):

k(i,j)＝exp(-(R _ii +R _jj -2R _ij )/σ ² ) (1)

wherein σ ² Is the variance of the fourth transition matrix R;

i and j are subscripts of the fourth transition matrix R.

According to the operation rule of the multidimensional matrix and the formula (1), the size of the space migration characteristic matrix K is [ n + m, n + m ],

it should be noted that, in the embodiment of the present application, the kernel function is taken as a gaussian kernel function for example, but is not limited thereto, and the kernel function may also be other types of kernel functions, for example, a linear kernel function, a polynomial kernel function, and the like, and details are not described herein again.

In summary, the method provided by the embodiment of the present application migrates the spatial features of the synthesized domain to the target domain for combination, and constructs a spatial migration feature matrix.

And S740, determining a sequence migration characteristic matrix according to the second neural network, the first image matrix and the second image matrix.

Optionally, dimension conversion is performed on the first spatial feature map and the second spatial feature map obtained in S730, so as to obtain a third spatial feature map and a fourth spatial feature map.

In one embodiment, the height of the down-sampled image is 1, and since the sequence transition feature matrix is determined in S740, the dimension of the second dimension in the first spatial feature map and the second spatial feature map can be ignored, and the size of the third spatial feature map is [ n, w [ ] ₃ ,channels ₁ ]The size of the fourth spatial feature map is [ m, w ] ₃ ,channels ₁ ]。

In one embodiment, the height of the down-sampled image is 1, dimensions of the second dimensions of the first and second spatial feature maps may not be ignored, and the size of the third spatial feature map is [ n,1, w ] ₃ ,channels ₁ ]The size of the fourth spatial feature map is [ m,1, w ] ₃ ,channels ₁ ]。

After the third spatial feature map and the fourth spatial feature map are obtained, the first sequence feature map and the second sequence feature map can be extracted based on the second neural network.

In one embodiment, the second neural network may be a recurrent neural network, the height of the down-sampled image is 1, the dimension of the second dimension is omitted, and the size of the first sequence feature map is [ n, w ] through the recurrent neural network extraction ₃ , char_num]The size of the second sequence profile is [ m, w ] ₃ ,char_num]Wherein w is ₃ May be the maximum length of text in the image to be recognized, and char _ num may be the number of characters of text in the image to be recognized.

In another embodiment, the features of the first spatial feature map and the second spatial feature map can be directly extracted based on the second neural network, so as to obtain the first sequence feature map and the second sequence feature map.

And performing dimension conversion on the first sequence feature diagram and the second sequence feature diagram according to the obtained first sequence feature diagram and second sequence feature diagram, so as to facilitate kernel function processing.

In one embodiment, the first sequence feature map may be converted to a size of [ n, w ] ₃ *char_num]The fifth transition matrix of (2), converting the second sequence feature map into a size of [ m, w ] ₃ *char_num]Similarly, the size of the fifth transition matrix and the sixth transition matrix in the implementation of the present application is not limited.

Determining a sequence migration characteristic matrix according to the obtained fifth transition matrix and sixth transition matrix, and performing matrix stacking on the fifth transition matrix and the sixth transition matrix to obtain the matrix with the size of [ n + m, [ w ] ₃ *channels ₁ ]And then based on the kernel function, the seventh transition matrix may be converted into a sequence migration feature matrix. Defining a seventh transition matrix as N, defining an eighth transition matrix D = NN ^T . Constructing a sequence migration feature matrix T based on the eighth transition matrix D, wherein each element in the matrix T satisfies formula (2):

T(i,j)＝exp(-(D _ii +D _jj -2D _ij )/σ ² ) (2)

wherein σ ² Is the variance of the sixth transition matrix D;

i and j are subscripts of the sixth transition matrix D.

According to the operation rule of the multidimensional matrix and the formula (1), the size of the sequence migration characteristic matrix T is [ n + m, n + m ].

And S750, adjusting the parameters of the model according to the space migration characteristic matrix and the sequence migration characteristic matrix.

In the embodiment of the application, the image characteristics of the synthesis domain are transferred to the target domain for learning, so that the training effect of the target domain is improved, in the training and learning process, a loss function can be introduced for measuring the difference between the predicted value and the target value, and the higher the output value of the loss function is, the larger the difference is.

In one embodiment, where the first neural network is a convolutional neural network and the second neural network is a recurrent neural network, the loss function may be as shown in equation (2.1), including 4 parts.

L _total ＝α ₁ L _cnn +α ₂ L _rnn +α ₃ L _tgt +α ₄ L _src (2.1)

Wherein L is _total A loss function representing an image text recognition model in the embodiment of the present application;

L _cnn a loss function representing spatial migration characteristics;

L _rnn a loss function representing a sequence migration characteristic;

L _tgt a loss function representing a target domain text classification;

L _src a penalty function representing a classification of the synthesized domain text;

α ₁ 、α ₂ 、α ₃ 、α ₄ is a hyper-parameter that balances the fractional loss functions.

The super-parameters can be understood as defining the structure of the model or optimizing strategies or controlling the action state of the model, and the loss function can be optimized through the super-parameters, so that the model is ensured not to be under-fitted or over-fitted. Common hyper-parameters include the number of layers of a neural network, a kernel function, and the like, and the selection of the hyper-parameters may be combined, and the hyper-parameters in the embodiment of the present application are not limited, and may be the two hyper-parameters, or may be other hyper-parameters.

In the embodiment of the application, L is minimized _total The target domain feature and the synthetic domain feature obtained through the image text recognition model are close to each other through the output value, and the predicted values of the target domain and the synthetic domain can be close to the true values.

The loss functions of the portions of the loss function of the image text recognition model will be described below.

In one embodiment, the loss function of the spatial migration feature may be constructed according to formula (3), and when constructing the loss function of the spatial migration feature, a Maximum Mean Difference (MMD) is used, and the spatial migration feature matrix is used as an input value, and an output value is used to represent a spatial feature distribution difference between the target domain and the synthesis domain. MMD is a migration learning, a loss function that is mainly used in domain adaptation (domain adaptation), mainly to measure the distance of two different but related distributions.

Where K (i, j) is an element in the spatial migration feature matrix K.

L _cnn The distance of the space migration characteristics of the target domain and the synthetic domain in a high-dimensional space is measured, and the smaller the output value of the loss function is, the more similar the space characteristics of the target domain and the synthetic domain are.

In one embodiment, the loss function of the sequence migration feature may be constructed according to formula (4), and when constructing the loss function of the sequence migration feature, the MMD is adopted, and the sequence migration feature matrix is used as an input value, and an output value is used to represent the difference in the distribution of the sequence features of the target domain and the synthesized domain.

Wherein T (i, j) is an element in the sequence migration feature matrix T.

L _rnn The distance of the sequence migration features of the target domain and the synthetic domain in a high-dimensional space is measured, and the smaller the output value of the loss function is, the more similar the sequence features of the target domain and the synthetic domain are.

In one implementation, the loss function of the target domain text classification may be constructed according to formula (5), and when constructing the loss function of the target domain text classification, a Connection Temporal Classification (CTC) is used to input the real image and the real text sequence in the real image.

Wherein, Q = { I _i ,S _i Denotes a training data set comprising n real images, I denotes an image, i.e. the n real images, S denotes the real text of the image IA sequence, namely a text sequence in the n real images;

l denotes the predicted sequence of the training model for the input image I,

the sum of the probabilities that the text sequence of the input image I is predicted to be correct is represented, and the larger the sum of the probabilities is, the better the performance of the training model is represented, or the more accurate the training model is, the smaller the classification loss of the text sequence of the target domain is.

In one embodiment, the loss function of the synthesized domain text classification may be constructed according to equation (6), and the synthesized image and the synthesized text sequence in the synthesized image are input using CTC in constructing the loss function of the synthesized domain text classification.

Wherein W = { U = _i ,V _i Represents a training data set comprising m synthetic images, U represents an image, i.e. the m real images, and V represents a synthetic text sequence of image U, i.e. a text sequence in the m synthetic images;

g represents the predicted sequence of input images U for the training model,

the sum of the probabilities that the text sequence of the input image U is predicted to be correct is represented, and the larger the sum of the probabilities, the better the performance of the training model is represented, or the more accurate the training model is, the smaller the classification loss of the text sequence in the synthesis domain is.

And (4) reducing the output value of the loss function through continuous training, and finally outputting a training result, wherein the training result comprises a neural network model obtained through training. The training results may also be the results of processing portions of the training data by the neural network model, as well as highlighting indicia of the portions of each training data that most affect the processing results. For example, a portion of pixels in the image of the training data that most affect the processing result may be highlighted for highlighting.

According to the highlighting mark of the part which has the greatest influence on the processing result in each training data, the reason influencing the precision of the neural network model obtained by training can be judged manually. The reasons may include, for example, poor training data, and/or the need for further optimization of the hyper-parameters under which training is performed, etc.

By the training method of the image text recognition model, the characteristics of the synthesis domain and the target domain are extracted, the combined learning is performed according to the characteristics of the synthesis domain and the target domain, the training efficiency is high, the robustness is high, and the problems of under-fitting and over-fitting of the model are effectively solved.

Fig. 8 shows a schematic flowchart of an image text recognition method 800 provided in an embodiment of the present application, where the method shown in fig. 8 may be executed by a device with strong computing power, such as a computer device, a server device, or a computing device, and the flowchart includes:

the method 800 includes steps S810 to S820, and the steps S810 to S820 are described in detail below.

In some embodiments, the method 800 may be performed by the execution device 210 in fig. 2, the chip shown in fig. 5, the execution device 610 in fig. 6, and the like.

And S810, acquiring an image to be identified.

Alternatively, an image to be recognized may be acquired from the client device 240 in fig. 2, and the image to be recognized may be an image captured by the client device 240 through a camera; alternatively, the image to be recognized may be acquired from the data storage system 250, for example, an image stored in the data storage system, or may also be acquired from a cloud.

And S820, processing the image to be recognized by using the image text recognition model to obtain a recognition result.

The image text recognition model can be obtained by the method 700, and is not described herein again.

The apparatus of the embodiment of the present application will be described with reference to fig. 9 to 10. It should be understood that the apparatus described below is capable of performing the method of the foregoing embodiments of the present application, and in order to avoid unnecessary repetition, the repeated description is appropriately omitted below when describing the apparatus of the embodiments of the present application.

Fig. 9 is a schematic block diagram of a training apparatus for an image text recognition model according to an embodiment of the present application. The training apparatus 900 shown in fig. 9 includes an acquisition unit 910 and a processing unit 920.

The obtaining unit 910 and the processing unit 920 may be configured to perform a training method of an image text recognition model according to an embodiment of the present application, and in particular, may be configured to perform the method 700.

An obtaining unit 910 is configured to obtain data to be trained. The data to be trained comprise n images from the target domain and m images from the synthesis domain, wherein n is more than or equal to 1, m is more than or equal to 1, and n and m are positive integers.

The processing unit 920 is configured to determine a spatial migration feature matrix of the data to be trained according to the first neural network;

the training device is also used for determining a sequence migration characteristic matrix of the data to be trained according to the second neural network;

and the image text recognition model is trained by utilizing the space migration characteristic matrix and the sequence migration characteristic matrix to obtain the trained image text recognition model.

Optionally, the first neural network is a convolutional neural network, and the processing unit 920 is specifically configured to determine a first image matrix and a second image matrix according to the data to be trained, where the first image matrix is used to indicate the number, height, width, and color channels of n images, and the second image matrix is used to indicate the number, height, width, and color channels of m images;

the convolution neural network performs convolution processing on the first image matrix to obtain a first spatial characteristic diagram, and the convolution neural network performs convolution processing on the second image matrix to obtain a second spatial characteristic diagram;

carrying out matrix change on the first spatial feature map to obtain a first transition matrix, and carrying out matrix change on the second spatial feature map to obtain a second transition matrix;

determining a third transition matrix according to the first transition matrix and the second transition matrix;

determining a fourth transition matrix according to the third transition matrix;

and determining a spatial migration characteristic matrix by using the kernel function and taking the elements of the fourth transition matrix as input.

Optionally, the kernel function is a gaussian kernel function, and the spatial migration feature matrix is determined according to the following formula;

k(i,j)＝exp(-(R _ii +R _jj -2R _ij )/σ ² )；

wherein R is a fourth transition matrix, σ ² Is the variance of the fourth transition matrix; i and j are subscripts of the fourth transition matrix.

Optionally, the first neural network is a convolutional neural network, the second neural network is a cyclic neural network, and the processing unit 920 is specifically configured to determine a first image matrix and a second image matrix according to the data to be trained, where the first image matrix is used to indicate the number, height, width, and color channel of n images, and the second image matrix is used to indicate the number, height, width, and color channel of m images;

the convolution neural network performs convolution processing on the first image matrix to obtain a first spatial feature map, and the convolution neural network performs convolution processing on the second image matrix to obtain a second spatial feature map;

determining a third spatial feature map according to the first spatial feature map, and determining a fourth spatial feature map according to the second spatial feature map;

the recurrent neural network extracts the characteristics of the third spatial feature map to obtain a first sequence feature map, and the recurrent neural network extracts the characteristics of the fourth spatial feature map to obtain a second sequence feature map;

carrying out matrix change on the first sequence characteristic diagram to obtain a fifth transition matrix, and carrying out matrix change on the second sequence characteristic diagram to obtain a sixth transition matrix;

stacking the fifth transition matrix and the sixth transition matrix to obtain a seventh transition matrix;

determining an eighth transition matrix according to the seventh transition matrix;

and determining a sequence migration characteristic matrix by using the kernel function and taking the element of the eighth transition matrix as an input.

Optionally, the kernel function is a gaussian kernel function, and the sequence migration feature matrix is determined according to the following formula;

T(i,j)＝exp(-(D _ii +D _jj -2D _ij )/σ ₁ ² )；

wherein σ ₁ ² Is the variance of the eighth transition matrix; i and j are subscripts of the eighth transition matrix.

Optionally, the processing unit 920 is specifically configured to construct a loss function according to the spatial migration feature matrix, the sequence migration feature matrix, and the data to be trained;

and adjusting the hyper-parameters of the loss function, determining the minimum output value of the loss function, and obtaining the trained image text recognition model.

Fig. 10 is a schematic block diagram of an image text recognition apparatus provided in an embodiment of the present application. The apparatus 1000 shown in fig. 10 comprises an acquisition unit 1010 and a processing unit 1020.

The obtaining unit 1010 and the processing unit 1020 may be configured to execute the image text recognition method according to the embodiment of the present application, for example, may be configured to execute the method 800.

An acquiring unit 1010 for acquiring an image to be recognized.

And the processing unit 1020 is configured to process the image to be recognized by using the image text recognition model to obtain a recognition result.

The training apparatus 900 and the apparatus 1000 are embodied as functional units. The term "unit" herein may be implemented in software and/or hardware, and is not particularly limited thereto.

For example, a "unit" may be a software program, a hardware circuit, or a combination of both that implement the above-described functions. The hardware circuitry may include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared processor, a dedicated processor, or a group of processors) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality.

Accordingly, the units of the respective examples described in the embodiments of the present application can be realized in electronic hardware, or a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 11 is a schematic hardware structure diagram of a training apparatus for an image text recognition model according to an embodiment of the present application. The training apparatus 5000 shown in fig. 11 (the apparatus 5000 may be a computer device) includes a memory 5001, a processor 5002, a communication interface 5003 and a bus 5004. The memory 5001, the processor 5002 and the communication interface 5003 are connected to each other via a bus 5004.

The memory 5001 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 5001 may store a program, and the processor 5002 is configured to perform the steps of the training method of the neural network model of the embodiments of the present application when the program stored in the memory 5001 is executed by the processor 5002. In particular, the processor 5002 may perform the method 700 illustrated in fig. 7 above.

The processor 5002 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement the method for training the image text recognition model according to the embodiment of the present invention.

The processor 5002 may also be an integrated circuit chip having signal processing capabilities, such as the chip shown in fig. 5. In implementation, the steps of the training method for image-text recognition model of the present application may be implemented by integrated logic circuits of hardware in the processor 5002 or instructions in the form of software.

The processor 5002 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in the memory 5001, and the processor 5002 reads information in the memory 5001 and completes functions required to be performed by the units included in the training apparatus shown in fig. 9 in combination with hardware thereof.

The communication interface 5003 enables communication between the apparatus 5000 and other devices or communication networks using transceiver means such as, but not limited to, a transceiver.

The bus 5004 may include a pathway to transfer information between the various components of the apparatus 5000 (e.g., the memory 5001, the processor 5002, the communication interface 5003).

Fig. 12 is a schematic hardware configuration diagram of an image text recognition apparatus according to an embodiment of the present application. The apparatus 6000 shown in fig. 12 includes a memory 6001, a processor 6002, a communication interface 6003, and a bus 6004. The memory 6001, the processor 6002, and the communication interface 6003 are connected to each other in a communication manner via a bus 6004.

Memory 6001 can be ROM, static storage device, and RAM. The memory 6001 may store programs, and the processor 6002 and the communication interface 6003 are used to execute the steps of the image processing method according to the embodiment of the present application when the programs stored in the memory 6001 are executed by the processor 6002. In particular, the processor 6002 may perform the method 800 illustrated in fig. 8 above.

The processor 6002 may be a general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits, and is configured to execute the relevant programs to implement the functions that the units in the image text recognition apparatus according to the embodiment of the present application need to execute.

Processor 6002 could also be an integrated circuit chip having signal processing capabilities and be, for example, the chip shown in FIG. 5. In implementation, the steps of the image processing method according to the embodiment of the present application may be implemented by integrated logic circuits of hardware in the processor 6002 or instructions in the form of software.

The processor 6002 could also be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in the memory 6001, and the processor 6002 reads information in the memory 6001, and completes functions to be executed by the cells included in the image text recognition apparatus shown in fig. 10 in combination with hardware thereof.

The communication interface 6003 enables communications between the apparatus 6000 and other devices or communication networks using transceiver means such as, but not limited to, a transceiver. For example, the image to be recognized may be acquired through the communication interface 6003.

The bus 6004 may include a pathway for transferring information between various components of the device 6000 (e.g., memory 6001, processor 6002, communication interface 6003).

It should be noted that although the above-described devices 5000 and 6000 only show memories, processors, and communication interfaces, in particular implementation, those skilled in the art will appreciate that the devices 5000 and 6000 may also include other components necessary for normal operation. Also, the apparatus 5000 and the apparatus 6000 may also include hardware components for performing other additional functions, as may be appreciated by those skilled in the art according to particular needs. Furthermore, it should be understood by those skilled in the art that the apparatus 5000 and the apparatus 6000 may also include only the components necessary to implement the embodiments of the present application, and not necessarily all of the components shown in fig. 11 and 12.

It should be understood that the processor in the embodiments of the present application may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

The above-described embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer instructions or the computer program are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, data center, etc., that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, which may be understood with particular reference to the former and latter text.

In the present application, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of training an image text recognition model, comprising:

extracting the features of the n images and the m images through a first neural network to construct a space migration feature matrix;

extracting the features of the n images and the m images through a second neural network to construct a sequence migration feature matrix;

and adjusting parameters of the image text recognition model according to the space migration characteristic matrix, the sequence migration characteristic matrix and the prediction data of the data to be trained.

2. The method according to claim 1, wherein the extracting features of the n images and the m images through a first neural network to construct a spatial migration feature matrix comprises:

performing convolution processing on the n images through the first neural network to obtain a first spatial characteristic diagram, and performing convolution processing on the m image matrixes through the first neural network to obtain a second spatial characteristic diagram;

performing matrix change on the first spatial feature map to obtain a first transition matrix, and performing matrix change on the second spatial feature map to obtain a second transition matrix, wherein the dimensionality of the first transition matrix is smaller than that of the first spatial feature map, and the dimensionality of the second transition matrix is smaller than that of the second spatial feature map;

stacking the first transition matrix and the second transition matrix to determine a third transition matrix;

constructing a fourth transition matrix according to the third transition matrix and the transposed matrix of the third transition matrix;

and determining the space migration characteristic matrix by taking the elements of the fourth transition matrix as the input of a kernel function.

3. The method according to claim 2, wherein determining the spatial migration feature matrix by using the elements of the fourth transition matrix as input of a kernel function comprises:

the kernel function is a Gaussian kernel function, and the spatial migration characteristic matrix is determined according to the following formula;

k(i,j)＝exp(-(R _ii +R _jj -2R _ij )/σ ² )；

wherein R is the fourth transition matrix, σ ² Is the variance of the fourth transition matrix.

4. The method according to claim 2, wherein the second neural network is a recurrent neural network, and the extracting features of the n images and the m images through the second neural network constructs a sequence migration feature matrix, including:

extracting sequence features of the first spatial feature map through the recurrent neural network to obtain a first sequence feature map, and extracting the second spatial feature map through the recurrent neural network to obtain a second sequence feature map;

performing matrix change on the first sequence characteristic diagram to obtain a fifth transition matrix, and performing matrix change on the second sequence characteristic diagram to obtain a sixth transition matrix, wherein the dimension of the fifth transition matrix is smaller than that of the third spatial characteristic diagram, and the dimension of the sixth transition matrix is smaller than that of the fourth spatial characteristic diagram;

stacking the fifth transition matrix and the sixth transition matrix to determine a seventh transition matrix;

constructing an eighth transition matrix according to the seventh transition matrix and the transposed matrix of the seventh transition matrix;

and determining the sequence migration characteristic matrix by taking the elements of the eighth transition matrix as the input of a kernel function.

5. The method according to claim 4, wherein determining the sequence migration feature matrix using the elements of the eighth transition matrix as inputs to a kernel function comprises:

the kernel function is a Gaussian kernel function, and the sequence migration characteristic matrix is determined according to the following formula;

T(i,j)＝exp(-(D _ii +D _jj -2D _ij )/σ ₁ ² )；

wherein T is the eighth transition matrix, σ ₁ ² Is the variance of the eighth transition matrix.

6. The method according to any one of claims 1 to 5, wherein the adjusting parameters of the image text recognition model according to the spatial migration feature matrix, the sequence migration feature matrix, and the prediction data of the data to be trained comprises:

constructing a loss function according to the space migration characteristic matrix, the sequence migration characteristic matrix and the prediction data of the data to be trained;

adjusting a hyperparameter of the loss function.

7. The method according to claim 6, wherein the constructing a loss function according to the spatial migration feature matrix, the sequence migration feature matrix, and the predicted data of the data to be trained comprises:

constructing the loss function according to the following formula:

L _total ＝α ₁ L _cnn +α ₂ L _rnn +α ₃ L _tgt +α ₄ L _src ；

wherein L is _total Is the loss function;

L _cnn a loss function that is a characteristic of the spatial migration;

L _rnn a loss function that is a characteristic of the sequence migration;

L _tgt a loss function for classifying the target domain text;

L _src a penalty function for classifying the synthesized domain text;

α ₁ 、α ₂ 、α ₃ 、α ₄ to balance the over-parameters of the fractional loss functions.

8. The method of claim 7, wherein the loss function for the spatial migration feature is constructed according to the following equation:

wherein k (i, j) is an element of the spatial migration feature matrix;

constructing a loss function for the sequence migration feature according to the following formula:

wherein t (i, j) is an element of the sequence migration feature matrix.

9. A method of image text recognition, the method comprising:

acquiring an image to be identified;

processing the image to be recognized by using an image text recognition model to obtain a recognition result of the image to be recognized,

the image text recognition model is obtained by a training image text recognition model method, and the training image text recognition model method comprises the following steps: extracting features of n images and m images through a first neural network to construct a spatial migration feature matrix, extracting features of the n images and the m images through a second neural network to construct a sequence migration feature matrix, and adjusting the image text recognition model according to the spatial migration feature matrix, the sequence migration feature matrix, prediction data of the n images and prediction data of the m images;

the n images are images in a target domain, and the m images are images in a composite domain.

10. The method according to claim 9, wherein the extracting features of the n images and the m images through a first neural network to construct a spatial migration feature matrix comprises:

performing convolution processing on the n images through the first neural network to obtain a first spatial feature map, and performing convolution processing on the m image matrixes through the first neural network to obtain a second spatial feature map;

constructing a fourth transition matrix according to the third transition matrix and a transposed matrix of the third transition matrix;

11. The method according to claim 10, wherein determining the spatial migration feature matrix using the elements of the fourth transition matrix as inputs to a kernel function comprises:

the kernel function is a Gaussian kernel function, and the space migration characteristic matrix is determined according to the following formula;

k(i,j)＝exp(-(R _ii +R _jj -2R _ij )/σ ² )；

12. The method of claim 10, wherein the second neural network is a recurrent neural network, and the extracting features of the n images and the m images through the second neural network constructs a sequence migration feature matrix, including:

constructing an eighth transition matrix according to the seventh transition matrix and a transposed matrix of the seventh transition matrix;

13. The method according to claim 12, wherein determining the sequence migration feature matrix using the kernel function with the elements of the eighth transition matrix as inputs comprises:

T(i,j)＝exp(-(D _ii +D _jj -2D _ij )/σ ₁ ² )；

wherein R is the eighth transition matrix, σ ₁ ² Is the variance of the eighth transition matrix.

14. The method according to any one of claims 9 to 13, wherein the adjusting parameters of the image text recognition model according to the spatial transition feature matrix, the sequence transition feature matrix, the prediction data of the n images and the prediction data of the m images comprises:

constructing a loss function according to the spatial migration characteristic matrix, the sequence migration characteristic matrix, the prediction data of the n images and the prediction data of the m images;

adjusting a hyperparameter of the loss function.

15. The method of claim 14, wherein constructing a loss function from the spatial transition feature matrix, the sequence transition feature matrix, the prediction data for the n pictures, and the prediction data for the m pictures comprises:

constructing the loss function according to the following equation:

L _total ＝α ₁ L _cnn +α ₂ L _rnn +α ₃ L _tgt +α ₄ L _src ；

wherein L is _total Is the loss function;

L _cnn a loss function that is a characteristic of the spatial migration;

L _rnn a loss function that is a characteristic of the sequence migration;

L _tgt a loss function for the target domain text classification;

L _src a loss function for the synthesized domain text classification;

16. The method of claim 15, wherein the loss function for the spatial migration feature is constructed according to the following formula:

wherein k (i, j) is an element of the spatial migration feature matrix;

constructing a loss function of the sequence migration features according to the following formula:

wherein t (i, j) is an element of the sequence migration feature matrix.

17. A training apparatus for training an image text recognition model, comprising a processor and a memory, the memory being adapted to store program instructions, the processor being adapted to invoke the program instructions to perform the method of any of claims 1 to 8.

18. An image text recognition apparatus comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 9 to 16.

19. A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, the program code comprising instructions for performing the method of any of claims 1 to 8 or 9 to 16.

20. A chip comprising a processor and a data interface, the processor reading instructions stored on a memory through the data interface to perform the method of any one of claims 1 to 8 or 9 to 16.