WO2022236959A1 - 图像数据处理方法、装置、设备以及存储介质 - Google Patents

图像数据处理方法、装置、设备以及存储介质 Download PDF

Info

Publication number
WO2022236959A1
WO2022236959A1 PCT/CN2021/107653 CN2021107653W WO2022236959A1 WO 2022236959 A1 WO2022236959 A1 WO 2022236959A1 CN 2021107653 W CN2021107653 W CN 2021107653W WO 2022236959 A1 WO2022236959 A1 WO 2022236959A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
recognition model
text recognition
encoding
image
Prior art date
Application number
PCT/CN2021/107653
Other languages
English (en)
French (fr)
Inventor
王斌
薛莫白
姜德强
Original Assignee
腾讯云计算(北京)有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯云计算(北京)有限责任公司 filed Critical 腾讯云计算(北京)有限责任公司
Priority to EP21941528.8A priority Critical patent/EP4339831A1/en
Publication of WO2022236959A1 publication Critical patent/WO2022236959A1/zh
Priority to US18/306,208 priority patent/US20230260304A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19127Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • G06V30/19093Proximity measures, i.e. similarity or distance measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular to an image data processing method, device, equipment and storage medium.
  • OCR Optical Character Recognition
  • text recognition refers to the recognition of optical characters through image processing and pattern recognition technology to output text information in the image.
  • an OCR model can be used to recognize an image containing text information, and extract the text information in the image.
  • the labeling of images is a very labor-intensive and time-consuming task, which will cause the cost of labeling image data to be too high.
  • labeling images it is usually to label images in a specific scene. Based on the above labeling
  • the OCR model obtained by image training is applied to other scenes, it is easy to cause the recognition accuracy of text recognition to be too low.
  • Embodiments of the present application provide an image data processing method, device, device, and storage medium, which can reduce data labeling costs and improve the recognition effect of a text recognition model.
  • An embodiment of the present application provides an image data processing method on the one hand, including:
  • the image representation information is encoded to obtain the semantic encoding information corresponding to the image representation information; the semantic encoding information is associated with the text information in the image data;
  • the discrete coding component of the text recognition model obtain the discrete coding information corresponding to the image representation information; meet the target;
  • the network parameters of the text recognition model are corrected, and the feature extraction component after parameter correction and the image encoding component after parameter correction are determined as the target text recognition model; the target text
  • the recognition model is used to recognize the text information in the image data to be processed.
  • An embodiment of the present application provides an image data processing device on the one hand, including:
  • the feature extraction module is used to input image data containing text information into the text recognition model, and obtain image representation information corresponding to the image data according to the feature extraction component in the text recognition model;
  • the semantic encoding module is used to encode the image representation information according to the image encoding component in the text recognition model to obtain the semantic encoding information corresponding to the image representation information; the semantic encoding information is associated with the text information in the image data;
  • the discrete encoding module is used to obtain the discrete encoding information corresponding to the image representation information according to the code table included in the discrete encoding component of the text recognition model; the code table includes learnable encoding vectors for characterizing text features, and the discrete encoding information is used for as a fitting target for unsupervised learning;
  • the parameter correction module is used to correct the network parameters of the text recognition model according to the encoding similarity between the semantic encoding information and the discrete encoding information, and determine the feature extraction component after parameter correction and the image encoding component after parameter correction as the target Text recognition model; the target text recognition model is used to recognize text information in the image data to be processed.
  • An embodiment of the present application provides a computer device on the one hand, including a memory and a processor, the memory is connected to the processor, the memory is used to store a computer program, and the processor is used to call the computer program, so that the computer device executes the computer program described in the embodiment of the present application.
  • a computer device including a memory and a processor, the memory is connected to the processor, the memory is used to store a computer program, and the processor is used to call the computer program, so that the computer device executes the computer program described in the embodiment of the present application.
  • Embodiments of the present application provide, on the one hand, a computer-readable storage medium, in which a computer program is stored, and the computer program is adapted to be loaded and executed by a processor, so that a computer device having a processor executes the implementation of the present application.
  • a computer-readable storage medium in which a computer program is stored, and the computer program is adapted to be loaded and executed by a processor, so that a computer device having a processor executes the implementation of the present application.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided in the above aspect.
  • the text recognition model may include a feature extraction component, an image encoding component, and a discrete encoding component; the image representation information of the image data can be obtained through the feature extraction component, and the semantic encoding information can be obtained by passing the image representation information through the image encoding component.
  • the discrete encoding information can be obtained by passing the image representation information through the discrete encoding component, and then the network parameters of the text recognition model can be corrected through the encoding similarity between the semantic encoding information and the discrete encoding information.
  • the discrete encoding information can be As the fitting target of the text recognition model in the training process, there is no need to use the labeling information of the image data in the above training process, which can reduce the cost of data labeling; because the unlabeled image data has a large amount of data, a wide coverage and other diversity Features, directly using unlabeled image data for training can improve the generalization ability of the target text recognition model, thereby improving the recognition effect of the target text recognition model.
  • FIG. 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • FIG. 2 is a training scene diagram of a text recognition model provided by an embodiment of the present application
  • FIG. 3 is a schematic flow chart of an image data processing method provided in an embodiment of the present application.
  • Fig. 4 is a schematic diagram of the processing of a discrete encoding component provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of an unsupervised training method provided by the embodiment of the present application.
  • Fig. 6 is a schematic diagram of a supervised training method provided by the embodiment of the present application.
  • FIG. 7 is a text recognition scene diagram provided by an embodiment of the present application.
  • Figure 8 is a text recognition scene diagram provided by the embodiment of the present application:
  • FIG. 9 is a schematic structural diagram of an image data processing device provided in an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • This application involves artificial intelligence (AI) technology, blockchain (Block Chain) technology and cloud technology.
  • AI artificial intelligence
  • blockchain Block Chain
  • cloud cloud technology.
  • This application specifically relates to the computer vision technology (Computer Vision, CV) under the artificial intelligence technology.
  • the image data and text recognition results involved in this application can be stored on the block chain to ensure that the image data and text recognition results cannot be tampered with.
  • this application relates to the artificial intelligence cloud service under the cloud technology, and the artificial intelligence cloud service may also be called AIaaS (AI as a Service, "AI as a service” in Chinese).
  • AIaaS AI as a Service
  • This application can use the AI framework and AI infrastructure provided by the platform to deploy the OCR service. After the OCR model is trained, the trained OCR model can be applied to the OCR service in the cloud artificial intelligence service.
  • OCR technology converts the text of various bills, newspapers, books, manuscripts and other printed materials into image information through optical input methods such as scanning, and then uses text recognition technology to convert image information into usable computer input technology.
  • important data such as amount, account number, and text data can be directly extracted from the image, and new text required in daily life can be generated, thereby replacing manual input of text data by humans.
  • Unsupervised training (unsupervised learning, also known as unsupervised learning, or self-supervised learning, or unsupervised learning): Unsupervised training is used to process sample sets of unlabeled categories. In unsupervised training, the sample data is not labeled, and there is no definite result; since the sample data category is unknown, it is necessary to classify the sample set according to the similarity between the sample data, trying to minimize the gap within the same category, different The gap between categories is maximized.
  • Supervised training can use a set of sample data of known categories to adjust the parameters of the network model to achieve the required performance.
  • the training data set is required to include input (features) and output (target), and the targets in the training data set can be manually marked; through the existing training data set (known sample data and its corresponding output) to train to get An optimal model, using this optimal model can map all inputs to corresponding outputs, and make simple judgments on the outputs to achieve the purpose of classification.
  • FIG. 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • the network architecture may include a server 10d and a user terminal cluster, and the user terminal cluster may include one or more user terminals, and the number of user terminals is not limited here.
  • the user terminal cluster may specifically include a user terminal 10a, a user terminal 10b, and a user terminal 10c.
  • the server 10d can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud Cloud servers for basic cloud computing services such as communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and big data and artificial intelligence platforms.
  • the user terminal 10a, the user terminal 10b and the user terminal 10c, etc. may include: smart phones, tablet computers, notebook computers, palmtop computers, mobile Internet devices (Mobile Internet Device, MID), wearable devices (such as smart watches, smart bracelets) etc.) and smart terminals with text recognition functions such as smart TVs.
  • the user terminal 10a, user terminal 10b, and user terminal 10c can respectively be connected to the server 10d through a network, so that each user terminal can perform data interaction with the server 10d through the network connection.
  • the user terminal 10a can collect image data in different scenes through electronic equipment (such as scanners, camera equipment, etc.), and the above image data can include camera Business promotion pictures (for example, advertisement pictures on the wall, banner slogan pictures, poster pictures, etc. ), text, data and pictures scanned by the scanner, etc.
  • electronic equipment such as scanners, camera equipment, etc.
  • camera Business promotion pictures for example, advertisement pictures on the wall, banner slogan pictures, poster pictures, etc.
  • the user terminal 10a can obtain an initialized text recognition model (also referred to as an initial text recognition model or an initial OCR model), and train the text recognition model through the collected image data;
  • the text recognition model can include feature extraction components,
  • the image encoding component and the discrete encoding component for each image data input to the text recognition model, can perform feature extraction (image downsampling) on the image data through the feature extraction component, output image representation information corresponding to the image data, and pass the image encoding component It can output the semantic coding information corresponding to the image representation information, and can also output the discrete coding information corresponding to the image representation information through the discrete coding component, and then can use the coding similarity between the semantic coding information and the discrete coding information, and the code table index confidence , correcting the network parameters of the text recognition model to obtain a trained text recognition model (also referred to as a target text recognition model), which can be used to recognize text information in the image data to be processed.
  • a trained text recognition model also referred to as
  • the training of the text recognition model does not need to use the labeling information of the image data, which can reduce the cost of labeling the image data; due to the diversity of the collected image data, the text recognition model obtained through training has a better recognition effect.
  • FIG. 2 is a training scene diagram of a text recognition model provided by the embodiment of the present application.
  • the user terminal 10a in the user terminal cluster shown in FIG. way to train the text recognition model.
  • the user terminal 10a can use image data that does not carry label information to train the initialized text recognition model, and the text recognition model can include a feature extraction component 20a, an image encoding component 20d, and a discrete encoding component 20f.
  • a target text recognition model including a feature extraction component with updated parameters and an image encoding component with updated parameters can be obtained.
  • the main purpose is to train the network parameters of the feature extraction component 20a and the network parameters of the image encoding component 20d.
  • this sample image 20b of input text recognition model (this sample image 20b can contain text information), at first input to the feature extraction component 20a in the text recognition model 20b, can carry out image preprocessing to sample image 20b,
  • the image preprocessing may include but not limited to: image grayscale, image normalization, image size adjustment, image denoising processing; if the sample image 20b is a color image, the sample image 20b may be image grayscaled, Converting the sample image 20b into a grayscale image can reduce the computational complexity of the text recognition model in the training process; of course, in order to further reduce the computational complexity in the training process, the sample image after image grayscale processing can also be processed 20b carries out image normalization; When the feature extraction component 20a has size regulation to the size of the input image, the size of the sample image 20b can be adjusted to the size specified by the feature extraction component 20a; the sample image 20b can be denoised, to optimize the sample image 20b and so on.
  • the user terminal 10a may preprocess the sample image 20b by using one or more image preprocessing methods described above to obtain a preprocessed image 20c.
  • the image width can be adjusted The ratio between the image downsampling ratio and the image downsampling ratio is rounded up. It can be understood that by down-sampling the image 20c through the feature extraction component 20a, the image representation information of the region where the text information is located can be extracted from the image 20c, that is, the purpose of the feature extraction component 20a is to detect the text contained in the image 20c. The area where the text information is located, and the image representation information of the area where the text information is located is extracted.
  • the image representation feature can be calculated
  • FIG. 3 is a schematic flowchart of an image data processing method provided in an embodiment of the present application. It can be understood that the image data processing method can be executed by a computer device, and the computer device can be a user terminal, or a server, or a system composed of a user terminal and a server, or a computer program application (including program code), which is not described here Be specific. As shown in Figure 3, the image data processing method may include the following steps:
  • Step S101 input image data containing text information into the text recognition model, and obtain image representation information corresponding to the image data according to the feature extraction component in the text recognition model.
  • a text recognition model (also referred to as an OCR model) may be used to detect and recognize text information contained in image data, so as to output the text content contained in the image data.
  • the text recognition model can be trained in an unsupervised training manner to obtain a trained text recognition model.
  • the initialized text recognition model may be referred to as an initial text recognition model
  • the trained text recognition model may be referred to as a target text recognition model.
  • the text recognition model may include a feature extraction component, an image encoding component, and a discrete encoding component.
  • the computer device After collecting the sample data set for training the text recognition model, the computer device (for example, the user terminal 10a in the embodiment corresponding to FIG. 1 above) can obtain the text recognition model, and divide the image data contained in the sample data set into batches The batch is input to the text recognition model, and the above-mentioned sample data set is used for unsupervised training of the text recognition model.
  • the image data contained in the sample data set are all images containing text information.
  • the sample data set can include image data such as business promotion pictures (advertising pictures), document scan pictures, document scan pictures, and screenshots.
  • the above sample data set All the image data included can be unlabeled data (that is, image data without label information); in the unsupervised training process, the image data in the sample data set can be processed in batches, and the images contained in the batch After the data is input into the text recognition model, it is first input into the feature extraction component in the text recognition model. Through the feature extraction component, the image data can be down-sampled, the area where the text information in the image data is detected, and the area where the text information is located can be extracted.
  • the above-mentioned feature extraction components may include but are not limited to: VGGNet (a convolutional neural network model, which may include 16 to 19 network layers, the size of the convolution kernel used in the convolution layer may be 3*3, in the pool The size of the filter used in the layer can be 2*2), GoogLeNet (a convolutional neural network model that can include 22 network layers, and introduce an inception (starting) structure in the network model, the inception structure Can be used to provide computational efficiency), ResNet (a convolutional neural network model, which can include 152 network layers by introducing a residual structure), DenseNet (a convolutional neural network model, each network layer in the network model Input outputs from all previous network layers).
  • VGGNet a convolutional neural network model
  • the feature extraction component is a ResNet network
  • L is a positive integer, such as L can take a value of 1, 2, 3...
  • the input of the lth network layer can be expressed as x l-1
  • x l-1 can also be expressed as the output of the (l-1)th network layer
  • H l can be expressed as the nonlinear transformation function (non-liear transformation) of the lth network layer
  • the H l can be It is understood as a combined operation, such as a series of BN (Batch Normalization, batch normalization), activation function, pooling, convolution and other operations.
  • the output of the lth network layer in the ResNet network can be the nonlinear transformation between the output of the (l-1)th network layer and the output of the (l-1)th network layer, and the Lth network layer in the ResNet network
  • the output x L of (the last network layer) is used as the image representation information output by the feature extraction component. It should be noted that one or more convolutional layers may be included between the lth network layer and the (1-1)th network layer.
  • the feature extraction component is a DenseNet network
  • the feature extraction component includes L network layers.
  • the computer device can obtain image data first (L-1 ) network layers, combine the output results corresponding to the first (L-1) network layers into a joint output result; and then according to the weight matrix corresponding to the Lth network layer in the feature extraction component, get The target output result corresponding to the output result is jointly output, and the target output result is determined as the image representation information corresponding to the image data.
  • the output of the lth network layer can be the output of the first (l-1) network layer after the nonlinear transformation after concatenation, Furthermore, the output x L of the Lth network layer in the DenseNet network is used as the image representation information output by the feature extraction component.
  • Step S102 according to the image encoding component in the text recognition model, encode the image representation information to obtain the semantic encoding information corresponding to the image representation information; the semantic encoding information is associated with the text information in the image data.
  • the computer device can input the image representation information into the image encoding component of the text recognition model, through the image
  • the image coding component can be used to learn semantic information between text information contained in image data, and can pay more attention to semantically related words and weaken irrelevant words.
  • the above-mentioned image encoding components may include but are not limited to: Transformer encoder (a encoder model structure) with mask (mask) module, LSTM (Long Short-Term Memory, long-term short-term memory network), RNN (Recurrent Neural Network, recurrent neural network).
  • Transformer encoder a encoder model structure
  • mask mask
  • RNN Recurrent Neural Network
  • the Transformer encoder may include a self-attention layer and a coding layer
  • the computer device may be in the image encoding component of the text recognition model, and may be based on the image encoding component
  • the weight matrix corresponding to the self-attention layer of the image representation information is multiplied to obtain the attention output vector corresponding to the image representation information; and then the text position encoding of the attention output vector can be performed according to the encoding layer in the image encoding component , to obtain the semantic encoding information corresponding to the image representation information.
  • each image representation feature in the image representation information Z ⁇ z 1 , z 2 , z 3 ,..., z T ⁇
  • each image representation feature can be generated corresponding to The query vector (Query), key vector (Key) and value vector (Value)
  • the query vectors corresponding to T image representation features can form a query matrix QU
  • the key vectors corresponding to T image representation features can form a key matrix KE
  • the value vectors corresponding to the T pieces of image representation information can form the value matrix VA.
  • the query matrix QU can be obtained after multiplying the image representation information Z by the query weight matrix W QU
  • the key matrix KE can be obtained by multiplying the image representation information Z by the key weight matrix W KE
  • the image representation information Z and The value matrix W VA can be multiplied to obtain the value matrix VA.
  • the output result of the self-attention layer (that is, the above-mentioned attention output vector, T image representation
  • the attention output vectors corresponding to the features can form the attention output matrix)
  • the output result of the self-attention layer can be expressed as: where d ke can be expressed as the dimension of the key vector, softmax can be expressed as a classifier, and ⁇ can be expressed as a matrix point multiplication operation.
  • Step S103 according to the code table contained in the discrete coding component of the text recognition model, obtain the discrete coding information corresponding to the image representation information.
  • the code table contained in the discrete coding component of the text recognition model the confidence degree of the code table index corresponding to the image representation information is obtained, and according to the code table index confidence degree, the discrete code corresponding to the image representation information is obtained in the code table information;
  • the code table includes learnable encoding vectors for characterizing text features.
  • the code table index confidence refers to the reliability of representing the image representation information by using the learnable coding vector.
  • a code table may be included in the discrete encoding component of the text recognition model, and the code table may include an encodable learning vector used to characterize text features.
  • the computer device can obtain the confidence degree of the code table index between each image representation feature in the image representation information and the code table, and according to the order of the code table index confidence from large to small , the discrete coding information corresponding to the image representation information can be obtained from the code table.
  • the code table in the discrete coding component can include V learnable coding vectors
  • the image representation information can include T image representation features, where V is a positive integer, such as V can be 1, 2, 3... ;
  • the computer device can obtain the code table contained in the discrete coding component of the text recognition model, and the image representation feature z i in the image representation information, where i is a positive integer less than or equal to T, wherein the image representation feature z i can represent is any one of the image representation features contained in the image representation information; furthermore, the code table index confidence between the image representation features zi and the V learnable coding vectors can be obtained, wherein, the first of the V learnable coding vectors
  • the confidence degree of the code table index corresponding to the i learnable coding vector refers to the reliability of using the ith learnable coding vector to represent the image representation feature z i .
  • V learnable coding vectors can be obtained The code table index confidence degrees corresponding to the encoding vectors respectively, optionally, the sum of the values of the V code table index confidence degrees is 1, and each code table index confidence degree is between [0,1]
  • the learnable encoding vector corresponding to the largest code table index confidence is determined as the target encoding vector, and then the discrete encoding corresponding to the image representation feature z i can be determined according to the target encoding vector Feature q i , the discrete coding features corresponding to the T image representation features respectively form the discrete coding information.
  • the method for obtaining the confidence degree of the code table index may include but not limited to: a sampling method based on softmax, a sampling method based on gumbel (Gumbel)-max, and a sampling method based on gumbel-softmax.
  • the computer device can obtain the distributed random number corresponding to the feature value in the image characteristic z i (the distributed random number here can be is a gumbel distribution random number), and then the feature value in the image feature z i can be added to the distribution random number to obtain the candidate feature corresponding to the image feature z i ; according to the feature value corresponding to the candidate feature z Index value, to obtain the code table index confidence between the candidate characterization features and the V learnable encoding vectors.
  • the distributed random number here can be is a gumbel distribution random number
  • the feature value in the image feature z i can be added to the distribution random number to obtain the candidate feature corresponding to the image feature z i ; according to the feature value corresponding to the candidate feature z Index value, to obtain the code table index confidence between the candidate characterization features and the V learnable encoding vectors.
  • the number of code tables is G, and each code table corresponds to a target encoding vector, G is a positive integer, and G can take values of 1, 2, 3..., wherein the values of G and V are the same as those of the text
  • the number of characters in the recognition scene is related; the target encoding vectors in the G code tables are spliced to obtain the joint feature corresponding to the image representation feature z i ; and then the joint feature can be input to the fully connected network layer, according to the fully connected network layer
  • the weight matrix in outputs the discrete coded feature q i corresponding to the image representation feature z i .
  • G the value of G is 1, there is only one code table, and one code table corresponds to one target encoding vector, so there is no step of splicing multiple target encoding training to obtain joint features.
  • a target encoding vector corresponding to this code table can be directly used as a joint feature, input to the fully connected network layer, and the discrete encoding feature q i corresponding to the image representation feature z i is output.
  • each code table can include V learnable code vectors e, and the dimension of each learnable code vector e is d-dimensional (e ⁇ R d ), Then the size of the code table can be expressed as: G ⁇ V ⁇ d.
  • a learnable tensor codebook ⁇ R G ⁇ V ⁇ d can be created as a code table, which can be continuously updated during the training process of the text recognition model. For any image representation feature z i ⁇ R GV in the image representation information, it can be expanded into a matrix S of size G ⁇ V.
  • the image representation feature z i When it is expanded into a matrix S, no numerical transformation is required, that is, z i —> S, S ⁇ R G ⁇ V , that is, the image representation feature z i expressed in vector form can be expanded into a matrix form.
  • code table index confidence between the image representation feature zi and the V learnable coding vectors in the code table can be obtained through the sampling method based on gumbel-softmax.
  • code table index confidence can be expressed as:
  • p g, v in the above formula (1) can be expressed as the index confidence of the code table corresponding to the eigenvalue of the gth row and the vth column in the matrix S.
  • the p g, v The dimension can be G;
  • s g, v can be expressed as the eigenvalue of the gth row and the vth column in the matrix S, and
  • n v can be expressed as the Gumbel distribution random number corresponding to the eigenvalue s g, v (that is, the random number of the above distribution)
  • the Gumbel distribution random number can be expressed as:
  • n v -log(-log(U(0,1)), where (U(0,1) can represent uniform distribution, (s g,v +n v ) can It is called a candidate characteristic feature;
  • can be a non-negative constant involved in the gumbel-softmax sampling method, and the constant ⁇ can be used to control the smoothness of gumbel-softmax sampling.
  • the constant ⁇ can be set larger (such as setting The constant ⁇ is set to 2).
  • the constant ⁇ can be gradually reduced, and each round of iterations (epoch, the number of times a sample data set is fully trained) can be multiplied by a coefficient less than 1 (for example, the The coefficient can be set to 0.9995), which can make the Gumbel distribution gradually approach the real discrete distribution while ensuring the stable convergence of the network.
  • the code table index Idx corresponding to the maximum code table index confidence can be taken, and the code table index Idx The dimension of is the same as that of the code table index confidence pg ,v , and the code table index Idx can be expressed as:
  • a learnable code vector can be selected from each of the G code tables.
  • W in the above formula (3) can be expressed as the weight matrix in the fully connected network layer, and b can be expressed as the bias in the fully connected network layer, where the size of W is G ⁇ d ⁇ G ⁇ V.
  • G ⁇ d represents 8 which can be understood as is an 8-dimensional vector
  • FIG. 4 is a schematic diagram of processing of a discrete encoding component provided by an embodiment of the present application.
  • the image representation information Z can be input to the discrete coding component, and the discrete coding
  • the component contains G learnable code tables, and each code table contains V learnable encoding vectors, and each image representation feature contained in the image representation information Z can be expanded into a size G ⁇ V matrix, and then can be calculated by the above formula (1) to obtain V code table index confidence degrees with a dimension of 2, and select the largest code table index confidence degree among the V code table index confidence degrees corresponding to each code table, Then determine the corresponding code table index (above formula (2)) of the maximum code table index confidence degree, according to this code table index can select a learnable coding vector from G code tables respectively as the target coding vector, the G The target coding vector is
  • Step S104 Correct the network parameters of the text recognition model according to the coding similarity between the semantic coding information and the discrete coding information, and determine the feature extraction component after parameter correction and the image coding component after parameter correction as the target text recognition model ;
  • the target text recognition model is used to recognize the text information in the image data to be processed.
  • the network optimization objective of the text recognition model can be determined according to the encoding similarity between the semantic encoding information and the discrete encoding information, and the network optimization objective can also be called a model loss function.
  • the computer device can obtain the semantic coding feature c i from the semantic coding information (the semantic coding feature c i can be any semantic coding feature in the semantic coding information), and the discrete coding feature q i in the discrete coding information can be determined as The positive sample of the semantic coding feature c i , the discrete coding feature q j in the discrete coding information is determined as the negative sample of the semantic coding feature c i , where i and j are both positive integers less than or equal to T, and i and j are not equal; furthermore, according to the first similarity between the semantic encoding feature c i and the positive sample, the second similarity between the semantic encoding feature c i and the negative sample, and the index confidence of the code table, the corresponding text recognition model can be determined Model loss function (the first similarity and second similarity at this time can be called encoding similarity); according to the model loss function, the network parameters of the text recognition model are corrected, and the feature extraction components
  • the model loss function of the text recognition model is determined according to the coding similarity between the semantic coding information and the discrete coding information, and the code table index confidence.
  • the above-mentioned model loss function may include two parts, which are respectively the contrast loss function and the diversity loss function, wherein the contrast loss function can make the semantic encoding information C in a set of representation information containing positive samples and negative samples (the above-mentioned code table ) can find the correct discrete coding information Q, by calculating the angle information between the semantic coding information C and the discrete coding information Q, optimize the network parameters of the text recognition model, so that the angle between the semantic coding feature c i and the positive sample becomes Smaller, the angle with the negative sample becomes larger; the diversity loss function can improve the utilization rate of the code table in the discrete coding component, and improve the diversity of the generated code table index by optimizing the information entropy of the code table index.
  • the contrast loss function can make the semantic encoding information C in a set of representation information containing positive samples and negative samples (the above-mentioned code table ) can find the correct discrete coding information Q, by calculating the angle information between the semantic coding information C
  • the computer device can obtain the first similarity between the semantic coding feature c i and the positive sample, obtain the second similarity between the semantic coding feature c i and the negative sample, and according to the first similarity and the second similarity Determine the contrastive loss function, which can be expressed as:
  • L m in the above formula (4) can be expressed as the contrast function loss
  • sim(c i , q i ) can be expressed as the first similarity between the semantic coding feature ci and the positive sample q i
  • sim(c i i , q j ) can be expressed as the second similarity between the semantic coding feature c i and the positive sample q j
  • the first similarity and the second similarity can both be cosine similarity
  • K can be expressed as a constant
  • formula (5) represents the cosine similarity calculation method between two vectors a and vector b
  • can be expressed as the 2-norm of vector a, that is, the elements in vector a The square sum of the absolute value and the square root.
  • the computer device can obtain the logarithmic value corresponding to the confidence degree of the code table index, and determine the diversity loss function according to the product between the logarithm value and the confidence degree of the code table index, and the diversity loss function can be expressed as:
  • L d can be expressed as a diversity loss function
  • H ⁇ can be expressed as information entropy
  • It can be expressed as the confidence degree of the code table index calculated in the training process.
  • the hyperparameter of the loss function, the model loss function L is the product of the hyperparameter ⁇ and the diversity loss function L d and then summed with the comparison loss function L m .
  • the termination condition the number of training times of the text recognition model reaches the maximum number of training times set, or the training of the text recognition model reaches the convergence condition
  • the feature extraction component and the image coding component meeting the training termination condition are determined as the target text recognition model.
  • the calculation of the above comparison loss function is an operation between the features of the same image data, and does not involve operations between different image data, while the discrete coding component can learn multiple code tables.
  • the supervised training process it can guide the training direction of the text recognition model.
  • the features obtained from different image data can move closer to the code table, and the different code tables will move away from each other, so that similar features between different image data are close to each other, and dissimilar features are far away from each other.
  • the discrete coding component avoids the surge in calculations caused by operations between different image data (which can be called cross-image operations), reduces memory requirements, and reduces the time cost of training the network.
  • the above-mentioned entire training process of the text recognition model can be called an unsupervised training process, and the feature extraction component and image coding component obtained when the training termination condition is met can be called the target text recognition model after training , the target text recognition model at this time can be applied in the text recognition scene to recognize the text information in the image data to be processed.
  • FIG. 5 is a schematic diagram of an unsupervised training method provided by an embodiment of the present application.
  • the text recognition model may include a feature extraction component, an image encoding component, and a discrete encoding component
  • the text recognition model can also include a classification network layer (also known as a feed-forward network);
  • the parameters are corrected, and the feature extraction component after parameter correction and the image coding component after parameter correction are determined as the candidate text recognition model, that is, the feature extraction component and image coding component completed by the aforementioned unsupervised training are called the candidate text recognition model.
  • the computer equipment can obtain the labeled image data containing text information, input the labeled image data to the candidate text recognition model, and the labeled image data can carry label information; the feature extraction component corrected according to the parameters in the candidate text recognition model, and the parameter correction
  • the final image coding component outputs the tagged semantic information corresponding to the tagged image data; furthermore, it can predict the tagged semantic information according to the classification network layer, and obtain the predicted text recognition result associated with the text information in the tagged image data; according to the tag information According to the error between the predicted text recognition results, the network parameters of the candidate text recognition model and the classification network layer are corrected, and the parameter-corrected candidate text recognition model and the parameter-corrected classification network layer are determined as the target text recognition model.
  • the label information of labeled image data can be used as the expected output result of the candidate text recognition model, and the predicted text recognition result output by the candidate text recognition model can be understood as the actual output result.
  • the error between the output results is back-propagated in the candidate text recognition model to update the network parameters of the candidate text recognition model and the network parameters of the classification network layer, and finally obtain the trained target text recognition model.
  • the target text recognition model refers to the network model obtained after unsupervised training and supervised fine-tuning.
  • the classification network layer can include but not limited to: softmax (a kind of multi-classifier), artificial neural network (Artificial Neural Networks, ANNs), support vector machine (Support Vector Machines, SVM).
  • FIG. 6 is a schematic diagram of a supervised training method provided by an embodiment of the present application.
  • a part of the labeled data can also be used for supervised fine-tuning of the unsupervised trained model (also called supervised training), that is, supervised training is performed after unsupervised training,
  • the processing process forward calculation process of the annotated image in the feature extraction component and image encoding component is the same as the aforementioned unsupervised training process.
  • the processing process for unlabeled images is the same in , and will not be repeated here.
  • the annotation semantic information can be input to the feedforward network (which can be understood as a classification network layer), and the predicted text recognition result corresponding to the annotation image can be output through the feedforward network , where the input of the feedforward network is the annotation semantic information output by the image encoding component, and the output of the feedforward network is a vector whose dimension is equal to the number of categories of text characters.
  • the candidate text recognition is suitable for recognizing 300 kinds of text character category
  • the output of the feedforward network can be a vector with a dimension of 300.
  • the output vector of the feedforward network can be used as the predicted text recognition result of the annotated image in the candidate text recognition model, and then the error calculation loss between the label information of the annotated image and the predicted text recognition result can be calculated, and the network of the candidate text recognition model can be optimized parameters to obtain the final trained target text recognition model.
  • the target text recognition model can be applied to any text recognition scenario, such as the delivery address recognition scenario when sending a courier (by using the target text recognition model to include address information
  • the picture is identified to obtain the address content in the picture, and the identified address information is automatically filled into the input area where the receiving address is located, which can simplify the input operation of the receiving address and improve the delivery speed), business promotion recognition scenarios (through Use the target text recognition model to identify the advertisement picture to obtain the advertisement text content in the advertisement picture), document data entry scene (when the text data in the written document needs to be entered into the electronic system, the written document can be scanned or photographed , and then use the target text recognition model to identify the scanned or photographed pictures to obtain the document content in the picture, and automatically enter the identified document content into the electronic system for storage, which can reduce human resources, and further Improve the entry efficiency of document content), account entry scene (when it is necessary to enter the bank card account number or ID number, you can take a photo of the bank card or ID card, and use the target text recognition model to
  • the computer device can obtain a business promotion picture containing text information, and determine the business promotion picture (for example, an advertisement picture) containing text information as an image to be processed
  • the image data to be processed is input to the target text recognition model
  • the feature extraction component corrected by the parameters in the target text recognition model outputs the promotion representation information corresponding to the image data to be processed
  • the parameter correction in the target text recognition model The image encoding component of the promotional text output the promotional text semantic information corresponding to the promotional representation information; according to the classification network layer in the target text recognition model, the promotional text semantic information is predicted, and the promotional text content corresponding to the promotional text semantic information is obtained, that is, the business promotion
  • the image is subjected to text recognition to output the promotional text contained in the business promotion image.
  • FIG. 7 is a text recognition scene diagram provided by an embodiment of the present application.
  • the user terminal 30a shown in FIG. 7 may be the above-mentioned computer device, and the user terminal 30a may be a terminal device used by user A, and a search application is installed in the user terminal 30a.
  • the current display interface as shown in Figure 7 is the main page of the search application.
  • a search box can be displayed on the main page, and a camera entry 30b can be included in the search box.
  • the user terminal 30a can respond to the trigger operation for the camera entrance 30b, start the camera in the user terminal 30a, and take the user terminal 30a close to the physical advertisement sheet 30c to take pictures.
  • the user terminal 30a can use the aforementioned pre-trained target recognition model to perform text recognition on the photo 30d, and output the text content 30e contained in the photo 30d.
  • the text content 30e includes: "2020 Ocean Day”, “Limited Edition Essence Cream", "Brand A”.
  • the search results associated with the above text content 30e can be retrieved in the search application, and the search results are displayed on the search page 30f of the search application As a result, the search result can be sorted and displayed on the search page 30f according to the degree of relevance to the above-mentioned text content 30e.
  • the search result can include a result display column 30g. When user A selects a certain result display column (for example, the result When you are interested in the content in the display column 30g), you can click the result display column to view the content details.
  • FIG. 8 is a text recognition scene diagram provided by an embodiment of the present application.
  • the user terminal 40a shown in FIG. 8 may be the above-mentioned computer equipment, and the user terminal 40a may be a terminal device used by user A, and the user terminal 30a is integrated with a courier mailing application (or a courier mailing applet).
  • a courier mailing application or a courier mailing applet
  • user A wants to send express to user B, he can open the express delivery application (or express delivery applet) to enter the delivery information page 40b, and ask user A to fill in the delivery information page 40b.
  • Information such as the name of the person, the contact information of the sender, the name of the recipient, the contact information of the recipient, the delivery address of the recipient, and the zip code.
  • user A If user A is not familiar with the delivery address of user B, then user A needs to write down the address of user B on paper or other places in advance, and then manually enter the delivery address in the mailing information page 40b, Alternatively, the user terminal 40a repeatedly switches the display page for address input.
  • a trigger operation can be performed on the picture recognition control 40c, and the user terminal 40a at this time can respond to the trigger operation for the picture recognition control 40c, and open the user terminal 40a.
  • the picture 40d containing the delivery address of user B is selected in the gallery application, and a trigger operation is performed on the confirmation control, and the user terminal 40a can respond to the trigger operation on the confirmation control, using the target recognition model completed in advance training Perform text recognition on the picture 40d, output the text content contained in the picture 40d, and match the identified text content with the keywords in the mailing information page 40b, and automatically fill the matched text content into the corresponding input box, For example, "Little B" is automatically filled in the column of the recipient, "130xxxxxx14” is automatically filled in the column of the recipient's contact information, and "xx county, xx city, xx province... ", user A can confirm the submission information after checking that there is no problem, which can improve the user's mailing efficiency.
  • the text recognition model may include a feature extraction component, an image encoding component, and a discrete encoding component; the image representation information of the image data can be obtained through the feature extraction component, and the semantic encoding information can be obtained by passing the image representation information through the image encoding component , the discrete encoding information can be obtained by passing the image representation information through the discrete encoding component, and then the network parameters of the text recognition model can be corrected through the encoding similarity between the semantic encoding information and the discrete encoding information, that is, the discrete encoding information It can be used as the fitting target of the text recognition model in the training process.
  • FIG. 9 is a schematic structural diagram of an image data processing device provided in an embodiment of the present application.
  • the image data processing device can be a computer program (including program code) applied in computer equipment, for example, the image data processing device can be an OCR character recognition application software, and the image data processing device can be used to execute this Apply the corresponding steps in the method provided by the embodiment.
  • the image data processing device 1 may include: a feature extraction module 11, a semantic coding module 12, a discrete coding module 13, and a parameter correction module 14;
  • the feature extraction module 11 is used to input image data containing text information to the text recognition model, and obtain image representation information corresponding to the image data according to the feature extraction component in the text recognition model;
  • the semantic encoding module 12 is used to encode the image representation information according to the image encoding component in the text recognition model to obtain the semantic encoding information corresponding to the image representation information; the semantic encoding information is associated with the text information in the image data;
  • the discrete encoding module 13 is used to obtain the discrete encoding information corresponding to the image representation information according to the code table included in the discrete encoding component of the text recognition model; the code table includes a learnable encoding vector for characterizing text features, and the discrete encoding information is used as a fitting target for unsupervised learning;
  • the parameter modification module 14 is used to modify the network parameters of the text recognition model according to the coding similarity between the semantic coding information and the discrete coding information, and determine the feature extraction component after parameter correction and the image coding component after parameter correction as Target text recognition model; the target text recognition model is used to recognize text information in the image data to be processed.
  • the specific function implementation of the feature extraction module 11, the semantic coding module 12, the discrete coding module 13, and the parameter correction module 14 can refer to steps S101-step S104 in the above-mentioned embodiment corresponding to FIG. 3 , and will not be repeated here.
  • the discrete coding module 13 is used to obtain the code table index confidence degree corresponding to the image representation information according to the code table included in the discrete coding component of the text recognition model;
  • the learnable coding vector represents the reliability of the image representation information; according to the code table index confidence, the discrete coding information corresponding to the image representation information is obtained in the code table.
  • the image representation information includes T image representation features
  • the code table includes V learnable encoding vectors, and both T and V are positive integers;
  • the discrete encoding module 13 may include: a code table acquisition unit 131, a confidence degree acquisition unit 132, an encoding vector selection unit 133, and a discrete feature determination unit 134;
  • Code table acquisition unit 131 used to obtain the code table included in the discrete coding component of the text recognition model, the image representation feature z i in the image representation information; i is a positive integer less than or equal to T;
  • Confidence degree acquisition unit 132 used to obtain the code table index confidence degree between image representation feature z i and V learnable coding vectors respectively;
  • An encoding vector selection unit 133 configured to determine, among the V learnable encoding vectors, the learnable encoding vector corresponding to the maximum code table index confidence as the target encoding vector;
  • the discrete feature determining unit 134 is configured to determine the discrete encoding feature q i corresponding to the image representation feature zi according to the target encoding vector, and compose the discrete encoding features corresponding to the T image representation features into discrete encoding information.
  • the specific function implementation of the code table acquisition unit 131, the confidence degree acquisition unit 132, the encoding vector selection unit 133, and the discrete feature determination unit 134 can refer to step S103 in the above-mentioned embodiment corresponding to FIG. 3 , and will not be repeated here.
  • the confidence degree acquisition unit 13 may include: a random number acquisition subunit 131, an index confidence degree acquisition subunit 132;
  • the random number acquisition subunit 131 is used to obtain the distribution random number corresponding to the feature value in the image representation feature z i , and add the feature value in the image representation feature z i to the distribution random number to obtain the image representation feature z i Corresponding candidate characterization features;
  • the index confidence acquiring subunit 132 is configured to acquire the code table index confidences between the candidate characterization features and the V learnable encoding vectors according to the index values corresponding to the eigenvalues in the candidate characterization features.
  • the specific function implementation manners of the random number obtaining subunit 131 and the index confidence degree obtaining subunit 132 can refer to step S103 in the above embodiment corresponding to FIG. 3 , which will not be repeated here.
  • the number of code tables is G, each code table corresponds to a target encoding vector, and G is a positive integer;
  • the discrete feature determination unit 134 may include: a stitching subunit 1341, a network output subunit 1342;
  • the splicing subunit 1341 is used to splice the target encoding vectors in the G code tables to obtain the joint feature corresponding to the image representation feature z i ;
  • the network output subunit 1342 is used to input the joint feature to the fully connected network layer, and output the discrete coding feature q i corresponding to the image representation feature zi according to the weight matrix in the fully connected network layer.
  • the specific function implementation manners of the splicing subunit 1341 and the network output subunit 1342 can refer to step S103 in the above embodiment corresponding to FIG. 3 , which will not be repeated here.
  • the semantic coding information includes T semantic coding features
  • the discrete coding information includes T discrete coding features
  • the coding similarity includes a first similarity and a second similarity
  • T is a positive integer
  • the parameter correction module 14 may include: a positive and negative sample determination unit 141, a model loss determination unit 142, and a network parameter correction unit 143;
  • the positive and negative sample determination unit 141 is used to obtain the semantic encoding feature ci in the semantic encoding information, determine the discrete encoding feature q i in the discrete encoding information as a positive sample of the semantic encoding feature ci , and determine the discrete encoding feature q i in the discrete encoding information
  • the discrete coding feature q j is determined as a negative sample of the semantic coding feature c i ; both i and j are positive integers less than or equal to T, and i and j are not equal;
  • a model loss determination unit 142 configured to determine a model loss function corresponding to the text recognition model according to the first similarity between the semantic coding feature c i and the positive sample, and the second similarity between the semantic coding feature c i and the negative sample ;
  • the network parameter correction unit 143 is configured to correct the network parameters of the text recognition model according to the model loss function, and determine the feature extraction component after parameter correction and the image coding component after parameter correction as the target text recognition model.
  • the specific function implementation of the positive and negative sample determination unit 141, the model loss determination unit 142, and the network parameter correction unit 143 can refer to step S104 in the above-mentioned embodiment corresponding to FIG. 3 , and will not be repeated here.
  • the model loss determination unit 142 may include: a comparison loss determination subunit 1421, a diversity loss determination subunit 1422, and a loss connection subunit 1423;
  • the comparison loss determination subunit 1421 is used to obtain the first similarity between the semantic coding feature c i and the positive sample, and obtain the second similarity between the semantic coding feature c i and the negative sample, according to the first similarity and the first similarity Two similarity determination contrast loss function;
  • the diversity loss determination subunit 1422 is used to obtain the logarithmic value corresponding to the confidence degree of the code table index according to the confidence degree of the code table index corresponding to the image representation information, and determine the diversity according to the product between the log value and the confidence degree of the code table index loss function;
  • the loss connection subunit 1423 is configured to determine a model loss function corresponding to the initial text recognition model according to the comparison loss function and the diversity loss function.
  • step S104 the specific function implementation manners of the comparison loss determination subunit 1421, the diversity loss determination subunit 1422, and the loss connection subunit 1423 can refer to step S104 in the above-mentioned embodiment corresponding to FIG. 3 , which will not be repeated here.
  • the network parameter correction unit 143 may include: a training subunit 1431, a model determination subunit 1432;
  • the training subunit 1431 is used to modify the network parameters of the feature extraction component, the network parameters of the image coding component and the code table in the discrete coding component according to the model loss function;
  • the model determination subunit 1432 is configured to determine the feature extraction component and the image encoding component that meet the training termination condition as the target text recognition model when the training times corresponding to the text recognition model meet the training termination condition.
  • step S104 the specific function implementation manners of the training subunit 1431 and the model determination subunit 1432 can refer to step S104 in the above embodiment corresponding to FIG. 3 , which will not be repeated here.
  • the text recognition model also includes a classification network layer
  • the parameter modification module 14 may include: an unsupervised training unit 144, an annotation data acquisition unit 145, a semantic information output unit 146, an annotation data prediction unit 147, and a supervision fine-tuning unit 148;
  • the unsupervised training unit 144 is used to modify the network parameters of the text recognition model according to the semantic coding information and the discrete coding information, and determine the feature extraction component after parameter correction and the image coding component after parameter correction as the candidate text recognition model;
  • Semantic information output unit 146 for according to the feature extraction component after the parameter correction in the candidate text recognition model, and the image encoding component after parameter correction, output the label semantic information corresponding to the label image data;
  • Annotation data prediction unit 147 used to predict the annotation semantic information according to the classification network layer, and obtain the predicted text recognition result associated with the text information in the annotation image data;
  • the supervisory fine-tuning unit 148 is used to modify the network parameters of the candidate text recognition model and the classification network layer according to the error between the label information and the predicted text recognition result, and modify the parameter-corrected candidate text recognition model and the parameter-corrected classification
  • the network layer is determined as the target text recognition model.
  • the specific function implementation of the unsupervised training unit 144, the labeled data acquisition unit 145, the semantic information output unit 146, the labeled data prediction unit 147, and the supervised fine-tuning unit 148 can refer to step S104 in the above-mentioned embodiment corresponding to FIG. 3 , here No further details will be given.
  • the feature extraction component includes L network layers, and L is a positive integer
  • the feature extraction module 11 may include: an output result combination unit 111, an image representation acquisition unit 112;
  • the output result combination unit 111 is used to obtain the output results of the image data in the first L-1 network layers in the feature extraction component of the text recognition model, and combine the output results corresponding to the first L-1 network layers into a joint output result;
  • the image representation acquisition unit 112 is configured to obtain the target output result corresponding to the joint output result according to the weight matrix corresponding to the Lth network layer in the feature extraction component, and determine the target output result as the image representation information corresponding to the image data.
  • step S101 in the above embodiment corresponding to FIG. 3 , which will not be repeated here.
  • the semantic encoding module 12 may include: an attention layer calculation unit 121, a text position encoding unit 122;
  • the attention layer calculation unit 121 is used to perform a product operation on the image representation information in the image coding component of the text recognition model according to the weight matrix corresponding to the self-attention layer of the image coding component, so as to obtain the attention corresponding to the image representation information output vector;
  • the text position encoding unit 122 is configured to perform text position encoding on the attention output vector according to the encoding layer in the image encoding component, so as to obtain semantic encoding information corresponding to the image representation information.
  • the specific function implementation of the attention layer calculation unit 121 and the text position encoding unit 122 can refer to the step S101 in the embodiment corresponding to the above-mentioned FIG. 3 , which will not be repeated here.
  • the image data processing device may also include: an acquisition module 15 for data to be processed, an extension feature extraction module 16, an extension text semantics acquisition module 17, and a text recognition result acquisition module 18;
  • the data to be processed acquisition module 15 is used to determine the business promotion picture containing text information as the image data to be processed, and input the image data to be processed to the target text recognition model;
  • the promotional feature extraction module 16 is used to output the promotional characterization information corresponding to the image data to be processed through the feature extraction component after the parameter correction in the target text recognition model;
  • the promotional text semantics acquisition module 17 is used to output the promotional text semantic information corresponding to the promotional representation information through the image coding component corrected by the parameters in the target text recognition model;
  • the text recognition result acquisition module 18 is configured to predict the promotion text semantic information according to the classification network layer in the target text recognition model, and obtain the promotion text content corresponding to the promotion text semantic information.
  • the specific function implementation of the data to be processed acquisition module 15, the promotion feature extraction module 16, the promotion text semantics acquisition module 17, and the text recognition result acquisition module 18 can refer to step S104 in the above-mentioned embodiment corresponding to FIG. 3 , and will not be repeated here. to repeat.
  • the text recognition model may include a feature extraction component, an image encoding component, and a discrete encoding component; the image representation information of the image data can be obtained through the feature extraction component, and the semantic encoding information can be obtained by passing the image representation information through the image encoding component , the discrete encoding information can be obtained by passing the image representation information through the discrete encoding component, and then the network parameters of the text recognition model can be corrected through the encoding similarity between the semantic encoding information and the discrete encoding information, that is, the discrete encoding information It can be used as the fitting target of the text recognition model in the training process.
  • the computer device 1000 may include: a processor 1001 , a network interface 1004 and a memory 1005 .
  • the computer device 1000 may further include: a user interface 1003 and at least one communication bus 1002 .
  • the communication bus 1002 is used to realize connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 can be a high-speed RAM memory, or a non-volatile memory, such as at least one disk memory.
  • the memory 1005 may also be at least one storage device located far away from the aforementioned processor 1001 .
  • the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 1004 can provide network communication functions; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control stored in the memory 1005 An application program to implement the above image data processing method.
  • the computer device 1000 described in the embodiment of the present application can execute the description of the image data processing method in the embodiment corresponding to FIG. 3 above, and can also execute the description of the image data processing device 1 in the embodiment corresponding to FIG. description and will not be repeated here. In addition, the description of the beneficial effect of adopting the same method will not be repeated here.
  • the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores the computer program executed by the aforementioned image data processing device 1, and the computer program It includes program instructions.
  • the processor executes the program instructions, it can execute the description of the image data processing method in the embodiment corresponding to FIG. 3 above, so details will not be repeated here.
  • the description of the beneficial effect of adopting the same method will not be repeated here.
  • the technical details not disclosed in the embodiments of the computer-readable storage medium involved in the present application please refer to the description of the method embodiments of the present application.
  • program instructions may be deployed to execute on one computing device, or on multiple computing devices located at one site, or, alternatively, on multiple computing devices distributed across multiple sites and interconnected by a communication network .
  • Multiple computing devices distributed in multiple locations and interconnected by a communication network can form a blockchain system.
  • the embodiment of the present application also provides a computer program product or computer program, where the computer program product or computer program may include computer instructions, and the computer instructions may be stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor can execute the computer instructions, so that the computer device executes the description of the image data processing method in the embodiment corresponding to FIG. 3 above. Therefore, here No further details will be given. In addition, the description of the beneficial effect of adopting the same method will not be repeated here.
  • the technical details not disclosed in the computer program products or computer program embodiments involved in this application please refer to the description of the method embodiments in this application.
  • the modules in the device of the embodiment of the present application can be combined, divided and deleted according to actual needs.
  • the computer programs can be stored in a computer-readable storage medium. , may include the processes of the embodiments of the above-mentioned methods.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

一种图像数据处理方法、装置、设备以及存储介质,该方法包括:将包含文本信息的图像数据输入至文本识别模型,根据文本识别模型中的特征提取组件,获取图像数据对应的图像表征信息;根据图像编码组件得到图像表征信息对应的语义编码信息;根据离散编码组件中所包含的码表,获取图像表征信息对应的离散编码信息;根据语义编码信息与离散编码信息之间的编码相似度,对文本识别模型的网络参数进行修正,得到目标文本识别模型;采用该方法可以降低数据的标注成本,并提高文本识别模型的识别效果。

Description

图像数据处理方法、装置、设备以及存储介质
本申请要求于2021年05月12日提交的申请号为202110518209.7、发明名称为“图像数据处理方法、装置、设备以及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种图像数据处理方法、装置、设备以及存储介质。
背景技术
OCR(Optical Character Recognition,光学字符识别)文字识别是指通过图像处理和模式识别技术对光学的字符进行识别,以输出图像中的文字信息。
目前,可以使用OCR模型对包含文字信息的图像进行识别,提取该图像中的文字信息。在采用OCR模型对图像进行识别之前,需要使用大量的标注图像数据进行模型训练。然而图像的标注是一项非常耗费人力和时间的工作,这会造成图像数据的标注成本过高,在对图像进行标注时,通常是对某个特定场景中的图像进行标注,将基于上述标注图像训练得到的OCR模型应用在其余场景时,容易造成文字识别的识别准确性过低。
发明内容
本申请实施例提供一种图像数据处理方法、装置、设备以及存储介质,能够降低数据的标注成本,并提高文本识别模型的识别效果。
本申请实施例一方面提供了一种图像数据处理方法,包括:
将包含文本信息的图像数据输入至文本识别模型,根据文本识别模型中的特征提取组件,获取图像数据对应的图像表征信息;
根据文本识别模型中的图像编码组件,对图像表征信息进行编码,得到图像表征信息对应的语义编码信息;语义编码信息与图像数据中的文本信息相关联;
根据文本识别模型的离散编码组件中所包含的码表,获取图像表征信息对应的离散编码信息;码表包括用于表征文本特征的可学习编码向量,离散编码信息用于作为无监督学习的拟合目标;
根据语义编码信息与离散编码信息之间的编码相似度,对文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型;目标文本识别模型用于识别待处理图像数据中的文本信息。
本申请实施例一方面提供了一种图像数据处理装置,包括:
特征提取模块,用于将包含文本信息的图像数据输入至文本识别模型,根据文本识别模型中的特征提取组件,获取图像数据对应的图像表征信息;
语义编码模块,用于根据文本识别模型中的图像编码组件,对图像表征信息进行编码,得到图像表征信息对应的语义编码信息;语义编码信息与图像数据中的文本信息相关联;
离散编码模块,用于根据文本识别模型的离散编码组件中所包含的码表,获取图像表征信息对应的离散编码信息;码表包括用于表征文本特征的可学习编码向量,离散编码信息用于作为无监督学习的拟合目标;
参数修正模块,用于根据语义编码信息与离散编码信息之间的编码相似度,对文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型;目标文本识别模型用于识别待处理图像数据中的文本信息。
本申请实施例一方面提供了一种计算机设备,包括存储器和处理器,存储器与处理器相连,存储器用于存储计算机程序,处理器用于调用计算机程序,以使得该计算机设备执行本申请实施例中上述一方面提供的方法。
本申请实施例一方面提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序适于由处理器加载并执行,以使得具有处理器的计算机设备执行本申请实施例中上述一方面提供的方法。
根据本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述一方面提供的方法。
本申请实施例中,文本识别模型可以包括特征提取组件、图像编码组件以及离散编码组件;通过特征提取组件可以获取图像数据的图像表征信息,将图像表征信息通过图像编码组件可以得到语义编码信息,将图像表征信息通过离散编码组件可以得到离散编码信息,进而可以通过语义编码信息与离散编码信息之 间的编码相似度,对文本识别模型的网络参数进行修正,也就是说,该离散编码信息可以作为文本识别模型在训练过程中的拟合目标,在上述训练过程中无需使用图像数据的标注信息,可以降低数据的标注成本;由于未标注的图像数据具有数据量大,覆盖范围广等多样性特点,直接使用无标注的图像数据进行训练,可以提高目标文本识别模型的泛化能力,从而提高目标文本识别模型的识别效果。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种网络架构的结构示意图;
图2是本申请实施例提供的一种文本识别模型的训练场景图;
图3是本申请实施例提供的一种图像数据处理方法的流程示意图;
图4是本申请实施例提供的一种离散编码组件的处理示意图;
图5是本申请实施例提供的一种无监督训练方式的示意图;
图6是本申请实施例提供的一种监督训练方式的示意图;
图7是本申请实施例提供的一种文本识别场景图;
图8是本申请实施例提供的一种文本识别场景图:
图9是本申请实施例提供的一种图像数据处理装置的结构示意图;
图10是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请涉及人工智能(Artificial Intelligence,AI)技术、区块链(Block Chain)技术以及云技术。本申请具体涉及人工智能技术下属的计算机视觉技术(Computer Vision,CV)。
可选地,本申请所涉及的图像数据以及文本识别结果均可以存储在区块链上,以确保图像数据以及文本识别结果的不可篡改性。
可选地,本申请涉及云技术下属的人工智能云服务,该人工智能云服务也可以称为AIaaS(AI as a Service,中文为“AI即服务”)。本申请可以使用平台提供的AI框架和AI基础设施来部署OCR服务,在训练得到OCR模型后,可以将训练完成的OCR模型应用在云人工智能服务中的OCR服务中。
本申请还涉及以下几个概念:
OCR技术:OCR技术是通过扫描等光学输入方式将各种票据、报刊、书籍、文稿以及其它印刷品的文字转化为图像信息,再利用文字识别技术将图像信息转化为可以使用的计算机输入技术。换言之,通过OCR技术可以直接从影像中提取金额、帐号、文字资料等重要数据,生成日常生活中所需的新文本,进而代替人的手工录入文本数据。
无监督训练(unsupervised learning,也可以称为无监督学习,或者自监督学习,或者非监督学习):无监督训练用于处理未被标记类别的样本集。在无监督训练中,样本数据没有被标记,也没有确定的结果;由于样本数据类别未知,需要根据样本数据之间的相似度对样本集进行分类,试图使相同类别内的差距最小化,不同类别之间的差距最大化。
监督训练(supervised learning,也可以称为监督学习,或者有教师学习):监督训练可以利用一组已知类别的样本数据调整网络模型的参数,使其达到所要求性能的过程。在监督训练中,训练数据集要求包括输入(特征)和输出(目标),训练数据集中的目标可以进行人工标注;通过已有的训练数据集(已知样本数据及其对应输出)去训练得到一个最佳模型,利用这个最佳模型可以将所有的输入映射为相应的输出,对输出进行简单的判断从而实现分类的目的。
请参见图1,图1是本申请实施例提供的一种网络架构的结构示意图。如图1所示,该网络架构可以包括服务器10d和用户终端集群,该用户终端集群可以包括一个或者多个用户终端,这里不对用户终端的数量进行限制。如图1所示,该用户终端集群可以具体包括用户终端10a、用户终端10b以及用户终端10c等。其中,服务器10d可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。用户终端10a、用户终端10b以及用户终端10c等均可以包括:智能手机、平板电脑、笔记本电脑、掌上电脑、移动互联网设备(Mobile Internet Device,MID)、可穿戴设备(例如智能手表、智能手环等)以及智能电视等具有文本识别功能的智能终端。如图1所示,用户终端10a、用户 终端10b以及用户终端10c等可以分别与服务器10d进行网络连接,以便于每个用户终端可以通过该网络连接与服务器10d之间进行数据交互。
如图1所示,以用户终端集群中的用户终端10a为例,该用户终端10a可以通过电子设备(例如,扫描仪、照相设备等)采集不同场景中的图像数据,上述图像数据可以包括照相设备(可以为用户终端10a中自带的照相机,或者与用户终端10a具有数据传输通道的外接照相设备)拍摄的业务推广图片(例如,墙壁上的广告图片、横幅宣传语图片、宣传画报图片等)、扫描仪所扫描的文字资料图片等。该用户终端10a可以获取初始化的文本识别模型(也可以称为初始文本识别模型或初始OCR模型),并通过采集到的图像数据对文本识别模型进行训练;该文本识别模型可以包括特征提取组件、图像编码组件以及离散编码组件,对于输入至文本识别模型的每个图像数据,可以通过特征提取组件对图像数据进行特征提取(图像下采样),输出图像数据对应的图像表征信息,通过图像编码组件可以输出图像表征信息对应的语义编码信息,还可以通过离散编码组件输出图像表征信息对应的离散编码信息,进而可以通过语义编码信息与离散编码信息之间的编码相似度,以及码表索引置信度,对文本识别模型的网络参数进行修正,得到训练完成的文本识别模型(也可以称为目标文本识别模型),该训练完成的文本识别模型可以用于识别待处理图像数据中的文本信息。本申请实施例中,文本识别模型的训练无需使用图像数据的标注信息,可以降低图像数据的标注成本;由于所采集图像数据的多样性,使得训练得到的文本识别模型具有更好的识别效果。
请一并参见图2,图2是本申请实施例提供的一种文本识别模型的训练场景图,下面以上述图1所示的用户终端集群中的用户终端10a为例,采用无监督训练的方式对文本识别模型进行训练。如图2所示,用户终端10a可以采用未携带标签信息的图像数据对初始化的文本识别模型进行训练,该文本识别模型可以包括特征提取组件20a、图像编码组件20d以及离散编码组件20f,在训练完成后,可以得到包含参数更新后的特征提取组件和参数更新后的图像编码组件的目标文本识别模型。换言之,采用无监督训练的方式对文本识别模型进行训练的过程中,最主要的目的在于训练上述特征提取组件20a的网络参数和图像编码组件20d的网络参数。
如图2所示,对于输入文本识别模型的样本图像20b(该样本图像20b可以包含文本信息),首先输入至文本识别模型20b中的特征提取组件20a,可以对样本图像20b进行图像预处理,该图像预处理可以包括但不限于:图像灰度化、图像归一化、图像尺寸调整、图像去噪处理;如样本图像20b为彩色图像时,可以对该样本图像20b进行图像灰度化,将该样本图像20b转换为灰度图像,可以降低文本识别模型在训练过程中的计算复杂度;当然,为了进一步减少训练过程中的计算复杂度,还可以对图像灰度化处理后的样本图像20b进行图像归一化;当特征提取组件20a对输入图像的尺寸具有尺寸规定时,可以将样本图像20b的尺寸调整至特征提取组件20a所规定的尺寸;可以对样本图像20b进行去噪处理,以优化该样本图像20b等。用户终端10a可以采用上述一种或者多种图像预处理方式对样本图像20b进行预处理,得到预处理后的图像20c。通过特征提取组件20a可以对图像20c进行下采样,从该图像20c中提取出可以用于描述该图像20c的图像表征信息Z={z 1,z 2,z 3,…,z T},其中T可以是指图像20c的图像宽度与图像下采样倍率之间的比值,该T可以取值为正整数,当图像宽度与图像下采样倍率之间的比值为非正整数时,可以对图像宽度与图像下采样倍率之间的比值进行取整。可以理解地,通过特征提取组件20a对图像20c进行下采样,可以从图像20c中提取出文本信息所在区域的图像表征信息,也就是说,特征提取组件20a的目的在于检测图像20c中所包含的文本信息所在的区域,并提取文本信息所在区域的图像表征信息。
进一步地,可以将特征提取组件20a输出的图像表征信息Z={z 1,z 2,z 3,…,z T}作为图像编码组件20d的输入数据,该图像编码组件20d中可以包括编码器20e,通过该编码器20e可以对图像表征信息Z={z 1,z 2,z 3,…,z T}进行文本位置编码,得到语义编码信息C={c 1,c 2,c 3,…,c T}。
其中,特征提取组件20a输出的图像表征信息Z={z 1,z 2,z 3,…,z T}还可以作为离散编码组件20f的输入数据,该离散编码组件20f可以包括一个或者多个可学习的码表,每个码表内部可以包含V个可学习编码向量,其中V为正整数,如V可以取值为1,2,3,……,每个码表中所包含的可学习编码向量在训练过程中是可以进行不断更新的。对于图像表征信息Z={z 1,z 2,z 3,…,z T}中的任意一个图像表征特征z i(其中i为小于或等于T的正整数),均可以计算得到图像表征特征z i分别与码表中所包含的V个可学习编码向量之间的码表索引置信度,如图像表征特征z i与每个码表之间均可以计算得到V个码表索引置信度,在V个码表索引置信度中选择最大的码表索引置信度,根据最大的码表索引置信度所对应的可学习编码向量,获取离散编码信息Q={q 1,q 2,q 3,…,q T},其中离散编码信息Q={q 1,q 2,q 3,…,q T}中的离散编码特征q i与上述语义编码信息C={c 1,c 2,c 3,…,c T}中的语义编码特征c i具有相同的尺寸。
进一步地,本申请可以将离散编码信息Q={q 1,q 2,q 3,…,q T}作为无监督训练的拟合目标,如可以根据语义编码信息C={c 1,c 2,c 3,…,c T}和离散编码信息Q={q 1,q 2,q 3,…,q T}之间的编码相似度,以及码表索引置信度,计算文本识别模型对应的模型损失函数,通过最小化模型损失函数优化文本识别模型的网络参数(可以包括特征提取组件20a的网络参数、图像编码组件20d的网络参数以及离散编码组件 20f中的码表),当初始文本识别模型达到训练终止条件时,可以保存此时的特征提取组件20a的网络参数以及图像编码组件20d的网络参数,作为目标文本识别模型。在无监督训练中,对于语义编码信息C={c 1,c 2,c 3,…,c T}中的语义编码特征c i,可以将离散编码信息Q={q 1,q 2,q 3,…,q T}中的离散编码特征q i作为语义编码特征c i的正样本,将离散编码信息Q={q 1,q 2,q 3,…,q T}中的离散编码特征q j(i不等于j,j为小于或等于T的正整数)作为语义编码特征c i的负样本,根据语义编码特征c i与正负样本之间的编码相似度,以及码表索引置信度,计算模型损失函数,根据该模型损失函数对文本识别模型的网络参数进行修正,以得到最终训练完成的目标文本识别模型,该目标文本识别模型可以用于识别待处理图像数据中的文本信息。
请参见图3,图3是本申请实施例提供的一种图像数据处理方法的流程示意图。可以理解地,该图像数据处理方法可以由计算机设备执行,该计算机设备可以为用户终端,或者服务器,或者为用户终端和服务器组成的系统,或者为一个计算机程序应用(包括程序代码),这里不做具体限定。如图3所示,该图像数据处理方法可以包括以下步骤:
步骤S101,将包含文本信息的图像数据输入至文本识别模型,根据文本识别模型中的特征提取组件,获取图像数据对应的图像表征信息。
具体的,在OCR文本识别应用中,可以使用文本识别模型(也可以称为OCR模型)对图像数据中所包含的文本信息进行检测并识别,以输出图像数据中所包含的文本内容。在使用文本识别模型之前,可以采用无监督训练的方式对文本识别模型进行训练,以得到训练完成的文本识别模型。为方便描述,本申请实施例中可以将初始化的文本识别模型称为初始文本识别模型,将训练完成的文本识别模型称为目标文本识别模型。另外,文本识别模型可以包括特征提取组件、图像编码组件以及离散编码组件。
在收集到用于训练文本识别模型的样本数据集后,计算机设备(例如,上述图1所对应实施例中的用户终端10a)可以获取文本识别模型,将样本数据集中所包含的图像数据分批次(batch)输入至文本识别模型,利用上述样本数据集对文本识别模型进行无监督训练。其中,样本数据集中所包含的图像数据均为包含文本信息的图像,如样本数据集可以包括业务推广图片(广告图片)、文件资料扫描图片、证件扫描图片、截图等图像数据,上述样本数据集中所包含的所有图像数据均可以为无标注数据(即不携带标签信息的图像数据);在无监督训练过程中,可以将样本数据集中的图像数据进行分批次处理,batch中所包含的图像数据输入文本识别模型后,首先输入至文本识别模型中的特征提取组件,通过该特征提取组件可以对图像数据进行下采样,检测图像数据中的文本信息所在的区域,并提取文本信息所在区域中的图像表征信息,此处的图像表征信息可以表示为Z={z 1,z 2,z 3,…,z T},T=图像宽度(width)/图像下采样倍率,该图像表征信息可以包括T个图像表征特征,每个图像表征特征的维度可以表示为R G.V,如G=2,V=256时,每个图像表征特征的维度为512。
其中,文本识别模型对所输入的图像数据设置有数据输入格式(此时的数据输入格式也可以理解为特征提取组件所规定的数据输入格式),该数据输入格式可以表示为:shape=batch size*channels*height*width,此时的shape可以表示为数据输入格式,batch size可以表示为批次大小,channels可以表示为通道数,height可以表示为图像数据对应的图像高度,width可以表示为图像数据对应的图像宽度;任何图像数据在输入特征提取组件中时,均需要满足上述数据输入格式,若图像数据不满足上述数据输入格式,则可以将图像数据转换为上述数据输入格式。图像表征信息的格式可以表示为:shape=batch size*Time step*channel,此时的shape可以表示为图像表征信息的格式,Time step可以表示为文本识别场景中所涉及的文本序列长度。上述特征提取组件可以包括但不限于:VGGNet(一种卷积神经网络模型,可以包括16至19个网络层,在卷积层中所使用的卷积核的尺寸可以为3*3,在池化层中所使用的滤波器的尺寸可以为2*2)、GoogLeNet(一种卷积神经网络模型,可以包括22个网络层,并在网络模型中引入inception(起始)结构,该inception结构可以用于提供计算效率)、ResNet(一种卷积神经网络模型,通过引入残差结构可以包括152个网络层)、DenseNet(一种卷积神经网络模型,该网络模型中每一个网络层的输入来自前面所有网络层的输出)。
可选的,当特征提取组件为ResNet网络时,假设该ResNet网络包括L个网络层,L为正整数,如L可以取值为1,2,3……;对于ResNet网络中的第l个网络层(l可以为小于或等于L的正整数),该第l个网络层的输入可以表示为x l-1,输出可以表示为:x l=H l(x l-1)+x l-1,其中x l-1也可以表示为第(l-1)个网络层的输出,H l可以表示为第l个网络层的非线性变换函数(non-liear transformation),该H l可以理解为一个组合操作,如一系列的BN(Batch Normalization,分批归一化)、激活函数、池化、卷积等操作。换言之,ResNet网络中第l个网络层的输出可以为第(l-1)个网络层的输出与第(l-1)个网络层输出的非线性变换,将ResNet网络中第L个网络层(最后一个网络层)的输出x L作为特征提取组件输出的图像表征信息。需要说明的是,此处的第l个网络层与第(l-1)个网络层之间可以包括一个或者多个卷积层。
可选的,当特征提取组件为DenseNet网络时,同样假设特征提取组件包括L个网络层,在文本识别模型的特征提取组件(DenseNet网络)中,计算机设备可以获取图像数据在前(L-1)个网络层中的输出 结果,将前(L-1)个网络层所对应的输出结果组合为联合输出结果;进而可以根据特征提取组件中的第L个网络层所对应的权重矩阵,得到联合输出结果对应的目标输出结果,将目标输出结果确定为图像数据对应的图像表征信息。其中,DenseNet网络中第L个网络层(最后一个网络层)的输出可以作为特征提取组件输出的图像表征信息,第L个网络层的输出可以表示为:x L=H L([x 0,x 1,…,x L-1]),其中x 0可以表示为DenseNet网络中的初始化值,x 1可以表示为DenseNet网络中第1个网络层的输出,……,x L-1可以表示为DenseNet网络中第(L-1)个网络层的输出,H L可以表示为第L个网络层所对应的权重矩阵,也可以理解为非线性变换函数中所涉及到的权重矩阵,与上述非线性变换函数H l类似;[x 0,x 1,…,x L-1]可以表示为将DenseNet网络中第0层至第(L-1)个网络的输出做拼接(concatenation),concatenation是指做通道的合并(即上述联合输出结果)。换言之,对于DenseNet网络中的任意一个网络层l(即第l个网络层),该第l个网络层的输出可以为前(l-1)个网络层的输出进行concatenation后的非线性变换,进而将DenseNet网络中第L个网络层的输出x L作为特征提取组件输出的图像表征信息。
步骤S102,根据文本识别模型中的图像编码组件,对图像表征信息进行编码,得到图像表征信息对应的语义编码信息;语义编码信息与图像数据中的文本信息相关联。
具体的,在上述特征提取组件输出图像表征信息Z={z 1,z 2,z 3,…,z T}后,计算机设备可以将图像表征信息输入文本识别模型的图像编码组件中,通过图像编码组件,可以对图像表征信息编码,以得到图像表征信息对应的语义编码信息,该语义编码信息与图像数据中所包含的文本信息相关联,该语义编码信息可以表示为C={c 1,c 2,c 3,…,c T}。其中,图像编码组件可以用于学习图像数据中所包含的文本信息之间的语义信息,可以更关注语义上相关的词语,并弱化不相关的词语。上述图像编码组件可以包括但不限于:含mask(掩码)模块的Transformer编码器(一种编码器模型结构)、LSTM(Long Short-Term Memory,长短期记忆网络)、RNN(Recurrent Neural Network,循环神经网络)。
可选的,当图像编码组件为含mask模块的Transformer编码器时,该Transformer编码器可以包括自注意力层和编码层,计算机设备可以在文本识别模型的图像编码组件中,可以根据图像编码组件的自注意力层所对应的权重矩阵,对图像表征信息进行乘积运算,得到图像表征信息对应的注意力输出向量;进而可以根据图像编码组件中的编码层,对注意力输出向量进行文本位置编码,得到图像表征信息对应的语义编码信息。对于图像表征信息Z={z 1,z 2,z 3,…,z T}中的每个图像表征特征,通过Transformer编码器中的自注意力层,均可以生成每个图像表征特征分别对应的查询向量(Query)、键向量(Key)以及值向量(Value),T个图像表征特征分别对应的查询向量可以组成查询矩阵QU,T个图像表征特征分别对应的键向量可以组成键矩阵KE,T个图像表征信息分别对应的值向量可以组成值矩阵VA。将图像表征信息Z与查询权阵矩阵W QU进行相乘后可以得到查询矩阵QU,将图像表征信息Z与键权阵矩阵W KE进行相乘后可以得到键矩阵KE,将图像表征信息Z与值权阵矩阵W VA进行相乘后可以得到值矩阵VA,根据上述查询矩阵QU、键矩阵KE以及值矩阵VA,得到自注意力层的输出结果(即上述注意力输出向量,T个图像表征特征分别对应的注意力输出向量可以组成注意力输出矩阵),该自注意力层的输出结果可以表示为:
Figure PCTCN2021107653-appb-000001
其中d ke可以表示为键向量的维数,softmax可以表示为分类器,⊙可以表示为矩阵点乘运算。进一步地,为了理解图像表征信息中每个图像表征特征所对应的文本顺序,可以采用根据图像编码组件中的编码层,对自注意力层的输出结果进行文本位置编码,得到图像表征信息对应的语义编码信息C={c 1,c 2,c 3,…,c T}。
步骤S103,根据文本识别模型的离散编码组件中所包含的码表,获取图像表征信息对应的离散编码信息。
可选的,根据文本识别模型的离散编码组件中所包含的码表,获取图像表征信息对应的码表索引置信度,根据码表索引置信度,在码表中获取图像表征信息对应的离散编码信息;码表包括用于表征文本特征的可学习编码向量。其中,码表索引置信度是指采用可学习编码向量表示图像表征信息的可靠度。
具体的,在文本识别模型的离散编码组件中可以包括码表,该码表可以包括用于表征文本特征的可编码学习向量。计算机设备可以根据离散编码组件中的码表,获取图像表征信息中的每个图像表征特征分别与码表之间的码表索引置信度,根据该码表索引置信度从大到小的排列顺序,可以从码表中获取图像表征信息所对应的离散编码信息。
可选的,离散编码组件中的码表均可以包括V个可学习编码向量,图像表征信息可以包括T个图像表征特征,V为正整数,如V可以取值为1,2,3……;计算机设备可以获取文本识别模型的离散编码组件中所包含的码表,在图像表征信息中的图像表征特征z i,其中i为小于或等于T的正整数,其中图像表征特征z i可以表示为图像表征信息中所包含的任意一个图像表征特征;进而可以获取图像表征特征z i分别与V个可学习编码向量之间的码表索引置信度,其中,V个可学习编码向量中的第i个可学习编码向量对应 的码表索引置信度,是指采用该第i个可学习编码向量表示图像表征特征z i的可靠度,对于图像表征特征z i来说,能够获取V个可学习编码向量分别对应的码表索引置信度,可选地,该V个码表索引置信度的取值之和为1,且每一个码表索引置信度均是取值在[0,1]之间的数值;在V个可学习编码向量中,将最大的码表索引置信度所对应的可学习编码向量确定为目标编码向量,进而可以根据目标编码向量确定图像表征特征z i对应的离散编码特征q i,将T个图像表征特征分别对应的离散编码特征组成离散编码信息。
其中,获取码表索引置信度的方法可以包括但不限于:基于softmax的采样方法、基于gumbel(耿贝尔)-max的采样方法、基于gumbel-softmax的采样方法。本申请实施例中,若采用基于gumbel-softmax的采样方法获取码表索引置信度,则计算机设备可以获取图像表征特征z i中的特征值所对应的分布随机数(此处的分布随机数可以为gumbel分布随机数),进而可以将图像表征特征z i中的特征值与分布随机数进行相加,得到图像表征特征z i对应的候选表征特征;根据候选表征特征中的特征值所对应的指数值,获取候选表征特征分别与V个可学习编码向量之间的码表索引置信度。
可选的,码表的数量为G个,每个码表均对应一个目标编码向量,G为正整数,G可以取值为1,2,3……,其中G和V的取值与文本识别场景中的文字数量相关联;对G个码表中的目标编码向量进行拼接,得到图像表征特征z i对应的联合特征;进而可以将联合特征输入至全连接网络层,根据全连接网络层中的权重矩阵,输出图像表征特征z i对应的离散编码特征q i。应当理解的是,在G的取值为1时,仅有1个码表,且1个码表对应1个目标编码向量,因此不存在对多个目标编码训练进行拼接得到联合特征的步骤,可以直接将这1个码表对应的1个目标编码向量作为联合特征,输入至全连接网络层,输出图像表征特征z i对应的离散编码特征q i
可选的,若离散编码组件中包括G个码表,每个码表均可以包括V个可学习编码向量e,每个可学习编码向量e的维度均为d维(e∈R d),则该码表的尺寸可以表示为:G×V×d。在实际应用中,可以创建一个可学习张量codebook∈R G×V×d作为码表,在文本识别模型的训练过程中,可以对该码表进行不断更新。对于图像表征信息中的任意一个图像表征特征z i∈R G.V,均可以展开为尺寸是G×V的矩阵S,由于离散编码组件中可能存在截断梯度的操作,因此在将图像表征特征z i展开为矩阵S时,可以不做任何数值上的变换,即z i—>S,S∈R G×V,即可以将向量形式表示的图像表征特征z i展开为矩阵形式。
进一步地,可以通过基于gumbel-softmax的采样方法得到图像表征特征z i分别与码表中的V个可学习编码向量之间的码表索引置信度,如码表索引置信度可以表示为:
Figure PCTCN2021107653-appb-000002
其中,上述公式(1)中的p g,v可以表示为矩阵S中第g行第v列特征值所对应的码表索引置信度,对于G个码表而言,该p g,v的维度可以为G;s g,v可以表示为矩阵S中的第g行第v列的特征值,n v可以表示为特征值s g,v对应的Gumbel分布随机数(即上述分布随机数),Gumbel分布随机数可以表示为:n v=-log(-log(U(0,1)),其中(U(0,1)可以表示服从均匀分布,(s g,v+n v)可以称为候选表征特征;τ可以为gumbel-softmax采样方法中所涉及的非负常数,该常数τ可以用于控制gumbel-softmax采样的平滑程度,常数τ越大,生成的gumbel分布越平滑,常数τ越小,生成的gumbel分布越接近离散的one-hot分布。在文本识别模型的训练初期,由于网络不稳定,为了避免梯度爆炸或消失等情况出现,可以将常数τ设置大一些(如将常数τ设置为2)。在文本识别模型的训练过程中,可以逐渐减小常数τ,每一轮迭代(epoch,完整训练一次样本数据集的次数)均可以乘以小于1的系数(例如该系数可以设置为0.9995),这样可以在保证网络稳定收敛的情况下,使得Gumbel分布逐步逼近真实的离散分布。
进一步地,根据上述公式(1)计算得到与图像表征特征z i相关联的V个码表索引置信度后,可以取最大的码表索引置信度所对应的码表索引Idx,码表索引Idx的维度与码表索引置信度p g,v的维度相同,该码表索引Idx可以表示为:
Figure PCTCN2021107653-appb-000003
通过上述公式(2)确定码表索引Idx后,可以从G个码表中个取一个可学习编码向量,此时从G个码表中所选取的可学习编码向量均可以称为目标编码向量,即基于码表索引Idx可以分别从各个码表中获取一个目标编码向量,将G个码表中所获得的目标编码向量进行拼接,得到图像表征特征z i对应的联合特征E,即通过G个码表,码表索引Idx可以得到联合特征E={e 1,e 2,…,e G},E∈R G.d;将该联合特征输入至全连接网络层,根据全连接网络层中的权重矩阵,输出图像表征特征z i对应的离散编码特征q i,该离散编码特征q i可以表示为:
q i=EW+b,W∈R G.d×G.V             (3)
其中,上述公式(3)中的W可以表示为全连接网络层中的权重矩阵,b可以表示为全连接网络层中的偏置,其中W的尺寸为G·d×G·V。通过上述公式(1)至公式(3),可以计算得到每个图像表征特征分别对应的离散编码特征,将每个图像表征特征分别对应的离散编码特征进行组合,可以得到离散编码信息Q={q 1,q 2,q 3,…,q T}。
需要说明的是,本申请实施例中所涉及的形如G·d的描述均表示为两个数值的乘积,如G=2,d=4时,G·d表示8,此时可以理解为是一个8维的向量,而形如d×G的描述可以表示为二维矩阵,如G=2,d=4时,d×G表示尺寸为4×2的矩阵。在实际应用中,可以根据实际需求构建码表的尺寸,例如,常用字符的类别数量大约在40000上下时,可以将G设置为2,V设置为256,这样码表所能表达的文本特征数量为V G=256 2
请一并参见图4,图4是本申请实施例提供的一种离散编码组件的处理示意图。如图4所示,在通过特征提取组件输出图像表征信息Z={z 1,z 2,z 3,…,z T}后,可以将该图像表征信息Z输入至离散编码组件,该离散编码组件中包含有G个可学习的码表,每个码表内部均包含有V个可学习编码向量,将图像表征信息Z中所包含的每个图像表征特征均可以展开为尺寸是G×V的矩阵,进而可以通过上述公式(1)计算得到V个维度为2的码表索引置信度,在每个码表所对应的V个码表索引置信度中选择最大的码表索引置信度,进而确定最大的码表索引置信度所对应的码表索引(上述公式(2)),根据该码表索引可以从G个码表中各选择一个可学习编码向量作为目标编码向量,将G个目标编码向量进行连接,在通过一个全连接网络层可以得到每个图像表征特征对应的离散编码特征(根据上述公式(3)可以计算得到),如图像表征特征z 1所对应离散编码特征可以表示为q 1,将T个图像表征特征分别对应的离散编码特征进行组合,可以得到离散编码信息Q={q 1,q 2,q 3,…,q T}。
步骤S104,根据语义编码信息与离散编码信息之间的编码相似度,对文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型;目标文本识别模型用于识别待处理图像数据中的文本信息。
具体的,在文本识别模型的训练过程中,可以根据语义编码信息与离散编码信息之间的编码相似度,确定文本识别模型的网络优化目标,该网络优化目标也可以称为模型损失函数。计算机设备可以在语义编码信息中获取语义编码特征c i(该语义编码特征c i可以为语义编码信息中的任意一个语义编码特征),可以将离散编码信息中的离散编码特征q i,确定为语义编码特征c i的正样本,将离散编码信息中的离散编码特征q j,确定为语义编码特征c i的负样本,其中i和j均为小于或等于T的正整数,且i和j不相等;进而可以根据语义编码特征c i与正样本之间的第一相似度、语义编码特征c i与负样本之间的第二相似度以及码表索引置信度,确定文本识别模型对应的模型损失函数(此时的第一相似度和第二相似度可以称为编码相似度);根据模型损失函数,对文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型。
可选的,在文本识别模型的训练过程中,根据语义编码信息与离散编码信息之间的编码相似度,以及码表索引置信度,确定文本识别模型的模型损失函数。
可选的,上述模型损失函数可以包括两部分,分别为对比损失函数和多样性损失函数,其中对比损失函数可以让语义编码信息C在一组含正样本和负样本的表征信息(上述码表)中可以找到正确的离散编码信息Q,通过计算语义编码信息C与离散编码信息Q之间的夹角信息,优化文本识别模型的网络参数,使得语义编码特征c i与正样本的夹角变小,与负样本的夹角变大;多样性损失函数可以提升离散编码组件中码表的利用率,通过优化码表索引的信息熵,提升生成的码表索引的多样性。具体的,计算机设备可以获取语义编码特征c i与正样本之间的第一相似度,获取语义编码特征c i与负样本之间的第二相似度,根据第一相似度和第二相似度确定对比损失函数,该对比损失函数可以表示为:
Figure PCTCN2021107653-appb-000004
Figure PCTCN2021107653-appb-000005
其中,上述公式(4)中的L m可以表示为对比函数损失,sim(c i,q i)可以表示为语义编码特征c i与正样本q i之间的第一相似度,sim(c i,q j)可以表示为语义编码特征c i与正样本q j之间的第二相似度,此处的第一相似度和第二相似度均可以为余弦相似度,K可以表示为常数,可以根据实际需求进行设置, 公式(5)表示两个向量a和向量b之间的余弦相似度计算方式,||a||可以表示为向量a的2范数,即向量a中的元素绝对值的平方和再开方。当sim(c i,q i)增大或sim(c i,q j)减小时,对比损失函数L m降低。
可选的,计算机设备可以获取码表索引置信度对应的对数值,根据对数值和码表索引置信度之间的乘积,确定多样性损失函数,该多样性损失函数可以表示为:
Figure PCTCN2021107653-appb-000006
其中,上述公式(6)中,L d可以表示为多样性损失函数,H{}可以表示为信息熵,
Figure PCTCN2021107653-appb-000007
可以表示为训练过程中所计算得到的码表索引置信度。当码表索引置信度趋于均匀分布时,多样性损失函数L d降低。
进一步地,根据对比损失函数和多样性损失函数,确定文本识别模型对应的模型损失函数,该模型损失函数可以表示为L=L m+αL d,其中α为用于连接对比损失函数和多样性损失函数的超参数,模型损失函数L为超参数α与多样性损失函数L d的乘积再与对比损失函数L m求和。
进一步地,根据模型损失函数L=L m+αL d,对特征提取组件的网络参数、图像编码组件的网络参数以及离散编码组件中的码表进行修正;当文本识别模型对应的训练次数满足训练终止条件(文本识别模型的训练次数达到设置的最大训练次数,或者文本识别模型的训练达到收敛条件)时,将满足训练终止条件的特征提取组件和图像编码组件,确定为目标文本识别模型。
需要说明的是,上述对比损失函数的计算中,是对同一个图像数据内部的特征之间的运算,不涉及不同图像数据之间的运算,而离散编码组件可以学习多个码表,在无监督训练过程中可以对文本识别模型的训练方向起指导作用。在无监督训练过程中,由不同图像数据所得到的特征可以向码表靠拢,不同码表之间会相互远离,使得不同图像数据之间相似的特征互相靠近,不相似的特征互相远离。离散编码组件避免了不同图像数据之间的运算(可以称为跨图运算)带来的计算量暴增,减少了内存需求,降低了训练网络的时间成本。
本申请实施例中,可以将上述对文本识别模型的整个训练过程称为无监督训练过程,在满足训练终止条件时所获得的特征提取组件和图像编码组件可以称为训练完成的目标文本识别模型,此时的目标文本识别模型可以应用在文本识别场景中,用于识别待处理图像数据中的文本信息。请一并参见图5,图5是本申请实施例提供的一种无监督训练方式的示意图。如图5所示,文本识别模型可以包括特征提取组件、图像编码组件以及离散编码组件,通过特征提取组件可以输出图像数据对应的图像表征信息Z={z 1,z 2,z 3,…,z T},该图像表征信息Z可以输入两个分支,一个分支为图像编码组件,一个分支为离散编码组件,通过图像编码组件可以输出图像表征信息Z对应的语义编码信息C={c 1,c 2,c 3,…,c T},通过离散编码组件可以输出图像表征信息Z对应的离散编码信息Q={q 1,q 2,q 3,…,q T},该离散编码信息Q可以作为无监督训练的拟合目标;对于语义编码信息C中的任意一个语义编码特征c i,可以将离散编码信息Q中的离散编码特征q i标记为正样本,将离散编码特征q j标记为负样本,根据正负样本可以计算损失、优化文本识别模型的网络参数。很显然,在无监督训练中,进行训练的图像数据为未携带标签信息的图像,可以降低图像数据的标注成本。
可选的,在无监督训练得到的模型基础上,可以利用少量的标注图像数据微调网络模型(该微调过程可以称为监督训练过程),以增强目标文本识别模型的鲁棒性,进而提高目标文本识别模型的识别效果。在对模型进行微调的过程中,该文本识别模型还可以包括分类网络层(也可以称为前馈网络);根据语义编码信息、离散编码信息以及码表索引置信度,对文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为候选文本识别模型,即将前述无监督训练完成的特征提取组件和图像编码组件称为候选文本识别模型。计算机设备可以获取包含文本信息的标注图像数据,将标注图像数据输入至候选文本识别模型,该标注图像数据可以携带标签信息;根据候选文本识别模型中的参数修正后的特征提取组件,以及参数修正后的图像编码组件,输出标注图像数据对应的标注语义信息;进而可以根据分类网络层,对标注语义信息进行预测,得到与标注图像数据中的文本信息相关联的预测文本识别结果;根据标签信息与预测文本识别结果之间的误差,对候选文本识别模型和分类网络层的网络参数进行修正,将参数修正后的候选文本识别模型和参数修正后的分类网络层确定为目标文本识别模型。换言之,在监督训练中,可以将标注图像数据的标签信息作为候选文本识别模型的期望输出结果,而候选文本识别模型输出的预测文本识别结果可以理解为实际输出结果,通过计算期望输出结果与实际输出结果之间的误差,在候选文本识别模型中进行反向传播,以对候选文本识别模型的网络参数以及分类网络层的网络参数 进行更新,最终得到训练完成的目标文本识别模型,此时的目标文本识别模型是指经过无监督训练和有监督微调之后所得到的网络模型。其中,分类网络层可以包括但不限于:softmax(一种多分类器)、人工神经网络(Artificial Neural Networks,ANNs)、支持向量机(Support Vector Machines,SVM)。
请一并参见图6,图6是本申请实施例提供的一种监督训练方式的示意图。如图6所示,在无监督训练完成之后,还可以使用一部分标注数据对无监督训练的模型进行有监督微调(也可以称为监督训练),即监督训练是在无监督训练之后进行的,对于输入候选文本识别模型(无监督训练完成后所得到的文本识别模型)的标注图像,在特征提取组件和图像编码组件中对标注图像的处理过程(前向计算过程)与前述无监督训练过程中对无标注图像的处理过程相同,这里不再进行赘述。通过候选文本识别模型中的图像编码组件输出标注语义信息后,可以将标注语义信息输入至前馈网络(可以理解为分类网络层),通过该前馈网络可以输出标注图像对应的预测文本识别结果,其中该前馈网络的输入为图像编码组件输出的标注语义信息,该前馈网络的输出为一个向量,该向量的维度等于文本字符的类别数量,如该候选文本识别适用于识别300种文本字符类别,则该前馈网络的输出可以为一个维度为300的向量。该前馈网络的输出向量可以作为标注图像在候选文本识别模型中的预测文本识别结果,进而可以计算标注图像的标签信息与预测文本识别结果之间的误差计算损失、优化候选文本识别模型的网络参数,以得到最终训练完成的目标文本识别模型。
可选的,在训练得到目标文本识别模型之后,可以将该目标文本识别模型应用在任何文本识别场景中,例如寄送快递时的收货地址识别场景(通过利用目标文本识别模型对包含地址信息的图片进行识别,以获取图片中的地址内容,并将识别到的地址信息自动填入收货地址所在输入区域,可以简化收货地址输入操作,提高寄件速度)、业务推广识别场景(通过利用目标文本识别模型对广告图片进行识别,以获取广告图片中的广告文本内容)、文件资料录入场景(当书面文件中的文字资料需要录入到电子系统中时,可以对书面文件进行扫描或拍照,进而利用目标文本识别模型对扫描或拍照所得到的图片进行识别,以获取图片中的文件资料内容,并将识别到的文件资料内容自动录入到电子系统中进行保存,可以减少人力资源,进而提高文件资料内容的录入效率)、账号录入场景(当需要输入银行卡账号或者身份证号时,可以对银行卡或身份证进行拍照,并利用目标文本识别模型对拍摄的照片进行识别,以自动输入银行卡账号或者身份证号,避免人工输入时出现错误)、内容审核场景(通过目标文本识别模型对图片中所包含的文本信息进行识别,自动进行内容审核,减少人为工作量,提高审核效率)、图片搜索场景(通过目标文本识别模型对图片中所包含的文本信息进行识别,将识别到的文本内容作为关键词进行搜索)等。
举例来说,将目标文本识别模型应用在业务推广识别场景时,计算机设备可以获取包含文本信息的业务推广图片,将包含文本信息的业务推广图片(例如,可以为广告图片)确定为待处理图像数据,将待处理图像数据输入至目标文本识别模型;通过目标文本识别模型中的参数修正后的特征提取组件,输出待处理图像数据对应的推广表征信息;通过目标文本识别模型中的参数修正后的图像编码组件,输出推广表征信息对应的推广文本语义信息;根据目标文本识别模型中的分类网络层,对推广文本语义信息进行预测,得到推广文本语义信息对应的推广文本内容,即对业务推广图片进行文本识别,以输出业务推广图片中所包含的推广文本内容。
请一并参见图7,图7是本申请实施例提供的一种文本识别场景图。如图7所示的用户终端30a可以为上述计算机设备,该用户终端30a可以为用户小A所使用的终端设备,该用户终端30a中安装有搜索应用。如图7所示的当前显示界面为搜索应用的主页面,在该主页面中可以显示搜索框,在该搜索框中可以包括照相入口30b,当用户小A对搜索框中的照相入口30b执行触发操作时,用户终端30a可以响应针对照相入口30b的触发操作,启动用户终端30a中的照相机,将用户终端30a靠近实物广告单30c进行拍摄,在用户小A拍摄了照片30d且对控件30e执行触发操作时,用户终端30a可以利用前述预先训练完成的目标识别模型对照片30d进行文本识别,输出照片30d中所包含的文本内容30e,该文本内容30e包括:“2020海洋日”、“限量版精华面霜”、“品牌A”。
进一步地,在识别得到文本内容30e后,以文本内容30e作为搜索关键词进行检索,可以在搜索应用中检索与上述文本内容30e相关联的检索结果,并在搜索应用的搜索页面30f中显示检索结果,该检索结果可以按照与上述文本内容30e的关联度,在搜索页面30f中进行排序显示,如该检索结果可以包括结果展示栏30g,当用户小A对某个结果展示栏(例如,结果展示栏30g)中的内容感兴趣时,可以点击该结果展示栏查看内容详情。
请一并参加图8,图8是本申请实施例提供的一种文本识别场景图。如图8所示的用户终端40a可以为上述计算机设备,该用户终端40a可以为用户小A所使用的终端设备,该用户终端30a中集成有快递寄件应用(或者快递寄件小程序)。当用户小A想要给用户小B寄快递时,可以打开快递寄件应用(或快递寄件小程序)进入寄件信息页面40b,在该寄件信息页面40b中要求用户小A填写寄件人姓名、寄件人联系方式、收件人姓名、收件人联系方式、收件人收货地址、邮政编码等信息。若用户小A对用户小B的收 货地址不太熟悉,则用户小A需要预先记下用户小B的地址在纸上或其余地方,再在寄件信息页面40b中手动输入收货地址,或者在用户终端40a反复切换显示页面进行地址输入。
可选的,当寄件信息页面40b包括图片识别控件40c时,可以对图片识别控件40c执行触发操作,此时的用户终端40a可以响应针对图片识别控件40c的触发操作,打开用户终端40a中的本地图库应用,在该图库应用中选择包含用户小B收货地址的图片40d,并对确认控件执行触发操作,用户终端40a可以响应针对确认控件的触发操作,利用前述预先训练完成的目标识别模型对图片40d进行文本识别,输出图片40d中所包含的文本内容,并将识别出的文本内容与寄件信息页面40b中的关键词进行匹配,将匹配到的文本内容自动填入对应输入框,如在收件人那一栏自动填入“小B”,在收件人联系方式那一栏自动填入“130xxxxxx14”,在收件地址那一栏自动填入“xx省xx市xx县…”,用户小A在检查没问题后,可以确认提交信息,可以提高用户的寄件效率。
本申请实施例中,该文本识别模型可以包括特征提取组件、图像编码组件以及离散编码组件;通过特征提取组件可以获取图像数据的图像表征信息,将图像表征信息通过图像编码组件可以得到语义编码信息,将图像表征信息通过离散编码组件可以得到离散编码信息,进而可以通过语义编码信息与离散编码信息之间的编码相似度,对文本识别模型的网络参数进行修正,也就是说,该离散编码信息可以作为文本识别模型在训练过程中的拟合目标,在上述训练过程中无需使用图像数据的标注信息,可以降低数据的标注成本;由于未标注的图像数据具有数据量大,覆盖范围广等多样性特点,直接使用无标注的图像数据进行训练,可以提高目标文本识别模型的泛化能力,从而提高目标文本识别模型的识别效果,并且可以提高目标文本识别模型的适用性。
请参见图9,图9是本申请实施例提供的一种图像数据处理装置的结构示意图。可以理解地,图像数据处理装置可以是应用于计算机设备中的一个计算机程序(包括程序代码),例如该图像数据处理装置可以为一个OCR文字识别应用软件,该图像数据处理装置可以用于执行本申请实施例提供的方法中的相应步骤。如图9所示,图像数据处理装置1可以包括:特征提取模块11,语义编码模块12,离散编码模块13,参数修正模块14;
特征提取模块11,用于将包含文本信息的图像数据输入至文本识别模型,根据文本识别模型中的特征提取组件,获取图像数据对应的图像表征信息;
语义编码模块12,用于根据文本识别模型中的图像编码组件,对图像表征信息进行编码,得到图像表征信息对应的语义编码信息;语义编码信息与图像数据中的文本信息相关联;
离散编码模块13,用于根据文本识别模型的离散编码组件中所包含的码表,获取图像表征信息对应的离散编码信息;码表包括用于表征文本特征的可学习编码向量,离散编码信息用于作为无监督学习的拟合目标;
参数修正模块14,用于根据语义编码信息与离散编码信息之间的编码相似度,对文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型;目标文本识别模型用于识别待处理图像数据中的文本信息。
其中,特征提取模块11,语义编码模块12,离散编码模块13,参数修正模块14的具体功能实现方式可以参见上述图3所对应实施例中的步骤S101-步骤S104,这里不再进行赘述。
在一些可行的实施方式中,离散编码模块13,用于根据文本识别模型的离散编码组件中所包含的码表,获取图像表征信息对应的码表索引置信度;码表索引置信度是指采用可学习编码向量表示图像表征信息的可靠度;根据码表索引置信度,在码表中获取图像表征信息对应的离散编码信息。
在一些可行的实施方式中,图像表征信息包括T个图像表征特征,码表包括V个可学习编码向量,T和V均为正整数;
离散编码模块13可以包括:码表获取单元131,置信度获取单元132,编码向量选取单元133,离散特征确定单元134;
码表获取单元131,用于获取文本识别模型的离散编码组件中所包含的码表,在图像表征信息中的图像表征特征z i;i为小于或等于T的正整数;
置信度获取单元132,用于获取图像表征特征z i分别与V个可学习编码向量之间的码表索引置信度;
编码向量选取单元133,用于在V个可学习编码向量中,将最大的码表索引置信度所对应的可学习编码向量确定为目标编码向量;
离散特征确定单元134,用于根据目标编码向量确定图像表征特征z i对应的离散编码特征q i,将T个图像表征特征分别对应的离散编码特征组成离散编码信息。
其中,码表获取单元131,置信度获取单元132,编码向量选取单元133,离散特征确定单元134的具体功能实现方式可以参见上述图3所对应实施例中的步骤S103,这里不再进行赘述。
在一些可行的实施方式中,置信度获取单元13可以包括:随机数获取子单元131,索引置信度获取子 单元132;
随机数获取子单元131,用于获取图像表征特征z i中的特征值所对应的分布随机数,将图像表征特征z i中的特征值与分布随机数进行相加,得到图像表征特征z i对应的候选表征特征;
索引置信度获取子单元132,用于根据候选表征特征中的特征值所对应的指数值,获取候选表征特征分别与V个可学习编码向量之间的码表索引置信度。
其中,随机数获取子单元131,索引置信度获取子单元132的具体功能实现方式可以参见上述图3所对应实施例中的步骤S103,这里不再进行赘述。
在一些可行的实施方式中,码表的数量为G个,每个码表均对应一个目标编码向量,G为正整数;
离散特征确定单元134可以包括:拼接子单元1341,网络输出子单元1342;
拼接子单元1341,用于对G个码表中的目标编码向量进行拼接,得到图像表征特征z i对应的联合特征;
网络输出子单元1342,用于将联合特征输入至全连接网络层,根据全连接网络层中的权重矩阵,输出图像表征特征z i对应的离散编码特征q i
其中,拼接子单元1341,网络输出子单元1342的具体功能实现方式可以参见上述图3所对应实施例中的步骤S103,这里不再进行赘述。
在一些可行的实施方式中,语义编码信息包括T个语义编码特征,离散编码信息包括T个离散编码特征,编码相似度包括第一相似度和第二相似度,T为正整数;
参数修正模块14可以包括:正负样本确定单元141,模型损失确定单元142,网络参数修正单元143;
正负样本确定单元141,用于在语义编码信息中获取语义编码特征c i,将离散编码信息中的离散编码特征q i,确定为语义编码特征c i的正样本,将离散编码信息中的离散编码特征q j,确定为语义编码特征c i的负样本;i和j均为小于或等于T的正整数,且i和j不相等;
模型损失确定单元142,用于根据语义编码特征c i与正样本之间的第一相似度、语义编码特征c i与负样本之间的第二相似度,确定文本识别模型对应的模型损失函数;
网络参数修正单元143,用于根据模型损失函数,对文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型。
其中,正负样本确定单元141,模型损失确定单元142,网络参数修正单元143的具体功能实现方式可以参见上述图3所对应实施例中的步骤S104,这里不再进行赘述。
在一些可行的实施方式中,模型损失确定单元142可以包括:对比损失确定子单元1421,多样性损失确定子单元1422,损失连接子单元1423;
对比损失确定子单元1421,用于获取语义编码特征c i与正样本之间的第一相似度,获取语义编码特征c i与负样本之间的第二相似度,根据第一相似度和第二相似度确定对比损失函数;
多样性损失确定子单元1422,用于根据图像表征信息对应的码表索引置信度,获取码表索引置信度对应的对数值,根据对数值和码表索引置信度之间的乘积,确定多样性损失函数;
损失连接子单元1423,用于根据对比损失函数和多样性损失函数,确定初始文本识别模型对应的模型损失函数。
其中,对比损失确定子单元1421,多样性损失确定子单元1422,损失连接子单元1423的具体功能实现方式可以参见上述图3所对应实施例中的步骤S104,这里不再进行赘述。
在一些可行的实施方式中,网络参数修正单元143可以包括:训练子单元1431,模型确定子单元1432;
训练子单元1431,用于根据模型损失函数,对特征提取组件的网络参数、图像编码组件的网络参数以及离散编码组件中的码表进行修正;
模型确定子单元1432,用于当文本识别模型对应的训练次数满足训练终止条件时,将满足训练终止条件的特征提取组件和图像编码组件,确定为目标文本识别模型。
其中,训练子单元1431,模型确定子单元1432的具体功能实现方式可以参见上述图3所对应实施例中的步骤S104,这里不再进行赘述。
在一些可行的实施方式中,文本识别模型还包括分类网络层;
参数修正模块14可以包括:无监督训练单元144,标注数据获取单元145,语义信息输出单元146,标注数据预测单元147,监督微调单元148;
无监督训练单元144,用于根据语义编码信息和离散编码信息,对文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为候选文本识别模型;
标注数据获取单元145,用于获取包含文本信息的标注图像数据,将标注图像数据输入至候选文本识别模型;标注图像数据携带标签信息;
语义信息输出单元146,用于根据候选文本识别模型中的参数修正后的特征提取组件,以及参数修正 后的图像编码组件,输出标注图像数据对应的标注语义信息;
标注数据预测单元147,用于根据分类网络层,对标注语义信息进行预测,得到与标注图像数据中的文本信息相关联的预测文本识别结果;
监督微调单元148,用于根据标签信息与预测文本识别结果之间的误差,对候选文本识别模型和分类网络层的网络参数进行修正,将参数修正后的候选文本识别模型和参数修正后的分类网络层确定为目标文本识别模型。
其中,无监督训练单元144,标注数据获取单元145,语义信息输出单元146,标注数据预测单元147,监督微调单元148的具体功能实现方式可以参见上述图3所对应实施例中的步骤S104,这里不再进行赘述。
在一些可行的实施方式中,特征提取组件包括L个网络层,L为正整数;
特征提取模块11可以包括:输出结果联合单元111,图像表征获取单元112;
输出结果联合单元111,用于在文本识别模型的特征提取组件中,获取图像数据在前L-1个网络层中的输出结果,将前L-1个网络层所对应的输出结果组合为联合输出结果;
图像表征获取单元112,用于根据特征提取组件中的第L个网络层所对应的权重矩阵,得到联合输出结果对应的目标输出结果,将目标输出结果确定为图像数据对应的图像表征信息。
其中,输出结果联合单元111,图像表征获取单元112的具体功能实现方式可以参见上述图3所对应实施例中的步骤S101,这里不再进行赘述。
在一些可行的实施方式中,语义编码模块12可以包括:注意力层计算单元121,文本位置编码单元122;
注意力层计算单元121,用于在文本识别模型的图像编码组件中,根据图像编码组件的自注意力层所对应的权重矩阵,对图像表征信息进行乘积运算,得到图像表征信息对应的注意力输出向量;
文本位置编码单元122,用于根据图像编码组件中的编码层,对注意力输出向量进行文本位置编码,得到图像表征信息对应的语义编码信息。
其中,注意力层计算单元121,文本位置编码单元122的具体功能实现方式可以参见上述图3所对应实施例中的步骤S101,这里不再进行赘述。
在一些可行的实施方式中,该图像数据处理装置还可以包括:待处理数据获取模块15,推广特征提取模块16,推广文本语义获取模块17,文本识别结果获取模块18;
待处理数据获取模块15,用于将包含文本信息的业务推广图片确定为待处理图像数据,将待处理图像数据输入至目标文本识别模型;
推广特征提取模块16,用于通过目标文本识别模型中的参数修正后的特征提取组件,输出待处理图像数据对应的推广表征信息;
推广文本语义获取模块17,用于通过目标文本识别模型中的参数修正后的图像编码组件,输出推广表征信息对应的推广文本语义信息;
文本识别结果获取模块18,用于根据目标文本识别模型中的分类网络层,对推广文本语义信息进行预测,得到推广文本语义信息对应的推广文本内容。
其中,待处理数据获取模块15,推广特征提取模块16,推广文本语义获取模块17,文本识别结果获取模块18的具体功能实现方式可以参见上述图3所对应实施例中的步骤S104,这里不再进行赘述。
本申请实施例中,该文本识别模型可以包括特征提取组件、图像编码组件以及离散编码组件;通过特征提取组件可以获取图像数据的图像表征信息,将图像表征信息通过图像编码组件可以得到语义编码信息,将图像表征信息通过离散编码组件可以得到离散编码信息,进而可以通过语义编码信息与离散编码信息之间的编码相似度,对文本识别模型的网络参数进行修正,也就是说,该离散编码信息可以作为文本识别模型在训练过程中的拟合目标,在上述训练过程中无需使用图像数据的标注信息,可以降低数据的标注成本;由于未标注的图像数据具有数据量大,覆盖范围广等多样性特点,直接使用无标注的图像数据进行训练,可以提高目标文本识别模型的泛化能力,从而提高目标文本识别模型的识别效果,并且可以提高目标文本识别模型的适用性。
请参见图10,图10是本申请实施例提供的一种计算机设备的结构示意图。如图10所示,该计算机设备1000可以包括:处理器1001,网络接口1004和存储器1005,此外,上述计算机设备1000还可以包括:用户接口1003,和至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。可选的,网络接口1004可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。可选的,存储器1005还可以是至少一个位于远离前述处理器1001的存储装置。如图10所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设 备控制应用程序。
在如图10所示的计算机设备1000中,网络接口1004可提供网络通讯功能;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现上述图像数据处理方法。
应当理解,本申请实施例中所描述的计算机设备1000可执行前文图3所对应实施例中对图像数据处理方法的描述,也可执行前文图9所对应实施例中对图像数据处理装置1的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机可读存储介质,且计算机可读存储介质中存储有前文提及的图像数据处理装置1所执行的计算机程序,且计算机程序包括程序指令,当处理器执行程序指令时,能够执行前文图3所对应实施例中对图像数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。作为示例,程序指令可被部署在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行,分布在多个地点且通过通信网络互连的多个计算设备可以组成区块链系统。
此外,需要说明的是:本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或者计算机程序可以包括计算机指令,该计算机指令可以存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器可以执行该计算机指令,使得该计算机设备执行前文图3所对应实施例中对图像数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机程序产品或者计算机程序实施例中未披露的技术细节,请参照本申请方法实施例的描述。
需要说明的是,对于前述的各个方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某一些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,存储介质可为磁碟、光盘、只读存储器(Read-Only Memory,ROM)或随机存储器(Random Access Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (15)

  1. 一种图像数据处理方法,由计算机设备执行,包括:
    将包含文本信息的图像数据输入至文本识别模型,根据所述文本识别模型中的特征提取组件,获取所述图像数据对应的图像表征信息;
    根据所述文本识别模型中的图像编码组件,对所述图像表征信息进行编码,得到所述图像表征信息对应的语义编码信息;所述语义编码信息与所述图像数据中的文本信息相关联;
    根据所述文本识别模型的离散编码组件中所包含的码表,获取所述图像表征信息对应的离散编码信息;所述码表包括用于表征文本特征的可学习编码向量,所述离散编码信息用于作为无监督学习的拟合目标;
    根据所述语义编码信息与所述离散编码信息之间的编码相似度,对所述文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型;所述目标文本识别模型用于识别待处理图像数据中的文本信息。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述文本识别模型的离散编码组件中所包含的码表,获取所述图像表征信息对应的离散编码信息,包括:
    根据所述文本识别模型的离散编码组件中所包含的码表,获取所述图像表征信息对应的码表索引置信度;所述码表索引置信度是指采用所述可学习编码向量表示所述图像表征信息的可靠度;
    根据所述码表索引置信度,在所述码表中获取所述图像表征信息对应的离散编码信息。
  3. 根据权利要求2所述的方法,其特征在于,所述图像表征信息包括T个图像表征特征,所述码表包括V个可学习编码向量,T和V均为正整数;
    所述根据所述文本识别模型的离散编码组件中所包含的码表,获取所述图像表征信息对应的码表索引置信度,根据所述码表索引置信度,在所述码表中获取所述图像表征信息对应的离散编码信息,包括:
    获取所述文本识别模型的离散编码组件中所包含的码表,在所述图像表征信息中的图像表征特征z i;i为小于或等于T的正整数;
    获取所述图像表征特征z i分别与所述V个可学习编码向量之间的码表索引置信度;
    在所述V个可学习编码向量中,将最大的码表索引置信度所对应的可学习编码向量确定为目标编码向量;
    根据所述目标编码向量确定所述图像表征特征z i对应的离散编码特征q i,将所述T个图像表征特征分别对应的离散编码特征组成所述离散编码信息。
  4. 根据权利要求3所述的方法,其特征在于,所述获取所述图像表征特征z i分别与所述V个可学习编码向量之间的码表索引置信度,包括:
    获取所述图像表征特征z i中的特征值所对应的分布随机数,将所述图像表征特征z i中的特征值与所述分布随机数进行相加,得到所述图像表征特征z i对应的候选表征特征;
    根据所述候选表征特征中的特征值所对应的指数值,获取所述候选表征特征分别与所述V个可学习编码向量之间的码表索引置信度。
  5. 根据权利要求3所述的方法,其特征在于,所述码表的数量为G个,每个码表均对应一个目标编码向量,G为正整数;
    所述根据所述目标编码向量确定所述图像表征特征z i对应的离散编码特征q i,包括:
    对G个码表中的目标编码向量进行拼接,得到所述图像表征特征z i对应的联合特征;
    将所述联合特征输入至全连接网络层,根据所述全连接网络层中的权重矩阵,输出所述图像表征特征z i对应的离散编码特征q i
  6. 根据权利要求1所述的方法,其特征在于,所述语义编码信息包括T个语义编码特征,所述离散编码信息包括T个离散编码特征,所述编码相似度包括第一相似度和第二相似度,T为正整数;
    所述根据所述语义编码信息与所述离散编码信息之间的编码相似度,对所述文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型,包括:
    在所述语义编码信息中获取语义编码特征c i,将所述离散编码信息中的离散编码特征q i,确定为所述 语义编码特征c i的正样本,将所述离散编码信息中的离散编码特征q j,确定为所述语义编码特征c i的负样本;i和j均为小于或等于T的正整数,且i和j不相等;
    根据所述语义编码特征c i与所述正样本之间的第一相似度、所述语义编码特征c i与所述负样本之间的第二相似度,确定所述文本识别模型对应的模型损失函数;
    根据所述模型损失函数,对所述文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为所述目标文本识别模型。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述语义编码特征c i与所述正样本之间的第一相似度、所述语义编码特征c i与所述负样本之间的第二相似度,确定所述文本识别模型对应的模型损失函数,包括:
    获取所述语义编码特征c i与所述正样本之间的第一相似度,获取所述语义编码特征c i与所述负样本之间的第二相似度,根据所述第一相似度和所述第二相似度确定对比损失函数;
    根据所述图像表征信息对应的码表索引置信度,获取所述码表索引置信度对应的对数值,根据所述对数值和所述码表索引置信度之间的乘积,确定多样性损失函数;
    根据所述对比损失函数和所述多样性损失函数,确定所述文本识别模型对应的模型损失函数。
  8. 根据权利要求6所述的方法,其特征在于,所述根据所述模型损失函数,对所述文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为所述目标文本识别模型,包括:
    根据所述模型损失函数,对所述特征提取组件的网络参数、所述图像编码组件的网络参数以及所述离散编码组件中的码表进行修正;
    当所述文本识别模型对应的训练次数满足训练终止条件时,将满足训练终止条件的特征提取组件和图像编码组件,确定为所述目标文本识别模型。
  9. 根据权利要求1所述的方法,其特征在于,所述文本识别模型还包括分类网络层;
    所述根据所述语义编码信息与所述离散编码信息之间的编码相似度,对所述文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型,包括:
    根据所述语义编码信息和所述离散编码信息,对所述文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为候选文本识别模型;
    获取包含文本信息的标注图像数据,将所述标注图像数据输入至所述候选文本识别模型;所述标注图像数据携带标签信息;
    根据所述候选文本识别模型中的所述参数修正后的特征提取组件,以及所述参数修正后的图像编码组件,输出所述标注图像数据对应的标注语义信息;
    根据所述分类网络层,对所述标注语义信息进行预测,得到与所述标注图像数据中的文本信息相关联的预测文本识别结果;
    根据所述标签信息与所述预测文本识别结果之间的误差,对所述候选文本识别模型和所述分类网络层的网络参数进行修正,将参数修正后的候选文本识别模型和参数修正后的分类网络层确定为目标文本识别模型。
  10. 根据权利要求1所述的方法,其特征在于,所述特征提取组件包括L个网络层,L为正整数;
    所述根据所述文本识别模型中的特征提取组件,获取所述图像数据对应的图像表征信息,包括:
    在所述文本识别模型的特征提取组件中,获取所述图像数据在前L-1个网络层中的输出结果,将所述前L-1个网络层所对应的输出结果组合为联合输出结果;
    根据所述特征提取组件中的第L个网络层所对应的权重矩阵,得到所述联合输出结果对应的目标输出结果,将所述目标输出结果确定为所述图像数据对应的图像表征信息。
  11. 根据权利要求1所述的方法,其特征在于,所述根据所述文本识别模型中的图像编码组件,对所述图像表征信息进行编码,得到所述图像表征信息对应的语义编码信息,包括:
    在所述文本识别模型的图像编码组件中,根据所述图像编码组件的自注意力层所对应的权重矩阵,对所述图像表征信息进行乘积运算,得到所述图像表征信息对应的注意力输出向量;
    根据所述图像编码组件中的编码层,对所述注意力输出向量进行文本位置编码,得到所述图像表征信息对应的语义编码信息。
  12. 根据权利要求1所述的方法,其特征在于,还包括:
    将包含文本信息的业务推广图片确定为所述待处理图像数据,将所述待处理图像数据输入至所述目标文本识别模型;
    通过所述目标文本识别模型中的所述参数修正后的特征提取组件,输出所述待处理图像数据对应的推广表征信息;
    通过所述目标文本识别模型中的所述参数修正后的图像编码组件,输出所述推广表征信息对应的推广文本语义信息;
    根据所述目标文本识别模型中的分类网络层,对所述推广文本语义信息进行预测,得到所述推广文本语义信息对应的推广文本内容。
  13. 一种图像数据处理装置,包括:
    特征提取模块,用于将包含文本信息的图像数据输入至文本识别模型,根据所述文本识别模型中的特征提取组件,获取所述图像数据对应的图像表征信息;
    语义编码模块,用于根据所述文本识别模型中的图像编码组件,对所述图像表征信息进行编码,得到所述图像表征信息对应的语义编码信息;所述语义编码信息与所述图像数据中的文本信息相关联;
    离散编码模块,用于根据所述文本识别模型的离散编码组件中所包含的码表,获取所述图像表征信息对应的离散编码信息;所述码表包括用于表征文本特征的可学习编码向量,所述离散编码信息用于作为无监督学习的拟合目标;
    参数修正模块,用于根据所述语义编码信息与所述离散编码信息之间的编码相似度,对所述文本识别模型的网络参数进行修正,将参数修正后的特征提取组件和参数修正后的图像编码组件确定为目标文本识别模型;所述目标文本识别模型用于识别待处理图像数据中的文本信息。
  14. 一种计算机设备,包括存储器和处理器;
    所述存储器与所述处理器相连,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以使得所述计算机设备执行权利要求1-12任一项所述的方法。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序适于由处理器加载并执行,以使得具有所述处理器的计算机设备执行权利要求1-12任一项所述的方法。
PCT/CN2021/107653 2021-05-12 2021-07-21 图像数据处理方法、装置、设备以及存储介质 WO2022236959A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21941528.8A EP4339831A1 (en) 2021-05-12 2021-07-21 Image data processing method and apparatus, and device and storage medium
US18/306,208 US20230260304A1 (en) 2021-05-12 2023-04-24 Image data processing method, apparatus and device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110518209.7A CN113762050B (zh) 2021-05-12 2021-05-12 图像数据处理方法、装置、设备以及介质
CN202110518209.7 2021-05-12

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/306,208 Continuation US20230260304A1 (en) 2021-05-12 2023-04-24 Image data processing method, apparatus and device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022236959A1 true WO2022236959A1 (zh) 2022-11-17

Family

ID=78787063

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/107653 WO2022236959A1 (zh) 2021-05-12 2021-07-21 图像数据处理方法、装置、设备以及存储介质

Country Status (4)

Country Link
US (1) US20230260304A1 (zh)
EP (1) EP4339831A1 (zh)
CN (1) CN113762050B (zh)
WO (1) WO2022236959A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645700A (zh) * 2023-07-27 2023-08-25 腾讯科技(深圳)有限公司 特征提取模型处理方法、装置和特征提取方法、装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114449345B (zh) * 2022-02-08 2023-06-23 腾讯科技(深圳)有限公司 视频处理方法、装置、设备及存储介质
CN117194605B (zh) * 2023-11-08 2024-01-19 中南大学 用于多模态医学数据缺失的哈希编码方法、终端及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569846A (zh) * 2019-09-16 2019-12-13 北京百度网讯科技有限公司 图像文字识别方法、装置、设备及存储介质
US20200142994A1 (en) * 2018-11-07 2020-05-07 Adobe Inc. Guided content discovery in visual search
CN112148870A (zh) * 2019-06-26 2020-12-29 阿里巴巴集团控股有限公司 摘要生成方法、装置、电子设备及计算机可读存储介质
CN112633290A (zh) * 2021-03-04 2021-04-09 北京世纪好未来教育科技有限公司 文本识别方法、电子设备及计算机可读介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590192B (zh) * 2017-08-11 2023-05-05 深圳市腾讯计算机系统有限公司 文本问题的数学化处理方法、装置、设备和存储介质
CN109492679A (zh) * 2018-10-24 2019-03-19 杭州电子科技大学 基于注意力机制与联结时间分类损失的文字识别方法
CN109993109A (zh) * 2019-03-29 2019-07-09 成都信息工程大学 图像文字识别方法
CN111753822B (zh) * 2019-03-29 2024-05-24 北京市商汤科技开发有限公司 文本识别方法及装置、电子设备和存储介质
CN111797834B (zh) * 2020-05-28 2021-06-15 华南理工大学 文本识别方法、装置、计算机设备和存储介质
CN112016543A (zh) * 2020-07-24 2020-12-01 华为技术有限公司 一种文本识别网络、神经网络训练的方法以及相关设备
CN112257716A (zh) * 2020-12-08 2021-01-22 之江实验室 一种基于尺度自适应及方向注意力网络的场景文字识别方法
CN112598000A (zh) * 2021-03-03 2021-04-02 北京世纪好未来教育科技有限公司 题目识别方法、装置、电子设备及计算机存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200142994A1 (en) * 2018-11-07 2020-05-07 Adobe Inc. Guided content discovery in visual search
CN112148870A (zh) * 2019-06-26 2020-12-29 阿里巴巴集团控股有限公司 摘要生成方法、装置、电子设备及计算机可读存储介质
CN110569846A (zh) * 2019-09-16 2019-12-13 北京百度网讯科技有限公司 图像文字识别方法、装置、设备及存储介质
CN112633290A (zh) * 2021-03-04 2021-04-09 北京世纪好未来教育科技有限公司 文本识别方法、电子设备及计算机可读介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645700A (zh) * 2023-07-27 2023-08-25 腾讯科技(深圳)有限公司 特征提取模型处理方法、装置和特征提取方法、装置
CN116645700B (zh) * 2023-07-27 2023-11-03 腾讯科技(深圳)有限公司 特征提取模型处理方法、装置和特征提取方法、装置

Also Published As

Publication number Publication date
CN113762050B (zh) 2024-05-24
CN113762050A (zh) 2021-12-07
US20230260304A1 (en) 2023-08-17
EP4339831A1 (en) 2024-03-20

Similar Documents

Publication Publication Date Title
WO2022236959A1 (zh) 图像数据处理方法、装置、设备以及存储介质
US11657602B2 (en) Font identification from imagery
US9501724B1 (en) Font recognition and font similarity learning using a deep neural network
CN111931664A (zh) 混贴票据图像的处理方法、装置、计算机设备及存储介质
CN111738169B (zh) 一种基于端对端网络模型的手写公式识别方法
CN113792741B (zh) 文字识别方法、装置、设备及存储介质
CN113051914A (zh) 一种基于多特征动态画像的企业隐藏标签抽取方法及装置
US20230035366A1 (en) Image classification model training method and apparatus, computer device, and storage medium
CN112712086A (zh) 数据处理方法、装置、计算机设备和存储介质
CN114330514B (zh) 一种基于深度特征与梯度信息的数据重建方法及系统
Kawabe et al. Application of deep learning to classification of braille dot for restoration of old braille books
CN110705572B (zh) 一种图像识别方法
CN113806536B (zh) 文本分类方法及其装置、设备、介质、产品
CN114913339A (zh) 特征图提取模型的训练方法和装置
CN114692715A (zh) 一种样本标注方法及装置
Bastida et al. Multimodal object recognition using deep learning representations extracted from images and smartphone sensors
CN116168398B (zh) 基于图像识别的试卷审批方法、装置和设备
CN117058437B (zh) 一种基于知识蒸馏的花卉分类方法、系统、设备及介质
US20220237692A1 (en) Method and system for providing financial process automation to financial organization
Zhumakhan REAL-TIME FACE RECOGNITION USING A DEEP LEARNING MODEL
CN117635236A (zh) 一种金融业务的宣传海报生成方法、装置、设备及介质
CN116935130A (zh) 基于ResNet和OCR的图片联合分类方法、装置、电子设备及介质
CN118035174A (zh) 一种文档管理处理方法和装置
Umarhayat et al. Automation of College Work using Artificial Intelligence
CN115617951A (zh) 合同信息提取方法、装置、计算机设备、介质和程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21941528

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2021941528

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021941528

Country of ref document: EP

Effective date: 20231212