WO2020199730A1 - 文本识别方法及装置、电子设备和存储介质 - Google Patents

文本识别方法及装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2020199730A1
WO2020199730A1 PCT/CN2020/072804 CN2020072804W WO2020199730A1 WO 2020199730 A1 WO2020199730 A1 WO 2020199730A1 CN 2020072804 W CN2020072804 W CN 2020072804W WO 2020199730 A1 WO2020199730 A1 WO 2020199730A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
semantic vector
neural network
convolutional neural
target semantic
Prior art date
Application number
PCT/CN2020/072804
Other languages
English (en)
French (fr)
Inventor
刘学博
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to JP2020561646A priority Critical patent/JP7153088B2/ja
Priority to SG11202010916SA priority patent/SG11202010916SA/en
Publication of WO2020199730A1 publication Critical patent/WO2020199730A1/zh
Priority to US17/081,758 priority patent/US12014275B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates to computer vision technology, in particular to a text recognition method and device, electronic equipment and storage medium.
  • Text recognition in natural scenes is an important issue in the field of image understanding and image restoration.
  • Accurate text recognition can be used for, for example, picture understanding, automatic translation, guidance for blind people, robot navigation, etc.
  • text recognition systems based on codec frameworks usually use recurrent neural networks as encoders and decoders.
  • a text recognition method which includes: performing feature extraction processing on an image to be detected to obtain a plurality of semantic vectors, wherein the plurality of semantic vectors are respectively corresponding to the text in the image to be detected The multiple characters of the sequence correspond; the multiple semantic vectors are sequentially recognized through a convolutional neural network to obtain the recognition result of the text sequence.
  • the accuracy of text recognition can be improved.
  • sequentially performing recognition processing on the plurality of semantic vectors to obtain the recognition result of the text sequence includes: processing prior information of the target semantic vector through the convolutional neural network to obtain the target The weight parameter of the semantic vector, wherein the target semantic vector is one of the plurality of semantic vectors; according to the weight parameter and the target semantic vector, a text recognition result corresponding to the target semantic vector is determined.
  • the target semantic vector can be weighted using the weight parameter obtained according to the prior information, and the prior information can be referred to in the process of recognizing the target semantic vector, thereby improving the recognition accuracy of the target semantic vector.
  • the a priori information includes the text recognition result and/or the start symbol corresponding to the previous semantic vector of the target semantic vector.
  • processing the prior information to obtain the weight parameter of the target semantic vector includes: performing processing on the target semantic vector through at least one first convolution layer in the convolutional neural network Encoding process to obtain the first vector of the target semantic vector; encoding the prior information of the target semantic vector through at least one second convolutional layer in the convolutional neural network to obtain the prior information A second vector corresponding to the information; based on the first vector and the second vector, the weight parameter is determined.
  • a priori information can be included in the weight parameter to provide a basis for identifying the target semantic vector.
  • performing encoding processing on the prior information to obtain a second vector corresponding to the prior information includes: responding to the prior information including a semantic vector corresponding to a previous semantic vector of the target semantic vector For the text recognition result, word embedding processing is performed on the text recognition result corresponding to the previous semantic vector to obtain a feature vector corresponding to the prior information; and encoding processing is performed on the feature vector to obtain the second vector.
  • a convolutional neural network can be used to recognize the character corresponding to the current target semantic vector according to the recognition result of the previous character, thereby avoiding the problem of uncontrollable long dependence and improving the accuracy of recognition.
  • performing encoding processing on the prior information to obtain a second vector corresponding to the prior information includes: encoding an initial vector corresponding to the start symbol in the prior information to obtain The second vector.
  • determining the text recognition result corresponding to the target semantic vector includes: obtaining an attention distribution vector corresponding to the target semantic vector based on the weight parameter and the target semantic vector; At least one deconvolution layer in the convolutional neural network decodes the attention distribution vector, and determines a text recognition result corresponding to the target semantic vector.
  • performing feature extraction processing on the to-be-detected image to obtain multiple semantic vectors includes: performing feature extraction on the to-be-detected image to obtain feature information; and performing down-sampling processing on the feature information to obtain all the semantic vectors. Describe multiple semantic vectors.
  • a text recognition device which includes: an extraction module, configured to perform feature extraction processing on an image to be detected to obtain a plurality of semantic vectors, wherein the plurality of semantic vectors are respectively related to the The multiple characters of the text sequence in the image to be detected correspond; the recognition module is used to sequentially recognize the multiple semantic vectors through a convolutional neural network to obtain the recognition result of the text sequence.
  • an electronic device including: a processor; a memory for storing instructions executable by the processor, wherein the processor executes the instructions stored in the memory , To achieve the above text recognition method.
  • a computer-readable storage medium having computer program instructions stored thereon, and when the computer program instructions are executed by a processor, the processor realizes the above text recognition method.
  • Fig. 1 shows a flowchart of a text recognition method according to an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of a coding and decoding framework based on a convolutional neural network for text recognition according to an embodiment of the present disclosure
  • Figure 3 shows a block diagram of a text recognition device according to an embodiment of the present disclosure
  • Figure 4 shows a block diagram of an electronic device according to an embodiment of the present disclosure
  • FIG. 5 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
  • a and/or B can represent the following three situations: A alone, A and B at the same time, and B alone.
  • “Including at least one of A, B, and C” may mean including any one or more elements selected from the set consisting of A, B, and C.
  • first, second, third, etc. may be used to describe various information, these information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • word “if” as used herein can be interpreted as "when” or “when” or "in response to”.
  • Fig. 1 shows a flowchart of a text recognition method according to an embodiment of the present disclosure. As shown in Fig. 1, the method may include steps S11 and S12.
  • step S11 feature extraction processing is performed on the image to be detected to obtain multiple semantic vectors, where the multiple semantic vectors respectively correspond to multiple characters of the text sequence in the image to be detected.
  • step S12 the multiple semantic vectors are sequentially recognized through a convolutional neural network to obtain a recognition result of the text sequence.
  • the accuracy of text recognition can be improved.
  • the text recognition method may be executed by a terminal device.
  • Terminal devices can be User Equipment (UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal digital assistants (PDAs), handheld devices, computing devices, vehicle-mounted devices, and Wearable equipment, etc.
  • the method can be implemented by a processor in a terminal device calling computer-readable instructions stored in a memory.
  • the terminal device may acquire the image to be detected and send the image to be detected to the server, so that the method is executed by the server.
  • step S11 may include: performing feature extraction on the image to be detected to obtain feature information; and performing down-sampling processing on the feature information to obtain the multiple semantic vectors.
  • the feature information may include, but is not limited to, feature maps or feature vectors.
  • the image to be detected may have, for example, a text sequence composed of multiple text characters.
  • the text characters of the text sequence may have a certain semantic connection relationship, that is, the text sequence may have a certain semantics.
  • a feature extraction network can be used to extract multiple semantic vectors in the image to be detected.
  • the feature extraction network may be a neural network such as a convolutional neural network.
  • Performing feature extraction processing on the image to be detected to obtain multiple semantic vectors may include: inputting the image to be detected into a feature extraction network to obtain the multiple semantic vectors.
  • the feature extraction network may obtain one or more feature matrices of the image to be detected through encoding processing or the like.
  • the dimension of the feature matrix can be M ⁇ P.
  • P can be 32, and the ratio of M to P can correspond to the aspect ratio of the image to be detected.
  • the resolution of the image to be detected is 1024 ⁇ 768, one or more 43 ⁇ 32 feature maps can be obtained through encoding processing.
  • the feature extraction network may perform down-sampling processing on the feature matrix to obtain one or more feature vectors as semantic vectors.
  • one or more 43 ⁇ 1 feature vectors can be obtained through down-sampling processing.
  • the feature extraction network may be trained before performing feature extraction processing on the image to be detected using the feature extraction network.
  • multiple images with multiple backgrounds, multiple resolutions, multiple fonts, multiple lighting conditions, multiple size scales, multiple tilt directions, and multiple blur levels can be taken as the first sample image. Train the feature extraction network.
  • the text in the first sample image may be annotated according to the probability dictionary to obtain the annotation semantic vector of the first sample image (hereinafter may also be referred to as the true semantic vector of the first sample image).
  • the probability dictionary may include a user-defined probability distribution on the text. For example, a vector including multiple elements may be used to represent the probability distribution information of each text in the probability dictionary.
  • the text in the probability dictionary can be determined according to the probability distribution information of each text, or the probability distribution information of the text in the probability dictionary can be determined, so as to determine the semantic vector corresponding to the text.
  • the probability distribution information of the text in the first sample image in the probability dictionary can be determined, so as to determine the semantic vector corresponding to the text in the first sample image, and compare the first image according to the semantic vector.
  • This image is annotated to obtain annotation information.
  • the annotation information may represent the real semantic vector of the first sample image.
  • the first sample image may be input into the feature extraction network for processing to obtain a sample semantic vector corresponding to the first sample image.
  • the sample semantic vector is the output result of the feature extraction network for the first sample image, and the output result may have errors.
  • the network loss of the feature extraction network can be determined according to the annotation information and the output result for the first sample image.
  • the real semantic vector (ie, the annotation information) of the first sample image can be compared with the sample semantic vector (ie, the output result), and the difference between the two can be determined as the loss function of the feature extraction network.
  • the cross entropy loss function of the feature extraction network can be determined according to the labeling information and the output result.
  • a regularized loss function can be used as the network loss of the feature extraction network, so as to avoid overfitting of the network parameters of the feature extraction network during the iterative training process.
  • the network parameters of the feature extraction network can be adjusted according to the network loss.
  • the network parameters can be adjusted to minimize the network loss, so that the adjusted feature extraction network has a higher goodness of fit while avoiding overfitting.
  • the gradient descent method can be used to backpropagate the network loss to adjust the network parameters of the feature extraction network. For example, for a feature extraction network with tree connections between neurons, stochastic gradient descent can be used to adjust network parameters to reduce the complexity of the process of adjusting network parameters, improve the efficiency of adjusting network parameters, and avoid post-adjustment The network parameters are over-fitting.
  • training may be performed on the feature extraction network, and the feature extraction network that meets the training conditions can be used in the process of obtaining the semantic vector.
  • Training conditions may include the number of adjustments, the size of the network loss, or the convergence and divergence of the network loss.
  • a predetermined number of first sample images may be input to the feature extraction network, that is, the network parameters of the feature extraction network are adjusted a predetermined number of times. When the number of adjustments reaches the predetermined number of times, the training condition is satisfied.
  • the number of adjustments may not be limited, and when the network loss decreases to a certain degree or converges within a certain threshold, the adjustment is stopped to obtain the adjusted feature extraction network, and the adjusted feature extraction network can be used to obtain the image to be detected.
  • the semantic vector is being processed. Training the feature extraction network by the difference between the label information and the output result can reduce the complexity of the loss function and increase the training speed.
  • a graphics processing unit (Graphics Processing Unit, GPU) may be used to accelerate the convolutional neural network to improve the processing efficiency of the convolutional neural network.
  • the prior information of the target semantic vector may be processed through a convolutional neural network to obtain the weight parameter of the target semantic vector, wherein the target semantic vector is the multiple One of two semantic vectors; and the text recognition result corresponding to the target semantic vector can be determined according to the weight parameter and the target semantic vector.
  • the a priori information includes the text recognition result and/or the start symbol corresponding to the previous semantic vector of the target semantic vector. If the target semantic vector is the first of the multiple semantic vectors, the prior information may be the initiator; if the target semantic vector is not the first of the multiple semantic vectors, the prior information may be The text recognition result corresponding to the previous semantic vector of the target semantic vector.
  • the target semantic vector may be encoded by at least one first convolutional layer in the convolutional neural network to obtain the first vector of the target semantic vector.
  • the prior information of the target semantic vector may be encoded by at least one second convolution layer in the convolutional neural network to obtain the second vector corresponding to the prior information. Then, the weight parameter of the target semantic vector may be determined based on the first vector and the second vector.
  • the first vector may have semantic information of the target semantic vector, and the first vector has the semantic connection relationship of the characters corresponding to the target semantic vector.
  • the target semantic vector can be The initial vector corresponding to the start symbol in the prior information is encoded to obtain the second vector corresponding to the prior information.
  • the initial vector corresponding to the start symbol may be a vector whose elements are preset values (for example, all elements are 1).
  • the characters in the text sequence are A, B, C, and D, and the initial vector corresponding to the start symbol S can be encoded to obtain the second vector.
  • the target semantic vector is not the first of the plurality of semantic vectors
  • the prior information including the text recognition result corresponding to the previous semantic vector of the target semantic vector
  • the The text recognition result corresponding to the previous semantic vector is subjected to word embedding processing to obtain the feature vector corresponding to the prior information, and the feature vector is encoded to obtain the second vector corresponding to the prior information.
  • word embedding can be performed on the text recognition result of the previous semantic vector of the target semantic vector , Obtain the feature vector corresponding to the prior information, and perform encoding processing on the feature vector to obtain the second vector corresponding to the prior information.
  • the text recognition result corresponding to the previous semantic vector of the target semantic vector may be subjected to word embedding processing to determine the result of the text recognition The corresponding feature vector.
  • word embedding processing an algorithm such as Word2Vec or GloVe can be used to perform word embedding processing on the text recognition result corresponding to the previous semantic vector to obtain the feature vector corresponding to the prior information.
  • the feature vector corresponding to the text recognition result corresponding to the previous semantic vector can be used as a basis for recognizing its subsequent characters.
  • the feature vector corresponding to the text recognition result corresponding to the previous semantic vector has the semantic information of the text recognition result corresponding to the previous semantic vector, and has the semantic connection of the text recognition result corresponding to the previous semantic vector relationship.
  • the weight parameter may be determined according to the first vector and the second vector, and the weight parameter may be a weight matrix. For example, vector multiplication may be performed on the first vector and the second vector to obtain the weight matrix.
  • an attention distribution vector corresponding to the target semantic vector may be obtained based on the weight parameter and the target semantic vector.
  • the attention distribution vector may be decoded through at least one deconvolution layer in the convolutional neural network to determine a text recognition result corresponding to the target semantic vector.
  • the weight parameter and the target semantic vector can be processed through the residual network to obtain the attention distribution vector, or the weight parameter (weight matrix) and the target semantic vector can be used for matrix multiplication (That is, weighting is performed on the target semantic vector) to obtain the attention distribution vector.
  • the attention distribution vector can have information such as the background, shooting angle, size, lighting conditions, and font of the image to be detected, as well as the semantic information of the target semantic vector.
  • the probability distribution information about the probability dictionary can be determined according to the attention distribution vector.
  • the attention distribution vector may be decoded through at least one deconvolution layer in the convolutional neural network to obtain probability distribution information about a probability dictionary.
  • the text in the probability dictionary can be determined according to the probability distribution information, that is, the text recognition result corresponding to the target semantic vector can be determined.
  • the text recognition result can be used in the process of recognizing the next character, and so on, until all characters in the text sequence are recognized.
  • an end vector can be input to the convolutional neural network, and the elements of the end vector can be preset (for example, all elements are 1).
  • the recognition of the text sequence in the image to be detected is completed, and the recognition result of the text sequence is obtained.
  • the semantic information of the text recognition result corresponding to the previous semantic vector may be included in the weight parameter (weight matrix) of the feature vector corresponding to the text recognition result corresponding to the previous semantic vector .
  • the elements in the weight parameter can have information such as the background, shooting angle, size, lighting conditions, and font of the image to be detected, which can be used as a basis for identifying subsequent characters in the text sequence.
  • the semantic information contained in the weight parameter can also be used as a basis for identifying subsequent characters. For example, if the target semantic vector is the second semantic vector, the previous semantic vector of the target semantic vector is the first semantic vector, and the corresponding character is the first character in the text sequence.
  • the recognition result of the first character can be used as the basis for recognizing the character corresponding to the target semantic vector, and the recognition result of the target semantic vector can be used as the recognition result of the third semantic vector (that is, the next semantic vector of the target semantic vector).
  • the basis of the corresponding character is, the recognition result of the third semantic vector (that is, the next semantic vector of the target semantic vector).
  • the recognized character when the first character in the text sequence is recognized, the recognized character does not exist in the image to be recognized, so the first character is recognized by the start character as the prior information.
  • the start character S is used as a priori information, and the initial vector corresponding to the start character S is used to identify the character A to obtain the text sequence
  • the recognition result of the first character is A.
  • the recognized character A is used to recognize the character B, and the recognition result B of the second character is obtained. And so on, until all the characters A, B, C, and D are recognized, and the recognition result of the text sequence is obtained.
  • a convolutional neural network can be used to recognize the character corresponding to the current target semantic vector according to the recognition result of the previous character, thereby avoiding the problem of uncontrollable long dependence and improving the accuracy of recognition.
  • the convolutional neural network may be trained before the text recognition result is determined using the convolutional neural network.
  • multiple images with multiple backgrounds, multiple resolutions, multiple fonts, multiple lighting conditions, multiple sizes and scales, multiple oblique directions, and multiple degrees of blur can be taken as the second sample image, and Use multiple second sample images to train the convolutional neural network.
  • the probability distribution information of the characters in each second sample image can be obtained according to the probability dictionary, and the second sample image can be annotated according to the probability distribution information to obtain the information in the second sample image.
  • the label information of each character that is, the label information is the actual probability distribution information of the corresponding character in the second sample image.
  • feature extraction processing can be performed on any second sample image to obtain multiple semantic vectors respectively corresponding to multiple characters in the second sample image.
  • the first semantic vector can be input to the first convolutional layer of the convolutional neural network and the start symbol can be input to the second convolutional layer to obtain the weight parameter of the first semantic vector.
  • the weight parameter can be used to weight the first semantic vector (ie, perform matrix multiplication) to obtain the sample attention distribution vector corresponding to the first semantic vector.
  • the sample attention distribution vector can be decoded by the deconvolution layer of the convolutional neural network to obtain the probability distribution information output by the convolutional neural network, that is, the output result of the convolutional neural network.
  • the network loss of the convolutional neural network can be determined according to the label information (true probability distribution information) and the output result (probability distribution information output by the convolutional neural network).
  • the label information of the characters in the second sample image can be compared with the output result of the convolutional neural network, and the difference between the two can be determined as the loss function of the convolutional neural network.
  • the cross-entropy loss function of the convolutional neural network can be determined according to the labeling information and the output result.
  • the regularized loss function can be used as the network loss of the convolutional neural network, so as to avoid overfitting of the network parameters of the convolutional neural network during the iterative training process.
  • the network parameters of the convolutional neural network can be adjusted according to the network loss.
  • the network parameters can be adjusted to minimize the network loss, so that the adjusted convolutional neural network has a higher goodness of fit while avoiding overfitting.
  • the gradient descent method can be used to perform the back propagation of the network loss to adjust the network parameters of the convolutional neural network. For example, for a convolutional neural network with tree connections between neurons, the stochastic gradient descent method can be used to adjust network parameters to reduce the complexity of the process of adjusting network parameters, improve the efficiency of adjusting network parameters, and avoid adjustments The latter network parameters appear to be over-fitting.
  • the character recognized by the convolutional neural network can be determined according to the probability distribution information output by the convolutional neural network and the probability dictionary, and word embedding processing is performed on the character to obtain a feature vector corresponding to the character.
  • the feature vector can be input into the second convolutional layer of the convolutional neural network, and the second semantic vector in the second sample image can be input into the first convolutional layer of the convolutional neural network to obtain the second convolutional layer.
  • the weight parameter of the semantic vector The weight parameter can be used to weight the second semantic vector to obtain the sample attention distribution vector corresponding to the second semantic vector.
  • the sample attention distribution vector can be decoded through the deconvolution layer of the convolutional neural network to obtain probability distribution information.
  • the network loss can be determined based on the probability distribution information and the label information of the second character, and the network loss can be used to adjust the network parameters of the convolutional neural network again.
  • iterative adjustments can be made in this way.
  • the weight parameter of the third semantic vector can be obtained according to the feature vector corresponding to the second character identified by the convolutional neural network and the third semantic vector, and then the sample attention corresponding to the third semantic vector can be obtained Distribution vector; after decoding it, the network loss can be determined, and the convolutional neural network can be adjusted again according to the network loss.
  • the convolutional neural network can be adjusted according to the third character and the fourth semantic vector, and the convolutional neural network can be adjusted according to the fourth character and the fifth semantic vector... until the character in the second sample image All recognition is complete. In this way, the network parameters of the convolutional neural network have been adjusted many times.
  • the convolutional neural network when the convolutional neural network meets the training condition, the convolutional neural network can be used in the process of recognizing the text sequence in the image to be detected.
  • Training conditions may include the number of adjustments, the size of the network loss, or the convergence and divergence of the network loss.
  • the network parameters of the convolutional neural network can be adjusted a predetermined number of times, and when the number of adjustments reaches the predetermined number of times, the training condition is satisfied.
  • the number of adjustments may not be limited, and when the network loss decreases to a certain extent or converges within a certain threshold, the adjustment is stopped, and the adjusted convolutional neural network is obtained.
  • semantic vectors can be extracted from the image to be detected, the complexity of text recognition is reduced, and the efficiency of text recognition is improved.
  • a convolutional neural network can be used to recognize the character corresponding to the current target semantic vector based on the recognition result of the previous character, thereby avoiding the problem of uncontrollable long dependence and improving the accuracy of recognition.
  • the GPU can be used to accelerate the convolutional neural network and improve the processing efficiency of the convolutional neural network.
  • Fig. 2 schematically shows a coding and decoding framework based on a convolutional neural network for text recognition according to an embodiment of the present disclosure.
  • feature extraction processing can be performed on the image to be detected to obtain multiple semantic vectors.
  • the prior information of the target semantic vector can be processed through the convolutional neural network to obtain the weight parameter of the target semantic vector, and the weight parameter and the target semantic vector can be used to determine the correlation with the target semantic vector The corresponding text recognition result.
  • the target semantic vector is any one of multiple semantic vectors.
  • the multiple semantic vectors may correspond to multiple characters of the text sequence.
  • each character in the multiple characters of the text sequence corresponds to one of the multiple semantic vectors.
  • the embodiment of the present disclosure does not Limited to this.
  • the target semantic vector is the first semantic vector among the plurality of semantic vectors (that is, the semantic vector corresponding to the first character in the text sequence in the image to be detected)
  • the target semantic vector is input to the convolutional nerve
  • the first convolutional layer of the network performs encoding processing to obtain a first vector
  • the initial vector corresponding to the start symbol is input into the second convolutional layer of the convolutional neural network for encoding processing to obtain a second vector.
  • vector multiplication can be performed on the first vector and the second vector to obtain the weight parameter of the first semantic vector, that is, the weight matrix.
  • the weight matrix can be used to perform weighting processing on the first semantic vector to obtain the attention distribution vector corresponding to the first semantic vector, which can be inverted by at least one of the convolutional neural networks.
  • the product layer decodes the attention distribution vector to obtain probability distribution information about the probability dictionary.
  • the text in the probability dictionary can be determined according to the probability distribution information, that is, the text recognition result corresponding to the first semantic vector can be determined, so as to obtain the recognition result of the first character.
  • word embedding processing may be performed on the recognition result of the first character to obtain the feature vector corresponding to the first character.
  • the feature vector corresponding to the first character may be input to the second convolutional layer of the convolutional neural network for encoding processing, to obtain the second vector corresponding to the first character.
  • the second semantic vector ie, the semantic vector corresponding to the second character in the character sequence in the image to be detected
  • vector multiplication may be performed on the first vector of the second semantic vector and the second vector corresponding to the first character to obtain the weight matrix of the second semantic vector.
  • the weight matrix can be used to weight the second semantic vector (ie matrix multiplication), and the weighted second semantic vector can be input to the fully connected layer of the convolutional neural network to obtain the second semantic vector The corresponding attention distribution vector.
  • the attention distribution vector corresponding to the second semantic vector can be decoded through at least one deconvolution layer in the convolutional neural network to obtain probability distribution information about the probability dictionary (ie, the recognition result of the second character) Probability distribution).
  • the text in the probability dictionary can be determined according to the probability distribution information, that is, the recognition result of the second character can be obtained. Further, the recognition result of the second character can be used to determine the recognition result of the third character, the recognition result of the third character can be used to determine the recognition result of the fourth character, and so on.
  • the recognized character when the first character in the text sequence is recognized, the recognized character does not exist in the image to be recognized, so the first character is recognized by the start character as the prior information.
  • the start character S is used as a priori information, and the initial vector corresponding to the start character S is used to identify the character A to obtain the text sequence
  • the recognition result of the first character is A.
  • the semantic vectors in the image to be processed can be iteratively processed in the above-mentioned manner to obtain the recognition result of each character in the image to be detected until all characters in the text sequence are recognized.
  • the end vector can be input to the convolutional neural network to complete the recognition of the text sequence in the image to be detected, and obtain the recognition result of the text sequence.
  • Fig. 3 shows a block diagram of a text recognition device that can implement the text recognition method according to any of the above embodiments.
  • the device may include an extraction module 11 and an identification module 12.
  • the extraction module 11 may perform feature extraction processing on the image to be detected to obtain multiple semantic vectors, wherein the multiple semantic vectors respectively correspond to multiple characters of the text sequence in the image to be detected.
  • the recognition module 12 may sequentially perform recognition processing on the multiple semantic vectors through a convolutional neural network to obtain the recognition result of the text sequence.
  • the recognition module may be used to process the prior information of the target semantic vector through a convolutional neural network to obtain the weight parameter of the target semantic vector, wherein the target semantic vector is One of the plurality of semantic vectors; and according to the weight parameter and the target semantic vector, a text recognition result corresponding to the target semantic vector is determined.
  • the a priori information includes the text recognition result and/or the start symbol corresponding to the previous semantic vector of the target semantic vector.
  • the recognition module may be used to encode the target semantic vector through at least one first convolutional layer in the convolutional neural network to obtain the first vector of the target semantic vector Encoding the prior information of the target semantic vector through at least one second convolution layer in the convolutional neural network to obtain a second vector corresponding to the prior information; based on the first vector And the second vector to determine the weight parameter.
  • the recognition module may be configured to: in response to the prior information including the text recognition result corresponding to the previous semantic vector of the target semantic vector, perform the recognition of the text recognition result corresponding to the previous semantic vector Perform word embedding processing to obtain a feature vector corresponding to the prior information; perform encoding processing on the feature vector to obtain the second vector.
  • the recognition module may be used to encode the initial vector corresponding to the start symbol in the prior information to obtain the second vector.
  • the recognition module may be used to: obtain an attention distribution vector corresponding to the target semantic vector based on the weight parameter and the target semantic vector; At least one deconvolution layer decodes the attention distribution vector, and determines a text recognition result corresponding to the target semantic vector.
  • the extraction module may be used to: perform feature extraction on the image to be detected to obtain feature information; and perform down-sampling processing on the feature information to obtain the multiple semantic vectors.
  • Fig. 4 is a block diagram of an electronic device 800 according to an exemplary embodiment.
  • the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.
  • the electronic device 800 may include one or more of the following: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, and a sensor component 814 , And communication component 816.
  • the processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 802 may include one or more processors 820 to execute instructions to execute all or part of the steps of any text recognition method described above.
  • the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components.
  • the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
  • the memory 804 can store various types of data to support operations on the electronic device 800. Examples of these data include instructions for any application or method executed on the electronic device 800, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 804 can be implemented by any type of volatile or nonvolatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk, etc.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • the power supply component 806 can provide power for various components of the electronic device 800.
  • the power supply component 806 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 800.
  • the multimedia component 808 may include a screen that provides an interface (eg, a graphical user interface (GUI)) between the electronic device 800 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel may include one or more sensors to sense touch, sliding, and/or other gestures on the touch panel. The sensor may not only sense the boundary of the touch or sliding motion, but also detect the duration and pressure related to the touch or sliding motion.
  • the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode such as a shooting mode or a video mode, the front camera and/or the rear camera can collect external multimedia data. Both the front camera and the rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 810 can output and/or input audio signals.
  • the audio component 810 may include a microphone.
  • the microphone can collect external audio signals.
  • the collected audio signal may be stored in the memory 804 or sent via the communication component 816.
  • the audio component 810 further includes a speaker for outputting audio signals.
  • the I/O interface 812 may provide an interface between the processing component 802 and peripheral devices.
  • peripheral devices may be keyboards, click wheels, buttons, etc. These buttons may include but are not limited to: home button, volume button, start button, and lock button.
  • the sensor component 814 may include one or more sensors for providing the electronic device 800 with various aspects of status information.
  • the sensor component 814 may include a proximity sensor to detect the presence of nearby objects when there is no physical contact.
  • the sensor component 814 may also include a light sensor, such as a complementary metal oxide semiconductor (CMOS) or charge coupled device (CCD) image sensor, for imaging applications.
  • CMOS complementary metal oxide semiconductor
  • CCD charge coupled device
  • the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 816 can facilitate wired or wireless communication between the electronic device 800 and other devices.
  • the electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
  • the communication component 816 may receive broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel.
  • the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology or other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the electronic device 800 may be implemented as one or more application specific integrated circuits (ASIC), digital signal processor (DSP), digital signal processing device (DSPD), programmable logic device (PLD), field Programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components to implement any of the above text recognition methods.
  • ASIC application specific integrated circuits
  • DSP digital signal processor
  • DSPD digital signal processing device
  • PLD programmable logic device
  • FPGA field Programmable gate array
  • controller microcontroller, microprocessor or other electronic components to implement any of the above text recognition methods.
  • a non-transitory computer-readable storage medium for example, the memory 804 may also be provided, on which computer program instructions are stored.
  • the computer program instructions when executed by a processor (for example, the processor 820), enable the processor to implement any of the foregoing text recognition methods.
  • Fig. 5 is a block diagram of an electronic device 1900 according to an exemplary embodiment.
  • the electronic device 1900 may be a server.
  • the electronic device 1900 may include: a processing component 1922, which may include one or more processors; and a memory resource represented by the memory 1932, for storing instructions executable by the processing component 1922, such as application programs.
  • the processing component 1922 can execute the instruction to implement any of the foregoing text recognition methods.
  • the electronic device 1900 may further include: a power supply component 1926 for performing power management of the electronic device 1900; a wired or wireless network interface 1950 for connecting the electronic device 1900 to a network; and an input/output (I/O) interface 1958.
  • a power supply component 1926 for performing power management of the electronic device 1900
  • a wired or wireless network interface 1950 for connecting the electronic device 1900 to a network
  • an input/output (I/O) interface 1958 may further include: a power supply component 1926 for performing power management of the electronic device 1900.
  • a wired or wireless network interface 1950 for connecting the electronic device 1900 to a network
  • I/O input/output
  • the electronic device 1900 may work based on an operating system (for example, Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM, etc.) stored in the storage 1932.
  • an operating system for example, Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM, etc.
  • a non-transitory computer-readable storage medium for example, the memory 1932
  • the computer program instructions when executed by the processor (for example, the processing component 1922), enable the processor to implement any of the foregoing text recognition methods.
  • the present disclosure can be implemented as an apparatus (system), method and/or computer program product.
  • the computer program product may include a computer readable storage medium loaded with computer readable program instructions for enabling a processor to implement the text recognition method of the present disclosure.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction, and the module, program segment, or part of an instruction contains one or more functions for implementing the specified logical function.
  • Executable instructions may also occur in a different order from the order marked in the drawings. For example, two consecutive blocks can actually be executed in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or actions, Or it can be realized by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

本公开涉及一种文本识别方法及装置、电子设备和存储介质。该方法包括:对待检测图像进行特征提取处理,获得多个语义向量,其中,所述多个语义向量分别与所述待检测图像中的文本序列的多个字符对应;通过卷积神经网络对所述多个语义向量依次进行识别处理,得到所述文本序列的识别结果。

Description

文本识别方法及装置、电子设备和存储介质
相关申请的交叉引用
本公开要求于2019年3月29日提交的、申请号为201910251661.4、发明名称为“文本识别方法及装置、电子设备和存储介质”的中国专利申请的优先权,该中国专利申请公开的全部内容以引用的方式并入本文中。
技术领域
本公开涉及计算机视觉技术,尤其涉及一种文本识别方法及装置、电子设备和存储介质。
背景技术
自然场景下的文本识别是图像理解和图像恢复领域的重要问题。精确的文本识别能够用于例如图片理解、自动翻译、盲人引导、机器人导航等。目前,基于编解码框架的文本识别系统通常使用循环神经网络作为编码器和解码器。
发明内容
根据本公开的一方面,提供了一种文本识别方法,其包括:对待检测图像进行特征提取处理,获得多个语义向量,其中,所述多个语义向量分别与所述待检测图像中的文本序列的多个字符对应;通过卷积神经网络对所述多个语义向量依次进行识别处理,得到所述文本序列的识别结果。
根据本公开的实施例的文本识别方法,可提高文本识别的精确度。
在一些实施例中,对所述多个语义向量依次进行识别处理以得到所述文本序列的识别结果包括:通过所述卷积神经网络对目标语义向量的先验信息进行处理,获得所述目标语义向量的权值参数,其中,所述目标语义向量为所述多个语义向量之一;根据所述权值参数和所述目标语义向量,确定与所述目标语义向量对应的文本识别结果。
通过这种方式,可使用根据先验信息获得的权值参数对目标语义向量进行加权,可在对目标语义向量的识别过程中参考先验信息,从而提高对目标语义向量的识别精度。
在一些实施例中,所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果和/或起始符。
在一些实施例中,对所述先验信息进行处理以获得所述目标语义向量的权值参数包括:通过所述卷积神经网络中的至少一个第一卷积层对所述目标语义向量进行编码处理,获得所述目标语义向量的第一向量;通过所述卷积神经网络中的至少一个第二卷积层对所述目标语义向量的先验信息进行编码处理,获得与所述先验信息对应的第二向量;基于所述第一向量和所述第二向量,确定所述权值参数。
通过这种方式,可使权值参数中包含先验信息,为识别目标语义向量提供依据。
在一些实施例中,对所述先验信息进行编码处理以获得与所述先验信息对应的第二向量包括:响应于所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果,对所述前一语义向量对应的文本识别结果进行词嵌入处理,获得与所述先验信息对应的特征向量;对所述特征向量进行编码处理,得到所述第二向量。
通过这种方式,可使用卷积神经网络,根据前一字符的识别结果对当前目标语义向量对应的字符进行识别,从而避免了不可控的长依赖问题,提高了识别的准确率。
在一些实施例中,对所述先验信息进行编码处理以获得与所述先验信息对应的第二向量包括:对所述先验信息中的起始符对应的初始向量进行编码处理,得到所述第二向量。
在一些实施例中,确定与所述目标语义向量对应的文本识别结果包括:基于所述权值参数和所述目标语义向量,获得与所述目标语义向量对应的注意力分布向量;通过所述卷积神经网络中的至少一个反卷积层对所述注意力分布向量进行解码处理,确定与所述目标语义向量对应的文本识别结果。
在一些实施例中,对所述待检测图像进行特征提取处理以获得多个语义向量包括:对所述待检测图像进行特征提取,获得特征信息;对所述特征信息进行下采样处理,得到所述多个语义向量。
根据本公开的另一方面,提供了一种文本识别装置,其包括:提取模块,用于对待检测图像进行特征提取处理,获得多个语义向量,其中,所述多个语义向量分别与所述待检测图像中的文本序列的多个字符对应;识别模块,用于通过卷积神经网络对所述多个语义向量依次进行识别处理,得到所述文本序列的识别结果。
根据本公开的另一方面,提供了一种电子设备,其包括:处理器;用于存储该处理 器可执行的指令的存储器,其中,所述处理器在执行所述存储器中存储的指令时,实现上述文本识别方法。
根据本公开的另一方面,提供了一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时,使该处理器实现上述文本识别方法。
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
图1示出根据本公开实施例的文本识别方法的流程图;
图2示出根据本公开实施例的用于文本识别的基于卷积神经网络的编解码框架的示意图;
图3示出根据本公开实施例的文本识别装置的框图;
图4示出根据本公开实施例的电子装置的框图;
图5示出根据本公开实施例的电子装置的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开中所使用的单数形式诸如“一种”、“所述”、“该”等也旨在包括复数形式,除非上下文清楚地表示其他含义。“A和/或B”可以表示下列三种情况:单独存在A、同时存在A和B、单独存在B。“包括A、B、C中的至少一个”可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
尽管本公开可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应受这些术语限制。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于”。
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
图1示出根据本公开实施例的文本识别方法的流程图。如图1所示,所述方法可以包括步骤S11和S12。
在步骤S11中,对待检测图像进行特征提取处理,获得多个语义向量,其中,所述多个语义向量分别与所述待检测图像中的文本序列的多个字符对应。
在步骤S12中,通过卷积神经网络对所述多个语义向量依次进行识别处理,得到所述文本序列的识别结果。
根据本公开的实施例的文本识别方法,可提高文本识别的精确度。
在一些实施例中,所述文本识别方法可以由终端设备执行。终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话机、无绳电话机、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。所述方法可以通过由终端设备中的处理器调用存储器中存储的计算机可读指令的方式来实现。或者,可由终端设备获取待检测图像,并将待检测图像发送至服务器,从而通过服务器执行所述方法。
在一些实施例中,步骤S11可包括:对所述待检测图像进行特征提取,获得特征信息;对所述特征信息进行下采样处理,得到所述多个语义向量。
在一些实施例中,所述特征信息可以包括但不限于特征图或特征向量等。
在示例中,待检测图像中可具有例如由多个文本字符组成的文本序列。所述文本序列的各文本字符之间可具有一定的语义连接关系,即,所述文本序列可具有某种语义。
在一些实施例中,可使用特征提取网络来提取待检测图像中的多个语义向量。该特征提取网络可以是例如卷积神经网络等神经网络。对待检测图像进行特征提取处理,获得多个语义向量,可以包括:将所述待检测图像输入特征提取网络,获得所述多个语义向量。
在示例中,特征提取网络可通过编码处理等来获取待检测图像的一个或多个特征矩阵。该特征矩阵的维度可为M×P。例如,P可为32,M与P之比可以对应于待检测图 像的宽高比。例如,假设待检测图像的分辨率为1024×768,则经过编码处理可获得一个或多个43×32的特征图。
在示例中,特征提取网络可对该特征矩阵进行下采样处理,以获得一个或多个特征向量作为语义向量。例如,可对维度为M×P的特征矩阵进行下采样处理,获得一个或多个维度为M×1的特征向量作为一维语义向量。针对前述示例中经过编码处理获得的43×32的特征图,可通过下采样处理获得一个或多个43×1的特征向量。
通过上述特征提取处理,可以降低文本识别的复杂度,从而提高文本识别的效率。
在一些实施例中,可在使用特征提取网络对待检测图像进行特征提取处理前,对所述特征提取网络进行训练。
在一些实施例中,可拍摄多种背景、多种分辨率、多种字体、多种光照条件、多种大小尺度、多种倾斜方向和多重模糊程度的多个图像作为第一样本图像,对特征提取网络进行训练。
在一些实施例中,可根据概率字典对第一样本图像中的文本进行标注,获得第一样本图像的标注语义向量(以下也可称为第一样本图像的真实语义向量)。所述概率字典可以包括用户定义的关于文本的概率分布。例如,可使用包括多个元素的向量表示概率字典中的各文本的概率分布信息。可根据各文本的概率分布信息确定概率字典中的文本,或者,可确定文本在概率字典中的概率分布信息,从而确定与所述文本对应的语义向量。在示例中,可确定第一样本图像中的文本在所述概率字典中的概率分布信息,从而确定第一样本图像中的文本对应的语义向量,并根据所述语义向量对第一样本图像进行标注,获得标注信息。这样,所述标注信息可表示第一样本图像的真实语义向量。
在一些实施例中,可将第一样本图像输入所述特征提取网络中进行处理,获得与第一样本图像对应的样本语义向量。所述样本语义向量为特征提取网络针对该第一样本图像的输出结果,该输出结果可能存在误差。
在一些实施例中,可根据针对第一样本图像的标注信息和输出结果来确定特征提取网络的网络损失。在示例中,可将第一样本图像的真实语义向量(即,标注信息)和样本语义向量(即,输出结果)进行对比,将二者之间的差异确定为特征提取网络的损失函数。又例如,可根据标注信息和输出结果确定特征提取网络的交叉熵损失函数。在示例中,可使用正则化的损失函数作为特征提取网络的网络损失,从而可避免在迭代训练过程中,特征提取网络的网络参数出现过拟合的情况。
在一些实施例中,可根据网络损失来调整特征提取网络的网络参数。在示例中,可调整网络参数以使网络损失最小化,使得调整后的特征提取网络具有较高的拟合优度,同时避免过拟合。在示例中,可使用梯度下降法进行网络损失的反向传播,以调整特征提取网络的网络参数。例如,对于各神经元之间树型连接的特征提取网络,可使用随机梯度下降法等调整网络参数,以降低调整网络参数的过程的复杂程度,提高调整网络参数的效率,并可避免调整后的网络参数出现过拟合的情况。
在一些实施例中,可对特征提取网络进行迭代训练,并将满足训练条件的特征提取网络用于获取语义向量的处理中。训练条件可包括调整次数、网络损失的大小或网络损失敛散性等。可对特征提取网络输入预定数量的第一样本图像,即,将特征提取网络的网络参数调整预定次数,当调整次数达到所述预定次数时,即为满足训练条件。或者,可不限制调整次数,而在网络损失降低到一定程度或收敛于一定阈值内时,停止调整,获得调整后的特征提取网络,并可将调整后的特征提取网络用于获取待检测图像的语义向量的处理中。通过标注信息和输出结果之差来训练特征提取网络,可降低损失函数的复杂程度,提高训练速度。
在一些实施例中,在步骤S12中,可使用图形处理单元(Graphics Processing Unit,GPU)对卷积神经网络进行加速,提高卷积神经网络的处理效率。
在一些实施例中,在步骤S12中,可通过卷积神经网络对目标语义向量的先验信息进行处理,获得所述目标语义向量的权值参数,其中,所述目标语义向量为所述多个语义向量之一;并且可以根据所述权值参数和所述目标语义向量,确定与所述目标语义向量对应的文本识别结果。
在一些实施例中,所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果和/或起始符。如果目标语义向量是多个语义向量中的第一个,则所述先验信息可以是起始符;如果目标语义向量不是多个语义向量中的第一个,则所述先验信息可以是目标语义向量的前一语义向量对应的文本识别结果。
在一些实施例中,可以通过所述卷积神经网络中的至少一个第一卷积层对所述目标语义向量进行编码处理,获得所述目标语义向量的第一向量。可以通过所述卷积神经网络中的至少一个第二卷积层对所述目标语义向量的先验信息进行编码处理,获得与所述先验信息对应的第二向量。然后,可以基于所述第一向量和所述第二向量,确定所述目标语义向量的权值参数。
在一些实施例中,所述第一向量可具有目标语义向量的语义信息,且第一向量具有目标语义向量所对应的字符的语义连接关系。
在一些实施例中,如果目标语义向量是多个语义向量中的第一个,即,目标语义向量为与文本序列中的第一个字符对应的语义向量,则可以通过对所述目标语义向量的先验信息中的起始符对应的初始向量进行编码处理,得到与该先验信息对应的第二向量。在示例中,与起始符对应的初始向量可以是元素为预设值(例如,元素全为1)的向量。在示例中,文本序列中的字符为A、B、C和D,可对起始符S对应的初始向量进行编码处理,得到第二向量。
在一些实施例中,如果目标语义向量不是多个语义向量中的第一个,则可以响应于所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果,对所述前一语义向量对应的文本识别结果进行词嵌入处理,获得与先验信息对应的特征向量,并对所述特征向量进行编码处理,得到与该先验信息对应的第二向量。例如,若文本序列中的字符为A、B、C和D,目标语义向量为B、C或D对应的语义向量,则可对目标语义向量的前一语义向量的文本识别结果进行词嵌入处理,获得与先验信息对应的特征向量,并对该特征向量进行编码处理,得到与该先验信息对应的第二向量。
在一些实施例中,如果目标语义向量不是多个语义向量中的第一个语义向量,则可对目标语义向量的前一个语义向量对应的文本识别结果进行词嵌入处理,确定与该文本识别结果对应的特征向量。在示例中,可通过Word2Vec或GloVe等算法来对前一语义向量对应的文本识别结果进行词嵌入处理,以获得与先验信息对应的特征向量。
在一些实施例中,在对目标语义向量的前一个语义向量对应的文本的识别过程中,可识别出待检测图像的背景、拍摄角度、尺寸、光照条件和字体等信息。即,所述前一个语义向量对应的文本识别结果是依据待检测图像的背景、拍摄角度、尺寸、光照条件和字体等信息的。因此,与所述前一个语义向量对应的文本识别结果对应的特征向量可作为识别其后续字符的依据。此外,与所述前一个语义向量对应的文本识别结果对应的特征向量具有所述前一个语义向量对应的文本识别结果的语义信息,且具有所述前一个语义向量对应的文本识别结果的语义连接关系。
在一些实施例中,可根据所述第一向量和所述第二向量,确定所述权值参数,所述权值参数可以是权值矩阵。例如,可对第一向量和第二向量进行向量乘法,获得所述权值矩阵。
在一些实施例中,可以基于所述权值参数和所述目标语义向量,获得与所述目标语义向量对应的注意力分布向量。可以通过所述卷积神经网络中的至少一个反卷积层对所述注意力分布向量进行解码处理,确定与所述目标语义向量对应的文本识别结果。
在一些实施例中,可通过残差网络对权值参数和目标语义向量进行处理,获得所述注意力分布向量,或者,可以使用权值参数(权值矩阵)和目标语义向量进行矩阵乘法(即,对目标语义向量进行加权处理),获得所述注意力分布向量。这样,注意力分布向量可以具有待检测图像的背景、拍摄角度、尺寸、光照条件和字体等信息以及目标语义向量的语义信息。
在一些实施例中,可根据注意力分布向量确定关于概率字典的概率分布信息。例如,可通过所述卷积神经网络中的至少一个反卷积层对所述注意力分布向量进行解码处理,获得关于概率字典的概率分布信息。然后,可以根据概率分布信息确定概率字典中的文本,即,确定与所述目标语义向量对应的文本识别结果。该文本识别结果可用于识别下一个字符的处理过程,依此类推,直到文本序列中的字符全部识别完成。在文本序列中的字符全部识别完成时,可向卷积神经网络输入结束向量,所述结束向量的元素可以是预设的(例如,元素全为1)。在输入结束向量时,待检测图像中的文本序列的识别工作完成,并获得所述文本序列的识别结果。
在一些实施例中,所述前一个语义向量对应的文本识别结果的语义信息可被包含在与所述前一个语义向量对应的文本识别结果对应的特征向量的权值参数(权值矩阵)中。该权值参数中的元素可具有待检测图像的背景、拍摄角度、尺寸、光照条件和字体等信息,可作为识别文本序列中的后续字符的依据。该权值参数包含的语义信息也可作为识别后续字符的依据。例如,若目标语义向量为第二个语义向量,则目标语义向量的前一个语义向量为第一个语义向量,其对应的字符为文本序列中的第一个字符。对第一个字符的识别结果可作为识别目标语义向量对应的字符的依据,并且,对目标语义向量的识别结果可作为识别第三个语义向量(即,目标语义向量的下一个语义向量)所对应的字符的依据。
在示例中,对文本序列中第一个字符进行识别时,待识别图像中不存在已识别字符,因而利用作为先验信息的起始符识别第一个字符。举例来说,文本序列中存在字符A、B、C和D时,在第一步,将起始符S作为先验信息,利用起始符S对应的初始向量识别出字符A,得到文本序列的第一个字符的识别结果为A。然后,利用已识别字符A识别字符B,得到第二个字符的识别结果B。依此类推,直至识别出全部字符A、B、C 和D,得到文本序列的识别结果。
通过这种方式,可使用卷积神经网络,根据前一字符的识别结果对当前目标语义向量对应的字符进行识别,从而避免了不可控的长依赖问题,提高了识别的准确率。
在一些实施例中,可在使用卷积神经网络确定文本识别结果之前,对所述卷积神经网络进行训练。
在一些实施例中,可拍摄多种背景、多种分辨率、多种字体、多种光照条件、多种大小尺度、多种倾斜方向和多重模糊程度的多个图像作为第二样本图像,并使用多个第二样本图像对卷积神经网络进行训练。
在一些实施例中,可根据概率字典,获得每个第二样本图像中的字符的概率分布信息,并根据所述概率分布信息对该第二样本图像进行标注,获得该第二样本图像中的每个字符的标注信息,即,所述标注信息为第二样本图像中的相应字符的真实的概率分布信息。
在一些实施例中,可对任一第二样本图像进行特征提取处理,获得与第二样本图像中的多个字符分别对应的多个语义向量。可以向卷积神经网络的第一卷积层输入第一个语义向量以及向第二卷积层输入起始符,以获得第一个语义向量的权值参数。进一步地,可使用该权值参数(权值矩阵)对第一个语义向量进行加权(即,进行矩阵乘法),获得与第一个语义向量对应的样本注意力分布向量。
在一些实施例中,可通过卷积神经网络的反卷积层对样本注意力分布向量进行解码处理,获得卷积神经网络输出的概率分布信息,即,卷积神经网络的输出结果。进一步地,可根据标注信息(真实的概率分布信息)和输出结果(卷积神经网络输出的概率分布信息)来确定卷积神经网络的网络损失。在示例中,可将第二样本图像中的字符的标注信息和卷积神经网络的输出结果进行对比,将二者之间的差异确定为卷积神经网络的损失函数。又例如,可根据标注信息和输出结果确定卷积神经网络的交叉熵损失函数。在示例中,可使用正则化的损失函数作为卷积神经网络的网络损失,从而可避免在迭代训练过程中,卷积神经网络的网络参数出现过拟合的情况。
在一些实施例中,可根据网络损失来调整卷积神经网络的网络参数。在示例中,可调整网络参数以使网络损失最小化,使得调整后的卷积神经网络具有较高的拟合优度,同时避免过拟合。在示例中,可使用梯度下降法进行网络损失的反向传播,以调整卷积神经网络的网络参数。例如,对于各神经元之间树型连接的卷积神经网络,可使用随机 梯度下降法等调整网络参数,以降低调整网络参数的过程的复杂程度,提高调整网络参数的效率,并可避免调整后的网络参数出现过拟合的情况。
在一些实施例中,可根据卷积神经网络输出的概率分布信息以及概率字典,确定卷积神经网络识别出的字符,并对该字符进行词嵌入处理,获得与该字符对应的特征向量。进一步地,可将该特征向量输入卷积神经网络的第二卷积层,并将第二样本图像中的第二个语义向量输入卷积神经网络的第一卷积层,以获得第二个语义向量的权值参数。可以使用该权值参数对第二个语义向量进行加权,获得与第二个语义向量对应的样本注意力分布向量。然后,可通过卷积神经网络的反卷积层对该样本注意力分布向量进行解码,获得概率分布信息。可以根据该概率分布信息与第二个字符的标注信息确定网络损失,并使用网络损失再次调整卷积神经网络的网络参数。在示例中,可通过这种方式进行迭代调整。例如,可以根据卷积神经网络识别出的第二个字符对应的特征向量与第三个语义向量,获得第三个语义向量的权值参数,进而获得与第三个语义向量对应的样本注意力分布向量;对其进行解码处理后,可确定网络损失,并根据网络损失再次调整卷积神经网络。依此类推,还可根据第三个字符以及第四个语义向量调整卷积神经网络,根据第四个字符以及第五个语义向量调整卷积神经网络……直到该第二样本图像中的字符全部识别完成。这样,卷积神经网络的网络参数被调整了多次。
在一些实施例中,当卷积神经网络满足训练条件时,可将卷积神经网络用于识别待检测图像中的文本序列的处理中。训练条件可包括调整次数、网络损失的大小或网络损失敛散性等。可将卷积神经网络的网络参数调整预定次数,当调整次数达到所述预定次数时,即为满足训练条件。或者,可不限制调整次数,而在网络损失降低到一定程度或收敛于一定阈值内时,停止调整,获得调整后的卷积神经网络。
根据本公开的实施例的文本识别方法,可从待检测图像中提取语义向量,降低文本识别的复杂度,提高文本识别的效率。可使用卷积神经网络,依赖前一字符的识别结果对当前目标语义向量对应的字符进行识别,从而避免了不可控的长依赖问题,提高了识别的准确率。可使用GPU对卷积神经网络进行加速,提高卷积神经网络的处理效率。
图2示意性示出了根据本公开实施例的用于文本识别的基于卷积神经网络的编解码框架。
在一些实施例中,可对待检测图像进行特征提取处理,获得多个语义向量。可通过卷积神经网络对目标语义向量的先验信息进行处理,获得所述目标语义向量的权值参数,并且可以根据所述权值参数和所述目标语义向量,确定与所述目标语义向量对应的文本 识别结果。所述目标语义向量为多个语义向量中的任意一个。
在一些实施例中,多个语义向量可以对应于文本序列的多个字符,例如,文本序列的多个字符中每个字符对应于多个语义向量中的一个语义向量,但本公开实施例不限于此。如果目标语义向量为多个语义向量中的第一个语义向量(即,与待检测图像中的文本序列中的第一个字符对应的语义向量),则将目标语义向量输入所述卷积神经网络的第一卷积层进行编码处理,获得第一向量,并将起始符对应的初始向量输入所述卷积神经网络的第二卷积层进行编码处理,获得第二向量。进一步地,可对第一向量和第二向量进行向量乘法,获得第一个语义向量的权值参数,即权值矩阵。
在一些实施例中,可使用该权值矩阵对第一个语义向量进行加权处理,获得第一个语义向量对应的注意力分布向量,并可通过所述卷积神经网络中的至少一个反卷积层对所述注意力分布向量进行解码处理,获得关于概率字典的概率分布信息。进一步地,可以根据该概率分布信息确定概率字典中的文本,即,确定与第一个语义向量对应的文本识别结果,从而获得第一个字符的识别结果。
在一些实施例中,可对第一个字符的识别结果进行词嵌入处理,获得第一个字符对应的特征向量。可将第一个字符对应的特征向量输入所述卷积神经网络的第二卷积层进行编码处理,获得第一个字符对应的第二向量。可将第二个语义向量(即,与待检测图像中的字符序列中的第二个字符对应的语义向量)输入卷积神经网络的第一卷积层进行编码处理,获得第二个语义向量的第一向量。进一步地,可对第二个语义向量的第一向量和第一个字符对应的第二向量进行向量乘法,获得第二个语义向量的权值矩阵。可使用该权值矩阵对第二个语义向量进行加权处理(即,进行矩阵乘法),并将加权后的第二个语义向量输入卷积神经网络的全连接层,以获得第二个语义向量对应的注意力分布向量。可通过所述卷积神经网络中的至少一个反卷积层对第二个语义向量对应的注意力分布向量进行解码处理,获得关于概率字典的概率分布信息(即,第二个字符的识别结果的概率分布)。可根据该概率分布信息确定概率字典中的文本,即,可获得第二个字符的识别结果。进一步地,还可利用第二个字符的识别结果,确定第三个字符的识别结果,利用第三个字符的识别结果,确定第四个字符的识别结果,依此类推。
在示例中,对文本序列中第一个字符进行识别时,待识别图像中不存在已识别字符,因而利用作为先验信息的起始符识别第一个字符。举例来说,文本序列中存在字符A、B、C和D时,在第一步,将起始符S作为先验信息,利用起始符S对应的初始向量识别出字符A,得到文本序列的第一个字符的识别结果为A。然后,利用已识别字符A识 别字符B,得到第二个字符的识别结果B。依此类推,直至识别出全部字符A、B、C和D,得到文本序列的识别结果。
在一些实施例中,可通过上述方式对待处理图像中的各语义向量进行迭代处理,可获得待检测图像中的每个字符的识别结果,直到文本序列中的字符全部识别完成。在文本序列中的字符全部识别完成时,可向卷积神经网络输入结束向量,以完成待检测图像中的文本序列的识别工作,获得文本序列的识别结果。
图3示出可以实现根据上述任一实施例的文本识别方法的文本识别装置的框图。如图3所示,所述装置可以包括提取模块11和识别模块12。
提取模块11可以对待检测图像进行特征提取处理,获得多个语义向量,其中,所述多个语义向量分别与所述待检测图像中的文本序列的多个字符对应。识别模块12可以通过卷积神经网络对所述多个语义向量依次进行识别处理,得到所述文本序列的识别结果。
在一些实施例中,所述识别模块可以用于:通过卷积神经网络对目标语义向量的先验信息进行处理,获得所述目标语义向量的权值参数,其中,所述目标语义向量为所述多个语义向量之一;根据所述权值参数和所述目标语义向量,确定与所述目标语义向量对应的文本识别结果。
在一些实施例中,所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果和/或起始符。
在一些实施例中,所述识别模块可以用于:通过所述卷积神经网络中的至少一个第一卷积层对所述目标语义向量进行编码处理,获得所述目标语义向量的第一向量;通过所述卷积神经网络中的至少一个第二卷积层对所述目标语义向量的先验信息进行编码处理,获得与所述先验信息对应的第二向量;基于所述第一向量和所述第二向量,确定所述权值参数。
在一些实施例中,所述识别模块可以用于:响应于所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果,对所述前一语义向量对应的文本识别结果进行词嵌入处理,获得与该先验信息对应的特征向量;对所述特征向量进行编码处理,得到所述第二向量。
在一些实施例中,所述识别模块可以用于:对所述先验信息中的起始符对应的初始向量进行编码处理,得到所述第二向量。
在一些实施例中,所述识别模块可以用于:基于所述权值参数和所述目标语义向量,获得与所述目标语义向量对应的注意力分布向量;通过所述卷积神经网络中的至少一个反卷积层对所述注意力分布向量进行解码处理,确定与所述目标语义向量对应的文本识别结果。
在一些实施例中,所述提取模块可以用于:对所述待检测图像进行特征提取,获得特征信息;对所述特征信息进行下采样处理,得到所述多个语义向量。
图4是根据一示例性实施例的一种电子设备800的框图。例如,电子设备800可以是移动电话机,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图4,电子设备800可以包括下列中的一个或多个:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)接口812,传感器组件814,以及通信组件816。
处理组件802通常控制电子设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以执行上述任一种文本识别方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804可以存储各种类型的数据以支持在电子设备800上的操作。这些数据的示例包括用于在电子设备800上执行的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,闪存,磁盘或光盘等。
电源组件806可以为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为电子设备800生成、管理和分配电力相关联的组件。
多媒体组件808可以包括在所述电子设备800和用户之间提供界面(例如,图形用户界面(GUI))的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输 入信号。触摸面板可以包括一个或多个传感器,以感测在触摸面板上的触摸、滑动和/或其它手势。所述传感器可以不仅感测触摸或滑动动作的边界,而且检测与所述触摸或滑动动作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当电子设备800处于操作模式如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以采集外部的多媒体数据。前置摄像头和后置摄像头均可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810可以输出和/或输入音频信号。例如,音频组件810可以包括一个麦克风。当电子设备800处于操作模式如呼叫模式、记录模式或语音识别模式时,麦克风可以采集外部音频信号。所采集的音频信号可以被存储在存储器804中或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812可以在处理组件802和外围设备之间提供接口。上述外围设备可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814可以包括一个或多个传感器,用于为电子设备800提供各个方面的状态信息。例如,传感器组件814可以包括接近传感器,用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如互补金属氧化物半导体(CMOS)或电荷耦合器件(CCD)图像传感器,用于成像应用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816可以便于电子设备800和其他设备之间的有线或无线通信。电子设备800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件816可以经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件816还包括近场通信(NFC)模块,以便于短程通信。例如,NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术或其他技术来实现。
在示例性实施例中,电子设备800可以被实现为一个或多个专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件,以实现上述任一种文本识别方法。
在示例性实施例中,还可以提供一种非暂时性计算机可读存储介质(例如,存储器804),其上存储有计算机程序指令。该计算机程序指令在由处理器(例如,处理器820)执行时,使该处理器实现上述任一种文本识别方法。
图5是根据一示例性实施例的一种电子设备1900的框图。例如,电子设备1900可以是一服务器。
参照图5,电子设备1900可以包括:处理组件1922,其可以包括一个或多个处理器;以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922执行的指令,例如应用程序。处理组件1922可以执行该指令,以实现上述任一种文本识别方法。
电子设备1900还可以包括:电源组件1926,用于执行电子设备1900的电源管理;有线或无线网络接口1950,用于将电子设备1900连接到网络;和输入/输出(I/O)接口1958。
电子设备1900可以基于存储在存储器1932中的操作系统(例如,Windows Server TM,Mac OS X TM,Unix TM,Linux TM,FreeBSD TM等)而工作。
在示例性实施例中,还可以提供一种非暂时性计算机可读存储介质(例如,存储器1932),其上存储有计算机程序指令。该计算机程序指令在由处理器(例如,处理组件1922)执行时,使该处理器实现上述任一种文本识别方法。
本公开可以被实现为装置(系统)、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的文本识别方法的计算机可读程序指令。
附图中的流程图和框图显示了根据本公开的各个实施例的装置(系统)、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施例。上述说明是示例性的,并不旨在限制本公 开。在不偏离本公开的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。凡在本公开的精神和原则之内所做的任何修改、等同替换、改进等,均应包含在本公开的范围之内。

Claims (18)

  1. 一种文本识别方法,包括:
    对待检测图像进行特征提取处理,获得多个语义向量,其中,所述多个语义向量分别与所述待检测图像中的文本序列的多个字符对应;
    通过卷积神经网络对所述多个语义向量依次进行识别处理,得到所述文本序列的识别结果。
  2. 根据权利要求1所述的方法,其中,所述通过卷积神经网络对所述多个语义向量依次进行识别处理,得到所述文本序列的识别结果,包括:
    通过所述卷积神经网络对目标语义向量的先验信息进行处理,获得所述目标语义向量的权值参数,其中,所述目标语义向量为所述多个语义向量之一;
    根据所述权值参数和所述目标语义向量,确定与所述目标语义向量对应的文本识别结果。
  3. 根据权利要求2所述的方法,其中,所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果和/或起始符。
  4. 根据权利要求2或3所述的方法,其中,所述通过卷积神经网络对目标语义向量的先验信息进行处理,获得所述目标语义向量的权值参数,包括:
    通过所述卷积神经网络中的至少一个第一卷积层对所述目标语义向量进行编码处理,获得所述目标语义向量的第一向量;
    通过所述卷积神经网络中的至少一个第二卷积层对所述目标语义向量的先验信息进行编码处理,获得与所述先验信息对应的第二向量;
    基于所述第一向量和所述第二向量,确定所述权值参数。
  5. 根据权利要求4所述的方法,其中,所述通过所述卷积神经网络中的至少一个第二卷积层对所述目标语义向量的先验信息进行编码处理,获得与所述先验信息对应的第二向量,包括:
    响应于所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果,对所述前一语义向量对应的文本识别结果进行词嵌入处理,获得与所述先验信息对应的 特征向量;
    通过所述卷积神经网络中的至少一个第二卷积层对所述特征向量进行编码处理,得到所述第二向量。
  6. 根据权利要求4或5所述的方法,其中,所述通过所述卷积神经网络中的至少一个第二卷积层对所述目标语义向量的先验信息进行编码处理,获得与所述先验信息对应的第二向量,包括:
    通过所述卷积神经网络中的至少一个第二卷积层对所述先验信息中的起始符对应的初始向量进行编码处理,得到所述第二向量。
  7. 根据权利要求2至6中任一项所述的方法,其中,所述根据所述权值参数和所述目标语义向量,确定与所述目标语义向量对应的文本识别结果,包括:
    基于所述权值参数和所述目标语义向量,获得与所述目标语义向量对应的注意力分布向量;
    通过所述卷积神经网络中的至少一个反卷积层对所述注意力分布向量进行解码处理,确定与所述目标语义向量对应的文本识别结果。
  8. 根据权利要求1至7中任一项所述的方法,其中,所述对待检测图像进行特征提取处理,获得多个语义向量,包括:
    对所述待检测图像进行特征提取,获得特征信息;
    对所述特征信息进行下采样处理,得到所述多个语义向量。
  9. 一种文本识别装置,包括:
    提取模块,用于对待检测图像进行特征提取处理,获得多个语义向量,其中,所述多个语义向量分别与所述待检测图像中的文本序列的多个字符对应;
    识别模块,用于通过卷积神经网络对所述多个语义向量依次进行识别处理,得到所述文本序列的识别结果。
  10. 根据权利要求9所述的装置,其中,所述识别模块用于:
    通过所述卷积神经网络对目标语义向量的先验信息进行处理,获得所述目标语义向量的权值参数,其中,所述目标语义向量为所述多个语义向量之一;
    根据所述权值参数和所述目标语义向量,确定与所述目标语义向量对应的文本识别 结果。
  11. 根据权利要求10所述的装置,其中,所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果和/或起始符。
  12. 根据权利要求10或11所述的装置,其中,所述识别模块用于:
    通过所述卷积神经网络中的至少一个第一卷积层对所述目标语义向量进行编码处理,获得所述目标语义向量的第一向量;
    通过所述卷积神经网络中的至少一个第二卷积层对所述目标语义向量的先验信息进行编码处理,获得与所述先验信息对应的第二向量;
    基于所述第一向量和所述第二向量,确定所述权值参数。
  13. 根据权利要求12所述的装置,其中,所述识别模块用于:
    响应于所述先验信息包括所述目标语义向量的前一语义向量对应的文本识别结果,对所述前一语义向量对应的文本识别结果进行词嵌入处理,获得与所述先验信息对应的特征向量;
    通过所述卷积神经网络中的至少一个第二卷积层对所述特征向量进行编码处理,得到所述第二向量。
  14. 根据权利要求12或13所述的装置,其中,所述识别模块用于:
    通过所述卷积神经网络中的至少一个第二卷积层对所述先验信息中的起始符对应的初始向量进行编码处理,得到所述第二向量。
  15. 根据权利要求10-14中任一项所述的装置,其中,所述识别模块用于:
    基于所述权值参数和所述目标语义向量,获得与所述目标语义向量对应的注意力分布向量;
    通过所述卷积神经网络中的至少一个反卷积层对所述注意力分布向量进行解码处理,确定与所述目标语义向量对应的文本识别结果。
  16. 根据权利要求9-15中任一项所述的装置,其中,所述提取模块用于:
    对所述待检测图像进行特征提取,获得特征信息;
    对所述特征信息进行下采样处理,得到所述多个语义向量。
  17. 一种电子设备,包括:
    处理器;
    用于存储可由所述处理器执行的指令的存储器,
    其中,所述处理器在执行所述存储器中存储的所述指令时,实现根据权利要求1至8中任意一项所述的方法。
  18. 一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时,使所述处理器实现根据权利要求1至8中任意一项所述的方法。
PCT/CN2020/072804 2019-03-29 2020-01-17 文本识别方法及装置、电子设备和存储介质 WO2020199730A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020561646A JP7153088B2 (ja) 2019-03-29 2020-01-17 テキスト認識方法及びテキスト認識装置、電子機器、記憶媒体並びにコンピュータプログラム
SG11202010916SA SG11202010916SA (en) 2019-03-29 2020-01-17 Text recognition method and apparatus, electronic device and storage medium
US17/081,758 US12014275B2 (en) 2019-03-29 2020-10-27 Method for text recognition, electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910251661.4 2019-03-29
CN201910251661.4A CN111753822B (zh) 2019-03-29 2019-03-29 文本识别方法及装置、电子设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/081,758 Continuation US12014275B2 (en) 2019-03-29 2020-10-27 Method for text recognition, electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2020199730A1 true WO2020199730A1 (zh) 2020-10-08

Family

ID=72664623

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/072804 WO2020199730A1 (zh) 2019-03-29 2020-01-17 文本识别方法及装置、电子设备和存储介质

Country Status (5)

Country Link
JP (1) JP7153088B2 (zh)
CN (1) CN111753822B (zh)
SG (1) SG11202010916SA (zh)
TW (1) TW202036464A (zh)
WO (1) WO2020199730A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704478A (zh) * 2021-09-07 2021-11-26 平安银行股份有限公司 文本要素提取方法、装置、电子设备及介质
CN113762050A (zh) * 2021-05-12 2021-12-07 腾讯云计算(北京)有限责任公司 图像数据处理方法、装置、设备以及介质
CN114495101A (zh) * 2022-01-12 2022-05-13 北京百度网讯科技有限公司 文本检测方法、文本检测网络的训练方法及装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733830A (zh) * 2020-12-31 2021-04-30 上海芯翌智能科技有限公司 店铺招牌识别方法及装置、存储介质和计算机设备
CN112949477B (zh) * 2021-03-01 2024-03-15 苏州美能华智能科技有限公司 基于图卷积神经网络的信息识别方法、装置及存储介质
CN113837965B (zh) * 2021-09-26 2024-06-18 北京百度网讯科技有限公司 图像清晰度识别方法、装置、电子设备及存储介质
CN114495102A (zh) * 2022-01-12 2022-05-13 北京百度网讯科技有限公司 文本识别方法、文本识别网络的训练方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017194806A (ja) * 2016-04-19 2017-10-26 AI inside株式会社 文字認識装置、方法およびプログラム
CN107590192A (zh) * 2017-08-11 2018-01-16 深圳市腾讯计算机系统有限公司 文本问题的数学化处理方法、装置、设备和存储介质
US20190050639A1 (en) * 2017-08-09 2019-02-14 Open Text Sa Ulc Systems and methods for generating and using semantic images in deep learning for classification and data extraction

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0954814A (ja) * 1995-08-04 1997-02-25 At & T Corp 入力記号表現の分析及び入力記号表現の可能解釈のスコアリングシステム
WO2018094294A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial attention model for image captioning
CN107797985B (zh) * 2017-09-27 2022-02-25 百度在线网络技术(北京)有限公司 建立同义鉴别模型以及鉴别同义文本的方法、装置
CN108288078B (zh) * 2017-12-07 2020-09-29 腾讯科技(深圳)有限公司 一种图像中字符识别方法、装置和介质
CN108287585A (zh) * 2018-01-25 2018-07-17 西安文理学院 一种稳压电源
CN108615036B (zh) * 2018-05-09 2021-10-01 中国科学技术大学 一种基于卷积注意力网络的自然场景文本识别方法
CN108874174B (zh) * 2018-05-29 2020-04-24 腾讯科技(深圳)有限公司 一种文本纠错方法、装置以及相关设备
CN108960330B (zh) * 2018-07-09 2021-09-10 西安电子科技大学 基于快速区域卷积神经网络的遥感图像语义生成方法
CN109389091B (zh) * 2018-10-22 2022-05-03 重庆邮电大学 基于神经网络和注意力机制结合的文字识别系统及方法
CN109471945B (zh) * 2018-11-12 2021-11-23 中山大学 基于深度学习的医疗文本分类方法、装置及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017194806A (ja) * 2016-04-19 2017-10-26 AI inside株式会社 文字認識装置、方法およびプログラム
US20190050639A1 (en) * 2017-08-09 2019-02-14 Open Text Sa Ulc Systems and methods for generating and using semantic images in deep learning for classification and data extraction
CN107590192A (zh) * 2017-08-11 2018-01-16 深圳市腾讯计算机系统有限公司 文本问题的数学化处理方法、装置、设备和存储介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GU, SUNYAN: "Research of Chinese Named Entity Recognition Based on Deep Neural Network", CHINESE MASTER'S THESES FULL-TEXT DATABASE, INFORMATION SCIENCE AND TECHNOLOGY, no. 2, 28 February 2019 (2019-02-28), ISSN: 1674-0246, DOI: 20200416113721X *
WU, XIAOWEI: "The Research and Implementation of Text Recognition System Based on Deep Learning", CHINESE MASTER'S THESES FULL-TEXT DATABASE, INFORMATION SCIENCE AND TECHNOLOGY, no. 10, 31 October 2018 (2018-10-31), ISSN: 1674-0246, DOI: 20200416113255X *
YANG, BIN: "Text Detection and Recognition in Image", CHINESE MASTER'S THESES FULL-TEXT DATABASE, INFORMATION SCIENCE AND TECHNOLOGY, no. 4, 30 April 2018 (2018-04-30), ISSN: 1674-0246, DOI: 20200416113502X *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762050A (zh) * 2021-05-12 2021-12-07 腾讯云计算(北京)有限责任公司 图像数据处理方法、装置、设备以及介质
CN113762050B (zh) * 2021-05-12 2024-05-24 腾讯云计算(北京)有限责任公司 图像数据处理方法、装置、设备以及介质
CN113704478A (zh) * 2021-09-07 2021-11-26 平安银行股份有限公司 文本要素提取方法、装置、电子设备及介质
CN113704478B (zh) * 2021-09-07 2023-08-22 平安银行股份有限公司 文本要素提取方法、装置、电子设备及介质
CN114495101A (zh) * 2022-01-12 2022-05-13 北京百度网讯科技有限公司 文本检测方法、文本检测网络的训练方法及装置

Also Published As

Publication number Publication date
SG11202010916SA (en) 2020-12-30
US20210042474A1 (en) 2021-02-11
JP7153088B2 (ja) 2022-10-13
JP2021520002A (ja) 2021-08-12
CN111753822B (zh) 2024-05-24
TW202036464A (zh) 2020-10-01
CN111753822A (zh) 2020-10-09

Similar Documents

Publication Publication Date Title
WO2020199730A1 (zh) 文本识别方法及装置、电子设备和存储介质
CN112001321B (zh) 网络训练、行人重识别方法及装置、电子设备和存储介质
CN107527059B (zh) 文字识别方法、装置及终端
CN109615006B (zh) 文字识别方法及装置、电子设备和存储介质
CN111783756B (zh) 文本识别方法及装置、电子设备和存储介质
CN109934275B (zh) 图像处理方法及装置、电子设备和存储介质
CN110688951A (zh) 图像处理方法及装置、电子设备和存储介质
CN109871843B (zh) 字符识别方法和装置、用于字符识别的装置
CN110633755A (zh) 网络训练方法、图像处理方法及装置、电子设备
CN110458218B (zh) 图像分类方法及装置、分类网络训练方法及装置
CN111126108B (zh) 图像检测模型的训练和图像检测方法及装置
CN111242303B (zh) 网络训练方法及装置、图像处理方法及装置
CN111539410B (zh) 字符识别方法及装置、电子设备和存储介质
CN113326768B (zh) 训练方法、图像特征提取方法、图像识别方法及装置
CN109145970B (zh) 基于图像的问答处理方法和装置、电子设备及存储介质
CN110532956B (zh) 图像处理方法及装置、电子设备和存储介质
CN113065591B (zh) 目标检测方法及装置、电子设备和存储介质
US20210326649A1 (en) Configuration method and apparatus for detector, storage medium
CN112597944B (zh) 关键点检测方法及装置、电子设备和存储介质
CN110781813A (zh) 图像识别方法及装置、电子设备和存储介质
CN109101542B (zh) 图像识别结果输出方法及装置、电子设备和存储介质
CN111259967A (zh) 图像分类及神经网络训练方法、装置、设备及存储介质
CN110633715B (zh) 图像处理方法、网络训练方法及装置、和电子设备
CN113538310A (zh) 图像处理方法及装置、电子设备和存储介质
CN111507131B (zh) 活体检测方法及装置、电子设备和存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020561646

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20783714

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20783714

Country of ref document: EP

Kind code of ref document: A1