WO2020228519A1 - 字符识别方法、装置、计算机设备以及存储介质 - Google Patents

字符识别方法、装置、计算机设备以及存储介质 Download PDF

Info

Publication number
WO2020228519A1
WO2020228519A1 PCT/CN2020/087010 CN2020087010W WO2020228519A1 WO 2020228519 A1 WO2020228519 A1 WO 2020228519A1 CN 2020087010 W CN2020087010 W CN 2020087010W WO 2020228519 A1 WO2020228519 A1 WO 2020228519A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
attention
image feature
feature vectors
feature vector
Prior art date
Application number
PCT/CN2020/087010
Other languages
English (en)
French (fr)
Inventor
吕鹏原
杨志成
冷欣航
李睿宇
沈小勇
戴宇荣
贾佳亚
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020228519A1 publication Critical patent/WO2020228519A1/zh
Priority to US17/476,327 priority Critical patent/US20220004794A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • G06V30/18019Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections by matching or filtering
    • G06V30/18038Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters
    • G06V30/18048Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters with interaction between the responses of different filters, e.g. cortical complex cells
    • G06V30/18057Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • This application relates to the field of network technology, in particular to a character recognition technology.
  • character recognition As users demand for intelligent recognition technology, character recognition has huge application requirements. It can be used in any scene where text information needs to be extracted from pictures, such as document digitization, ID photo recognition, picture public opinion monitoring and illegal picture filtering, etc. Wait.
  • the text to be recognized is often irregular text. For this reason, it is often necessary to convert the irregular text into regular text in the process of character recognition, and then recognize the regular text.
  • the embodiments of the present application provide a character recognition method, device, computer equipment, and readable storage medium, which improve character recognition efficiency.
  • the technical scheme is as follows:
  • a character recognition method includes:
  • the attention weight of the target quantity is obtained, where one attention weight is used to indicate the importance of the multiple image feature vectors to the character corresponding to the attention weight. degree;
  • the at least one character is obtained according to the plurality of image feature vectors and the attention weight of the target quantity.
  • a character recognition device in one aspect, includes:
  • a feature extraction unit for extracting image features of the image to be recognized, the image features including multiple image feature vectors
  • the parallel processing unit is configured to obtain the attention weight of the target quantity through parallel calculation based on the multiple image feature vectors, wherein one attention weight is used to indicate that the multiple image feature vectors are directed to the attention weight The importance of the character corresponding to the value;
  • the character obtaining unit is configured to obtain the at least one character according to the plurality of image feature vectors and the attention weight of the target quantity.
  • the device further includes:
  • a dependency acquisition unit configured to acquire a dependency feature vector of each image feature vector in the two-dimensional image features, the dependency feature vector being used to represent image information and the dependency relationship between the image feature vector and other image feature vectors;
  • the parallel processing unit is specifically configured to obtain the attention weight of the target quantity through parallel calculation based on the dependent feature vectors of the multiple image feature vectors.
  • the feature extraction unit is configured to input the image into a convolutional neural network, perform feature extraction on the image through each channel of the backbone network in the convolutional neural network, and output the image Feature vector.
  • the dependency acquisition unit is used to input the multiple image feature vectors into the relational attention module of the character recognition model, and the conversion unit in each layer of the relational attention module can
  • the image feature vector and other image feature vectors are similarly calculated in the attention mapping space to obtain the weights corresponding to the image feature vector and other image feature vectors respectively, and the calculation is performed based on the obtained weights, and the image feature vector is output Dependent feature vector.
  • the feature extraction unit is configured to: stitch each image feature vector in the image feature to obtain a feature sequence; based on the position of each image feature vector in the feature sequence, The image feature vectors determine the corresponding location vector; according to each image feature vector and the corresponding location vector, the multiple image feature vectors processed by the location vector are obtained.
  • the parallel processing unit is configured to input the dependent feature vectors of the multiple image feature vectors into the parallel attention module, and input the input nodes in parallel through the target number of output nodes in the parallel attention module.
  • the feature vector of is calculated, and the attention weight of the target quantity is output.
  • the character acquisition unit includes:
  • a feature determination subunit configured to obtain at least one attention feature according to the plurality of image feature vectors and the attention weight of the target quantity
  • the decoding subunit is used to decode the at least one attention feature to obtain the at least one character.
  • the decoding subunit is configured to input the at least one attention feature into the decoding module of the character recognition model, and for each attention feature, obtain the attention feature through the decoding module For the corresponding dependent feature, the dependent feature vector corresponding to the attention feature is decoded, and the character with the highest probability among the characters obtained by decoding is output as the character corresponding to the attention feature.
  • a computer device in one aspect, includes a processor and a memory, and at least one instruction is stored in the memory. The instruction is loaded and executed by the processor to implement the character recognition method described above. Action performed.
  • a computer-readable storage medium is provided, and at least one instruction is stored in the computer-readable storage medium, and the instruction is loaded and executed by the processor to implement the operations performed by the above-mentioned character recognition method .
  • a computer program product including instructions, which when run on a computer, causes the computer to execute the above method.
  • the technical solution of the embodiment of the present application can be used to extract the target number of characters from the image to be recognized.
  • the image features include multiple image feature vectors
  • the attention mechanism is adopted to calculate the output target number of characters in parallel based on the multiple image feature vectors.
  • the corresponding attention weight may indicate the importance of multiple image feature vectors for the characters corresponding to the attention weight.
  • Fig. 1 shows a structural block diagram of a character recognition system provided by an exemplary embodiment of the present application
  • Figure 2a is a brief flow chart of the character recognition process involved in an embodiment of the present application.
  • Figure 2b is a brief flowchart of the character recognition process involved in an embodiment of the present application.
  • FIG. 3 is a flowchart of a character recognition method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a relational attention module provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a secondary decoder provided by an embodiment of the present application.
  • Fig. 6a is a schematic structural diagram of a character recognition device provided by an embodiment of the present application.
  • FIG. 6b is a schematic structural diagram of a character recognition apparatus provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • Attention Mechanism It is a means to quickly filter out high-value information from a large amount of information using limited attention resources.
  • the visual attention mechanism is a brain signal processing mechanism unique to human vision. Human vision obtains the target area that needs to be focused on by quickly scanning the global image, which is commonly referred to as the focus of attention, and then invests more attention resources in this area to obtain more detailed information about the target that needs attention. And suppress other useless information.
  • the attention mechanism is widely used in various types of deep learning tasks, such as natural language processing, image recognition, and speech recognition. It is one of the most worthy of attention and in-depth understanding of the core technology in deep learning technology.
  • the attention mechanism mainly has two aspects: one is to decide which part of the input needs to be paid attention to; the other is to allocate limited information processing resources to important parts.
  • the attention mechanism in deep learning is essentially similar to the selective visual attention mechanism of humans.
  • the core goal is to select information that is more critical to the current task from a large number of information.
  • Fig. 1 shows a structural block diagram of a character recognition system 100 provided by an exemplary embodiment of the present application.
  • the character recognition system 100 includes a terminal 110 and a character recognition platform 140.
  • the terminal 110 is connected to the character recognition platform 110 through a wireless network or a wired network.
  • the terminal 110 may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3 player, an MP4 player, and a laptop portable computer.
  • the terminal 110 installs and runs an application program that supports character recognition.
  • the application can be any of social applications, instant messaging applications, translation applications, shopping applications, browser programs, and video programs.
  • the terminal 110 is a terminal used by a first user, and the first user account is logged in an application program running in the terminal 110.
  • the terminal 110 is connected to the character recognition platform 140 through a wireless network or a wired network.
  • the character recognition platform 140 includes at least one of a server, multiple servers, a cloud computing platform, and a virtualization center.
  • the character recognition platform 140 is used to provide background services for application programs that support character recognition.
  • the character recognition platform 140 is responsible for the main recognition work, and the terminal 110 is responsible for the secondary recognition work; or the character recognition platform 140 is responsible for the secondary recognition work, and the terminal 110 is responsible for the main recognition work; or, the character recognition platform 140 or the terminal 110 respectively Can undertake the identification work alone.
  • the character recognition platform 140 includes: an access server, a character recognition server, and a database.
  • the access server is used to provide the terminal 110 to provide access services.
  • the character recognition server is used to provide background services related to character recognition.
  • there are multiple character recognition servers there are at least two character recognition servers to provide different services, and/or, there are at least two character recognition servers to provide the same service, for example, to provide the same service in a load balancing manner
  • a character recognition model may be set in the character recognition server.
  • the character recognition model is a recognition model constructed based on the attention mechanism.
  • the terminal 110 may generally refer to one of multiple terminals, and this embodiment only uses the terminal 110 as an example for illustration.
  • the type of the terminal 110 includes at least one of a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3 player, an MP4 player, and a laptop portable computer.
  • the number of the aforementioned terminals may be more or less.
  • the foregoing terminal may be only one, or the foregoing terminal may be tens or hundreds, or a greater number.
  • the foregoing character recognition system may also include other terminals.
  • the embodiments of this application do not limit the number of terminals and device types.
  • Figure 2a is a brief flow chart of the character recognition process involved in an embodiment of the present application.
  • the entire character recognition model includes a total of three modules: an image feature extraction module, a parallel attention module, and a decoding module.
  • h and w can be used to represent the size of the input image
  • c is the number of channels of image features obtained by feature extraction
  • n is the output node of the parallel attention module
  • the number is the target number
  • h, w, n, and c are all positive integers greater than 1.
  • the aforementioned attention mechanism is used to assign different weights to multiple image feature vectors.
  • the image feature vectors with higher importance are assigned higher weights.
  • Image feature vectors with lower importance are given lower weights, thereby reducing the influence of image feature vectors with lower importance on decoding.
  • Figure 2b is a brief flow chart of another character recognition process involved in an embodiment of the present application.
  • the character recognition model includes a total of 4 modules: image feature extraction module, relational attention module, parallel attention module, and decoding module.
  • this solution can input the image feature vector in the image feature to the relational attention module to obtain the image feature vector relationship.
  • the dependence relationship can be expressed as a c-dimensional vector.
  • the output of the relational attention module is input to the parallel attention module to obtain n attention weights, thus, the image feature vector and attention
  • the weight value obtains the attention characteristic corresponding to each character, and finally decodes the attention characteristic into the character through the decoding module.
  • the aforementioned attention mechanism is used to assign different weights to multiple image feature vectors.
  • Image feature vectors are assigned higher weights, and image feature vectors with lower similarity are assigned lower weights.
  • image feature vectors with higher importance are assigned higher weights, which is more important Lower image feature vectors give lower weights, thereby reducing the impact of image feature vectors with lower importance on decoding.
  • FIG. 3 a specific implementation process of a character recognition method as shown in FIG. 3 is provided.
  • the embodiment of the present application only takes a computer device as the execution subject for illustration.
  • the computer device can be implemented as a terminal or a server in the implementation environment. Referring to Fig. 3, the method includes:
  • a computer device extracts image features of an image to be recognized.
  • the image feature includes multiple image feature vectors.
  • the computer device can input the image to be recognized into the image feature extraction module of the character recognition model, perform feature extraction on the image through each channel of the image feature extraction module, and output image features including multiple image feature vectors .
  • the embodiment of the present application does not limit the dimension of the image feature, and the image feature may be a one-dimensional image feature or a two-dimensional image feature.
  • the dimensions of the two-dimensional image feature may be the dimensions for the two directions of the image width and height.
  • the process of extracting multiple two-dimensional image features of the image to be recognized can be implemented using a backbone network of a convolutional neural network.
  • the backbone network can be based on The backbone network of the residual structure (ResNet), of course, the backbone network includes but is not limited to ResNet, and the backbone network may also use other convolutional neural networks, such as Inception-Resnet-V2, NasNet, MobileNet, etc., which are not limited here.
  • the backbone network may be the remaining structure of the convolutional neural network excluding the classification module, which may include multiple convolutional layers, for example, the backbone network may be retained until the last convolutional layer (convolution layer). layer) convolutional neural network.
  • the output of the backbone network can be a feature map of the image.
  • the two-dimensional image features of the image can be extracted through the image feature extraction module.
  • the total size of the two-dimensional image features output by the image feature extraction module can be
  • the two-dimensional image feature is determined by It consists of a c-dimensional image feature vector, each image feature vector can be expressed as I i , i is a positive integer less than or equal to k, where,
  • the computer device obtains the attention weight of the target quantity through parallel calculation based on multiple image feature vectors. Among them, an attention weight is used to indicate the importance of multiple image feature vectors to the character corresponding to the attention weight.
  • the computer device obtains at least one attention feature according to the plurality of image feature vectors and the attention weight of the target quantity.
  • the target quantity mentioned above may be a preset character output quantity for the character recognition model.
  • the target number may be set for the character recognition model, so that the target number of characters is output from the image for character recognition through the character recognition model.
  • a character obtained by recognizing an image it may correspond to an attention weight, and the attention weight is used to identify the character corresponding to the attention weight for multiple image feature vectors Degree of importance.
  • the character corresponding to the attention weight can be obtained and output.
  • the number of characters in some images may be less than the target number. Therefore, for the characters included in the image, some of the values in the calculated attention weight are relatively high. For the characters not included in the image, the corresponding calculation All the values in the obtained attention weight value will be relatively low, resulting in the character calculated according to the attention weight value being close to 0, so that the output character corresponding to the attention weight value is empty.
  • the image to be recognized includes 1 letter as a, and 3 image feature vectors are extracted from the image, namely m1, m2, and m3, which correspond to the 3 regions in the image .
  • the calculated target number, that is, the two attention weights are x1 and x2.
  • x1 corresponds to the first character that should be output by recognizing the image
  • the importance of the three image feature vectors in x1 is 0.8, 0, and 0 (that is, corresponding to m1, m2, and m3)
  • the attention weight x1 of the first character Based on the attention weight x1 of the first character, the importance of the image feature vector m1 is higher, that is, the first character is more likely to be in the region corresponding to the image feature vector, so according to the three images
  • the feature vector and the attention weight x1, the first character is "a".
  • the attention weight x2 corresponds to the second character that should be output by recognizing the image, and the importance of the three image feature vectors in x2 is 0, 0, 0 (that is, corresponding to m1, m2, and m3). ); Therefore, based on the attention weight x2 of the second character, the importance of the three image feature vectors is not high, and it can be determined that the second character is not included in the image, so according to the three image features The vector and the attention weight x2, get the empty second character.
  • the attention weight of the target quantity can be performed through the aforementioned parallel attention module.
  • the parallel attention module includes a target number of output nodes, denoted as n, n is an integer less than k, and each output node can calculate the attention weight in parallel according to multiple input image feature vectors.
  • the parallel attention module uses the following formula to calculate the input image feature vector to output the attention weight of the target quantity:
  • is used to represent the attention weight
  • tanh() is the hyperbolic tangent function
  • softmax() is the normalization function
  • O T is the input of the output node, that is, the image feature vector
  • W 1 and W 2 are learned parameter.
  • the above step 302 is the process of obtaining the attention weight of the target quantity based on the dependent feature vectors of the multiple image feature vectors.
  • a parallel attention module is used to perform a specific calculation process.
  • the parallel attention module Different from the traditional attention module, it no longer determines the current moment's attention weight based on the value of the previous moment, but removes the relationship between the output nodes. For each node, the calculation is It is independent and realizes parallel calculation.
  • the computer device can obtain at least one attention feature according to the attention weights of the plurality of image feature vectors and the target quantity according to the decoding module in the character recognition model of FIG. 2a or FIG. 2b.
  • the technical solution of the embodiment of the present application can be used to extract the target number of characters from the image to be recognized.
  • the image features include multiple image feature vectors
  • the attention mechanism is adopted to calculate the output target number of characters in parallel based on the multiple image feature vectors.
  • the corresponding attention weight may indicate the importance of multiple image feature vectors for the characters corresponding to the attention weight.
  • the attention mechanism is used to determine the dependence between the image feature vectors in the two-dimensional image features, so that the parallel calculation of attention weights is used to determine the importance of the features.
  • character recognition can be performed directly based on the two-dimensional image features and the importance of each feature vector in the two-dimensional image features.
  • the above-mentioned processing process based on the two-dimensional image features retains the spatial information of the features. , So it can greatly improve the accuracy of character recognition.
  • the computer device inputs the image to be recognized into the image feature extraction module of the character recognition model, performs feature extraction on the image through each channel of the image feature extraction module, and outputs image features including multiple image feature vectors.
  • the computer device inputs multiple image feature vectors in the two-dimensional image features output by the image feature extraction module into the relational attention module of the character recognition model, and the conversion unit in each layer of the relational attention module performs the An image feature vector and other image feature vectors are similarly calculated in the attention mapping space to obtain the weight of each image feature vector, and the calculation is performed based on the obtained weight, and the dependent feature vector of the image feature vector is output.
  • linear weighting can be performed based on the weight, and the feature vector obtained by the linear weighting can be non-linearly processed to obtain the dependent feature vector of the image feature vector.
  • the relational attention module is composed of many conversion units, and is a multi-layer bidirectional structure.
  • the number of conversion units in each layer is equal to the number of input image feature vectors.
  • Figure 4 (a) represents the internal structure of the relational attention module, the relational attention module includes multiple layers, each layer includes the same number of image feature vectors as the input Conversion unit.
  • Figure 4 (b) shows the internal structure of a conversion unit.
  • dotmat is used to indicate dot multiplication calculation
  • softmax is used to indicate normalization processing
  • matmut is used to indicate matrix multiplication
  • layernorm is used to indicate normalization processing in the channel direction
  • linear is used to indicate linear calculation
  • GELU is used for Represents transformation processing based on activation function.
  • the conversion unit For each conversion unit, the conversion unit includes three inputs Query (query), Key (key) and Value (value), that is, it can be understood that this is a dictionary lookup process, and the Key-Value pair constitutes one Dictionary, the user gives a Query, the computer device can find the same Key and return the corresponding Value.
  • the similarity between the Query and each input key can be calculated separately to assign the weight to all And output their weighted sum as the value of this output.
  • the input of the conversion unit is represented by the following formulas (1), (2) and (3):
  • l represents the layer where the conversion unit is located
  • i represents the conversion unit of the layer
  • Query representing the i-th conversion unit of the l-th layer, which can be a 1 ⁇ c-sized vector
  • V l i respectively represent the corresponding Key and Value, and their sizes are both
  • O l-1 is the output of all conversion units in the upper layer, and its shape and size are also
  • the output of the conversion unit in each layer of the relational attention module is the weighted sum of the input, and the weight is represented by the following formula (4):
  • W is the learned parameter
  • the denominator of the formula is used to represent the weighted sum of the output of the k conversion units.
  • the output of the conversion unit is represented by the following formula (5):
  • Func() is a non-linear function. Based on the non-linear function, a linear function with limited characterization ability is subjected to non-linear processing to improve its characterization ability. It should be noted that the non-linear function can adopt any non-linear function, which is not limited in the embodiment of the present application.
  • the position sensitivity of the image feature vector can be improved, that is, for each image feature vector in the two-dimensional image feature Perform splicing to obtain a feature sequence; determine a corresponding position vector for each image feature vector based on the position of each image feature vector in the feature sequence; wherein, the position vector may be a vector with the same dimension as the image feature vector. Then, according to each image feature vector and the corresponding position vector, such as adding each image feature vector and the corresponding position vector, the processed multiple image feature vectors are obtained. Since the position vector can represent the position of the feature vector, the value of the obtained image feature vector at the corresponding position will change significantly, thereby achieving the purpose of improving position sensitivity.
  • the above process of processing the image feature vector can be understood as the following process: the total size of the feature vector output by each channel is Therefore, it can be expanded into a c-dimensional feature sequence, which includes A c-dimensional feature vector. It can be coded based on the position of each feature vector in the feature sequence. For example, for the first feature vector in the feature sequence, it can be coded to obtain a position vector with dimension c of (1,0,0,...0) E i , and then add each feature vector I i and the corresponding position vector E i to obtain a position-sensitive image feature vector.
  • the processed image feature vector can be represented by F i .
  • the computer device can use the processed multiple image feature vectors as the input of the first layer of the relational attention module to continue the weighting calculation and other processes to output each image Dependent feature vector of feature vector.
  • This method calculates the attention weight based on the dependent feature vector corresponding to the image feature vector. Since the dependent feature vector reflects the image information corresponding to the image feature vector and the dependency relationship between the image feature vector and other image feature vectors, This dependency is taken into account in the process of calculating the attention weight, thereby improving the accuracy of the attention weight calculation, thereby improving the efficiency of character recognition.
  • the computer device inputs the dependent feature vectors of the multiple image feature vectors output by the relational attention module into the parallel attention module in the character recognition model, and parallelizes the input feature vectors through each output node in the parallel attention module Calculate and output the attention weight of the target quantity.
  • each output node will calculate the attention weight of the input feature vector in parallel.
  • the parallel attention module uses the following formula (6) to calculate the input image feature vector to output the attention weight of the target quantity:
  • its input O T can be the output of the relational attention module.
  • the computer device obtains at least one attention feature according to the plurality of image feature vectors and the attention weight of the target quantity.
  • G i is the attention characteristic of the output of the ith output node.
  • the attention feature may be a feature used to obtain the i-th character in the image to be recognized through decoding.
  • the ⁇ can be understood as the degree of importance corresponding to the multiple image feature vectors for the i-th character, or the degree of attention paid to each part of the input image at the current moment, and can also be understood as a mask from the perspective of image processing (mask), the attention feature obtained by the weighted summation of the attention weight and the image feature vector can be understood as the result of selective observation of the input image by the network.
  • the decoding can be performed in the following step 405.
  • the computer device inputs the at least one attention feature into a two-stage decoder in the character recognition model for decoding, and outputs the at least one character.
  • the computer device can convert the attention feature into characters by decoding at least one attention feature, thereby realizing character recognition.
  • a two-stage decoder may be used to capture the interdependence between output nodes.
  • the embodiment of the present application adopts a two-stage decoder to realize the function of the decoding module.
  • the at least one attention feature is input into the two-stage decoder of the character recognition model, for each Attention features, the dependent feature vector of the attention feature is obtained through the relational attention module in the two-stage decoder, and then the dependent feature vector corresponding to the attention feature is decoded through the decoder, and the decoded The character with the highest probability among the obtained characters is output as the character corresponding to the attention feature.
  • the probability of each character can be calculated in the following way:
  • the above-mentioned secondary decoder can be obtained through training, where W is the weight value obtained by training, and b is the bias value obtained during the training process.
  • Character sequence you can use "-" to fill it into a sequence of length n. For a sequence of length greater than n, it will be truncated to a sequence of length n. "-" is a special character that can be used to indicate the end of a character sequence (end of sequence).
  • is a character set.
  • an attention feature G with a size of n ⁇ c, it can be decoded by the decoder in the first branch to obtain a probability matrix corresponding to the decoder in the first branch. Each element in the probability matrix is used to represent The attention feature is the probability of any character in the character set.
  • the attention feature G can be processed by the relational attention module in the second branch to obtain the dependency between the attention feature and other attention features.
  • the dependency relationship can be represented by a dependency feature vector of size n ⁇ c, and then the dependency feature vector is passed through the decoder of the second branch to obtain the probability matrix corresponding to the decoder.
  • the The character with the highest probability in the probability matrix output by the first branch and the probability matrix output by the second branch is used as the character obtained by decoding the attention feature.
  • the second branch can be used as the decoding module of the character recognition model to finally output the decoded character sequence after decoding.
  • the attention feature can also be obtained directly based on the image feature vector using a serial calculation method, which is not specifically limited in the embodiment of the present application.
  • the applicable network structure and optimization method of the above technical solution include, but are not limited to, the structure and loss function provided by the diagram and formula.
  • the method provided by the embodiments of the present application extracts two-dimensional image features of an image, and uses an attention mechanism to determine the dependency relationship between image feature vectors in two-dimensional image features, thereby further adopting a parallel calculation of attention weights.
  • Determine the importance of features so that in the process of character recognition, you can directly perform character recognition based on two-dimensional image features and the importance of each feature vector in two-dimensional image features.
  • the above-mentioned processing process based on two-dimensional image features is due to The spatial information of the feature is retained, so the accuracy of character recognition can be greatly improved, and through the above-mentioned recognition based on the attention mechanism, the character recognition of any shape can be effectively performed through a simple process, avoiding the process of cyclic operation, and greatly improving Operational efficiency.
  • Fig. 6a is a schematic structural diagram of a character recognition device provided by an embodiment of the present application. Referring to Fig. 6a, the device includes:
  • the feature extraction unit 601 is configured to extract image features of the image to be recognized, where the image features include multiple image feature vectors;
  • the parallel processing unit 603 is configured to obtain the attention weight of the target quantity through parallel calculation based on the multiple image feature vectors, where one attention weight is used to indicate that the multiple image feature vectors are directed to this attention The importance of the character corresponding to the weight;
  • the character obtaining unit 604 is configured to obtain the at least one character according to the plurality of image feature vectors and the attention weight of the target quantity.
  • Figure 6b shows a schematic structural diagram of a character recognition device provided by an embodiment of the present application, and the device further includes:
  • the dependency relationship acquiring unit 603 is configured to acquire the dependent feature vector of each image feature vector in the two-dimensional image features, where the dependent feature vector is used to represent image information and the dependency relationship between the image feature vector and other image feature vectors ;
  • the parallel processing unit 604 is specifically configured to obtain the attention weight of the target quantity through parallel calculation based on the dependent feature vectors of the multiple image feature vectors.
  • the feature extraction unit is configured to input the image into a convolutional neural network, perform feature extraction on the image through each channel of the backbone network in the convolutional neural network, and output the image feature.
  • the dependency acquisition unit is used to input the multiple image feature vectors into the relational attention module of the character recognition model, and the conversion unit in each layer of the relational attention module can
  • the image feature vector and other image feature vectors are calculated in the attention mapping space to obtain the weights corresponding to the image feature vectors and other image feature vectors respectively, and the calculation is performed based on the obtained weights, and the dependence of the output image feature vectors Feature vector.
  • the input of the conversion unit in each layer of the relational attention module And V l i are expressed by the following formulas (1), (2) and (3) respectively:
  • l represents the layer where the conversion unit is located, and i represents the conversion unit of the layer,
  • V l i represent the input of the i-th conversion unit of the l- th layer,
  • F i represents the i-th image feature vector,
  • F is the set of the multiple image feature vectors, and
  • O l-1 is the input of all the conversion units of the previous layer Output.
  • the output of the conversion unit in each layer of the relational attention module is a weighted sum of inputs, where the weight is represented by the following formula (4):
  • W is the learned parameter, Represents the weight of the j-th key corresponding to the i-th conversion unit of the l-th layer, and the denominator of the formula is used to represent the weighted sum of the output of the k conversion units;
  • the output of the conversion unit is represented by the following formula (5):
  • Func() is a nonlinear function
  • the feature extraction unit is configured to: stitch each image feature vector in the two-dimensional image feature to obtain a feature sequence; based on the position of each image feature vector in the feature sequence, A corresponding position vector is determined for each image feature vector; according to each image feature vector and the corresponding position vector, the multiple image feature vectors processed by the position vector are obtained.
  • the parallel processing unit 603 is configured to input the dependent feature vectors of the multiple image feature vectors into a parallel attention module, and perform parallel pairing through the target number of output nodes in the parallel attention module.
  • the input feature vector is calculated, and the attention weight of the target quantity is output.
  • the parallel attention module uses the following formula to calculate the input feature to output the attention weight of the target quantity:
  • is used to represent the attention weight
  • tanh() is the hyperbolic tangent function
  • softmax() is the normalization function
  • O T is the input of the output node
  • W 1 and W 2 are the learned parameters.
  • the character acquisition unit includes:
  • a feature determination subunit configured to obtain at least one attention feature according to the plurality of image feature vectors and the attention weight of the target quantity
  • the decoding subunit is used to decode the at least one attention feature to obtain the at least one character.
  • the decoding subunit is configured to input the at least one attention feature into the decoding module of the character recognition model, and for each attention feature, obtain the attention feature through the decoding module
  • the dependent feature vector of decodes the dependent feature vector corresponding to the attention feature, and outputs the character with the highest probability among the characters obtained by decoding as the character corresponding to the attention feature.
  • FIG. 7 is a schematic structural diagram of a server provided in an embodiment of the present application.
  • the server 700 can be changed due to different configuration or performance. A relatively large difference may be generated, which may include one or more processors (central processing units, CPU) 701 and one or more memories 702, where at least one instruction is stored in the memory 702, and the at least one instruction is processed by the
  • the device 701 is loaded and executed to implement the character recognition method provided by the foregoing method embodiments.
  • the server may also have components such as a wired or wireless network interface, a keyboard, an input and output interface for input and output, and the server may also include other components for implementing device functions, which are not described here.
  • FIG. 8 is a schematic structural diagram of a terminal provided in an embodiment of the present application.
  • the terminal 800 can be a portable mobile terminal, such as a smart phone, a tablet computer, a moving Picture Experts Group Audio Layer III (MP3) player, a moving Picture Experts compressed standard audio layer 4 (Moving Picture Experts Group III, MP3) player Experts Group Audio Layer IV, MP4) player, notebook computer, desktop computer, head-mounted device, or any other intelligent terminal.
  • the terminal 800 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 800 includes a processor 801 and a memory 802.
  • the processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 801 can adopt at least one hardware form among digital signal processing (Digital Signal Processing, DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). achieve.
  • the processor 801 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the awake state, also called a central processing unit (CPU); the coprocessor is A low-power processor used to process data in the standby state.
  • the processor 801 may be integrated with a graphics processor (Graphics Processing Unit, GPU), and the GPU is used to render and draw content that needs to be displayed on the display screen.
  • the processor 801 may further include an artificial intelligence (AI) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI artificial intelligence
  • the memory 802 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 802 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 802 is used to store at least one instruction, and the at least one instruction is used by the processor 801 to implement the character recognition provided in the method embodiment of the present application. method.
  • the terminal 800 may optionally further include: a peripheral device interface 803 and at least one peripheral device.
  • the processor 801, the memory 802, and the peripheral device interface 803 may be connected by a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 803 through a bus, a signal line or a circuit board.
  • the peripheral device includes at least one of a radio frequency circuit 804, a touch display screen 805, a camera 806, an audio circuit 807, a positioning component 808, and a power supply 809.
  • the peripheral device interface 803 may be used to connect at least one peripheral device related to an input/output (I/O) to the processor 801 and the memory 802.
  • the processor 801, the memory 802, and the peripheral device interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 801, the memory 802 and the peripheral device interface 803 or The two can be implemented on separate chips or circuit boards, which are not limited in this embodiment.
  • the radio frequency circuit 804 is used to receive and transmit radio frequency (RF) signals, also called electromagnetic signals.
  • the radio frequency circuit 804 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 804 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and so on.
  • the radio frequency circuit 804 can communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes but is not limited to: metropolitan area network, various generations of mobile communication networks (2G, 3G, 4G and 8G), wireless local area network and/or wireless fidelity (Wireless Fidelity, WiFi) network.
  • the radio frequency circuit 804 may also include a circuit related to Near Field Communication (NFC), which is not limited in this application.
  • NFC Near Field Communication
  • the display screen 805 is used to display a user interface (User Interface, UI).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the display screen 805 also has the ability to collect touch signals on or above the surface of the display screen 805.
  • the touch signal can be input to the processor 801 as a control signal for processing.
  • the display screen 805 may also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 805 there may be one display screen 805, which is provided with the front panel of the terminal 800; in other embodiments, there may be at least two display screens 805, which are respectively arranged on different surfaces of the terminal 800 or in a folded design; In still other embodiments, the display screen 805 may be a flexible display screen, which is arranged on the curved surface or the folding surface of the terminal 800. Furthermore, the display screen 805 can also be set as a non-rectangular irregular pattern, that is, a special-shaped screen.
  • the display screen 805 may be made of materials such as liquid crystal display (LCD) and organic light-emitting diode (OLED).
  • the camera assembly 806 is used to capture images or videos.
  • the camera assembly 806 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • the camera assembly 806 may also include a flash.
  • the flash can be a single-color flash or a dual-color flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • the audio circuit 807 may include a microphone and a speaker.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 801 for processing, or input to the radio frequency circuit 804 to implement voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively set in different parts of the terminal 800.
  • the microphone can also be an array microphone or an omnidirectional acquisition microphone.
  • the speaker is used to convert the electrical signal from the processor 801 or the radio frequency circuit 804 into sound waves.
  • the speaker can be a traditional thin-film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert the electric signal into human audible sound waves, but also convert the electric signal into human inaudible sound waves for distance measurement and other purposes.
  • the audio circuit 807 may also include a headphone jack.
  • the positioning component 808 is used to locate the current geographic location of the terminal 800 to implement navigation or location-based service (LBS).
  • LBS location-based service
  • the positioning component 808 may be a positioning component based on the Global Positioning System (GPS) of the United States, the Beidou system of China, the Grenas system of Russia, or the Galileo system of the European Union.
  • GPS Global Positioning System
  • the power supply 809 is used to supply power to various components in the terminal 800.
  • the power source 809 may be alternating current, direct current, disposable batteries, or rechargeable batteries.
  • the rechargeable battery may support wired charging or wireless charging.
  • the rechargeable battery can also be used to support fast charging technology.
  • the terminal 800 further includes one or more sensors 810.
  • the one or more sensors 810 include, but are not limited to, an acceleration sensor 811, a gyroscope sensor 812, a pressure sensor 813, a fingerprint sensor 814, an optical sensor 815, and a proximity sensor 816.
  • the acceleration sensor 811 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 811. For example, the acceleration sensor 811 can be used to detect the components of the gravitational acceleration on three coordinate axes.
  • the processor 801 may control the touch screen 805 to display the user interface in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor 811.
  • the acceleration sensor 811 may also be used for the collection of game or user motion data.
  • the gyroscope sensor 812 can detect the body direction and rotation angle of the terminal 800, and the gyroscope sensor 812 can cooperate with the acceleration sensor 811 to collect the user's 3D actions on the terminal 800.
  • the processor 801 can implement the following functions according to the data collected by the gyroscope sensor 812: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 813 may be disposed on the side frame of the terminal 800 and/or the lower layer of the touch screen 805.
  • the processor 801 performs left and right hand recognition or quick operation according to the holding signal collected by the pressure sensor 813.
  • the processor 801 controls the operability controls on the UI interface according to the user's pressure operation on the touch display screen 805.
  • the operability control includes at least one of a button control, a scroll bar control, an icon control, and a menu control.
  • the fingerprint sensor 814 is used to collect the user's fingerprint.
  • the processor 801 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the user's identity according to the collected fingerprint. When it is recognized that the user's identity is a trusted identity, the processor 801 authorizes the user to have related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings.
  • the fingerprint sensor 814 may be provided on the front, back or side of the terminal 800. When a physical button or a manufacturer logo is provided on the terminal 800, the fingerprint sensor 814 can be integrated with the physical button or a manufacturer logo.
  • the optical sensor 815 is used to collect the ambient light intensity.
  • the processor 801 may control the display brightness of the touch screen 805 according to the ambient light intensity collected by the optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the touch screen 805 is increased; when the ambient light intensity is low, the display brightness of the touch screen 805 is decreased.
  • the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 according to the ambient light intensity collected by the optical sensor 815.
  • the proximity sensor 816 also called a distance sensor, is usually arranged on the front panel of the terminal 800.
  • the proximity sensor 816 is used to collect the distance between the user and the front of the terminal 800.
  • the processor 801 controls the touch screen 805 to switch from the on-screen state to the off-screen state; when the proximity sensor 816 detects When the distance between the user and the front of the terminal 800 gradually increases, the processor 801 controls the touch display screen 805 to switch from the rest screen state to the bright screen state.
  • FIG. 8 does not constitute a limitation on the terminal 800, and may include more or fewer components than shown, or combine some components, or adopt different component arrangements.
  • a computer-readable storage medium such as a memory including instructions, which may be executed by a processor in a terminal or a server to complete the character recognition method in the foregoing embodiment.
  • the computer-readable storage medium may be a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a CD-ROM (Compact Disc Read-Only Memory, CD-ROM), Tapes, floppy disks and optical data storage devices, etc.
  • the embodiments of the present application also provide a computer program product including instructions, which when run on a computer, cause the computer to execute the above method.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium can be read-only memory, magnetic disk or optical disk, etc.

Abstract

本申请公开了一种字符识别方法、装置、计算机设备以及存储介质,属于图像处理技术领域。本申请实施例的技术方案可以用于从待识别图像中提取出目标数量个字符,通过提取图像的图像特征,该图像特征包括多个图像特征向量,采用注意力机制,根据该多个图像特征向量,通过并行计算的方式计算输出目标数量个字符所对应的注意力权值。其中,一个注意力权值可以表示多个图像特征向量针对该注意力权值所对应字符的重要程度。使得在字符识别的过程中,通过上述基于注意力机制的识别,能够通过简单的流程有效的进行任意形状的字符识别,避免循环运算的过程,大大提高了运算效率。

Description

字符识别方法、装置、计算机设备以及存储介质
本申请要求于2019年05月10日提交中国专利局、申请号为201910387655.1、申请名称为“字符识别方法、装置、计算机设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及网络技术领域,特别涉及一种字符识别技术。
背景技术
随着用户对智能识别技术的需求,字符识别有着巨大的应用需求,其可以使用于任何需要从图片中提取文字信息的场景,比如文档电子化、证件照识别、图片舆情监测及非法图片过滤等等。
在实际使用时,待识别的文本经常会是不规则文本,为此,在进行字符识别的过程中经常需要将不规则文本转化为规则文本,从而再对规则文本进行识别。
发明内容
本申请实施例提供了一种字符识别方法、装置、计算机设备以及可读存储介质,提高了字符识别效率。该技术方案如下:
一方面,提供了一种字符识别方法,所述方法包括:
提取待识别的图像的图像特征,所述图像特征包括多个图像特征向量;
基于所述多个图像特征向量,通过并行计算,获取目标数量的注意力权值,其中,一个注意力权值用于表示所述多个图像特征向量针对这个注意力权值所对应字符的重要程度;
根据所述多个图像特征向量和所述目标数量的注意力权值,得到所述至少一个字符。
一方面,提供了一种字符识别装置,所述装置包括:
特征提取单元,用于提取待识别的图像的图像特征,所述图像特征包括多个图像特征向量;
并行处理单元,用于基于所述多个图像特征向量,通过并行计算,获取目标数量的注意力权值,其中,一个注意力权值用于表示所述多个图像特征向量针对这个注意力权值所对应字符的重要程度;
字符获取单元,用于根据所述多个图像特征向量和所述目标数量的注意力权值,得到所述至少一个字符。
在一种可能实现方式中,所述装置还包括:
依赖关系获取单元,用于获取所述二维图像特征中每个图像特征向量的依赖特征向量,所述依赖特征向量用于表示图像信息以及图像特征向量与其他图像特征向量之间的依赖关系;
所述并行处理单元,具体用于基于所述多个图像特征向量的依赖特征向量,通过并行计算,获取目标数量的注意力权值。
在一种可能实现方式中,所述特征提取单元用于将所述图像输入卷积神经网络,通过所述卷积神经网络中主干网络的各个通道对所述图像进行特征提取,输出所述图像特征向量。
在一种可能实现方式中,所述依赖关系获取单元用于将所述多个图像特征向量输入字符识别模型的关系注意力模块,通过所述关系注意力模块每一层中的转换单元对所述图像特征向量与其他图像特征向量在注意力映射空间进行相似度计算,以得到所述图像特征向量与其他图像特征向量分别对应的权重,并基于得到的权重进行计算,输出所述图像特征向量的依赖特征向量。
在一种可能实现方式中,所述特征提取单元用于:对所述图像特征中的各个图像特征向量进行拼接,得到特征序列;基于各个图像特征向量在所述特征序列中的位置,为每个图像特征向量确定对应的位置向量;根据每个图像特征向量与对应的位置向量,得到经所述位置向量处理后的所述多个图像特征向量。
在一种可能实现方式中,所述并行处理单元用于将所述多个图像特征向量的依赖特征向量输入并行注意力模块,通过所述并行注意力模块中的目标数量的输出节点并行对输入的特征向量进行计算,输出所述目标数量的注意力权值。
在一种可能实现方式中,字符获取单元包括:
特征确定子单元,用于根据所述多个图像特征向量和所述目标数量的注意力权值,得到至少一个注意力特征;
解码子单元,用于对所述至少一个注意力特征进行解码,得到所述至少一个字符。
在一种可能实现方式中,该解码子单元,用于将所述至少一个注意力特征输入字符识别模型的解码模块中,对于每个注意力特征,通过所述解码模块获取所述注意力特征对应的依赖特征,对所述注意力特征对应的依赖特征向量进行解码,将解码所得到的字符中概率最大的字符作为所述注意力特征对应的字符输出。
一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如上述的字符识别方法所执行的操作。
一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如上述的字符识别方法所执行的操作。
一方面,提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行上述方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
本申请实施例的技术方案可以用于从待识别图像中提取出目标数量个字符。在该方案中,通过提取待识别的图像的图像特征,该图像特征包括多个图像特征向量,采用注意力机制,根据该多个图像特征向量,通过并行计算的方式计算输出目标数量个字符所对应的注意力权值。其中,一个注意力权值可以表示多个图像特征向量针对该注意力权值所对应字符的重要程度。使得在字符识别的过程中,通过上述基于注意力机制的识别,能够通过简单的流程有效的进行任意形状的字符识别,避免循环运算的过程,大大提高了运算效率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请 的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请一个示例性实施例提供的字符识别系统的结构框图;
图2a是本申请实施例所涉及到的字符识别过程的简要流程图;
图2b是本申请实施例所涉及到的字符识别过程的简要流程图;
图3是本申请实施例提供的一种字符识别方法的流程图;
图4是本申请实施例提供的关系注意力模块的结构示意图;
图5是本申请实施例提供的二级解码器的结构示意图;
图6a是本申请实施例提供的一种字符识别装置的结构示意图;
图6b是本申请实施例提供的一种字符识别装置的结构示意图;
图7是本申请实施例提供的一种终端的结构示意图;
图8是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
注意力机制(Attention Mechanism):是利用有限的注意力资源从大量信息中快速筛选出高价值信息的手段。视觉注意力机制是人类视觉所特有的大脑信号处理机制。人类视觉通过快速扫描全局图像,获得需要重点关注的目标区域,也就是一般所说的注意力焦点,而后对这一区域投入更多注意力资源,以获取更多所需要关注目标的细节信息,而抑制其他无用信息。注意力机制被广泛使用在自然语言处理、图像识别及语音识别等各种不同类型的深度学习任务中,是深度学习技术中最值得关注与深入了解的核心技术之一。
综上,注意力机制主要有两个方面:一是决定需要关注输入的哪部分;二是分配有限的信息处理资源给重要的部分。深度学习中的注意力机制从本质上讲和人类的选择性视觉注意力机制类似,核心目标也是从众多信息中选择出对当前任务更关键的信息。
图1示出了本申请一个示例性实施例提供的字符识别系统100的结构框图。该字符识别系统100包括:终端110和字符识别平台140。
终端110通过无线网络或有线网络与字符识别平台110相连。终端110可以是智能手机、游戏主机、台式计算机、平板电脑、电子书阅读器、MP3播放器、MP4播放器和膝上型便携计算机中的至少一种。终端110安装和运行有支持字符识别的应用程序。该应用程序可以是社交应用程序、即时通讯应用程序、翻译类应用程序、购物类应用程序、浏览器程序、视频程序中的任意一种。示意性的,终端110是第一用户使用的终端,终端110中运行的应用程序内登录有第一用户账号。
终端110通过无线网络或有线网络与字符识别平台140相连。
字符识别平台140包括一台服务器、多台服务器、云计算平台和虚拟化中心中的至少一种。字符识别平台140用于为支持字符识别的应用程序提供后台服务。可选地,字符识别平台140承担主要识别工作,终端110承担次要识别工作;或者,字符识别平台140承担次要识别工作,终端110承担主要识别工作;或者,字符识别平台140或终端110分别可以单独承担识别工作。
可选地,字符识别平台140包括:接入服务器、字符识别服务器和数据库。接入服务器用于提供终端110提供接入服务。字符识别服务器用于提供字符识别有关的后台服务。字符识别服务器可以是一台或多台。当字符识别服务器是多台时,存在至少两台字符识别服务器用于提供不同的服务,和/或,存在至少两台字符识别服务器用于提供相同的服务,比如以负载均衡方式提供同一种服务,本申请实施例对此不加以限定。字符识别服务器中可以设置有字符识别模型。在本申请实施例中,该字符识别模型是基于注意力机制所构建的识别模型。
终端110可以泛指多个终端中的一个,本实施例仅以终端110来举例说明。终端110的类型包括:智能手机、游戏主机、台式计算机、平板电脑、电子书阅读器、MP3播放器、MP4播放器和膝上型便携计算机中的至少一种。
本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量,此时上述字符识别系统还包括其他终端。本申请实施例对终端的数量和设备类型不加以限定。
图2a是本申请实施例所涉及到的字符识别过程的简要流程图。参见图2, 整个字符识别模型总共包含3个模块:图像特征提取模块、并行注意力模块以及解码模块。
基于上述模块,在字符识别模型中输入一幅图像,可以采用h和w分别表示输入图像的大小,c是通过特征提取所得到的图像特征的通道数,n是并行注意力模块的输出节点的个数即目标数量,h、w、n和c均为大于1的正整数。首先使用图像特征提取模块,来提取图像的图像特征,然后将该图像特征中的图像特征向量输入到并行注意力模块,以得到n个的注意力权值(掩膜mask),最后,使用图像特征向量和注意力权值得到每一个字符对应的注意力特征(glimpse),最后通过解码模块将注意力特征解码为字符。
在本申请实施例中,上述注意力机制用于对多个图像特征向量赋予不同的权重,例如:在上述并行注意力模块中,对于重要程度较高的图像特征向量赋予较高的权重,对于重要程度较低的图像特征向量赋予较低的权重,从而减少重要程度较低的图像特征向量对解码的影响。
图2b是本申请实施例所涉及到的另一种字符识别过程的简要流程图。参见图2b,相比于上述图2a中的字符识别模型,该字符识别模型总共包含4个模块:图像特征提取模块、关系注意力模块、并行注意力模块以及解码模块。
相比上述通过图2a对应的字符识别模型进行字符识别的方式,该方案在提取图像的图像特征之后,可以将该图像特征中的图像特征向量输入到关系注意力模块,以获取图像特征向量间的相互依赖关系,该依赖关系可以表示为c维向量,此后,将关系注意力模块的输出输入到并行注意力模块,以得到n个的注意力权值,从而,使用图像特征向量和注意力权值得到每一个字符对应的注意力特征,最后通过解码模块将注意力特征解码为字符。
在本申请实施例中,上述注意力机制用于对多个图像特征向量赋予不同的权重,例如:在上述关系注意力模块中,对于与其他特征在注意力映射空间内的相似度较高的图像特征向量赋予较高的权重,对于相似度较低的图像特征向量赋予较低的权重,在上述并行注意力模块中,对于重要程度较高的图像特征向量赋予较高的权重,对于重要程度较低的图像特征向量赋予较低的权重,从而减少重要程度较低的图像特征向量对解码的影响。
下面,基于上述图2a所示的模型架构和简要流程,提供了如图3所示的一种字符识别方法的具体实施过程,本申请实施例仅以执行主体为一计算机设备为例进行说明,该计算机设备在实施环境中可以实施为终端或服务器,参见图3,该方法包括:
301、计算机设备提取待识别的图像的图像特征。
其中,图像特征包括多个图像特征向量。
在本申请实施例中,计算机设备可以将待识别的图像输入字符识别模型的图像特征提取模块,通过图像特征提取模块的各个通道对该图像进行特征提取,输出包括多个图像特征向量的图像特征。
需要说明,本申请实施例不限定该图像特征的维数,该图像特征可以是一维图像特征,也可以是二维图像特征。二维图像特征的维度可以是针对图像宽和高这两个方向的维度,接下来以二维图像特征为例对本申请实施例提供的技术方案进行介绍。
为了提高图像特征提取速度,在一种可能的实现方式中,提取待识别的图像的多个二维图像特征的过程可以采用卷积神经网络的主干网络来实现,例如,该主干网络可以为基于残差结构(ResNet)的主干网络,当然,该主干网络包括但不仅限于ResNet,该主干网络还可以采用其它卷积神经网络,例如Inception-Resnet-V2、NasNet、MobileNet等,此处不作限定。
在一种可能的实施方式中,主干网络可以是卷积神经网络中除去分类模块的剩余结构,其可以包括多个卷积层,例如,该主干网络可以是保留到最后一个卷积层(convolution layer)的卷积神经网络。主干网络的输出可以是图像的特征图(feature map)。
例如,基于上述图2b所示的模型结构,可以通过图像特征提取模块提取图像的二维图像特征,为了保留足够的空间信息,图像特征提取模块的输出的二维图像特征的总大小可以为
Figure PCTCN2020087010-appb-000001
实际上,该二维图像特征由
Figure PCTCN2020087010-appb-000002
个c维的图像特征向量构成,每个图像特征向量可以表示为I i,i为小于或等于k 的正整数,其中,
Figure PCTCN2020087010-appb-000003
302、该计算机设备基于多个图像特征向量,通过并行计算,获取目标数量的注意力权值。其中,一个注意力权值用于表示多个图像特征向量针对这个注意力权值所对应字符的重要程度。
303、该计算机设备根据该多个图像特征向量和该目标数量的注意力权值,得到至少一个注意力特征。
其中,上述中的目标数量可以是针对字符识别模型预设的字符输出数量。在本申请实施例中,可以为字符识别模型设置目标数量,以使通过该字符识别模型进行字符识别的图像输出目标数量个字符。
在本申请实施例中,针对对图像进行识别得到的一个字符,其可以对应于一个注意力权值,该注意力权值用于标识了多个图像特征向量针对该注意力权值所对应字符的重要程度。通过根据多个图像特征向量和一个注意力权值进行计算,可以得到并输出该注意力权值对应的字符。
可以理解,一些图像中的字符可能少于目标数量,由此,针对于图像中包括的字符,计算得到的注意力权值中的部分数值相对较高,针对图像中不包括的字符,对应计算得到的注意力权值中的全部数值会相对较低,导致根据该注意力权值计算得到的字符接近0,从而该注意力权值对应的输出的字符为空。
下面进行举例说明。假设目标数值为2,待识别的图像中包括1个字母分别为a,对图像提取了3个图像特征向量分别为m1、m2和m3,该3个图像特征向量对应于图像中的3片区域。计算得到的目标数量即2个注意力权值分别为x1和x2。
其中,假设x1中对应于通过识别该图像应该输出的第一个字符,x1中针对这3个图像特征向量的重要程度分别为0.8、0、0(即对应于m1、m2和m3);从而,基于该第一个字符的注意力权值x1,图像特征向量m1的重要程度较高,即该第一个字符在图像特征向量对应的区域的可能性更大,由此根据这三个图像特征向量和该注意力权值x1,得到第一个字符为“a”。另外,注意力权值x2中对应于通过识别该图像应该输出的第二个字符,x2中针对这3个图像特征向量的重要程度分别为0、0、0(即对应于m1、m2和m3);从而, 基于该第二个字符的注意力权值x2,对三个图像特征向量的重要程度均不高,可以确定该图像中不包括第二个字符,由此根据这三个图像特征向量和该注意力权值x2,得到空的第二个字符。
在具体实现中,可以通过上述并行注意力模块进行该目标数量的注意力权值。其中,该并行注意力模块包括目标数量个输出节点,记为n个,n为小于k的整数,根据输入的多个图像特征向量,每个输出节点可以并行计算注意力权值。在该步骤302中,该并行注意力模块采用下述公式对输入的图像特征向量进行计算,以输出所述目标数量的注意力权值:
α=softmax(W 2tanh(W 1O T)
其中,α用于表示注意力权值,tanh()为双曲正切函数,softmax()为归一化函数,O T为输出节点的输入即图像特征向量,W 1和W 2是学习得到的参数。
上述步骤302是基于所述多个图像特征向量的依赖特征向量,获取目标数量的注意力权值的过程,在该过程中,采用了并行注意力模块进行具体的计算过程,该并行注意力模块区别于传统的注意力模块,不再基于前一时刻的值来确定当前时刻的注意力权值,而是移除掉各个输出节点之间的相互关系,对于每一个节点来说,其计算都是独立的,实现了并行的计算。
从而,计算机设备可以根据该图2a或图2b的字符识别模型中的解码模块,实现根据该多个图像特征向量和该目标数量的注意力权值,得到至少一个注意力特征。
本申请实施例的技术方案可以用于从待识别图像中提取出目标数量个字符。在该方案中,通过提取待识别的图像的图像特征,该图像特征包括多个图像特征向量,采用注意力机制,根据该多个图像特征向量,通过并行计算的方式计算输出目标数量个字符所对应的注意力权值。其中,一个注意力权值可以表示多个图像特征向量针对该注意力权值所对应字符的重要程度。使得在字符识别的过程中,通过上述基于注意力机制的识别,能够通过简单的流程有效的进行任意形状的字符识别,避免循环运算的过程,大大提高了运算效率。
另外,通过提取图像的二维图像特征,采用注意力机制,来确定二维图像特征中图像特征向量之间的依赖关系,从而进一步采用并行计算注意力权值的 方式来确定特征的重要程度,使得在字符识别的过程中,可以直接基于二维图像特征以及二维图像特征中各个特征向量的重要程度,来进行字符识别,上述基于二维图像特征的处理过程,由于保留了特征的空间信息,因此可以大大提高字符识别的准确性。
接下来基于上述图2b所示的模型架构和简要流程,介绍另一种字符识别方法,该方法包括:
401、计算机设备将待识别的图像输入字符识别模型的图像特征提取模块,通过图像特征提取模块的各个通道对该图像进行特征提取,输出包括多个图像特征向量的图像特征。
针对该步骤401的介绍如前述步骤301,此处不再赘述。
402、该计算机设备将该图像特征提取模块输出的二维图像特征中的多个图像特征向量输入该字符识别模型的关系注意力模块,通过该关系注意力模块每一层中的转换单元对每个图像特征向量与其他图像特征向量在注意力映射空间进行相似度计算,以得到各个图像特征向量的权重,并基于得到的权重进行计算,输出图像特征向量的依赖特征向量。
在一种可能的实现方式中,可以基于权重做线性加权,对线性加权得到的特征向量进行非线性处理,得到图像特征向量的依赖特征向量。
其中,该关系注意力模块由很多转换单元构成,且为一个多层双向的结构。每一层的转换单元的个数和输入的图像特征向量个数相等。
参见图4,该图4中(a)图所表示的即为该关系注意力模块的内部结构,该关系注意力模块包括多个层,每个层中包括与输入的图像特征向量相同个数的转化单元。图4中(b)图所表示的为一个转换单元的内部结构。其中,dotmat用于表示点乘计算,softmax用于表示归一化处理,matmut用于表示矩阵乘法,layernorm用于表示在通道方向上的归一化处理,linear用于表示线性计算,GELU用于表示基于激活函数的变换处理。对于每个转换单元来说,转换单元包括三个输入Query(查询)、Key(键)以及Value(值),也即是,可以理解为这是一个查字典的过程,Key-Value对构成一个字典,用户给一个Query,计算机设备可以找到与之相同的Key,返回对应的Value,在关系注意力模块 中,可以通过分别计算Query和每一个输入的Key的相似度,来作为权重分配到所有的Value上,并输出它们的加权求和作为本次输出的Value。
其中,转换单元的输入分别采用下式(1)、(2)和(3)表示:
Figure PCTCN2020087010-appb-000004
Figure PCTCN2020087010-appb-000005
Figure PCTCN2020087010-appb-000006
其中,l表示该转换单元所在的层,i表示该转换单元为该层的第几个转换单元,
Figure PCTCN2020087010-appb-000007
表示第l层第i个转换单元的Query,其可以为一个1×c大小的向量,
Figure PCTCN2020087010-appb-000008
和V l i分别表示对应的Key和Value,其大小均为
Figure PCTCN2020087010-appb-000009
O l-1是上一层所有转换单元的输出,其形状大小也为
Figure PCTCN2020087010-appb-000010
基于上述公式可以看出,对于第一层的转换单元,其输入来源于该图像特征提取模块所输出的二维图像特征中的多个图像特征向量。对于不是第一层的转换单元,其输入来源于上一层所有转换单元的输出。
其中,该关系注意力模块中每一层中的转换单元的输出为输入的加权和,权值采用下式(4)表示:
Figure PCTCN2020087010-appb-000011
其中,W是学习得到的参数,
Figure PCTCN2020087010-appb-000012
表示第l层第i个转换单元对应的第j个key的权重,公式的分母用于表示k个转换单元输出的加权和。
转换单元的输出采用下式(5)表示:
Figure PCTCN2020087010-appb-000013
在上述公式(5)中,Func()是一个非线性的函数。基于非线性函数,将一个表征能力有限的线性函数做非线性处理,以提高其表征能力。需要说明的是,该非线性函数可以采用任一种非线性函数,本申请实施例对此不做限定。
以第一层的第i个转换单元为例,对转换单元的工作原理进行介绍:将F作为输入,对于第i个转换单元来说,其Query为F i,Key和Value均是{F 1,F 2……F k},分别计算F i与{F 1,F 2……F k}中各个图像特征向量在注意力映射空间内的相似度,再通过softmax归一化,以得到权重,将经过归一化得到的权重和Value做线性加权,将线性加权得到的权值经过如(b)中结构的非线性处理,输出O i,作为下一层中第i个转换单元的一个输入。
通过上述直接基于图像特征向量的依赖关系的确定,可以避免由于特征从二维转换为一维时所造成的空间信息丢失的问题,上述过程的计算量相对较小,因此也能够相应提高字符识别过程中的运算效率。
在一种可能实现方式中,还可以在将图像特征向量输入到关系注意力模块之前,提高该图像特征向量的位置敏感性,也即是,对所述二维图像特征中的各个图像特征向量进行拼接,得到特征序列;基于各个图像特征向量在所述特征序列中的位置,为每个图像特征向量确定对应的位置向量;其中,该位置向量可以是与图像特征向量维度相同的向量。然后,根据每个图像特征向量与对应的位置向量,如将每个图像特征向量与对应的位置向量相加,得到处理后的所述多个图像特征向量。由于位置向量可以代表该特征向量的位置,因此,所得到的图像特征向量在对应位置上的数值会发生显著变化,从而达到提高位置敏感性的目的。
上述对图像特征向量处理的过程,可以理解为以下过程:该各个通道输出的特征向量总大小为
Figure PCTCN2020087010-appb-000014
因此,可以将其展开成c维的特征序列,其中包括
Figure PCTCN2020087010-appb-000015
个c维的特征向量。可以基于各个特征向量在特征序列中的位置来进行编码,例如,对于特征序列中第一个特征向量,可以为其编码得到(1,0,0,……0)的维度为c的位置向量E i,再将每一个特征向量I i和对应的位置向 量E i相加,即可得到位置敏感的图像特征向量,这种处理后的图像特征向量可以用F i来表示。
在得到该处理后的多个图像特征向量后,计算机设备可以将该处理后的多个图像特征向量作为关系注意力模块的第一层的输入,来继续进行加权计算等过程,以输出各个图像特征向量的依赖特征向量。
该方法通过根据图像特征向量对应的依赖特征向量进行注意力权值的计算,由于依赖特征向量体现了该图像特征向量所对应图像信息以及该图像特征向量与其他图像特征向量间的依赖关系,使得在计算注意力权值的过程中将该依赖关系考虑在内,由此提高了注意力权值计算的准确性,进而提高了字符识别效率。
403、该计算机设备将关系注意力模块输出的该多个图像特征向量的依赖特征向量输入字符识别模型中的并行注意力模块,通过该并行注意力模块中的各个输出节点并行对输入的特征向量进行计算,输出目标数量的注意力权值。
其中,每个输出节点会并行计算输入的依赖特征向量的注意力权值。在该步骤403中,该并行注意力模块采用下述公式(6)对输入的图像特征向量进行计算,以输出所述目标数量的注意力权值:
α=softmax(W 2tanh(W 1O T))   (6)
其中,对于输出节点来说,其输入O T可以为关系注意力模块的输出。
404、该计算机设备根据该多个图像特征向量和该目标数量的注意力权值,得到至少一个注意力特征。
Figure PCTCN2020087010-appb-000016
其中,G i为第i个输出节点的输出的注意力特征。该注意力特征可以是用于通过解码得到待识别的图像中第i个字符的特征。该α可以理解为针对第i个字符该多个图像特征向量对应的重要程度,也可以理解为当前时刻对输入图像的每一个局部的关注程度,从图像处理的角度也可以理解为一个掩膜(mask),该注意力权值和图像特征向量的加权求和得到的注意力特征,可以理解为网络对输入图像选择性观察得到的结果。
从而通过对该至少一个注意力特征进行解码,得到至少一个字符。
其中,可以通过下述步骤405的方式进行解码。
405、该计算机设备将该至少一个注意力特征输入字符识别模型中的两级解码器进行解码,输出该至少一个字符。
该计算机设备通过对至少一个注意力特征的解码,可以将注意力特征转化为字符,从而实现字符识别。在本申请实施例中,为了提高识别准确率,可以采用两级解码器来捕获输出节点之间的相互依赖。
为了捕获输出节点之间的相互依赖,本申请实施例采用两级解码器来实现解码模块的功能,具体地,将所述至少一个注意力特征输入字符识别模型的两级解码器中,对于每个注意力特征,通过所述两级解码器中的关系注意力模块获取所述注意力特征的依赖特征向量,再通过解码器对所述注意力特征对应的依赖特征向量进行解码,将解码所得到的字符中概率最大的字符作为所述注意力特征对应的字符输出。
在一种可能实施方式中,可以通过如下的方式计算每一个字符的概率:
P i=softmax(WG i+b)   (8)
当然,上述二级解码器可以通过训练得到,其中,W为训练得到的权重值,b为训练过程中得到的偏置值,在训练时,对于训练样本进行初始化时,对于字符长度小于n的字符序列,可以使用“-”将其填充成长度为n的序列,对于长度大于n的序列,将其截断成长度为n的序列。“-”为特殊字符,可以用于表示字符序列的结束(end of sequence)。
训练二级解码器的过程可以参见图5所示的二级解码器的训练架构,其中,Ω为字符集。对于大小为n×c的注意力特征G来说,可以通过第一分支中的解码器进行解码,以得到第一分支的解码器对应的概率矩阵,该概率矩阵中的每个元素用于表示该注意力特征为字符集中任一个字符的概率,同时,该注意力特征G可以通过第二分支中关系注意力模块的处理,以得到该注意力特征与其他注意力特征之间的依赖关系,该依赖关系可以采用大小为n×c的依赖特征向量来表示,再将该依赖特征向量通过第二分支的解码器,以得到解码器对应的概率矩阵,对于一个注意力特征来说,将该第一分支所输出的概率矩阵和第二分支所输出的概率矩阵中概率最大的字符作为该注意力特征解码 得到的字符。
在训练时可以同时优化这两个解码器,优化损失函数如下:
Figure PCTCN2020087010-appb-000017
其中,y表示训练样本对应的字符串的真值,P是字符的概率。通过上述训练过程,则可以得到权重和偏置值,从而,在应用过程中,可以将第二分支作为字符识别模型的解码模块,以在解码后最终输出解码所得到的字符序列。
对于上述实施方式来说,如果不考虑计算高效度的情况下,还可以直接基于图像特征向量使用串行计算的方式来得到注意力特征,对此本申请实施例不做具体限定。另外,上述技术方案可适用的网络结构,优化方法等包括但不局限于上述通过图示以及公式所提供的结构和损失函数。
本申请实施例提供的方法,通过提取图像的二维图像特征,采用注意力机制,来确定二维图像特征中图像特征向量之间的依赖关系,从而进一步采用并行计算注意力权值的方式来确定特征的重要程度,使得在字符识别的过程中,可以直接基于二维图像特征以及二维图像特征中各个特征向量的重要程度,来进行字符识别,上述基于二维图像特征的处理过程,由于保留了特征的空间信息,因此可以大大提高字符识别的准确性,并且通过上述基于注意力机制的识别,能够通过简单的流程有效的进行任意形状的字符识别,避免循环运算的过程,大大提高了运算效率。
图6a是本申请实施例提供的一种字符识别装置的结构示意图,参见图6a,所述装置包括:
特征提取单元601,用于提取待识别的图像的图像特征,所述图像特征包括多个图像特征向量;
并行处理单元603,用于基于所述多个图像特征向量,通过并行计算,获取目标数量的注意力权值,其中,一个注意力权值用于表示所述多个图像特征向量针对这个注意力权值所对应字符的重要程度;
字符获取单元604,用于根据所述多个图像特征向量和所述目标数量的注意力权值,得到所述至少一个字符。
在一种可能的实现方式中,参见图6b,该图示出了本申请实施例提供的 一种字符识别装置的结构示意图,所述装置还包括:
依赖关系获取单元603,用于获取所述二维图像特征中每个图像特征向量的依赖特征向量,所述依赖特征向量用于表示图像信息以及图像特征向量与其他图像特征向量之间的依赖关系;
所述并行处理单元604,具体用于基于所述多个图像特征向量的依赖特征向量,通过并行计算,获取目标数量的注意力权值。
在一种可能实现方式中,所述特征提取单元用于将所述图像输入卷积神经网络,通过所述卷积神经网络中主干网络的各个通道对所述图像进行特征提取,输出所述图像特征。
在一种可能实现方式中,所述依赖关系获取单元用于将所述多个图像特征向量输入字符识别模型的关系注意力模块,通过所述关系注意力模块每一层中的转换单元对所述图像特征向量与其他图像特征向量在注意力映射空间进行相似度计算,以得到所述图像特征向量与其他图像特征向量分别对应的权重,并基于得到的权重进行计算,输出图像特征向量的依赖特征向量。
在一种可能实现方式中,所述关系注意力模块中每一层中的转换单元的输入
Figure PCTCN2020087010-appb-000018
和V l i分别采用下式(1)、(2)和(3)表示:
Figure PCTCN2020087010-appb-000019
Figure PCTCN2020087010-appb-000020
Figure PCTCN2020087010-appb-000021
其中,l表示转换单元所在的层,i表示转换单元为该层的第几个转换单元,
Figure PCTCN2020087010-appb-000022
和V l i表示第l层第i个转换单元的输入,F i表示第i个图像特征向量,F是所述多个图像特征向量的集合,O l-1是上一层所有转换单元的输出。
在一种可能实现方式中,所述关系注意力模块中每一层中的转换单元的输出为输入的加权和,其中,权值采用下式(4)表示:
Figure PCTCN2020087010-appb-000023
其中,W是学习得到的参数,
Figure PCTCN2020087010-appb-000024
表示第l层第i个转换单元对应的第j个key的权重,公式的分母用于表示k个转换单元输出的加权和;
转换单元的输出采用下式(5)表示:
Figure PCTCN2020087010-appb-000025
其中,Func()是一个非线性的函数,
Figure PCTCN2020087010-appb-000026
是第l层的第i个转换单元的输出。
在一种可能实现方式中,所述特征提取单元用于:对所述二维图像特征中的各个图像特征向量进行拼接,得到特征序列;基于各个图像特征向量在所述特征序列中的位置,为每个图像特征向量确定对应的位置向量;根据每个图像特征向量与对应的位置向量,得到经所述位置向量处理后的所述多个图像特征向量。
在一种可能实现方式中,所述并行处理单元603用于将所述多个图像特征向量的依赖特征向量输入并行注意力模块,通过所述并行注意力模块中的目标数量的输出节点并行对输入的特征向量进行计算,输出所述目标数量的注意力权值。
在一种可能实现方式中,所述并行注意力模块采用下述公式对输入的特征进行计算,以输出所述目标数量的注意力权值:
α=softmax(W 2tanh(W 1O T))   (6)
其中,α用于表示注意力权值,tanh()为双曲正切函数,softmax()为归一化函数,O T为输出节点的输入,W 1和W 2是学习得到的参数。
在一种可能实现方式中,字符获取单元包括:
特征确定子单元,用于根据所述多个图像特征向量和所述目标数量的注意力权值,得到至少一个注意力特征;
解码子单元,用于对所述至少一个注意力特征进行解码,得到所述至少一个字符。
在一种可能实现方式中,该解码子单元,用于将所述至少一个注意力特征输入字符识别模型的解码模块中,对于每个注意力特征,通过所述解码模块获取所述注意力特征的依赖特征向量,对所述注意力特征对应的依赖特征向量进行解码,将解码所得到的字符中概率最大的字符作为所述注意力特征对应的字符输出。
本申请实施例所提供的方法可以实施于计算机设备,该计算机设备可以实施为服务器,例如,图7是本申请实施例提供的一种服务器的结构示意图,该服务器700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)701和一个或一个以上的存储器702,其中,该存储器702中存储有至少一条指令,该至少一条指令由该处理器701加载并执行以实现上述各个方法实施例提供的字符识别方法。当然,该服务器还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件,在此不做赘述。
本申请实施例所提供的方法可以实施于计算机设备,该计算机设备可以实施为终端,例如,图8是本申请实施例提供的一种终端的结构示意图。该终端800可以是便携式移动终端,比如:智能手机、平板电脑、动态影像专家压缩标准音频层面3(Moving Picture Experts Group Audio Layer III,MP3)播放器、动态影像专家压缩标准音频层面4(Moving Picture Experts Group Audio Layer IV,MP4)播放器、笔记本电脑、台式电脑、头戴式设备,或其他任意智能终端。终端800还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端800包括有:处理器801和存储器802。
处理器801可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器801可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器801也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数 据进行处理的处理器,也称中央处理器(Central Processing Unit,CPU);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器801可以在集成有图像处理器(Graphics Processing Unit,GPU),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器801还可以包括人工智能(Artificial Intelligence,AI)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器802可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器802还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器802中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器801所具有以实现本申请中方法实施例提供的字符识别方法。
在一些实施例中,终端800还可选包括有:外围设备接口803和至少一个外围设备。处理器801、存储器802和外围设备接口803之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口803相连。具体地,外围设备包括:射频电路804、触摸显示屏805、摄像头806、音频电路807、定位组件808和电源809中的至少一种。
外围设备接口803可被用于将输入/输出(Input/Output,I/O)相关的至少一个外围设备连接到处理器801和存储器802。在一些实施例中,处理器801、存储器802和外围设备接口803被集成在同一芯片或电路板上;在一些其他实施例中,处理器801、存储器802和外围设备接口803中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路804用于接收和发射射频(Radio Frequency,RF)信号,也称电磁信号。射频电路804通过电磁信号与通信网络以及其他通信设备进行通信。射频电路804将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路804包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路804可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:城域网、各代移动通信网络(2G、3G、 4G及8G)、无线局域网和/或无线保真(Wireless Fidelity,WiFi)网络。在一些实施例中,射频电路804还可以包括近距离无线通信(Near Field Communication,NFC)有关的电路,本申请对此不加以限定。
显示屏805用于显示用户界面(User Interface,UI)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏805是触摸显示屏时,显示屏805还具有采集在显示屏805的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器801进行处理。此时,显示屏805还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏805可以为一个,设置终端800的前面板;在另一些实施例中,显示屏805可以为至少两个,分别设置在终端800的不同表面或呈折叠设计;在再一些实施例中,显示屏805可以是柔性显示屏,设置在终端800的弯曲表面上或折叠面上。甚至,显示屏805还可以设置成非矩形的不规则图形,也即异形屏。显示屏805可以采用液晶显示屏(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等材质制备。
摄像头组件806用于采集图像或视频。可选地,摄像头组件806包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及虚拟现实(Virtual Reality,VR)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件806还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路807可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器801进行处理,或者输入至射频电路804以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端800的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器801或射频电路804的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶 瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路807还可以包括耳机插孔。
定位组件808用于定位终端800的当前地理位置,以实现导航或基于位置的服务(Location Based Service,LBS)。定位组件808可以是基于美国的全球定位系统(Global Positioning System,GPS)、中国的北斗系统、俄罗斯的格雷纳斯系统或欧盟的伽利略系统的定位组件。
电源809用于为终端800中的各个组件进行供电。电源809可以是交流电、直流电、一次性电池或可充电电池。当电源809包括可充电电池时,该可充电电池可以支持有线充电或无线充电。该可充电电池还可以用于支持快充技术。
在一些实施例中,终端800还包括有一个或多个传感器810。该一个或多个传感器810包括但不限于:加速度传感器811、陀螺仪传感器812、压力传感器813、指纹传感器814、光学传感器815以及接近传感器816。
加速度传感器811可以检测以终端811建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器811可以用于检测重力加速度在三个坐标轴上的分量。处理器801可以根据加速度传感器811采集的重力加速度信号,控制触摸显示屏805以横向视图或纵向视图进行用户界面的显示。加速度传感器811还可以用于游戏或者用户的运动数据的采集。
陀螺仪传感器812可以检测终端800的机体方向及转动角度,陀螺仪传感器812可以与加速度传感器811协同采集用户对终端800的3D动作。处理器801根据陀螺仪传感器812采集的数据,可以实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
压力传感器813可以设置在终端800的侧边框和/或触摸显示屏805的下层。当压力传感器813设置在终端800的侧边框时,可以检测用户对终端800的握持信号,由处理器801根据压力传感器813采集的握持信号进行左右手识别或快捷操作。当压力传感器813设置在触摸显示屏805的下层时,由处理器801根据用户对触摸显示屏805的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件 中的至少一种。
指纹传感器814用于采集用户的指纹,由处理器801根据指纹传感器814采集到的指纹识别用户的身份,或者,由指纹传感器814根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器801授权该用户具有相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器814可以被设置终端800的正面、背面或侧面。当终端800上设置有物理按键或厂商Logo时,指纹传感器814可以与物理按键或厂商标志集成在一起。
光学传感器815用于采集环境光强度。在一个实施例中,处理器801可以根据光学传感器815采集的环境光强度,控制触摸显示屏805的显示亮度。具体地,当环境光强度较高时,调高触摸显示屏805的显示亮度;当环境光强度较低时,调低触摸显示屏805的显示亮度。在另一个实施例中,处理器801还可以根据光学传感器815采集的环境光强度,动态调整摄像头组件806的拍摄参数。
接近传感器816,也称距离传感器,通常设置在终端800的前面板。接近传感器816用于采集用户与终端800的正面之间的距离。在一个实施例中,当接近传感器816检测到用户与终端800的正面之间的距离逐渐变小时,由处理器801控制触摸显示屏805从亮屏状态切换为息屏状态;当接近传感器816检测到用户与终端800的正面之间的距离逐渐变大时,由处理器801控制触摸显示屏805从息屏状态切换为亮屏状态。
本领域技术人员可以理解,图8中示出的结构并不构成对终端800的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括指令的存储器,上述指令可由终端或服务器中的处理器执行以完成上述实施例中的字符识别方法。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
本申请实施例还提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行上述方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (18)

  1. 一种字符识别方法,所述方法包括:
    提取待识别的图像的图像特征,所述图像特征包括多个图像特征向量;
    基于所述多个图像特征向量,通过并行计算,获取目标数量的注意力权值,其中,一个注意力权值用于表示所述多个图像特征向量针对这个注意力权值所对应字符的重要程度;
    根据所述多个图像特征向量和所述目标数量的注意力权值,得到所述至少一个字符。
  2. 根据权利要求1所述的方法,所述方法还包括:
    获取所述图像特征中每个图像特征向量的依赖特征向量,所述依赖特征向量用于表示图像信息以及图像特征向量与其他图像特征向量之间的依赖关系;
    所述基于所述多个图像特征向量,通过并行计算,获取目标数量的注意力权值,包括:
    基于所述多个图像特征向量的依赖特征向量,通过并行计算,获取目标数量的注意力权值。
  3. 根据权利要求1所述的方法,所述提取待识别的图像的图像特征,包括:
    将所述图像输入卷积神经网络,通过所述卷积神经网络中主干网络的各个通道对所述图像进行特征提取,输出所述图像特征。
  4. 根据权利要求3所述的方法,所述卷积神经网络中主干网络包括所述卷积神经网络中除去分类模块的剩余结构。
  5. 根据权利要求2所述的方法,所述获取所述图像特征中每个图像特征向量的依赖特征向量,包括:
    将所述多个图像特征向量输入字符识别模型的关系注意力模块,通过所述关系注意力模块每一层中的转换单元对所述图像特征向量与其他图像特征向量在注意力映射空间进行相似度计算,以得到所述图像特征向量与其他图像特征向量分别对应的权重,并基于得到的权重进行计算,输出所述图像特征向量的依赖特征向量。
  6. 根据权利要求5所述的方法,所述输出所述图像特征向量的依赖特征向量之前,所述方法还包括:
    基于所述权重做线性加权,对所述线性加权得到的特征向量进行非线性处理,得到所述图像特征向量的依赖特征向量。
  7. 根据权利要求1-6任意一项所述的方法,所述图像特征为二维图像特征。
  8. 根据权利要求7所述的方法,在所述获取所述图像特征中每个图像特征向量的依赖特征向量之前,所述方法还包括:
    对所述图像特征中的各个图像特征向量进行拼接,得到特征序列;
    基于各个图像特征向量在所述特征序列中的位置,为每个图像特征向量确定对应的位置向量;
    根据每个图像特征向量与对应的位置向量,得到经所述位置向量处理后的所述多个图像特征向量。
  9. 根据权利要求2所述的方法,所述基于所述多个图像特征向量的依赖特征向量,通过并行计算,获取目标数量的注意力权值,包括:
    将所述多个图像特征向量的依赖特征向量输入并行注意力模块,通过所述并行注意力模块中的目标数量的输出节点并行对输入的特征向量进行计算,输出所述目标数量的注意力权值。
  10. 根据权利要求1所述的方法,所述根据所述多个图像特征向量和所述目标数量的注意力权值,得到所述至少一个字符,包括:
    根据所述多个图像特征向量和所述目标数量的注意力权值,得到至少一个注意力特征;
    对所述至少一个注意力特征进行解码,得到所述至少一个字符。
  11. 根据权利要求10所述的方法,所述对所述至少一个注意力特征进行解码,得到所述至少一个字符,包括:
    将所述至少一个注意力特征输入字符识别模型的解码模块中,对于每个注意力特征,通过所述解码模块获取所述注意力特征对应的依赖特征向量,对所述注意力特征对应的依赖特征向量进行解码,将解码所得到的字符中概率最大的字符作为所述注意力特征对应的字符输出。
  12. 一种字符识别装置,所述装置包括:
    特征提取单元,用于提取待识别的图像的图像特征,所述图像特征包括多个图像特征向量;
    并行处理单元,用于基于所述多个图像特征向量,通过并行计算,获取目标数量的注意力权值,其中,一个注意力权值用于表示所述多个图像特征向量针对这个注意力权值所对应字符的重要程度;
    字符获取单元,用于根据所述多个图像特征向量和所述目标数量的注意力权值,得到所述至少一个字符。
  13. 根据权利要求12所述的装置,所述装置还包括:
    依赖关系获取单元,用于获取所述二维图像特征中每个图像特征向量的依赖特征向量,所述依赖特征向量用于表示图像信息以及图像特征向量与其他图像特征向量之间的依赖关系;
    所述并行处理单元,具体用于基于所述多个图像特征向量的依赖特征向量,通过并行计算,获取目标数量的注意力权值。
  14. 根据权利要求12所述的装置,所述特征提取单元用于将所述图像输入卷积神经网络,通过所述卷积神经网络中主干网络的各个通道对所述图像进行特征提取,输出所述图像特征。
  15. 根据权利要求12所述的装置,所述依赖关系获取单元用于将所述多个图像特征向量输入字符识别模型的关系注意力模块,通过所述关系注意力模块每一层中的转换单元对所述图像特征向量与其他图像特征向量在注意力映射空间进行相似度计算,以得到所述图像特征向量与其他图像特征向量分别对应的权重,并基于得到的权重进行计算,输出所述图像特征向量的依赖特征向量。
  16. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如权利要求1至权利要求11任一项所述的字符识别方法所执行的操作。
  17. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如权利要求1至权利要求11任一项所述的字符识别方法所执行的操作。
  18. 一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行权利要求1至11任一项中所述的方法。
PCT/CN2020/087010 2019-05-10 2020-04-26 字符识别方法、装置、计算机设备以及存储介质 WO2020228519A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/476,327 US20220004794A1 (en) 2019-05-10 2021-09-15 Character recognition method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910387655.1 2019-05-10
CN201910387655.1A CN110097019B (zh) 2019-05-10 2019-05-10 字符识别方法、装置、计算机设备以及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/476,327 Continuation US20220004794A1 (en) 2019-05-10 2021-09-15 Character recognition method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020228519A1 true WO2020228519A1 (zh) 2020-11-19

Family

ID=67447583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087010 WO2020228519A1 (zh) 2019-05-10 2020-04-26 字符识别方法、装置、计算机设备以及存储介质

Country Status (3)

Country Link
US (1) US20220004794A1 (zh)
CN (1) CN110097019B (zh)
WO (1) WO2020228519A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966140A (zh) * 2021-03-10 2021-06-15 北京百度网讯科技有限公司 字段识别方法、装置、电子设备、存储介质和程序产品

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097019B (zh) * 2019-05-10 2023-01-10 腾讯科技(深圳)有限公司 字符识别方法、装置、计算机设备以及存储介质
CN110502612A (zh) * 2019-08-08 2019-11-26 南京逸杰软件科技有限公司 一种基于黑名单智能识别的交通信息发布安全检测方法
CN110659640B (zh) * 2019-09-27 2021-11-30 深圳市商汤科技有限公司 文本序列的识别方法及装置、电子设备和存储介质
CN111414962B (zh) * 2020-03-19 2023-06-23 创新奇智(重庆)科技有限公司 一种引入物体关系的图像分类方法
CN111725801B (zh) * 2020-05-06 2022-05-24 国家计算机网络与信息安全管理中心 基于注意力机制的配电系统脆弱节点辨识方法及系统
CN111899292A (zh) * 2020-06-15 2020-11-06 北京三快在线科技有限公司 文字识别方法、装置、电子设备及存储介质
CN111814796A (zh) * 2020-06-29 2020-10-23 北京市商汤科技开发有限公司 字符序列识别方法及装置、电子设备和存储介质
CN112070079B (zh) * 2020-07-24 2022-07-05 华南理工大学 基于特征图重赋权的x光违禁品包裹检测方法及装置
CN112069841B (zh) * 2020-07-24 2022-07-05 华南理工大学 X光违禁品包裹跟踪方法及装置
CN112148124A (zh) * 2020-09-10 2020-12-29 维沃移动通信有限公司 图像处理方法、装置及电子设备
CN112488094A (zh) * 2020-12-18 2021-03-12 北京字节跳动网络技术有限公司 光学字符识别方法、装置和电子设备
CN112801103B (zh) * 2021-01-19 2024-02-27 网易(杭州)网络有限公司 文本方向识别及文本方向识别模型训练方法、装置
CN114663896B (zh) * 2022-05-17 2022-08-23 深圳前海环融联易信息科技服务有限公司 基于图像处理的文档信息抽取方法、装置、设备及介质
CN115546901B (zh) * 2022-11-29 2023-02-17 城云科技(中国)有限公司 用于宠物规范行为检测的目标检测模型及方法
CN116630979B (zh) * 2023-04-10 2024-04-30 雄安创新研究院 一种ocr识别方法、系统、存储介质和边缘设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944174A (zh) * 2009-07-08 2011-01-12 西安电子科技大学 车牌字符的识别方法
CN109389091A (zh) * 2018-10-22 2019-02-26 重庆邮电大学 基于神经网络和注意力机制结合的文字识别系统及方法
CN109492679A (zh) * 2018-10-24 2019-03-19 杭州电子科技大学 基于注意力机制与联结时间分类损失的文字识别方法
CN110097019A (zh) * 2019-05-10 2019-08-06 腾讯科技(深圳)有限公司 字符识别方法、装置、计算机设备以及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10354168B2 (en) * 2016-04-11 2019-07-16 A2Ia S.A.S. Systems and methods for recognizing characters in digitized documents
CN107018410B (zh) * 2017-05-10 2019-02-15 北京理工大学 一种基于预注意机制和空间依赖性的无参考图像质量评价方法
CN109658455B (zh) * 2017-10-11 2023-04-18 阿里巴巴集团控股有限公司 图像处理方法和处理设备
CN108875722A (zh) * 2017-12-27 2018-11-23 北京旷视科技有限公司 字符识别与识别模型训练方法、装置和系统及存储介质
CN108615036B (zh) * 2018-05-09 2021-10-01 中国科学技术大学 一种基于卷积注意力网络的自然场景文本识别方法
CN109447115A (zh) * 2018-09-25 2019-03-08 天津大学 基于多层语义监督式注意力模型的细粒度零样本分类方法
CN109543667B (zh) * 2018-11-14 2023-05-23 北京工业大学 一种基于注意力机制的文本识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944174A (zh) * 2009-07-08 2011-01-12 西安电子科技大学 车牌字符的识别方法
CN109389091A (zh) * 2018-10-22 2019-02-26 重庆邮电大学 基于神经网络和注意力机制结合的文字识别系统及方法
CN109492679A (zh) * 2018-10-24 2019-03-19 杭州电子科技大学 基于注意力机制与联结时间分类损失的文字识别方法
CN110097019A (zh) * 2019-05-10 2019-08-06 腾讯科技(深圳)有限公司 字符识别方法、装置、计算机设备以及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966140A (zh) * 2021-03-10 2021-06-15 北京百度网讯科技有限公司 字段识别方法、装置、电子设备、存储介质和程序产品
CN112966140B (zh) * 2021-03-10 2023-08-08 北京百度网讯科技有限公司 字段识别方法、装置、电子设备、存储介质和程序产品

Also Published As

Publication number Publication date
CN110097019B (zh) 2023-01-10
CN110097019A (zh) 2019-08-06
US20220004794A1 (en) 2022-01-06

Similar Documents

Publication Publication Date Title
WO2020228519A1 (zh) 字符识别方法、装置、计算机设备以及存储介质
US11244170B2 (en) Scene segmentation method and device, and storage medium
CN110134804B (zh) 图像检索方法、装置及存储介质
CN110471858B (zh) 应用程序测试方法、装置及存储介质
CN110083791B (zh) 目标群组检测方法、装置、计算机设备及存储介质
CN110147533B (zh) 编码方法、装置、设备及存储介质
WO2022134632A1 (zh) 作品处理方法及装置
US20210319167A1 (en) Encoding method, apparatus, and device, and storage medium
WO2022057435A1 (zh) 基于搜索的问答方法及存储介质
CN110290426B (zh) 展示资源的方法、装置、设备及存储介质
CN111178343A (zh) 基于人工智能的多媒体资源检测方法、装置、设备及介质
CN110991445B (zh) 竖排文字识别方法、装置、设备及介质
CN110503160B (zh) 图像识别方法、装置、电子设备及存储介质
CN110232417B (zh) 图像识别方法、装置、计算机设备及计算机可读存储介质
CN112257594A (zh) 多媒体数据的显示方法、装置、计算机设备及存储介质
CN113343709B (zh) 意图识别模型的训练方法、意图识别方法、装置及设备
CN113361376B (zh) 获取视频封面的方法、装置、计算机设备及可读存储介质
CN113822916B (zh) 图像匹配方法、装置、设备及可读存储介质
CN111310701B (zh) 手势识别方法、装置、设备及存储介质
CN114817709A (zh) 排序方法、装置、设备及计算机可读存储介质
CN112487162A (zh) 确定文本语义信息的方法、装置、设备以及存储介质
CN111294320B (zh) 数据转换的方法和装置
CN115221888A (zh) 实体提及的识别方法、装置、设备及存储介质
WO2020220702A1 (zh) 生成自然语言
CN111652432A (zh) 用户属性信息的确定方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20806016

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20806016

Country of ref document: EP

Kind code of ref document: A1