WO2022257578A1 - 用于识别文本的方法和装置 - Google Patents
用于识别文本的方法和装置 Download PDFInfo
- Publication number
- WO2022257578A1 WO2022257578A1 PCT/CN2022/085317 CN2022085317W WO2022257578A1 WO 2022257578 A1 WO2022257578 A1 WO 2022257578A1 CN 2022085317 W CN2022085317 W CN 2022085317W WO 2022257578 A1 WO2022257578 A1 WO 2022257578A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- text
- feature map
- network
- graph
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 124
- 230000011218 segmentation Effects 0.000 claims abstract description 55
- 238000012545 processing Methods 0.000 claims abstract description 34
- 238000012549 training Methods 0.000 claims description 51
- 230000006870 function Effects 0.000 claims description 39
- 230000008569 process Effects 0.000 claims description 31
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000004590 computer program Methods 0.000 claims description 15
- 230000004927 fusion Effects 0.000 claims description 15
- 238000010276 construction Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 9
- 235000019580 granularity Nutrition 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000011524 similarity measure Methods 0.000 abstract 1
- 239000013598 vector Substances 0.000 description 29
- 238000010586 diagram Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 6
- 238000013507 mapping Methods 0.000 description 6
- 238000012015 optical character recognition Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000013519 translation Methods 0.000 description 4
- 238000007500 overflow downdraw method Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 235000019587 texture Nutrition 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000011425 standardization method Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 208000029257 vision disease Diseases 0.000 description 1
- 230000004393 visual impairment Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19127—Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1918—Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
Definitions
- the embodiments of the present disclosure relate to the field of computer technology, and in particular to a method and device for recognizing text.
- OCR Optical Character Recognition
- STR Scene Text Recognition, scene text recognition
- the image background in the OCR recognition scene is simple, the text is arranged neatly, and the fonts are standard, etc., while the image background in the STR recognition scene is more complex, the text is arranged randomly, and the fonts are diverse. Therefore, the recognition difficulty of STR is far greater than that of OCR.
- STRs have important utility in many domains, such as aided navigation for the visually impaired, autonomous driving applications, text reading and translation in augmented reality, and have attracted increasing attention in the computer vision community.
- the current STR recognition method usually first locates the text region from the image, and then recognizes the text in the text region.
- Embodiments of the present disclosure propose methods and apparatuses for recognizing text.
- embodiments of the present disclosure provide a method for recognizing text, the method comprising: acquiring a feature map, wherein the feature map is obtained by segmenting text instances of an image presenting the text to be recognized; constructing a feature map based on the feature map A relational graph, wherein the nodes in the relational graph represent the pixels in the feature map, and the edge represents that the similarity of the spatial semantic features of the two connected nodes is greater than the target threshold, and the spatial semantic features include the positional features and Category features; use the pre-trained graph convolutional network to process the relationship graph to obtain the first text feature corresponding to the image; generate the text recognition result of the image according to the first text feature.
- an embodiment of the present disclosure provides an apparatus for recognizing text, the apparatus including: a feature map acquisition unit configured to acquire a feature map, wherein the feature map is obtained by performing text processing on an image presenting a text to be recognized Instance segmentation is obtained; the relational graph construction unit is configured to construct a relational graph according to the feature map, wherein the nodes in the relational graph represent the pixels in the feature map, and the edges represent that the similarity of the spatial semantic features of the two connected nodes is greater than The target threshold, the spatial semantic feature includes the position feature and category feature of the pixel point indicated by the node; the graph convolution processing unit is configured to use the pre-trained graph convolution network to process the relationship graph to obtain the first text feature corresponding to the image ; The recognition unit is configured to generate a text recognition result of the image according to the first text feature.
- the embodiments of the present disclosure provide an electronic device, which includes: one or more processors; a storage device for storing one or more programs; when one or more programs are used by one or more processors, so that one or more processors implement the method described in any implementation manner in the first aspect.
- embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, the method described in any implementation manner in the first aspect is implemented.
- the embodiments of the present disclosure provide a computer program product including a computer program, when the computer program is executed by a processor, the method described in any implementation manner in the first aspect can be implemented.
- FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
- FIG. 2 is a flowchart of one embodiment of a method for recognizing text according to the present disclosure
- FIG. 3 is a flowchart of an embodiment of a method for generating a feature map in a method for recognizing text according to the present disclosure
- FIG. 4 is a flowchart of another embodiment of a method for recognizing text according to the present disclosure.
- 5a, 5b, and 5c are schematic diagrams of an exemplary application scenario of the method for recognizing text according to the present disclosure
- FIG. 6 is a flowchart of an embodiment of a training method of a graph convolutional network, a language model, and a segmentation network in a method for recognizing text according to the present disclosure
- Fig. 7 is a schematic structural diagram of an embodiment of a device for recognizing text according to the present disclosure.
- FIG. 8 is a schematic structural diagram of an electronic device suitable for implementing an embodiment of the present disclosure.
- FIG. 1 shows an exemplary architecture 100 to which embodiments of the method for recognizing text or the apparatus for recognizing text of the present disclosure can be applied.
- a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
- the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
- Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
- the terminal devices 101, 102, 103 interact with the server 105 via the network 104 to receive or send messages and the like.
- Various client applications may be installed on the terminal devices 101, 102, 103. For example, browser applications, search applications, image processing applications, deep learning frameworks, etc.
- the terminal devices 101, 102, and 103 may be hardware or software.
- the terminal devices 101, 102, 103 When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices, including but not limited to smart phones, tablet computers, e-book readers, laptop computers, desktop computers and so on.
- the terminal devices 101, 102, 103 When the terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple software or software modules (for example, multiple software or software modules for providing distributed services), or as a single software or software module. No specific limitation is made here.
- the server 105 may be a server that provides various services, for example, a server that provides backend support for client applications installed on the terminal devices 101 , 102 , 103 .
- the server 105 can be hardware or software.
- the server 105 When the server 105 is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
- the server 105 When the server 105 is software, it can be implemented as multiple software or software modules (for example, multiple software or software modules for providing distributed services), or can be implemented as a single software or software module. No specific limitation is made here.
- the method for recognizing text provided by the embodiments of the present disclosure is generally executed by the server 105 , and correspondingly, the device for recognizing text is generally disposed in the server 105 .
- image processing applications may also be installed in terminal devices 101, 102, and 103, and terminal devices 101, 102, and 103 may also process images presenting text to be recognized based on image processing applications.
- the method for recognizing text can also be executed by the terminal devices 101 , 102 , 103 , and correspondingly, the means for recognizing text can also be set in the terminal devices 101 , 102 , 103 .
- server 105 and network 104 may not exist in exemplary system architecture 100 .
- terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
- the method and device for recognizing text obtain the feature map obtained by segmenting the text instance of the image presenting the text to be recognized, and then use the pixel points in the feature map as nodes, and use the spatial semantics of the nodes.
- the similarity of features establishes edges to construct the relationship graph corresponding to the feature map, and then uses the graph convolutional network to process the relationship graph to extract the first text feature of the text to be recognized in the image, and then use the first text feature to generate an image
- This graph-based text recognition method can take into account the two-dimensional spatial information of the text in the image, avoid directly compressing the text features in the image into one-dimensional features and ignore the two-dimensional spatial information, and help to improve the text recognition effect.
- FIG. 2 shows a flow 200 of an embodiment of the method for recognizing text according to the present disclosure.
- the method for recognizing text includes the following steps:
- Step 201 obtain a feature map.
- the feature map can be obtained by segmenting the text instance of the image presenting the text to be recognized.
- the text to be recognized may be text of various contents.
- the text to be recognized may include one or more characters (such as letters, numbers, special symbols, Chinese characters, etc.).
- the image presenting the text to be recognized may be various types of images.
- the quality of the image presenting the text to be recognized may be different, and various attributes such as the position and writing style of the text to be recognized presented by the image may be different.
- Text instance segmentation may refer to detecting each character included in the text to be recognized from an image and distinguishing each character.
- the feature map (MASK) obtained after text instance segmentation may correspond to the characters included in the recognized text.
- the number of feature maps can be flexibly set according to actual application scenarios. For example, if the text to be recognized in the image includes only one character, the number of feature maps may be one.
- the number of feature maps is at least two.
- the number of feature maps may be specified in advance by a skilled person.
- the number of characters included in the text to be recognized in the image can be estimated according to the actual application scenario, and the number of feature maps set can be greater than the estimated number of characters to avoid situations such as missing recognition.
- the above-mentioned executive body or other electronic devices may use various existing instance segmentation methods to perform text instance segmentation on the image presenting the text to be recognized, so as to obtain a feature map.
- the method for recognizing text may locally pre-store the feature map obtained by segmenting the text instance of the image presenting the text to be recognized.
- the execution subject may Get feature maps directly from local.
- the execution subject can also acquire feature maps from other storage devices (such as connected databases, third-party data platforms, terminal devices 101, 102, 103 shown in FIG. 1, etc.).
- Step 202 constructing a relationship graph according to the feature graph.
- each pixel in the feature map can be used as a node, and an edge can be constructed according to the similarity between the spatial semantic features of the pixels, so as to obtain a relationship graph constructed based on the feature map.
- the spatial semantic feature of a pixel may include a position feature and a category feature of a pixel.
- the position feature of the pixel can be used to represent the position of the pixel in the feature map.
- the category feature of the pixel can be used to indicate the text category to which the pixel belongs.
- Text categories can be pre-set according to actual application scenarios. For example, if the text to be recognized is a number, 11 types of text can be pre-classified to represent 0-9 and the background respectively.
- the position feature of a pixel point can be characterized by the coordinates (such as abscissa and ordinate) of the pixel point in the feature map.
- the category feature of the pixel point can be represented by a vector, and the vector can represent the probability that the pixel point belongs to each preset text category.
- the target threshold can be preset by technicians, and can also be flexibly determined during the establishment of the relationship diagram. For example, the target threshold can be determined according to the spatial semantic similarity between each pixel.
- the similarity of spatial semantic features can be flexibly determined by various methods. For example, the similarity of the positional features and the similarity of the category features of two pixels can be calculated separately, and then the similarity of the positional features and the similarity of the category features are weighted and summed as the similarity of the spatial semantic features of the two pixels .
- a corresponding relationship subgraph may be constructed for each feature map, and then the relationship subgraphs corresponding to each feature map may be combined to obtain a final relationship graph.
- the merging of the relationship graphs can be realized by sequentially connecting each relationship sub-graph according to the corresponding positional relationship of each feature map in the image.
- a node may be selected from the two relational subgraphs for connection to realize the connection of the two relational subgraphs.
- the method of selecting nodes from the relational subgraph can be flexibly set, such as selecting the root node, etc.
- Step 203 using the pre-trained graph convolutional network to process the relationship graph to obtain the first text feature corresponding to the image.
- Graph Convolutional Networks can generally be regarded as a model of a Chebyshev first-order polynomial approximation spectral convolution operation using a graph Laplacian matrix. From the perspective of spectral graph convolution, graph convolutional networks can be seen as a special form of graph Laplacian smoothing.
- the convolution operation of the graph convolutional network can be regarded as transforming the feature information of each node and sending it to the neighbor nodes of the node, and then fusing the feature information of the neighbor nodes to update the feature information of each node.
- the updated feature information of each node can be used to generate the first text feature corresponding to the image using various methods.
- the first text feature can be used to characterize the feature of the text to be recognized in the image.
- the feature information of each node may be averaged or maximized, and then the processing result may be used as the first text feature.
- the updated feature information of the node corresponding to the target pixel in the feature map may be selected as the first text feature.
- the target pixel can be flexibly set.
- the target pixel point may be pre-specified by a technician, for example, the target pixel point may be the geometric center point of the feature map. As yet another example, the target pixel point may be a pixel point whose similarity between the corresponding node and each neighboring node is greater than a preset threshold.
- step 204 a text recognition result of the image is generated according to the first text feature.
- various existing text recognition methods can be used to generate the text recognition corresponding to the first text features
- the result is the text recognition result corresponding to the image presenting the text to be recognized.
- the feature map obtained by segmenting the text instance of the image presenting the text to be recognized can represent the image features of the image area where each character in the text to be recognized is located and other feature maps sequence features between.
- the sequential feature can represent the sequential relationship between each feature map, so that in the subsequent recognition process, the context of each character can be recognized more accurately, which helps to improve the accuracy of the recognition result.
- FIG. 3 shows a flowchart 300 of an embodiment of a method for generating a feature map in a method for recognizing text according to an embodiment of the present disclosure.
- the feature map obtained by segmenting the text instance of the image presenting the text to be recognized can be generated by the following steps:
- Step 301 input the image into the pre-trained convolutional neural network to obtain an initial feature map.
- the convolutional neural network can be used to perform a convolution operation on the image to extract various features of the image (such as texture features, color features, etc.) to obtain an initial feature map.
- the convolutional neural network can be implemented based on the feature pyramid network FPN and the residual network ResNet, and the step size of at least one convolutional layer before the output layer of the residual network can be set to 1, and the feature map output by the residual network
- the input feature maps of feature pyramid networks can be generated by deformable convolutions.
- ResNet50 generally processes the input image through 5 stages to obtain the output feature map. Assuming that the input image passes through 5 stages and the output feature maps are respectively S1-S5, the convolution step size of stage 4 and stage 5 can be set to 1, so that S4 and S5 can retain more text textures, text boundaries, etc. Underlying image information.
- deformable convolution processing can be performed on the feature map S5, and the feature map after the deformable convolution processing is input into the FPN, and the final feature map output by the FPN is used as the initial feature map.
- deformable convolution can make the convolutional neural network better adapt to the irregular boundary of the text to be processed, thereby improving the accuracy of subsequent text recognition.
- Step 302 performing text instance segmentation on the initial feature map to obtain an instance feature map.
- various existing instance segmentation methods can be used to perform text instance segmentation on the initial feature map to obtain the instance feature map.
- the network of PPM Physical Pooling Module
- the network of PPM Physical Pooling Module
- parallel 1*1, 3*3, and 5*5 convolutional layers can be used to extract features from the initial feature map, and then the features extracted by each convolutional layer can be connected in series, and then the 1*1 convolutional layer can be used Layers are dimensionally transformed for subsequent processing.
- multiple stacked convolutional layers can be used to further perform feature conversion based on the spatial attention mechanism to obtain features with enhanced spatial positions.
- the 3*3 and 1*1 convolutional layers can be further used to obtain the spatial attention feature map. Then, the text instance segmentation is performed based on the obtained spatial position enhanced features.
- Step 303 perform sequential text segmentation on the initial feature map to obtain a sequential feature map.
- various existing text sequence segmentation methods can be used to process the initial feature map to obtain a sequence feature map.
- the initial feature map can be input to a network built based on the convolutional encoder-decoder structure to perform simple convolutional downsampling and deconvolutional upsampling on the initial feature map to obtain a sequential feature map.
- Step 304 fusing the instance feature map and the sequential feature map to obtain a feature map obtained by segmenting the text instance of the image presenting the text to be recognized.
- the obtained instance feature map and sequential feature map may be fused to obtain a feature map obtained by segmenting the text instance of the image presenting the text to be recognized.
- the instance feature map may be fused with the sequential feature map corresponding to each text instance to obtain at least two fused feature maps.
- various feature fusion methods can be used to fuse instance feature maps and sequential feature maps.
- the fusion of the instance feature map and the sequential feature map can be obtained by multiplying the corresponding pixels in the instance feature map and the sequential feature map.
- the spatial semantic features of the nodes in the relationship graph constructed based on the feature map may also include the order feature of the feature map where the pixel indicated by the node is located.
- the spatial semantic features of the node can be generated by the following steps:
- Step 1 Obtain the sequence value of the feature map corresponding to the node in each feature map.
- the order value can represent the sorted position of the feature map in each feature map. For example, if 20 feature maps are preset, the sort number (one of 1-20) of each feature map can be used as the sequence value of the feature map.
- sequence values corresponding to each pixel in the same feature map are the same.
- Step 2 According to the sequence value of the feature map corresponding to the node, the sequence feature of the feature map corresponding to the node is determined.
- the sequential features of the feature map can be represented by vectors. Therefore, the ordinal features corresponding to each feature map can be obtained by mapping each ordinal value to the same vector space. Specifically, various existing mapping methods may be used for the mapping method.
- a vector representation of sequential values can be performed using sine and cosine functions of different wavelengths.
- the vector representation of each sequence value can be obtained using the following formula:
- z can represent an ordinal value.
- C ⁇ represents a vector dimension, which can be preset by a technician.
- i represents the serial number of the element in the vector.
- ⁇ 2i means the element with an even number in the vector, and
- ⁇ 2i+1 means the element with an odd number in the vector.
- Step 3 splicing the position feature, category feature and determined order feature of the pixel indicated by the node to obtain the spatial semantic feature of the node.
- the position feature, category feature and determined sequence feature of the pixel point indicated by the node can be spliced sequentially to obtain the spatial semantic feature of the node.
- the position feature and category feature of the pixel point can be represented by vector.
- the position feature and category feature of the pixel point can be first vectorized, and then the vector representation of the node’s position feature and category feature and the vector representation of the node’s order feature can be concatenated to obtain the node spatial semantic features.
- the abscissa of the node is X
- the ordinate is Y
- the category feature is P
- map X, Y, and P to vector representations, and then concatenate the vectors corresponding to X, Y, and P respectively
- the representation is a vector representation corresponding to the position feature and category feature of the node.
- each vector can be consistent.
- 1*1 convolution can be used to flexibly adjust the dimensionality of each vector representation.
- various standardization methods can be used to perform the vector representations corresponding to the node's positional features and category features Normalize to reduce the number of vector elements for subsequent calculations.
- normalization may be achieved by subtracting the vector representation corresponding to the position feature and category feature of the target node in the corresponding feature map from the vector representation of the node. It can be understood that the vector representation corresponding to the position feature and category feature of the target node is 0 after normalization.
- the target node can be pre-established by technicians, and can also be flexibly set during the calculation process.
- the target node can be the root node of the feature map.
- the geometric center point of the feature map can be set as the root node.
- the root node of the feature map can be determined by the following steps: For a node in the feature map, determine the intersection and union ratios of the node and other nodes, and in response to determining that the intersection and union ratios corresponding to the node are not greater than the preset Threshold, determine the node as the root node.
- the intersection ratio may represent a ratio of the number of elements included in the intersection of neighbor nodes of two nodes to the number of elements included in the union of neighbor nodes.
- the preset threshold can be preset by technicians. Through this method, the pixel point in the center of each text can be effectively selected as the root node, and at the same time, the distribution of nodes in the relational subgraph corresponding to the feature map can be balanced.
- the similarity of spatial semantic features may be determined through the following steps:
- Step 1 Determine the Euclidean distance between two nodes, and determine the first similarity according to the determined Euclidean distance.
- the Euclidean distance between two nodes is generally inversely proportional to the similarity of the spatial semantic features of the nodes.
- the first similarity can be determined using various methods. As an example, the first similarity can be determined using the following formula:
- D(p,q) represents the Euclidean distance between p and q.
- H m and W m denote the height and width of the feature map, respectively.
- E s (p,q) represents the first similarity.
- Step 2 Determine the cosine similarity of the category features corresponding to the two nodes.
- the existing cosine similarity calculation method can be used to calculate the cosine similarity by using the vector representations of the category features corresponding to the two nodes.
- Step 3 according to the first similarity and the cosine similarity, determine the similarity of the spatial semantic features of the two nodes.
- the spatial semantic feature similarity can generally be directly proportional to the first similarity, and also proportional to the cosine similarity. Based on this, various methods can be used to determine the similarity of the spatial semantic features of two nodes. For example, the product of the first similarity and the cosine similarity can be directly calculated as the similarity of the spatial semantic features of two nodes. As a result, the similarity between nodes in terms of spatial location, category, etc. can be fully considered, which helps to build a more accurate relationship graph.
- the similarity of the spatial semantic features between the node and other nodes can be calculated respectively, and then according to the similarity
- the order of the degree from large to small is used to select the target number of nodes as the neighbor nodes of the node, that is, to establish an edge between the selected node and the node.
- the target number can be flexibly set according to specific application scenarios.
- the target number can be 8.
- the complexity and accuracy of the constructed relationship graph can be flexibly controlled to assist subsequent calculations.
- the foregoing graph convolutional network may include a first graph convolutional network and an output network.
- the first graph convolutional network can be used to transform the feature matrix of the constructed relationship graph.
- the output network can be used to select nodes from each relational subgraph according to the output of the first graph convolutional network, and aggregate the converted features corresponding to the nodes respectively selected from each relational subgraph to obtain the first text feature.
- the characteristic matrix and adjacency matrix of the relational graph can be used to represent the relational subgraph.
- the elements in the feature matrix are used to represent the features of the nodes in the relationship subgraph.
- the adjacency matrix is used to represent the connection relationship between each node in the relational subgraph (such as whether there is an edge, etc.).
- the first graph convolutional network can adopt various existing convolutional network structures to realize the transformation of the feature matrix of the relational graph.
- the first graph convolutional network may include a first graph convolutional subnetwork and a second graph convolutional subnetwork.
- the first graph convolutional sub-network can be used to convert the feature matrix of the relationship graph.
- the second graph convolutional subnetwork may be used to transform the feature matrix output by the first graph convolutional subnetwork according to the relational subgraph constructed based on the output of the first graph convolutional subnetwork.
- the first graph convolutional sub-network converts the feature matrix of the relationship graph, it can recalculate the similarity between nodes and re-establish edges according to the characteristics of each node in the converted feature matrix, that is, update relationship subgraph.
- the second graph convolutional subnetwork can process the updated relational subgraph.
- the structures of the first graph convolutional subnetwork and the second graph convolutional subnetwork can be flexibly set by technicians according to actual application requirements.
- l is the number of convolutional layers of the first graph convolutional subnetwork.
- Y l represents the output of layer l.
- Xl represents the input of layer l .
- W l is the network parameter learned by the first graph convolutional sub-network.
- A represents the adjacency matrix of the relationship graph.
- IN represents a matrix of the same size as A, and the main diagonal elements are all 1. is a diagonal matrix.
- i and j represent the serial number of the row and column, respectively.
- ⁇ denotes a non-linear activation function. Represents matrices concatenated by dimension.
- l is the number of convolutional layers of the second graph convolutional subnetwork. Indicates the output of layer l.
- the output network can recalculate the distance between nodes according to the update of the feature information of each node in the relationship graph represented by the output of the first graph convolutional network, and then discard the distance from small to large. Part of the nodes (such as discarding half of the nodes), so as to realize the pooling operation and reduce the size of the corresponding feature map.
- the process of updating features and filtering nodes of the first graph convolutional network and output network may be iteratively performed until only one node remains in each relational subgraph. Then, according to the order of the relational subgraphs, the feature information of the nodes filtered out in each relational subgraph can be concatenated sequentially to form the first text feature.
- the node in the spatial relationship that best represents the corresponding text instance can be screened out from each relational subgraph, thereby utilizing the feature information of the node Subsequent text recognition can help improve the efficiency and accuracy of text recognition.
- the methods provided by the above embodiments of the present disclosure propose a graph-based text recognition method.
- the relationship graph is constructed by segmenting the feature graph of the text instance to express the spatial semantic information of the text using the graph structure, and then the graph convolution is performed on the graph to extract the two-dimensional spatial feature information of the text to be recognized in the image. , and combined with the sequential relationship between each text instance, the recognition of the text to be recognized in the image is realized.
- FIG. 4 it shows a flow 400 of another embodiment of the method for recognizing text according to the present disclosure.
- the method for recognizing text includes the following steps:
- Step 401 acquiring a feature map obtained by segmenting a text instance of an image presenting text to be recognized.
- Step 402 constructing a relationship graph according to the feature graph.
- Step 403 using the pre-trained graph convolutional network to process the relationship graph to obtain the first text feature corresponding to the image.
- Step 404 using the pre-trained language model to process the feature map to obtain the second text feature corresponding to the image.
- the language model may be various existing language models used for text recognition. Such as N-Gram model, HMM (Hidden Markov Model, Hidden Markov Model), BERT (Bidirectional Encoder Representations from Transformers) and so on.
- N-Gram model HMM (Hidden Markov Model, Hidden Markov Model), BERT (Bidirectional Encoder Representations from Transformers) and so on.
- HMM Hidden Markov Model, Hidden Markov Model
- BERT Bidirectional Encoder Representations from Transformers
- the feature representation of the text sequence generated by the language model before the output layer can be selected as the second text feature.
- the pre-trained semantic feature extraction network can be used to process the feature map to obtain the semantic features corresponding to the feature map. Then input the semantic features into the language model to obtain the second text features.
- the semantic feature extraction network can adopt the structure of various existing feature extraction networks.
- a semantic feature extraction network may include a pooling layer that pools feature maps and a linear layer that linearly transforms the output of the pooling layer. Specifically, the input feature map is pooled first, and then the pooled result is linearly transformed to reduce the resolution of the image space and generate semantic features.
- the language model can adopt the structure of various existing models based on natural language processing.
- the translation model can use several text instances before and after each text instance to predict the semantics of the text, so as to realize text recognition.
- Step 405 generating a text recognition result of the image according to the first text feature and the second text feature.
- various methods may be used in combination with the first text feature and the second text feature to generate the text recognition result of the image.
- various existing feature fusion methods may be used to fuse the first text feature and the second text feature to obtain a fused text feature, and then recognize the fused text feature to obtain a recognition result.
- the text recognition process of the language model uses one-dimensional compressed feature information and ignores the two-dimensional spatial feature information
- the richer feature information can be used. Enables more reliable text recognition.
- a text recognition result of an image is generated according to the first text feature, the second text feature, and the feature map.
- the feature map itself can represent the image features of the text to be recognized, while using the first text feature and the second text feature for text recognition, combined with the feature map obtained by text instance segmentation, the representation of text features can be further enhanced ability to improve text recognition performance.
- the specific identification method can be flexibly set according to actual application requirements. For example, various existing feature fusion methods can be used to first fuse the first text feature, the second text feature and the feature map, and then use the fused features to predict the text recognition result.
- the first text feature, the second text feature and the feature map can be input to the pre-trained feature fusion network to generate the text recognition result of the image.
- the feature fusion network can be used to concatenate the first text feature, the second text feature and the feature map, and then linearly transform the obtained splicing result to obtain the text recognition result of the image.
- the structure of the feature fusion network can be preset by technicians.
- f t represents the fusion result of the feature fusion network.
- v t , l t and g t denote feature maps, second text features and first text features, respectively.
- W z and W f represent the linear transformation parameters to be learned by the feature fusion network.
- ⁇ represents the multiplication operation between elements. ";" indicates concatenation operation according to dimension.
- Sigmod is the activation function.
- t represents the number of feature maps obtained by segmenting the text instance of the image presenting the text to be recognized.
- the feature map has an order relationship
- the first text feature and the second text feature are also feature representations of the text sequence generated based on this order relationship. Therefore, each fusion is to combine the feature map, the second text feature The text features and the first text features are fused corresponding to each text feature, so as to obtain a final fusion result.
- the feature maps also have a contextual sequence relationship.
- the disclosure can identify each Text instances corresponding to feature maps to improve processing efficiency.
- FIG. 5b and FIG. 5c are schematic diagrams of an exemplary application scenario of the method for recognizing text according to this embodiment.
- the image representing the text to be recognized can be obtained first, that is, the text image, and then the text image is input to the pre-trained convolutional network to extract the initial feature map, and then the initial feature map is input to the text instance respectively Segmentation network and text sequential segmentation network to extract instance feature maps and sequential feature maps, and then fuse the obtained instance feature maps and sequential feature maps to form multiple feature maps corresponding to text images.
- a relationship graph can be constructed based on the obtained multiple feature maps, and then the pre-trained graph convolutional network can be used to process the constructed relationship graph to obtain the first text feature of the text image.
- the language model is used to process the obtained multiple feature maps to obtain the second text feature of the text image.
- text recognition is performed by synthesizing the multiple feature maps obtained, the first text feature and the second text feature, and a text recognition result corresponding to the text image is obtained.
- the process of using the graph convolutional network and the language model to process the obtained multiple feature maps can refer to Fig. 5b for details.
- the relationship subgraph corresponding to each feature map can be constructed first, and then the relationship subgraphs are merged to obtain the relationship graph, and then the merged relationship graph is processed by the graph convolution network to obtain the feature representation of the nodes in the relationship graph .
- the feature map can be input into the semantic feature extraction network to extract semantic features, and then the translation model can be used to form a feature representation of the text sequence corresponding to the text to be recognized based on the semantic features.
- the linear layer can be used to perform dimension transformation and other processing, so as to fuse the feature representation of the nodes in the relationship graph and the feature representation of the text sequence, and according to The fusion results generate a text recognition result of the text image.
- the specific process of constructing a corresponding relational subgraph for each feature map can refer to FIG. 5c.
- the order feature of the feature map, as well as the position feature and category feature of each pixel can be determined first.
- the sequence feature and position feature can be fused and the vector representation of each node can be formed by mapping and other methods, so as to obtain the feature matrix of the relational subgraph composed of the vector representation of each node.
- the similarity between the nodes indicated by each pixel can be determined according to the category features, and neighbor nodes can be searched for each node to build edges according to the similarity between nodes, so as to form the adjacency matrix of the relational subgraph.
- the obtained feature matrix and adjacency matrix can be used to represent the relation subgraph corresponding to the feature subgraph.
- the method provided by the above-mentioned embodiments of the present disclosure dynamically integrates the graph-based text recognition method and the language model-based text recognition method, so that the two recognition methods can learn from each other to use more information for text recognition, thereby improving Text recognition effect to better adapt to various text recognition situations in natural scene text recognition, such as complex background of pictures, uneven lighting, blurred pictures, and different text shapes.
- the methods described in the above embodiments of the present disclosure can be applied to text recognition in practical business scenarios such as navigation assistance for visual impairments, automatic driving, and text reading and translation in augmented reality. For other actual business scenarios to which the methods in the embodiments of the present disclosure can also be applied, this description will not list them one by one.
- FIG. 6 it shows a flow 600 of an embodiment of a training method for a graph convolutional network, a language model and a segmentation network in a method for recognizing text according to the present disclosure.
- the graph convolutional network, language model and segmentation network can be trained through the following steps:
- Step 601 obtain a labeled training data set and an unlabeled training data set.
- the training of the graph convolutional network, the language model and the segmentation network can be performed by the executing subject of the above-mentioned method for recognizing text, or can be performed by other electronic devices.
- the execution subjects of training graph convolutional network, language model and segmentation network can obtain labeled training datasets and unlabeled training datasets from local or other storage devices.
- the labeled training data set is composed of labeled training data
- the unlabeled training data set is composed of unlabeled training data.
- Annotated training datasets can include annotations of different granularities. For example, for the recognition of character-based text, annotations can include character set annotations and word-level annotations.
- Step 602 constructing the teacher-student network corresponding to the graph convolutional network, the language model and the segmentation network, and using the labeled training data set, the unlabeled training data set and the preset loss function to train the teacher-student network.
- the teacher-student network is a network structure in transfer learning.
- the structure of teacher network and student network can be identical or different.
- parameter sharing of the structure of the teacher network and the student network can be realized based on EMA (exponential moving average, etc.) technology.
- the input of the student network may include labeled training data and unlabeled training data
- the input of the teacher network may only include unlabeled training data.
- the loss function may include a first loss function, a second loss function and a third loss function.
- the first loss function and the second loss function may respectively represent the difference between the output result of the student network for the labeled training data and the labels of different granularities.
- the first loss function can be used to measure the difference between the character-level recognition results output by the student network and the real character-level annotations
- the second loss function can be used to measure the difference between the word-level recognition results output by the student network and the real word-level annotations. difference between annotations.
- the third loss function may represent the difference between the output results of the student network and the teacher network respectively for the unlabeled training data.
- the first loss function, the second loss function, and the third loss function can be integrated (such as optimizing the sum of the three loss functions) to adjust the parameters of the graph convolutional network, language model, and segmentation network to complete the graph convolutional network , language model and segmentation network training.
- the training method for graph convolutional network, language model, and segmentation network utilizes labeled training data and unlabeled training data to jointly train graph convolutional network, language model, and segmentation network, solving the practical problems
- the lack of training data and the difficulty of labeling real data can also improve the generalization and robustness of the overall network composed of graph convolutional networks, language models, and segmentation networks, thereby helping to improve the accuracy of text recognition results.
- the present disclosure provides an embodiment of a device for recognizing text, which corresponds to the method embodiment shown in FIG. 2 , and the device specifically It can be applied to various electronic devices.
- the apparatus 700 for recognizing text includes a feature map acquisition unit 701 , a relationship graph construction unit 702 , a graph convolution processing unit 703 and a recognition unit 704 .
- the feature map acquisition unit 701 is configured to acquire a feature map, wherein the feature map is obtained by segmenting the text instance of the image presenting the text to be recognized;
- the relational map construction unit 702 is configured to construct a relational map according to the feature map, wherein the relation The nodes in the graph represent the pixels in the feature map, and the edge represents that the similarity of the spatial semantic features of the two connected nodes is greater than the target threshold.
- the spatial semantic features include the position characteristics and category characteristics of the pixel points indicated by the nodes; graph convolution
- the processing unit 703 is configured to use the pre-trained graph convolutional network to process the relationship graph to obtain the first text feature corresponding to the image; the recognition unit 704 is configured to generate a text recognition result of the image according to the first text feature.
- the specific processing of the feature graph acquisition unit 701, the relationship graph construction unit 702, the graph convolution processing unit 703 and the recognition unit 704 and the technical effects brought about by them can be respectively Refer to FIG. 2 for related descriptions of steps 201, 202, 203, and 204 in the corresponding embodiment, and details are not repeated here.
- the apparatus 700 for recognizing text further includes: a language model processing unit (not shown in the figure), configured to process the feature map using a pre-trained language model, obtaining a second text feature corresponding to the image; and the recognition unit 704 is further configured to generate a text recognition result of the image according to the first text feature and the second text feature.
- a language model processing unit (not shown in the figure), configured to process the feature map using a pre-trained language model, obtaining a second text feature corresponding to the image
- the recognition unit 704 is further configured to generate a text recognition result of the image according to the first text feature and the second text feature.
- the recognition unit 704 is further configured to: generate a text recognition result of an image according to the first text feature, the second text feature, and the feature map.
- the feature map is at least two feature maps; and the relationship map construction unit 702 is further configured to: for the feature maps in the at least two feature maps, construct the feature map corresponding to The relationship subgraphs of each feature map are merged to obtain the relationship diagram.
- the feature maps in the at least two feature maps are used to characterize the image features of the image regions where each character in the text to be recognized is located and the sequence features with other feature maps .
- the feature map is generated through the following steps: input the image to a pre-trained convolutional neural network to obtain an initial feature map; perform text instance segmentation on the initial feature map to obtain an instance feature map ; Carrying out sequential text segmentation on the initial feature map to obtain the sequential feature map; fusing the instance feature map and the sequential feature map to obtain the feature map obtained by segmenting the text instance of the image presenting the text to be recognized.
- the convolutional neural network is implemented based on a feature pyramid network and a residual network, wherein the step size of at least one convolutional layer before the output layer of the residual network is 1, and the residual
- the feature map output by the network is generated by deformable convolution to generate the input feature map of the feature pyramid network.
- the above-mentioned language model processing unit is further configured to: use a pre-trained semantic feature extraction network to process the feature map to obtain the semantic feature corresponding to the feature map, wherein the semantic feature extraction
- the network includes a pooling layer for pooling the feature map and a linear layer for linearly transforming the output of the pooling layer; the semantic feature is input to the language model to obtain the second text feature.
- the spatial semantic feature also includes the sequence feature of the feature map where the pixel indicated by the node is located; and for the node in the relational subgraph, the spatial semantic feature of the node is generated by the following steps : Obtain the sequence value of the feature map corresponding to the node in at least two feature maps; determine the sequence feature of the feature map corresponding to the node according to the sequence value; splice the position feature, category feature and determined Ordinal features, get the spatial semantic features of the node.
- the similarity of the spatial semantic features of two nodes is determined by the following steps: determining the Euclidean distance between the two nodes, and determining the first similarity according to the determined Euclidean distance ; Determine the cosine similarity of the category features corresponding to the two nodes; determine the similarity of the spatial semantic features of the two nodes according to the first similarity and the cosine similarity.
- the relationship graph construction unit 702 is further configured to: connect root nodes of adjacent relationship subgraphs sequentially according to the sequence relationship between each relationship subgraph.
- the root node of the feature map is determined through the following steps: For a node in the feature map, determine the intersection and union ratio of the node with other nodes, where the intersection and union ratio represents two The ratio of the number of elements included in the intersection of neighbor nodes of a node to the number of elements included in the union of neighbor nodes; in response to determining that each intersection-union ratio corresponding to the node is not greater than a preset threshold, determine that the node is the root node.
- the graph convolutional network includes a first graph convolutional network and an output network, wherein the first graph convolutional network is used to convert the feature matrix of the relationship graph, and the output network uses According to the output of the convolutional network in the first graph, nodes are respectively selected from each relational subgraph, and converted features corresponding to the nodes respectively selected from each relational subgraph are aggregated to obtain the first text feature.
- the first graph convolutional network includes a first graph convolutional subnetwork and a second graph convolutional subnetwork, wherein the first graph convolutional subnetwork is used to The feature matrix of the first graph convolutional subnetwork is used to convert the feature matrix output by the first graph convolutional subnetwork according to the relationship graph constructed based on the output of the first graph convolutional subnetwork.
- the recognition unit 704 is further configured to: input the first text feature, the second text feature and the feature map to the pre-trained feature fusion network to generate the text recognition result of the image , the feature fusion network is used to concatenate the first text feature, the second text feature and the feature map, and perform linear transformation on the obtained splicing result to obtain the text recognition result of the image.
- performing text instance segmentation on the initial feature map to obtain the instance feature map includes: using a pre-trained text instance segmentation network to perform text instance segmentation on the initial feature map to obtain the instance feature map ; and performing text sequence segmentation on the initial feature map to obtain a sequence feature map, including: using a pre-trained text sequence segmentation network to perform text sequence segmentation on the initial feature map to obtain a sequence feature map; and the above-mentioned graph convolutional network, language model and The segmentation network is trained through the following steps, wherein the segmentation network includes a convolutional neural network, a text instance segmentation network, and a text sequence segmentation network: Obtain a labeled training data set and an unlabeled training data set, wherein the training data in the labeled training data set The data includes annotations of different granularities; constructing the teacher-student network corresponding to the graph convolutional network, language model and segmentation network, and using the labeled training data set, unlabeled training
- the feature map acquisition unit obtains the feature map obtained by segmenting the text instance of the image presenting the text to be recognized;
- the relationship graph construction unit constructs the relationship graph according to the feature map, wherein the nodes in the relationship Represents the pixels in the feature map, and the edge indicates that the similarity of the spatial semantic features of the two connected nodes is greater than the target threshold, and the spatial semantic features include the position features and category features of the pixels indicated by the nodes;
- the graph convolution processing unit utilizes pre- The trained graph convolutional network processes the relationship graph to obtain the first text feature corresponding to the image;
- the recognition unit generates the text recognition result of the image according to the first text feature, and realizes the text recognition based on the graph.
- This method can take into account The two-dimensional spatial information of the text in the image helps to improve the text recognition effect.
- FIG. 8 it shows a schematic structural diagram of an electronic device (such as the server in FIG. 1 ) 800 suitable for implementing embodiments of the present disclosure.
- the terminal equipment in the embodiments of the present disclosure may include but not limited to mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablet Computers), PMPs (Portable Multimedia Players), vehicle-mounted terminals ( Mobile terminals such as car navigation terminals) and stationary terminals such as digital TVs, desktop computers and the like.
- the server shown in FIG. 8 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
- an electronic device 800 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) Various appropriate actions and processes are executed by programs in the memory (RAM) 803 . In the RAM 803, various programs and data necessary for the operation of the electronic device 800 are also stored.
- the processing device 801, ROM 802, and RAM 803 are connected to each other through a bus 804.
- An input/output (I/O) interface 805 is also connected to the bus 804 .
- the following devices can be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 807 such as a computer; a storage device 808 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 809.
- the communication means 809 may allow the electronic device 800 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 8 shows electronic device 800 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided. Each block shown in FIG. 8 may represent one device, or may represent multiple devices as required.
- embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts.
- the computer program may be downloaded and installed from a network via communication means 809, or from storage means 808, or from ROM 802.
- the processing device 801 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
- the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
- a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
- Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
- the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
- the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires a feature map, wherein the feature map performs text processing on an image presenting text to be recognized Instance segmentation is obtained; a relational graph is constructed according to the feature map, where the nodes in the relational graph represent the pixels in the feature map, and the edge represents that the similarity of the spatial semantic features of the two connected nodes is greater than the target threshold, and the spatial semantic features include nodes The location feature and category feature of the indicated pixel point; use the pre-trained graph convolution network to process the relationship graph to obtain the first text feature corresponding to the image; generate the text recognition result of the image according to the first text feature.
- Computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, Also included are conventional procedural programming languages - such as the "C" language or similar programming languages.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
- LAN local area network
- WAN wide area network
- Internet service provider for example, using an Internet service provider to connected via the Internet.
- each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
- the units involved in the embodiments described in the present disclosure may be implemented by software or by hardware.
- the described units can also be set in a processor, for example, it can be described as: a processor includes a feature map acquisition unit, a relationship map construction unit, a graph convolution processing unit and a recognition unit.
- a processor includes a feature map acquisition unit, a relationship map construction unit, a graph convolution processing unit and a recognition unit.
- the names of these units do not constitute a limitation of the unit itself in some cases, for example, the feature map acquisition unit can also be described as "obtaining a feature map, wherein the feature map is obtained by performing an image representation of the text to be recognized. The text instance is segmented to get ".
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Image Analysis (AREA)
- Character Discrimination (AREA)
Abstract
Description
Claims (20)
- 一种用于识别文本的方法,包括:获取特征图,其中,所述特征图通过对呈现有待识别文本的图像进行文本实例分割得到;根据所述特征图构建关系图,其中,所述关系图中的节点表示所述特征图中的像素点,边表示所连接的两个节点的空间语义特征的相似度大于目标阈值,空间语义特征包括节点指示的像素点的位置特征和类别特征;利用预先训练的图卷积网络对所述关系图进行处理,得到所述图像对应的第一文本特征;以及根据所述第一文本特征,生成所述图像的文本识别结果。
- 根据权利要求1所述的方法,其中,所述方法还包括:利用预先训练的语言模型对所述特征图进行处理,得到所述图像对应的第二文本特征;以及所述根据所述第一文本特征,生成所述图像的文本识别结果,包括:根据所述第一文本特征和第二文本特征,生成所述图像的文本识别结果。
- 根据权利要求2所述的方法,其中,所述根据所述第一文本特征和第二文本特征,生成所述图像的文本识别结果,包括:根据所述第一文本特征、第二文本特征和特征图,生成所述图像的文本识别结果。
- 根据权利要求3所述的方法,其中,所述特征图为至少两个特征图;以及所述根据所述特征图构建关系图,包括:对于所述至少两个特征图中的特征图,构建该特征图对应的关系 子图;以及合并各个特征图分别对应的关系子图,得到所述关系图。
- 根据权利要求4所述的方法,其中,所述至少两个特征图中的特征图用于表征所述待识别文本中的各个文字分别所在的图像区域的图像特征和与其他特征图之间的顺序特征。
- 根据权利要求5所述的方法,其中,所述特征图通过如下步骤生成:将所述图像输入至预先训练的卷积神经网络,得到初始特征图;对所述初始特征图进行文本实例分割,得到实例特征图;对所述初始特征图进行文本顺序分割,得到顺序特征图;以及融合所述实例特征图和顺序特征图,得到所述对呈现有待识别文本的图像进行文本实例分割得到的特征图。
- 根据权利要求6所述的方法,其中,所述卷积神经网络基于特征金字塔网络和残差网络实现,其中,所述残差网络的输出层之前的至少一个卷积层的步长为1,所述残差网络输出的特征图通过可变形卷积生成所述特征金字塔网络的输入特征图。
- 根据权利要求2所述的方法,其中,所述利用预先训练的语言模型对所述特征图进行处理,得到所述图像对应的第二文本特征,包括:利用预先训练的语义特征提取网络对所述特征图进行处理,得到所述特征图对应的语义特征,其中,所述语义特征提取网络包括对所述特征图进行池化处理的池化层和对所述池化层的输出进行线性变换的线性层;以及将所述语义特征输入至所述语言模型,得到所述第二文本特征。
- 根据权利要求4所述的方法,其中,所述空间语义特征还包括 节点指示的像素点所在的特征图的顺序特征;以及对于关系子图中的节点,该节点的空间语义特征通过如下步骤生成:获取该节点对应的特征图在所述至少两个特征图中的顺序值;根据所述顺序值,确定该节点对应的特征图的顺序特征;以及拼接该节点指示的像素点的位置特征、类别特征和确定的顺序特征,得到该节点的空间语义特征。
- 根据权利要求4所述的方法,其中,两个节点的空间语义特征的相似度通过如下步骤确定:确定两个节点之间的欧式距离,以及根据所确定的欧式距离确定第一相似度;确定两个节点分别对应的类别特征的余弦相似度;以及根据所述第一相似度和余弦相似度,确定两个节点的空间语义特征的相似度。
- 根据权利要求4所述的方法,其中,所述合并各个特征图分别对应的关系子图,包括:按照各个关系子图之间的顺序关系,依次连接相邻关系子图的根节点。
- 根据权利要求11所述的方法,其中,特征图的根节点通过如下步骤确定:对于所述特征图中的节点,确定该节点分别与其他各个节点的交并比,其中,交并比表示两个节点的邻居节点的交集所包括的元素数目与邻居节点的并集所包括的元素数目的比值;以及响应于确定该节点对应的各个交并比不大于预设阈值,确定该节点为根节点。
- 根据权利要求4所述的方法,其中,所述图卷积网络包括第 一图卷积网络和输出网络,其中,第一图卷积网络用于对所述关系图的特征矩阵进行转换,输出网络用于根据第一图卷积网络的输出,从各关系子图中分别选取节点,以及聚合从各个关系子图中分别选取的节点对应的转换后的特征以得到所述第一文本特征。
- 根据权利要求13所述的方法,其中,所述第一图卷积网络包括第一图卷积子网络和第二图卷积子网络,其中,第一图卷积子网络用于对所述关系图的特征矩阵进行转换,第二图卷积子网络用于根据基于所述第一图卷积子网络的输出构建的关系图,对所述第一图卷积子网络输出的特征矩阵进行转换。
- 根据权利要求3所述的方法,其中,所述根据所述第一文本特征、第二文本特征和特征图,生成所述图像的文本识别结果,包括:将所述第一文本特征、第二文本特征和特征图输入至预先训练的特征融合网络,生成所述图像的文本识别结果,特征融合网络用于拼接所述第一文本特征、第二文本特征和特征图,以及对得到的拼接结果进行线性变换,得到所述图像的文本识别结果。
- 根据权利要求6所述的方法,其中,所述对所述初始特征图进行文本实例分割,得到实例特征图,包括:利用预先训练的文本实例分割网络对所述初始特征图进行文本实例分割,得到实例特征图;以及所述对所述初始特征图进行文本顺序分割,得到顺序特征图,包括:利用预先训练的文本顺序分割网络对所述初始特征图进行文本顺序分割,得到顺序特征图;以及所述图卷积网络、语言模型和分割网络通过如下步骤训练得到,其中,所述分割网络包括所述卷积神经网络、文本实例分割网络和文本顺序分割网络:获取有标注训练数据集和无标注训练数据集,其中,有标注训练数据集中的训练数据包括不同粒度的标注;以及构建所述图卷积网络、语言模型和分割网络对应的教师-学生网络,以及利用所述有标注训练数据集、无标注训练数据集和预设的损失函数,对所述教师-学生网络进行训练,其中,学生网络的输入包括有标注训练数据和无标注训练数据,教师网络的输入包括无标注训练数据,所述损失函数包括第一损失函数、第二损失函数和第三损失函数,第一损失函数和第二损失函数分别表示学生网络针对有标注训练数据的输出结果与不同粒度的标注之间的差异,第三损失函数表示学生网络和教师网络分别针对无标注训练数据的输出结果之间的差异。
- 一种用于识别文本的装置,其中,所述装置包括:特征图获取单元,被配置成获取特征图,其中,所述特征图通过对呈现有待识别文本的图像进行文本实例分割得到;关系图构建单元,被配置成根据所述特征图构建关系图,其中,所述关系图中的节点表示所述特征图中的像素点,边表示所连接的两个节点的空间语义特征的相似度大于目标阈值,空间语义特征包括节点指示的像素点的位置特征和类别特征;图卷积处理单元,被配置成利用预先训练的图卷积网络对所述关系图进行处理,得到所述图像对应的第一文本特征;识别单元,被配置成根据所述第一文本特征,生成所述图像的文本识别结果。
- 一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-16中任一所述的方法。
- 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-16中任一所述的方法。
- 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-16中任一项所述的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/567,243 US20240273932A1 (en) | 2021-06-07 | 2022-04-06 | Method for recognizing text, and apparatus |
JP2023575611A JP2024526065A (ja) | 2021-06-07 | 2022-04-06 | テキストを認識するための方法および装置 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110632180.5A CN115457531A (zh) | 2021-06-07 | 2021-06-07 | 用于识别文本的方法和装置 |
CN202110632180.5 | 2021-06-07 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022257578A1 true WO2022257578A1 (zh) | 2022-12-15 |
Family
ID=84295273
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/085317 WO2022257578A1 (zh) | 2021-06-07 | 2022-04-06 | 用于识别文本的方法和装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240273932A1 (zh) |
JP (1) | JP2024526065A (zh) |
CN (1) | CN115457531A (zh) |
WO (1) | WO2022257578A1 (zh) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116051935A (zh) * | 2023-03-03 | 2023-05-02 | 北京百度网讯科技有限公司 | 图像检测方法、深度学习模型的训练方法及装置 |
CN116229277A (zh) * | 2023-05-08 | 2023-06-06 | 中国海洋大学 | 基于语义相关性的强抗干扰海洋遥感图像语义分割方法 |
CN116311275A (zh) * | 2023-02-16 | 2023-06-23 | 中南民族大学 | 一种基于seq2seq语言模型的文字识别方法及系统 |
CN116400317A (zh) * | 2023-06-08 | 2023-07-07 | 中国人民解放军战略支援部队航天工程大学 | 基于图卷积的散射拓扑特征构建与空间目标识别方法 |
CN116503517A (zh) * | 2023-06-27 | 2023-07-28 | 江西农业大学 | 长文本生成图像的方法及系统 |
CN116563840A (zh) * | 2023-07-07 | 2023-08-08 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | 基于弱监督跨模态对比学习的场景文本检测与识别方法 |
CN117034905A (zh) * | 2023-08-07 | 2023-11-10 | 重庆邮电大学 | 一种基于大数据的互联网假新闻识别方法 |
CN117173716A (zh) * | 2023-09-01 | 2023-12-05 | 湖南天桥嘉成智能科技有限公司 | 一种基于深度学习的高温板坯id字符识别方法和系统 |
CN117611924A (zh) * | 2024-01-17 | 2024-02-27 | 贵州大学 | 一种基于图文子空间联合学习的植物叶片表型病害分类方法 |
CN117876384A (zh) * | 2023-12-21 | 2024-04-12 | 珠海横琴圣澳云智科技有限公司 | 目标对象实例分割、模型训练方法及相关产品 |
CN118132731A (zh) * | 2024-05-06 | 2024-06-04 | 杭州数云信息技术有限公司 | 对话方法及装置、存储介质、终端、计算机程序产品 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106022363A (zh) * | 2016-05-12 | 2016-10-12 | 南京大学 | 一种适用于自然场景下的中文文字识别方法 |
CN108549893A (zh) * | 2018-04-04 | 2018-09-18 | 华中科技大学 | 一种任意形状的场景文本端到端识别方法 |
CN111209398A (zh) * | 2019-12-30 | 2020-05-29 | 北京航空航天大学 | 一种基于图卷积神经网络的文本分类方法、系统 |
CN111414913A (zh) * | 2019-01-08 | 2020-07-14 | 北京地平线机器人技术研发有限公司 | 一种文字识别方法以及识别装置、电子设备 |
WO2020254924A1 (en) * | 2019-06-16 | 2020-12-24 | Way2Vat Ltd. | Systems and methods for document image analysis with cardinal graph convolutional networks |
-
2021
- 2021-06-07 CN CN202110632180.5A patent/CN115457531A/zh active Pending
-
2022
- 2022-04-06 WO PCT/CN2022/085317 patent/WO2022257578A1/zh active Application Filing
- 2022-04-06 US US18/567,243 patent/US20240273932A1/en active Pending
- 2022-04-06 JP JP2023575611A patent/JP2024526065A/ja active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106022363A (zh) * | 2016-05-12 | 2016-10-12 | 南京大学 | 一种适用于自然场景下的中文文字识别方法 |
CN108549893A (zh) * | 2018-04-04 | 2018-09-18 | 华中科技大学 | 一种任意形状的场景文本端到端识别方法 |
CN111414913A (zh) * | 2019-01-08 | 2020-07-14 | 北京地平线机器人技术研发有限公司 | 一种文字识别方法以及识别装置、电子设备 |
WO2020254924A1 (en) * | 2019-06-16 | 2020-12-24 | Way2Vat Ltd. | Systems and methods for document image analysis with cardinal graph convolutional networks |
CN111209398A (zh) * | 2019-12-30 | 2020-05-29 | 北京航空航天大学 | 一种基于图卷积神经网络的文本分类方法、系统 |
Non-Patent Citations (2)
Title |
---|
DU CHEN; WANG CHUNHENG; WANG YANNA; FENG ZIPENG; ZHANG JIYUAN: "TextEdge: Multi-oriented Scene Text Detection via Region Segmentation and Edge Classification", 2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), IEEE, 20 September 2019 (2019-09-20), pages 375 - 380, XP033701254, DOI: 10.1109/ICDAR.2019.00067 * |
YAO LIANG, MAO CHENGSHENG, LUO YUAN: "Graph Convolutional Networks for Text Classification", 13 November 2018 (2018-11-13), XP055775123, Retrieved from the Internet <URL:https://arxiv.org/pdf/1809.05679.pdf> [retrieved on 20210211], DOI: 10.1609/aaai.v33i01.33017370 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116311275B (zh) * | 2023-02-16 | 2023-09-19 | 中南民族大学 | 一种基于seq2seq语言模型的文字识别方法及系统 |
CN116311275A (zh) * | 2023-02-16 | 2023-06-23 | 中南民族大学 | 一种基于seq2seq语言模型的文字识别方法及系统 |
CN116051935A (zh) * | 2023-03-03 | 2023-05-02 | 北京百度网讯科技有限公司 | 图像检测方法、深度学习模型的训练方法及装置 |
CN116051935B (zh) * | 2023-03-03 | 2024-03-22 | 北京百度网讯科技有限公司 | 图像检测方法、深度学习模型的训练方法及装置 |
CN116229277A (zh) * | 2023-05-08 | 2023-06-06 | 中国海洋大学 | 基于语义相关性的强抗干扰海洋遥感图像语义分割方法 |
CN116229277B (zh) * | 2023-05-08 | 2023-08-08 | 中国海洋大学 | 基于语义相关性的强抗干扰海洋遥感图像语义分割方法 |
CN116400317A (zh) * | 2023-06-08 | 2023-07-07 | 中国人民解放军战略支援部队航天工程大学 | 基于图卷积的散射拓扑特征构建与空间目标识别方法 |
CN116400317B (zh) * | 2023-06-08 | 2023-08-18 | 中国人民解放军战略支援部队航天工程大学 | 基于图卷积的散射拓扑特征构建与空间目标识别方法 |
CN116503517A (zh) * | 2023-06-27 | 2023-07-28 | 江西农业大学 | 长文本生成图像的方法及系统 |
CN116503517B (zh) * | 2023-06-27 | 2023-09-05 | 江西农业大学 | 长文本生成图像的方法及系统 |
CN116563840A (zh) * | 2023-07-07 | 2023-08-08 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | 基于弱监督跨模态对比学习的场景文本检测与识别方法 |
CN116563840B (zh) * | 2023-07-07 | 2023-09-05 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | 基于弱监督跨模态对比学习的场景文本检测与识别方法 |
CN117034905A (zh) * | 2023-08-07 | 2023-11-10 | 重庆邮电大学 | 一种基于大数据的互联网假新闻识别方法 |
CN117034905B (zh) * | 2023-08-07 | 2024-05-14 | 重庆邮电大学 | 一种基于大数据的互联网假新闻识别方法 |
CN117173716A (zh) * | 2023-09-01 | 2023-12-05 | 湖南天桥嘉成智能科技有限公司 | 一种基于深度学习的高温板坯id字符识别方法和系统 |
CN117173716B (zh) * | 2023-09-01 | 2024-03-26 | 湖南天桥嘉成智能科技有限公司 | 一种基于深度学习的高温板坯id字符识别方法和系统 |
CN117876384A (zh) * | 2023-12-21 | 2024-04-12 | 珠海横琴圣澳云智科技有限公司 | 目标对象实例分割、模型训练方法及相关产品 |
CN117611924A (zh) * | 2024-01-17 | 2024-02-27 | 贵州大学 | 一种基于图文子空间联合学习的植物叶片表型病害分类方法 |
CN117611924B (zh) * | 2024-01-17 | 2024-04-09 | 贵州大学 | 一种基于图文子空间联合学习的植物叶片表型病害分类方法 |
CN118132731A (zh) * | 2024-05-06 | 2024-06-04 | 杭州数云信息技术有限公司 | 对话方法及装置、存储介质、终端、计算机程序产品 |
Also Published As
Publication number | Publication date |
---|---|
JP2024526065A (ja) | 2024-07-17 |
US20240273932A1 (en) | 2024-08-15 |
CN115457531A (zh) | 2022-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022257578A1 (zh) | 用于识别文本的方法和装置 | |
US10762376B2 (en) | Method and apparatus for detecting text | |
US11775574B2 (en) | Method and apparatus for visual question answering, computer device and medium | |
CN109508681B (zh) | 生成人体关键点检测模型的方法和装置 | |
WO2021190115A1 (zh) | 检索目标的方法和装置 | |
US11768876B2 (en) | Method and device for visual question answering, computer apparatus and medium | |
WO2023077995A1 (zh) | 信息提取方法、装置、设备、介质及产品 | |
CN113704531A (zh) | 图像处理方法、装置、电子设备及计算机可读存储介质 | |
CN111932546A (zh) | 图像分割模型训练方法、图像分割方法、装置、设备及介质 | |
CN113947147B (zh) | 目标地图模型的训练方法、定位方法及相关装置 | |
US20240185602A1 (en) | Cross-Modal Processing For Vision And Language | |
CN114972944B (zh) | 视觉问答模型的训练方法及装置、问答方法、介质、设备 | |
CN117033609B (zh) | 文本视觉问答方法、装置、计算机设备和存储介质 | |
CN110659639B (zh) | 汉字识别方法、装置、计算机可读介质及电子设备 | |
CN111104941B (zh) | 图像方向纠正方法、装置及电子设备 | |
WO2023179310A1 (zh) | 图像修复方法、装置、设备、介质及产品 | |
CN115578570A (zh) | 图像处理方法、装置、可读介质及电子设备 | |
CN113610034B (zh) | 识别视频中人物实体的方法、装置、存储介质及电子设备 | |
CN110263779A (zh) | 文本区域检测方法及装置、文本检测方法、计算机可读介质 | |
WO2024001653A9 (zh) | 特征提取方法、装置、存储介质及电子设备 | |
CN110674813B (zh) | 汉字识别方法、装置、计算机可读介质及电子设备 | |
CN115375657A (zh) | 息肉检测模型的训练方法、检测方法、装置、介质及设备 | |
CN114972910A (zh) | 图文识别模型的训练方法、装置、电子设备及存储介质 | |
CN114155540A (zh) | 基于深度学习的文字识别方法、装置、设备及存储介质 | |
CN117173731B (zh) | 一种模型训练的方法、图像处理的方法以及相关装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22819183 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18567243 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 2023575611 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26/03/2024) |