CN114495101A - Text detection method, and training method and device of text detection network - Google Patents
Text detection method, and training method and device of text detection network Download PDFInfo
- Publication number
- CN114495101A CN114495101A CN202210034256.9A CN202210034256A CN114495101A CN 114495101 A CN114495101 A CN 114495101A CN 202210034256 A CN202210034256 A CN 202210034256A CN 114495101 A CN114495101 A CN 114495101A
- Authority
- CN
- China
- Prior art keywords
- text
- sample
- image
- sequence
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 113
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000012549 training Methods 0.000 title claims abstract description 30
- 239000013598 vector Substances 0.000 claims abstract description 122
- 238000004590 computer program Methods 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 11
- 230000004044 response Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 abstract description 8
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 230000011218 segmentation Effects 0.000 description 8
- 238000012805 post-processing Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000012015 optical character recognition Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The disclosure provides a text detection method and a training method of a text detection network, relates to the technical field of image processing, and particularly relates to the technical field of artificial intelligence. The specific implementation scheme is as follows: determining the sequence characteristics of an image to be detected; determining a decoded sequence vector based on the sequence feature and an instance feature corresponding to the text instance; determining the type of the image to be detected based on the decoded sequence vector; and responding to the type of the image to be detected that the image to be detected comprises a text, and determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature.
Description
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to a text detection method for artificial intelligence, a training method for a text detection network, and an apparatus for the same.
Background
The character detection means determining the position of characters in the image and finding out the boundary box. The method is a preposed step of many visual tasks, such as character recognition, scene understanding and the like, and can be widely applied to business scenes such as identity card recognition, bill recognition and the like, so that the manual input time is greatly saved, and the efficiency in various application scenes is improved.
Disclosure of Invention
The disclosure provides a character detection method, a training method of a text detection network and a device.
According to an aspect of the present disclosure, there is provided a text detection method including:
determining the sequence characteristics of an image to be detected;
determining a decoded sequence vector based on the sequence feature and an instance feature corresponding to the text instance;
determining the type of the image to be detected based on the decoded sequence vector;
and responding to the type of the image to be detected that the image to be detected comprises a text, and determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature.
According to another aspect of the present disclosure, there is provided a training method of a text detection network including an encoding subnetwork, a decoding subnetwork, and an output subnetwork;
determining sequence sample features of sample images in a training sample set based on the coding subnetwork;
determining the output of the cross-layer attention layer as a decoded sample sequence vector by taking the sequence sample characteristics and the example sample characteristics corresponding to the text example samples as the input of the cross-layer attention layer of the coding sub-network;
taking the decoded sample sequence vector as the input of the output sub-network, and determining the prediction type of the sample image according to the output of the output sub-network;
in response to the prediction type of the sample image being that the sample image comprises text, taking the decoded sample sequence vector and the sample sequence features as the input of the output sub-network, and determining the predicted position information of the text in the sample image according to the output of the output sub-network;
and matching the prediction type of the sample image with the annotation type of the sample image, and the prediction position information and the annotation position information of the text in the sample image, and adjusting the parameters of the text detection network based on the matching result.
A third aspect of the present disclosure provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text detection method or the training method of the text detection network described above.
A fourth aspect of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the text detection method or the training method of the text detection network described above.
A fifth aspect of the present disclosure provides a computer program product comprising a computer program/instructions which, when executed by a processor, implements the text detection method or the training method of a text detection network described above.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a schematic flowchart illustrating an alternative text detection method provided in an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating an alternative method for training a text detection network according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating an alternative text detection method provided by an embodiment of the present application;
FIG. 4 is a data diagram illustrating a text detection method provided by an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating an alternative structure of a text detection apparatus according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating an alternative structure of a text detection network training apparatus according to an embodiment of the present application;
FIG. 7 illustrates a schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The character detection technology is to determine the position of characters in an image and find out a boundary box of the characters. The method is a preposed step of many visual tasks, such as character recognition, scene understanding and the like, and can be widely applied to business scenes such as identity card recognition, bill recognition and the like, so that the manual input time is greatly saved, and the efficiency in various application scenes is improved. The character detection of natural scenes has specificity compared with the general target detection, and characters serving as visual main targets have diversified characteristics such as fonts, sizes, colors, directions, shapes and the like, and are more difficult to detect compared with general targets. In recent years, text detection techniques have been rapidly developed, and all of them have achieved good results on conventional text detection data sets, but the effects are frustrating when full of challenging natural scene data sets containing arbitrary shapes.
The existing methods are mainly divided into regression-based methods and segmentation-based methods, the regression-based methods often need additional shape modeling to solve the problem of bent characters, the characters with any shape still cannot be effectively solved, the segmentation-based methods can naturally solve the problem of any shape, but post-processing rules are often needed to distinguish different character entities, and the effect cannot be optimal.
The former (regression-based method) locates the target text by regressing the corresponding bounding box, and the latter generally adopts a full convolution network to perform pixel-by-pixel classification prediction on the image, divides the image into text and non-text regions, and converts the output of the pixel level into the form of the bounding box by a specific post-processing operation. Although the segmentation-based text detection algorithm mainly uses Mask-RCNN as the basic neural network to generate the segmentation graph, the segmentation-based method usually requires complex post-processing steps to generate the corresponding bounding boxes, which consumes a lot of memory and time in the inference stage because the generated regions are refined and labeled. Meanwhile, due to the adoption of the segmentation idea, the detection effect is poor aiming at the condition of overlapping character frames. The regression-based detection method usually directly predicts a bounding box, algorithms such as EAST and CPTN are common, the reasoning speed of the regression-based detection method is obviously superior to that of the segmentation-based detection method due to the simple post-processing process, but the detection effect of the regression-based detection method is not good due to the problems of large font variation amplitude and serious scene interference in a complex natural scene.
In summary, the existing text detection method has the following disadvantages:
1) most of the existing methods can obtain satisfactory detection precision aiming at the horizontal text box. However, when facing characters with arbitrary shapes in natural scenes, a large number of missed detection and error detection situations still exist due to the limited modeling capability.
2) Most of the existing character detection models are based on a CNN architecture, a large amount of artificial prior assumptions and complex post-processing processes are needed, such as Anchor design and NMS (network management system) and the whole process is complex.
The disclosure provides a text detection method and a training method of a text detection network, which at least solve the defects of text detection in the prior art.
Fig. 1 shows an alternative flowchart of a text detection method provided in an embodiment of the present application, which will be described according to various steps.
And step S101, determining the sequence characteristics of the image to be detected.
In some embodiments, a text detection device (hereinafter referred to as a first device) converts an image matrix corresponding to an image to be detected into a one-dimensional vector; and determining the characteristics corresponding to the one-dimensional vectors as the sequence characteristics of the image to be detected.
In a specific implementation, the first device may determine the sequence features of the image to be detected based on a coding subnetwork included in the text detection network; namely, the image to be detected is input into the coding sub-network, and the sequence characteristics of the image to be detected are obtained.
And step S102, determining a decoded sequence vector based on the sequence characteristics and the example characteristics corresponding to the text examples.
In some embodiments, before determining the decoded sequence vector, the first device obtains an embedded value corresponding to the text instance, where the embedded value is an instance feature corresponding to the text instance; wherein the text examples comprise character strings and/or phrases, and the embedded values corresponding to different text examples are different.
In specific implementation, the first device may obtain an embedded value corresponding to the text instance based on a self-attention layer of a decoding subnetwork included in the text detection network; namely, the text instance is used as the input of the self-attention layer, and the output of the self-attention layer is the embedded value corresponding to the text instance.
In some embodiments, the first device may obtain the decoded sequence vector based on a cross-layer attention layer included by the coding sub-network; namely, the sequence feature and the instance feature corresponding to the text instance are used as the input of the cross-layer attention layer, and the output of the cross-layer attention layer is the decoded sequence vector.
And step S103, determining the type of the image to be detected based on the decoded sequence vector.
In some embodiments, the first device may determine the type of the image to be detected based on a output subnetwork included in the text detection network; namely, the decoded sequence vector is used as the input of a full connection layer included in the output sub-network, and the output of the full connection layer is the type of the image to be detected.
The type of the image to be detected may include that the image to be detected includes text or that the image to be detected does not include text.
And S104, responding to the type of the image to be detected that the image to be detected comprises a text, and determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature.
In some embodiments, in response to that the type of the image to be detected is that the image to be detected includes text, the first device determines the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature.
In some embodiments, the first device multiplies the decoded sequence vector by a vector corresponding to the sequence feature to obtain a product result; determining position information of the text in the sequence feature based on the multiplication result; and determining the position information of the text in the image to be detected based on the position information of the text in the sequence features. Optionally, the decoded sequence vector and the vector corresponding to the sequence feature may perform a Tensor multiplication operation; when performing the Tensor multiplication operation, the length of the decoded sequence vector and the length of the vector corresponding to the sequence feature can be made the same by means of 0 complementing, copying and the like.
In some optional embodiments, after the first device determines the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature, the first device may further determine a connected domain in the image to be detected based on the position information of the text in the image to be detected; determining a text bounding box based on the boundaries of the connected domain; the text bounding box is used for identifying the text in the image to be detected.
Therefore, the text detection method provided by the embodiment of the disclosure can directly predict the position of the text in any shape at the text instance level in the image, and can adapt to the detection tasks of texts in various shapes under complex scenes. Particularly, in the case of text Recognition in combination with Optical Character Recognition (OCR), the text Recognition capability of OCR in a natural scene can be improved.
Fig. 2 is a schematic flow chart illustrating an alternative method for training a text detection network according to an embodiment of the present application, which will be described according to various steps.
Step S201, determining the sequence sample characteristics of the sample images in the training sample set based on the coding sub-network.
In some embodiments, the text detection network comprises: an encoding sub-network, a decoding sub-network and an output sub-network; wherein the coding sub-network comprises a self-attention layer and a cross-layer attention layer; the output sub-network comprises a fully connected layer.
In some embodiments, a training device (hereinafter referred to as a second device) of the text detection network converts an image matrix corresponding to the sample image into a one-dimensional vector characterizing the sample image based on the coding sub-network; and the characteristic corresponding to the one-dimensional vector of the sample image is the sequence sample characteristic of the sample image.
Step S202, taking the sequence sample characteristics and the example sample characteristics corresponding to the text example samples as the input of the cross-layer attention layer of the coding sub-network, and determining the output of the cross-layer attention layer as a decoded sample sequence vector.
In some embodiments, before the second device determines the decoded sample sequence vector, the text instance sample may also be input to a self-attention layer of the coding sub-network, and an embedded value corresponding to the text instance sample is obtained based on an output of the self-attention layer, where the embedded value is an instance sample feature corresponding to the text instance sample; the text example samples comprise character strings and/or phrases, and the embedded values corresponding to different text example samples are different; alternatively, the example sample features may be one-dimensional vectors.
In some embodiments, the second device inputs the sequence sample features and the example sample features to the cross-layer attention layer; determining an output of the cross-layer attention layer as the decoded sample sequence vector.
Step S203, using the decoded sample sequence vector as the input of the output sub-network, and determining the prediction type of the sample image according to the output of the output sub-network.
In some embodiments, the second device determines a prediction type of the sample image based on an output of a fully-connected layer included in the output sub-network using the decoded sample sequence vector as an input to the fully-connected layer.
Wherein the prediction type may include: the sample image includes text, or the sample image does not include text.
And step S204, responding to the prediction type of the sample image that the sample image comprises text, taking the decoded sample sequence vector and the sample sequence characteristics as the input of the output sub-network, and determining the predicted position information of the text in the sample image according to the output of the output sub-network.
In some embodiments, the second apparatus multiplies the decoded sample sequence vector by a vector corresponding to the sample sequence feature to obtain a product result, in response to the prediction type of the sample image being that the sample image includes text; determining predicted position information of the text in the sample sequence features based on the multiplication result; and determining the predicted position information of the text in the sample image based on the predicted position information of the text in the sample sequence feature.
In some optional embodiments, the decoded sample sequence vector and the vector corresponding to the sample sequence feature may perform a Tensor multiplication operation; when performing the Tensor multiplication operation, the length of the decoded sample sequence vector and the length of the vector corresponding to the sample sequence feature can be made the same by means of 0 complementing, copying and the like.
In specific implementation, after performing a Tensor multiplication operation on the vector corresponding to the sample sequence feature and the decoded sample sequence vector, in an obtained result (represented by a vector), a parameter of the text corresponding to the result is larger or smaller than a parameter of the text corresponding to the result, which corresponds to a non-text, in the result, and the predicted position information of the text in the sample sequence feature can be determined through the multiplication result; the sample sequence features are one-dimensional vectors obtained after the sample images are transformed; by means of matrix transformation, the predicted position information of the text in the sample image can be determined based on the predicted position information of the text in the sample sequence feature.
Step S205, matching the prediction type of the sample image and the annotation type of the sample image, and the prediction position information and the annotation position information of the text in the sample image, and adjusting the parameters of the text detection network based on the matching result.
In some embodiments, if the prediction type and the annotation type are the same and a loss value between the predicted location information and the annotation location information is less than a preset threshold, the second device determines not to adjust a parameter of the text detection network.
In other embodiments, if the prediction type and the annotation type are different, or the loss value between the predicted position information and the annotation position information is greater than or equal to the preset threshold, the parameter of the text detection network is adjusted based on a difference between the prediction type of the sample image and the annotation type of the sample image, and/or a difference between the predicted position information and the annotation position information of the text in the sample image.
In specific implementation, the prediction type and the labeling type or the prediction position information and the labeling position information can be matched through a bipartite graph matching algorithm, and type loss and position information loss (or mask loss) are calculated respectively.
Wherein the value may comprise at least one of a cross entropy loss value and a dice loss value of the second class.
Therefore, the neural network capable of directly predicting the position of the text with any shape in the text instance level in the image can be obtained through the training method of the text detection network provided by the embodiment of the disclosure, and powerful support is provided for detection tasks of texts with various shapes in complex scenes.
Fig. 3 is a schematic flow chart illustrating another alternative text detection method provided in the embodiment of the present application, which will be described according to various steps; fig. 4 shows a data diagram of a text detection method provided in an embodiment of the present application.
The character detection method provided by the embodiment of the disclosure comprises the steps that a natural scene image (image to be detected) of characters in any shape passes through a coding sub-network of a text detection network, the sequence characteristics of the image to be detected are extracted, then different text example information in the image to be detected is focused through a decoding sub-network and text vectors corresponding to different learnable text examples (Object Queries), and further, the text detection network outputs position information of different example-level text examples; based on the position information, the frame of the character instance can be obtained by simple connected domain analysis, powerful support is provided for subsequent character recognition, and accuracy of character recognition is improved.
The following is a process of performing text detection on the basis of completion of the text detection network and training (e.g., through steps S201 to S205):
step S301, inputting the image to be detected into a text detection network.
Step S302, the coding subnetwork included in the text detection network determines the sequence characteristics of the image to be detected.
In some embodiments, the text detection network includes a coding sub-network (or coding Encode module) that may be based on a Convolutional Neural Network (CNN), Transform (a machine learning model), or a network structure in which CNN and Transform are mixed, and the purpose of the coding sub-network is to extract sequence features of an image to be detected. The sequence includes vectors obtained by arranging the images according to rows (if the image to be detected can be represented as an h × w matrix, the sequence corresponding to the image to be detected is a one-dimensional vector of 1 × hw).
In some embodiments, the coding subnetwork converts an image matrix corresponding to an image to be detected into a one-dimensional vector; and determining the characteristics corresponding to the one-dimensional vectors as the sequence characteristics of the image to be detected.
In step S303, the decoding subnetwork included in the text detection network determines the decoded sequence vector.
In some embodiments, the text detection network includes a decoding subnetwork (or decoding Decoder module) constructed based on a self-attention and cross-layer attention mechanism, for decoding the sequence features extracted by the encoding subnetwork. Specifically, the decoding subnetwork comprises a self-attention layer and a cross-layer attention layer; the self-attention layer is used for converting at least one learnable text instance (Object questions) into at least one text feature, wherein the at least one learnable text instance comprises character strings and/or phrases with different lengths; the cross-layer attention layer is used for outputting a decoded sequence vector.
In some embodiments, the at least one text instance is input to a self-attention layer of the coding sub-network, and an embedded value corresponding to the at least one text instance is obtained based on an output of the self-attention layer, wherein the embedded value is an instance feature corresponding to the text instance; and determining the output of the cross-layer attention layer as a decoded sequence vector by taking the example characteristics corresponding to the sequence samples and the text examples as the input of the cross-layer attention layer of the coding sub-network.
Step S304, determining the type of the image to be detected and/or the position information of the text in the image to be detected based on the decoded sequence vector.
In some embodiments, the text detection network includes an output subnetwork comprising a fully connected layer; the full connection layer is used for determining the type of the image to be detected based on the decoded sequence vector; the text detection network is further used for determining the position information of the text in the image to be detected.
In specific implementation, the output sub-network may multiply the decoded sequence vector and the sequence feature through a Tensor multiplication operation, and determine the position information of the text in the sequence feature based on a multiplication result; and determining the position information of the text in the image to be detected based on the position information of the text in the sequence features.
As shown in fig. 4, after the image to be detected is input into the coding sub-network, the sequence characteristics are obtained; and after the text example is input into the decoding sub-network, the output of the text example and the sequence characteristic are jointly used as the output of the decoding sub-network, and the output sub-network is multiplied by the sequence characteristic to obtain the position information of the text in the image to be detected.
Therefore, by the text detection method provided by the embodiment of the application, the problem that the existing text detection method cannot effectively process the detection of the characters with any shapes is solved, the text detection method provided by the embodiment of the application converts the problem of detecting the characters with any shapes into the problem of classifying masks (namely, dividing examples and distinguishing different examples of different classes). Specifically, the text detection method provided in the embodiment of the present application performs feature extraction on an input image to be detected by using a coding subnetwork included in a text detection network, obtains a sequence feature, outputs position information of at least one text in the image to be detected by using a decoding subnetwork included in the text detection network, and finally performs simple connected domain analysis on a corresponding position to obtain a bounding box of a text instance. Compared with the conventional character detection method based on regression and segmentation, the text detection method provided by the embodiment of the application simplifies the complexity of any shape modeling by directly predicting the position information of the example-level text in the image to be detected, does not need complex artificial post-processing, and can effectively improve any character detection effect in a complex natural scene in a data-driven manner.
Fig. 5 is a schematic diagram illustrating an alternative structure of a text detection apparatus according to an embodiment of the present application, which will be described in detail according to various parts.
In some embodiments, the text detection apparatus 500 includes an encoding unit 501, a decoding unit 502, an image type determination unit 503, and an output unit 504.
The encoding unit 501 is configured to determine a sequence feature of an image to be detected;
a decoding unit 502, configured to determine a decoded sequence vector based on the sequence feature and an instance feature corresponding to a text instance;
the image type determining unit 503 is configured to determine the type of the image to be detected based on the decoded sequence vector;
the output unit 504 is configured to determine, in response to that the type of the image to be detected is that the image to be detected includes a text, position information of the text in the image to be detected based on the decoded sequence vector and a vector corresponding to the sequence feature.
The encoding unit 501 is specifically configured to convert an image matrix corresponding to an image to be detected into a one-dimensional vector; and determining the characteristics corresponding to the one-dimensional vectors as the sequence characteristics of the image to be detected.
The coding unit 502 is further configured to obtain an embedded value corresponding to the text instance, where the embedded value is an instance feature corresponding to the text instance; wherein the text examples comprise character strings and/or phrases, and the embedded values corresponding to different text examples are different.
The output unit 504 is specifically configured to multiply the decoded sequence vector with a vector corresponding to the sequence feature to obtain a product result; determining position information of the text in the sequence feature based on the multiplication result; and determining the position information of the text in the image to be detected based on the position information of the text in the sequence features.
In some optional embodiments, the text detection apparatus 500 may further include: a bounding box determination unit 505.
The bounding box determining unit 505 is configured to determine, after determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature, a connected domain in the image to be detected based on the position information of the text in the image to be detected; determining a text bounding box based on the boundaries of the connected domain; the text bounding box is used for identifying the text in the image to be detected.
Fig. 6 is a schematic diagram illustrating an alternative structure of a text detection network training apparatus according to an embodiment of the present application, which will be described according to various parts.
In some embodiments, the training apparatus 600 of the text detection network comprises a first determining unit 601, a second determining unit 602, a third determining unit 603, a responding unit 604 and an adjusting unit 605.
The first determining unit 601 is configured to determine, based on the coding subnetwork, a sequence sample feature of a sample image in a training sample set;
the second determining unit 602, taking the sequence sample feature and an example sample feature corresponding to a text example sample as an input of a cross-layer attention layer of the coding sub-network, and determining an output of the cross-layer attention layer as a decoded sample sequence vector;
the third determining unit 603 is configured to determine the prediction type of the sample image from the output of the output sub-network, using the decoded sample sequence vector as the input of the output sub-network;
the response unit 604 is configured to, in response to that the prediction type of the sample image is that the sample image includes text, use the decoded sample sequence vector and the sample sequence feature as inputs of the output sub-network, and determine predicted position information of the text in the sample image according to an output of the output sub-network;
the adjusting unit 605 is configured to match the prediction type of the sample image and the annotation type of the sample image, and the prediction position information and the annotation position information of the text in the sample image, and adjust the parameter of the text detection network based on the matching result.
The first determining unit 601 is specifically configured to convert, through a coding sub-network, an image matrix corresponding to the sample image into a one-dimensional vector characterizing the sample image; and the characteristic corresponding to the one-dimensional vector of the sample image is the sequence sample characteristic of the sample image.
The second determining unit 602 is further configured to input the text instance sample to a self-attention layer of the coding sub-network, and obtain an embedded value corresponding to the text instance sample based on an output of the self-attention layer, where the embedded value is an instance sample feature corresponding to the text instance sample; wherein the text example sample comprises character strings and/or phrases, and the embedded values corresponding to different text example samples are different.
The responding unit 604 is specifically configured to use the decoded sample sequence vector as an input of a fully-connected layer included in the output sub-network, and determine a prediction type of the sample image based on an output of the fully-connected layer.
The third determining unit 603 is specifically configured to multiply the decoded sample sequence vector with a vector corresponding to the sample sequence feature to obtain a product result; determining predicted position information of the text in the sample sequence features based on the multiplication result; and determining the predicted position information of the text in the sample image based on the predicted position information of the text in the sample sequence feature.
The adjusting unit 605 is specifically configured to determine not to adjust the parameter of the text detection network if the prediction type is the same as the annotation type, and a loss value between the predicted position information and the annotation position information is smaller than a preset threshold; if the prediction type is different from the annotation type, or the loss value between the prediction position information and the annotation position information is greater than or equal to the preset threshold value, adjusting parameters of the text detection network based on the difference between the prediction type of the sample image and the annotation type of the sample image, and/or the difference between the prediction position information and the annotation position information of the text in the sample image.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
Fig. 7 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.
Claims (25)
1. A text detection method, comprising:
determining the sequence characteristics of an image to be detected;
determining a decoded sequence vector based on the sequence feature and an instance feature corresponding to the text instance;
determining the type of the image to be detected based on the decoded sequence vector;
and responding to the type of the image to be detected that the image to be detected comprises a text, and determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature.
2. The method of claim 1, the determining sequence characteristics of the image to be detected comprising:
converting an image matrix corresponding to an image to be detected into a one-dimensional vector;
and determining the characteristics corresponding to the one-dimensional vectors as the sequence characteristics of the image to be detected.
3. The method of claim 1, wherein prior to said determining the decoded sequence vector, the method further comprises:
acquiring an embedded value corresponding to the text instance, wherein the embedded value is an instance feature corresponding to the text instance;
wherein the text examples comprise character strings and/or phrases, and the embedded values corresponding to different text examples are different.
4. The method of claim 1, wherein the determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature comprises:
multiplying the decoded sequence vector by the vector corresponding to the sequence feature to obtain a product result;
determining position information of the text in the sequence feature based on the multiplication result;
and determining the position information of the text in the image to be detected based on the position information of the text in the sequence features.
5. The method of claim 1, wherein after determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature, the method comprises:
determining a connected domain in the image to be detected based on the position information of the text in the image to be detected;
determining a text bounding box based on the boundaries of the connected domain;
the text bounding box is used for identifying the text in the image to be detected.
6. A training method of a text detection network, wherein the text detection network comprises an encoding sub-network, a decoding sub-network and an output sub-network;
determining sequence sample features of sample images in a training sample set based on the coding subnetwork;
determining the output of the cross-layer attention layer as a decoded sample sequence vector by taking the sequence sample characteristics and the example sample characteristics corresponding to the text example samples as the input of the cross-layer attention layer of the coding sub-network;
taking the decoded sample sequence vector as the input of the output sub-network, and determining the prediction type of the sample image according to the output of the output sub-network;
in response to the prediction type of the sample image being that the sample image comprises text, taking the decoded sample sequence vector and the sample sequence features as the input of the output sub-network, and determining the predicted position information of the text in the sample image according to the output of the output sub-network;
and matching the prediction type of the sample image with the annotation type of the sample image, and the prediction position information and the annotation position information of the text in the sample image, and adjusting the parameters of the text detection network based on the matching result.
7. The method of claim 6, wherein the determining a sequence sample characteristic of a sample image in a training sample set based on the coding sub-network comprises:
the coding sub-network converts an image matrix corresponding to the sample image into a one-dimensional vector representing the sample image;
and the characteristic corresponding to the one-dimensional vector of the sample image is the sequence sample characteristic of the sample image.
8. The method of claim 6, wherein prior to said determining the decoded sample sequence vector, the method further comprises:
inputting the text example sample into a self-attention layer of the coding sub-network, and obtaining an embedded value corresponding to the text example sample based on the output of the self-attention layer, wherein the embedded value is an example sample feature corresponding to the text example sample;
wherein the text example sample comprises character strings and/or phrases, and the embedded values corresponding to different text example samples are different.
9. The method of claim 6, wherein the determining the prediction type of the sample image from the output of the output sub-network using the decoded sample sequence vector as an input of the output sub-network comprises:
and determining the prediction type of the sample image based on the output of the fully-connected layer by taking the decoded sample sequence vector as the input of the fully-connected layer included in the output sub-network.
10. The method of claim 6, wherein the determining predicted location information of the text in the sample image from the output of the output sub-network using the decoded sample sequence vector and the sample sequence feature as inputs to the output sub-network comprises:
multiplying the decoded sample sequence vector by a vector corresponding to the sample sequence characteristic to obtain a product result;
determining predicted position information of the text in the sample sequence features based on the multiplication result;
and determining the predicted position information of the text in the sample image based on the predicted position information of the text in the sample sequence feature.
11. The method of claim 6, wherein matching the prediction type of the sample image and the annotation type of the sample image, and the prediction location information and the annotation location information of the text in the sample image, adjusting parameters of the text detection network based on the matching results comprises:
if the prediction type is the same as the marking type and the loss value between the predicted position information and the marking position information is smaller than a preset threshold value, determining not to adjust the parameters of the text detection network;
if the prediction type is different from the annotation type, or the loss value between the prediction position information and the annotation position information is greater than or equal to the preset threshold value, adjusting parameters of the text detection network based on the difference between the prediction type of the sample image and the annotation type of the sample image, and/or the difference between the prediction position information and the annotation position information of the text in the sample image.
12. A text detection apparatus comprising:
the encoding unit is used for determining the sequence characteristics of the image to be detected;
a coding unit, configured to determine a decoded sequence vector based on the sequence feature and an instance feature corresponding to a text instance;
the image type determining unit is used for determining the type of the image to be detected based on the decoded sequence vector;
and the output unit is used for responding to the type of the image to be detected, determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature, wherein the image to be detected comprises the text.
13. The apparatus according to claim 12, wherein the encoding unit is specifically configured to:
converting an image matrix corresponding to an image to be detected into a one-dimensional vector;
and determining the characteristics corresponding to the one-dimensional vectors as the sequence characteristics of the image to be detected.
14. The apparatus of claim 12, wherein the coding unit is further to:
before determining a decoded sequence vector, acquiring an embedded value corresponding to the text instance, wherein the embedded value is an instance feature corresponding to the text instance;
wherein the text examples comprise character strings and/or phrases, and the embedded values corresponding to different text examples are different.
15. The apparatus of claim 12, wherein the output unit is specifically configured to:
multiplying the decoded sequence vector by the vector corresponding to the sequence feature to obtain a product result;
determining position information of the text in the sequence feature based on the multiplication result;
and determining the position information of the text in the image to be detected based on the position information of the text in the sequence features.
16. The apparatus of claim 12, wherein the apparatus further comprises:
a bounding box determining unit, configured to determine, after determining, based on the decoded sequence vector and the vector corresponding to the sequence feature, position information of the text in the image to be detected, determine, based on the position information of the text in the image to be detected, a connected domain in the image to be detected; determining a text bounding box based on the boundary of the connected domain;
the text bounding box is used for identifying the text in the image to be detected.
17. An apparatus for training a text detection network, comprising:
a first determining unit, configured to determine, based on the coding subnetwork, a sequence sample feature of a sample image in a training sample set;
a second determining unit, configured to determine an output of a cross-layer attention layer of the coding sub-network as a decoded sample sequence vector by using the sequence sample feature and an instance sample feature corresponding to a text instance sample as an input of the cross-layer attention layer;
a third determining unit, configured to determine a prediction type of the sample image according to an output of the output sub-network, using the decoded sample sequence vector as an input of the output sub-network;
a response unit, configured to, in response to that the sample image includes text as a prediction type of the sample image, use the decoded sample sequence vector and the sample sequence feature as inputs of the output sub-network, and determine predicted position information of the text in the sample image according to an output of the output sub-network;
and the adjusting unit is used for matching the prediction type of the sample image with the annotation type of the sample image, and the prediction position information and the annotation position information of the text in the sample image, and adjusting the parameters of the text detection network based on the matching result.
18. The apparatus according to claim 17, wherein the first determining unit is specifically configured to:
converting an image matrix corresponding to the sample image into a one-dimensional vector for representing the sample image through a coding sub-network;
and the characteristic corresponding to the one-dimensional vector of the sample image is the sequence sample characteristic of the sample image.
19. The apparatus of claim 17, wherein the second determining unit is further configured to:
inputting the text example sample into a self-attention layer of the coding sub-network, and obtaining an embedded value corresponding to the text example sample based on the output of the self-attention layer, wherein the embedded value is an example sample feature corresponding to the text example sample;
wherein the text example sample comprises character strings and/or phrases, and the embedded values corresponding to different text example samples are different.
20. The apparatus according to claim 17, wherein the response unit is specifically configured to:
and determining the prediction type of the sample image based on the output of the fully-connected layer by taking the decoded sample sequence vector as the input of the fully-connected layer included in the output sub-network.
21. The apparatus according to claim 17, wherein the third determining unit is specifically configured to:
multiplying the decoded sample sequence vector by a vector corresponding to the sample sequence characteristic to obtain a product result;
determining predicted position information of the text in the sample sequence features based on the multiplication result;
and determining the predicted position information of the text in the sample image based on the predicted position information of the text in the sample sequence feature.
22. The apparatus according to claim 17, wherein the adjusting unit is specifically configured to:
if the prediction type is the same as the marking type and the loss value between the predicted position information and the marking position information is smaller than a preset threshold value, determining not to adjust the parameters of the text detection network;
if the prediction type is different from the annotation type, or the loss value between the prediction position information and the annotation position information is greater than or equal to the preset threshold value, adjusting parameters of the text detection network based on the difference between the prediction type of the sample image and the annotation type of the sample image, and/or the difference between the prediction position information and the annotation position information of the text in the sample image.
23. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5;
or to enable the at least one processor to perform the method of any of claims 6-11.
24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5;
alternatively, the computer instructions are for causing the computer to perform the method of any of claims 6-11.
25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5;
alternatively, the computer program, when executed by a processor, implements the method of any of claims 6-11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210034256.9A CN114495101A (en) | 2022-01-12 | 2022-01-12 | Text detection method, and training method and device of text detection network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210034256.9A CN114495101A (en) | 2022-01-12 | 2022-01-12 | Text detection method, and training method and device of text detection network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114495101A true CN114495101A (en) | 2022-05-13 |
Family
ID=81512118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210034256.9A Pending CN114495101A (en) | 2022-01-12 | 2022-01-12 | Text detection method, and training method and device of text detection network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114495101A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115422389A (en) * | 2022-11-07 | 2022-12-02 | 北京百度网讯科技有限公司 | Method for processing text image, neural network and training method thereof |
CN115438214A (en) * | 2022-11-07 | 2022-12-06 | 北京百度网讯科技有限公司 | Method for processing text image, neural network and training method thereof |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020199730A1 (en) * | 2019-03-29 | 2020-10-08 | 北京市商汤科技开发有限公司 | Text recognition method and apparatus, electronic device and storage medium |
CN112528621A (en) * | 2021-02-10 | 2021-03-19 | 腾讯科技(深圳)有限公司 | Text processing method, text processing model training device and storage medium |
WO2021052358A1 (en) * | 2019-09-16 | 2021-03-25 | 腾讯科技(深圳)有限公司 | Image processing method and apparatus, and electronic device |
CN112633290A (en) * | 2021-03-04 | 2021-04-09 | 北京世纪好未来教育科技有限公司 | Text recognition method, electronic device and computer readable medium |
CN113033534A (en) * | 2021-03-10 | 2021-06-25 | 北京百度网讯科技有限公司 | Method and device for establishing bill type identification model and identifying bill type |
CN113591719A (en) * | 2021-08-02 | 2021-11-02 | 南京大学 | Method and device for detecting text with any shape in natural scene and training method |
CN113657390A (en) * | 2021-08-13 | 2021-11-16 | 北京百度网讯科技有限公司 | Training method of text detection model, and text detection method, device and equipment |
-
2022
- 2022-01-12 CN CN202210034256.9A patent/CN114495101A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020199730A1 (en) * | 2019-03-29 | 2020-10-08 | 北京市商汤科技开发有限公司 | Text recognition method and apparatus, electronic device and storage medium |
WO2021052358A1 (en) * | 2019-09-16 | 2021-03-25 | 腾讯科技(深圳)有限公司 | Image processing method and apparatus, and electronic device |
CN112528621A (en) * | 2021-02-10 | 2021-03-19 | 腾讯科技(深圳)有限公司 | Text processing method, text processing model training device and storage medium |
CN112633290A (en) * | 2021-03-04 | 2021-04-09 | 北京世纪好未来教育科技有限公司 | Text recognition method, electronic device and computer readable medium |
CN113033534A (en) * | 2021-03-10 | 2021-06-25 | 北京百度网讯科技有限公司 | Method and device for establishing bill type identification model and identifying bill type |
CN113591719A (en) * | 2021-08-02 | 2021-11-02 | 南京大学 | Method and device for detecting text with any shape in natural scene and training method |
CN113657390A (en) * | 2021-08-13 | 2021-11-16 | 北京百度网讯科技有限公司 | Training method of text detection model, and text detection method, device and equipment |
Non-Patent Citations (3)
Title |
---|
X. GAO, S 等: "A Detection and Verification Model Based on SSD and Encoder-Decoder Network for Scene Text Detection", IEEE ACCESS, 31 December 2019 (2019-12-31) * |
张运超;陈靖;王涌天;: "一种融合重力信息的快速海量图像检索方法", 自动化学报, no. 10, 15 October 2016 (2016-10-15) * |
王涛;江加和;: "基于语义分割技术的任意方向文字识别", 应用科技, no. 03, 4 July 2017 (2017-07-04) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115422389A (en) * | 2022-11-07 | 2022-12-02 | 北京百度网讯科技有限公司 | Method for processing text image, neural network and training method thereof |
CN115438214A (en) * | 2022-11-07 | 2022-12-06 | 北京百度网讯科技有限公司 | Method for processing text image, neural network and training method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113657390B (en) | Training method of text detection model and text detection method, device and equipment | |
CN115063875B (en) | Model training method, image processing method and device and electronic equipment | |
CN113642583B (en) | Deep learning model training method for text detection and text detection method | |
CN114495102B (en) | Text recognition method, training method and device of text recognition network | |
CN114863437B (en) | Text recognition method and device, electronic equipment and storage medium | |
CN114495101A (en) | Text detection method, and training method and device of text detection network | |
CN113657483A (en) | Model training method, target detection method, device, equipment and storage medium | |
CN114022887B (en) | Text recognition model training and text recognition method and device, and electronic equipment | |
CN112966744A (en) | Model training method, image processing method, device and electronic equipment | |
CN113947700A (en) | Model determination method and device, electronic equipment and memory | |
CN114511743B (en) | Detection model training, target detection method, device, equipment, medium and product | |
CN114612651B (en) | ROI detection model training method, detection method, device, equipment and medium | |
CN114724133A (en) | Character detection and model training method, device, equipment and storage medium | |
CN113553428B (en) | Document classification method and device and electronic equipment | |
CN114549904A (en) | Visual processing and model training method, apparatus, storage medium, and program product | |
CN114445682A (en) | Method, device, electronic equipment, storage medium and product for training model | |
CN114724144B (en) | Text recognition method, training device, training equipment and training medium for model | |
CN116363429A (en) | Training method of image recognition model, image recognition method, device and equipment | |
CN113361522B (en) | Method and device for determining character sequence and electronic equipment | |
CN114973333A (en) | Human interaction detection method, human interaction detection device, human interaction detection equipment and storage medium | |
CN113947195A (en) | Model determination method and device, electronic equipment and memory | |
CN115809687A (en) | Training method and device for image processing network | |
CN114330576A (en) | Model processing method and device, and image recognition method and device | |
CN113379592A (en) | Method and device for processing sensitive area in picture and electronic equipment | |
CN114093006A (en) | Training method, device and equipment of living human face detection model and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |