CN115578735B - Text detection method and training method and device of text detection model - Google Patents

Text detection method and training method and device of text detection model Download PDF

Info

Publication number
CN115578735B
CN115578735B CN202211205551.2A CN202211205551A CN115578735B CN 115578735 B CN115578735 B CN 115578735B CN 202211205551 A CN202211205551 A CN 202211205551A CN 115578735 B CN115578735 B CN 115578735B
Authority
CN
China
Prior art keywords
position information
feature
characters
information
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211205551.2A
Other languages
Chinese (zh)
Other versions
CN115578735A (en
Inventor
吕鹏原
范森
章成全
姚锟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211205551.2A priority Critical patent/CN115578735B/en
Publication of CN115578735A publication Critical patent/CN115578735A/en
Application granted granted Critical
Publication of CN115578735B publication Critical patent/CN115578735B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image

Abstract

The disclosure provides a text detection method and a training method and device of a text detection model, relates to the field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, image processing and the like, and can be applied to scenes such as OCR and the like. The text detection method comprises the following steps: extracting image features of the text image; adopting a decoder to decode the image features according to the preset inquiry feature sequence to obtain a decoded feature sequence; predicting to obtain a plurality of prediction results according to the decoding characteristic sequence; each of the predicted outcomes includes location information, classification information corresponding to the location information, and association information between a character at a location indicated by the location information and a plurality of characters at a location indicated by the location information in the plurality of predicted outcomes; and determining the position information indicating that the characters exist at the positions according to the association information and the classification information, and integrating the position information of the characters with association relations among the plurality of characters to obtain a text detection result.

Description

Text detection method and training method and device of text detection model
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular to the technical fields of computer vision, deep learning, image processing, etc., which may be applied to scenes such as OCR.
Background
With the development of computer technology and network technology, deep learning technology is widely used in a plurality of fields. For example, the image may be processed using a deep learning technique to enable detection of text in the image. In natural scenes, the spacing and shape between the plurality of characters included in the text in the image often affect the detection accuracy of the text.
Disclosure of Invention
The disclosure aims to provide a text detection method for improving text detection precision, and a training method, device, equipment and medium of a text detection model.
According to one aspect of the present disclosure, there is provided a text detection method including: extracting image features of the text image; adopting a decoder to decode the image features according to the preset inquiry feature sequence to obtain a decoded feature sequence; wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence; predicting to obtain a plurality of prediction results according to the decoding characteristic sequence; the plurality of prediction results are in one-to-one correspondence with decoding features in the decoding feature sequence; each of the predicted outcomes includes location information, classification information corresponding to the location information, and association information between a character at a location indicated by the location information and a plurality of characters at a location indicated by the location information in the plurality of predicted outcomes; and determining position information indicating that the position has characters and integrating the position information of the characters with the association relation in the plurality of characters according to the association information and the classification information to obtain a text detection result, wherein the classification information is used for indicating whether the position information indicates that the position has characters or not.
According to another aspect of the present disclosure, there is provided a training method of a text detection model, wherein the text detection model includes a feature extraction network, a decoder, and a prediction network, the method comprising: extracting image features of a text image serving as a sample by adopting a feature extraction network; the text image has corresponding indication information, and the indication information indicates a character detection result corresponding to the text image; adopting a decoder to decode the image features according to the preset inquiry feature sequence to obtain a decoded feature sequence; wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence; a prediction network is adopted to obtain a plurality of prediction results according to the prediction of the decoding feature sequence, and the plurality of prediction results are in one-to-one correspondence with the decoding features in the decoding feature sequence; each of the predicted results includes predicted position information, classification information corresponding to the predicted position information, and predicted association information between a character at a predicted position information indicating position and a plurality of characters at the predicted position information indicating position among the plurality of detection results; and training the text detection model according to the plurality of prediction results and the character detection result, wherein the classification information indicates whether the predicted position information indicates that the position has characters.
According to another aspect of the present disclosure, there is provided a text detection apparatus including: the feature extraction module is used for extracting image features of the text image; the feature decoding module is used for decoding the image features according to the preset query feature sequence by adopting a decoder to obtain a decoded feature sequence; wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence; the prediction module is used for predicting and obtaining a plurality of prediction results according to the decoding characteristic sequence; the plurality of prediction results are in one-to-one correspondence with decoding features in the decoding feature sequence; each of the predicted outcomes includes location information, classification information corresponding to the location information, and association information between a character at a location indicated by the location information and a plurality of characters at a location indicated by the location information in the plurality of predicted outcomes; and the detection result obtaining module is used for determining the position information indicating that the position has characters and integrating the position information of the characters with the association relation in the plurality of characters according to the association information and the classification information to obtain a text detection result, wherein the classification information is used for indicating whether the position information indicates that the position has characters or not.
According to another aspect of the present disclosure, there is provided a training apparatus of a text detection model, wherein the text detection model includes a feature extraction network, a decoder, and a prediction network, the apparatus comprising: the feature extraction module is used for extracting image features of the text image serving as a sample by adopting a feature extraction network; the text image has corresponding indication information, and the indication information indicates a character detection result corresponding to the text image; the feature decoding module is used for decoding the image features according to the preset query feature sequence by adopting a decoder to obtain a decoded feature sequence; wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence; the prediction module is used for predicting and obtaining a plurality of prediction results according to the decoding characteristic sequence by adopting a prediction network, and the plurality of prediction results are in one-to-one correspondence with the decoding characteristics in the decoding characteristic sequence; each of the predicted results includes predicted position information, classification information corresponding to the predicted position information, and predicted association information between a character at a predicted position information indicating position and a plurality of characters at the predicted position information indicating position among the plurality of detection results; and the model training module is used for training the text detection model according to the plurality of prediction results and the character detection results, wherein the classification information is used for indicating whether characters exist at the position indicated by the prediction position information.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text detection method and/or training method of the text detection model provided by the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the text detection method and/or training method of the text detection model provided by the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the text detection method and/or training method of the text detection model provided by the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is an application scenario schematic diagram of a text detection method and training method and apparatus of a text detection model according to an embodiment of the present disclosure;
FIG. 2 is a flow diagram of a text detection method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of text detection according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of decoding image features according to an embodiment of the present disclosure;
FIG. 5 is a flow diagram of a training method for a text detection model according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a training text detection model according to an embodiment of the present disclosure;
fig. 7 is a block diagram of a text detection device according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of a training device of a text detection model according to an embodiment of the present disclosure; and
fig. 9 is a block diagram of an electronic device for implementing a text detection method and/or a training method for a text detection model in accordance with an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The text detection technology in natural scenes can be widely applied to various industries of society, for example, the technology can be applied to the fields of education, medical treatment, finance and the like. With the development of text detection technology, the technology of identifying card tickets, automatically inputting documents, photographing, searching questions and the like derived from the text detection technology greatly improves the intelligent degree and the production efficiency of the traditional industry, and brings great convenience to daily study and life of people.
For example, text detection may be implemented based on a two-level scheme of candidate boxes or an instance-segmentation based scheme. The two-stage scheme based on the candidate boxes can detect to obtain the text candidate boxes, and then text segmentation is carried out on the texts in the text candidate boxes, so that a text envelope, namely the position information of the text lines, is obtained. The two-stage candidate-box based scheme may be implemented, for example, using a mask text detection series model (Mask Text Spotter) or more than one-time looking model (Look More Than Once, LOMO). The scheme based on the example segmentation can directly segment the text region, and then calculate the connected domain based on the segmented mask, so that the text envelope is obtained. For example, a pixel-link algorithm or a differentiable binarization algorithm (Differentiable Binarization, DB) may be employed to implement the instance-based segmentation scheme.
It is understood that these schemes are mostly based on text line granularity for text detection. However, in natural scenes, if the text has a large character spacing or lines and columns are difficult to distinguish, the schemes often have difficulty in accurately realizing the detection of the granularity of the text based on the visual information of the image. Based on the text detection method and device, the text detection method and device is convenient to improve detection accuracy, and the training method and device of the text detection model is provided. An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.
Fig. 1 is an application scenario schematic diagram of a text detection method and a training method and device of a text detection model according to an embodiment of the disclosure.
As shown in fig. 1, the application scenario 100 of this embodiment may include an electronic device 110, and the electronic device 110 may be various electronic devices with processing functions, including but not limited to a smart phone, a tablet computer, a laptop computer, a desktop computer, a server, and the like.
The electronic device 110 may, for example, detect the input text image 120 to obtain text detection results 130 of text line granularity. For example, electronic device 110 may employ any of the schemes described above to detect the resultant text detection 130.
In order to improve the detection accuracy, the electronic device 110 may further detect the text image 120 to obtain a character frame with granularity of characters and association information between the characters, and then integrate the character frames based on the association information, so as to obtain a text detection result 130 with granularity of text lines. To facilitate this detection, the application scenario 100 may be provided with a text detection model 140, for example, for the electronic device 110 in advance. The electronic device can take the text image 120 as input to the text detection model 140 and detect the character boxes with granularity of characters and the associated information between the characters by the text detection model 140.
In an embodiment, the electronic device 110 may be installed with various client applications, such as a client application of a document understanding class, a communication class application, a browser class application, and the like. The electronic device 110 may have the ability to recognize and understand the detected text, for example, in addition to the ability to detect text in a document image, which is not limited by the present disclosure.
In an embodiment, as shown in fig. 1, the application scenario 100 may further include a server 150, where the server 150 may be, for example, a background management server supporting the operation of a client application in the electronic device 110. Electronic device 110 may be communicatively coupled to server 150 via a network, which may include wired or wireless communication links.
For example, server 150 may train a text detection model based on the text image as a sample and send trained text detection model 140 to electronic device 110 in response to a request from electronic device 110.
In an embodiment, the electronic device 110 may also send the text image 120 to the server 150, and the server 150 processes the text image 120 according to the trained document detection model 140 to obtain the text detection result 130.
It should be noted that, the text detection method provided in the present disclosure may be executed by the electronic device 110 or may be executed by the server 150. Accordingly, the text detection apparatus provided by the present disclosure may be disposed in the electronic device 110 or may be disposed in the server 150. The training method of the text detection model provided by the present disclosure may be performed by the server 150. Accordingly, the training device of the text detection model provided by the present disclosure may be provided in the server 150.
It should be understood that the number and type of electronic devices 110 and servers 150 in fig. 1 are merely illustrative. There may be any number and type of electronic devices 110 and servers 150 as desired for implementation.
The text detection method provided by the present disclosure will be described in detail below with reference to fig. 2 to 4.
Fig. 2 is a flow diagram of a text detection method according to an embodiment of the present disclosure.
As shown in fig. 2, the text detection method 200 of this embodiment includes operations S210 to S240.
In operation S210, image features of a text image are extracted.
According to embodiments of the present disclosure, the text image may be obtained, for example, by scanning or photographing a common card ticket, paper document, billboard, or promotional word, or the like.
According to embodiments of the present disclosure, a feature extraction network constructed based on convolutional neural network CNN may be employed to extract image features of a text image. The feature extraction Network may be, for example, a deep convolutional Network VGG, a Residual Network (Residual Network), a densely connected convolutional Network (DenseNet), or a lightweight Network MobileNet proposed by the visual geometry group (Visual Geometry Group). This embodiment may also employ some operators that can be used to improve the network effect to extract the image features of the text image. Among them, operators for improving network effects may include, for example, a deformable convolution (Deformable Convolution, deformConv), a spread convolution (division Conv), or an acceptance operator, etc.
It will be appreciated that the network or operator employed to extract image features described above is merely an example to facilitate an understanding of the present disclosure, which is not limited in this disclosure. For example, the embodiment may also employ a feature pyramid network (Feature Pyramid Networks, FPN) or the like capable of acquiring local features and more advanced semantic features simultaneously to extract image features, so as to improve the accuracy of the extracted image features.
In operation S220, a decoder is employed to decode image features according to a predetermined query feature sequence, resulting in a decoded feature sequence.
According to an embodiment of the present disclosure, the predetermined query feature sequence may include a plurality of query features, and the number of the plurality of query features may be set according to actual requirements. For example, the number of the plurality of query features may be greater than or equal to the number of characters included in the text image.
According to embodiments of the present disclosure, the predetermined query feature sequence may be derived by encoding a random data sequence. For example, if the number of query features is N, the embodiment may obtain the predetermined sequence of query features by embedding the data sequence {0,1,2, (N-1) }. The network parameters used for the encoding may be, for example, learned.
According to embodiments of the present disclosure, a sine and cosine function may be employed to derive a predetermined query feature sequence. The sine and cosine functions can be expressed by the following formula.
Wherein PE (pos,2i) Representing a reservationQuerying even-bit features in a feature sequence, PE (pos,2i+1) Features representing odd bits in a predetermined query feature sequence. pos represents the position of a predetermined query feature in a sequence of predetermined query features, e.g., for the first predetermined query feature in the sequence, its position may be denoted as 1.d, d model Representing the dimensions of a predetermined query feature set in advance.
According to the embodiment of the disclosure, each predetermined query feature in the predetermined query feature sequence may be spliced with an image feature, the spliced feature is used as an input of a decoder, and a decoded feature corresponding to the each predetermined query feature is output by the decoder. A plurality of decoding features that are in one-to-one correspondence with a plurality of predetermined query features in a predetermined sequence of query features may constitute the sequence of decoding features. The predetermined query feature at any position in the predetermined sequence of query features corresponds to a decoded feature at a corresponding position in the decoded feature sequence.
According to an embodiment of the present disclosure, the decoder may be, for example, a decoder in any sequence network, for example, a decoder constructed based on a recurrent neural network, or a decoder constructed based on an attention mechanism, which is not limited in this disclosure.
In operation S230, a plurality of prediction results are predicted according to the decoded feature sequence.
According to an embodiment of the present disclosure, each prediction result may include location information and classification information corresponding to the location information. Wherein the classification information is used for indicating whether characters exist at the position indicated by the position information. In this embodiment, a regression branch in a general object detection model may be used to process each decoded feature in the decoded feature sequence, predict to obtain a position information, and a classification branch in the general object detection model may be used to process each decoded feature, predict to obtain a classification information. A prediction result corresponding to each decoding feature can be obtained by processing a position information and a classification information of each decoding feature.
In an embodiment, each of the predictors may further include association information between the character at the position indicated by the position information and a plurality of characters at the position indicated by the plurality of position information in the plurality of predictors. For example, this embodiment may employ an associated prediction branch similar to the classification branch described above to process each decoded feature to obtain an associated information vector, each element in the associated information vector representing: the position information predicted from each of the decoding features indicates association information between a character at the position and one of all characters at the position indicated by all of the position information predicted from all of the decoding features. For example, if the number of the obtained prediction results is set to N, the number of the characters at the N position information indication positions in the N prediction results may be N, and the N associated information is included in the associated information vector obtained by processing each decoding feature by using the associated prediction branch. It will be appreciated that the associated predicted branch differs from the classification branch in that the associated predicted branch is used to complete a multi-labeled classification task, while the classification branch is used to complete a classification task. For each decoded feature in the sequence of decoded features, each task of the multi-classification task is operable to predict association information between a character at a position indicated by position information derived from the decoded feature and one of the N characters.
In operation S240, position information indicating that there is a character at a position is determined according to the association information and the classification information and position information of characters having an association relationship among the plurality of characters is integrated to obtain a text detection result.
According to an embodiment of the present disclosure, the classification information may include any one of a character-having type and a character-free type, and the embodiment may use classification information corresponding to the plurality of prediction results as position information of the character-having type as position information indicating that there is a character at the position.
According to an embodiment of the present disclosure, each element in the above-described association information vector may include any one of an associated relationship and a non-associated relationship. The embodiment can determine whether all characters at the position information have an association relation with each other according to the association information in a plurality of prediction results aiming at the position information with the characters at all the indicated positions. The embodiment can integrate the position information of the characters with the association relationship, so that a text detection result is obtained.
For example, the positional information may be represented by the position of a character frame surrounding the character in an image coordinate system constructed based on the text image, and may specifically include the coordinate value of the upper left corner and the coordinate value of the lower right corner of the character frame, or may include the coordinate value of the upper left corner of the character frame and the height, width of the character frame. In the embodiment, when the position information of the characters with the association relationship is integrated, the extremum in the directions of two coordinate axes of the image coordinate system can be determined according to the position information which is integrated first, and the integrated position information is determined according to the extremum. For example, if the minimum value of the position information to be integrated in the X-axis direction of the image coordinate system is X1, the maximum value is X2, and the minimum value of the position information to be integrated in the Y-axis direction of the image coordinate system is Y1, and the maximum value is Y2, it may be determined that the integrated position information includes (X1, Y1), (X2, Y2). The embodiment may use (x 1, y 1) in the integrated position information as the top left corner vertex of the rectangle, use (x 2, y 2) in the integrated position information as the bottom right corner vertex of the rectangle, and use the rectangle frame uniquely determined by the two vertices as the text detection result. As such, the text detection result may be used to represent a detection box of the granularity of the detected text lines.
In an embodiment, the classification information comprises a probability value of a character and the association information comprises a degree of association. In this embodiment, when determining the position information indicating that there is a character at the position based on the association information and the classification information and integrating the position information of the character having the association relationship among the plurality of characters, the position information indicating that there is a character at the position among the plurality of position information included in the plurality of prediction results may be determined as the target position information based on the probability value and the predetermined probability threshold value. Namely, the position information with the probability value larger than or equal to a preset probability threshold value in the corresponding classification information is taken as target position information. Then, the embodiment may group characters having an association relationship among the target characters into character groups according to the association degree of the target characters at the target position information indication position to obtain at least one character group. For example, two target characters having a degree of association equal to or greater than a predetermined degree of association threshold may be classified as one character group, thereby obtaining at least one character group. Finally, the embodiment can determine the position information of the text line corresponding to each character group according to the position information of the characters in each character group, and obtain a text detection result.
It will be appreciated that in determining the location information of the text line corresponding to each character set, the location information of the characters in each character set may be integrated by using the method as described above, thereby obtaining a rectangular frame surrounding the text line. For at least one character set, at least one rectangular box may be obtained. The embodiment may use the position information of the at least one rectangular frame as a text detection result.
According to the embodiment, text detection with text line granularity can be realized on the basis of fine detection by fine detection with characters as granularity and prediction of the association relation of the characters and integration of prediction results based on the predicted association relation. Compared with an algorithm for calculating a connected domain and the like, the method can enable the detection result not to be influenced by the text line and the inter-character distance, and can effectively improve the detection precision of text detection. Therefore, the method provided by the embodiment of the disclosure can be used for detecting the text with any shape with high precision, and can provide more reliable detection results for downstream applications related to character recognition, understanding and the like, thereby being beneficial to bringing more flow and user experience for the downstream applications.
The implementation principle of the operation S230 and the entire text detection method described above will be further defined and extended in conjunction with fig. 3 and 4.
Fig. 3 is a schematic diagram of text detection according to an embodiment of the present disclosure.
As shown in fig. 3, this embodiment 300 may employ an end-to-end text detection model to process a text image 301 to obtain a plurality of predicted results 302.
For example, the text detection model may include a feature extraction network 310, a decoding network 320, and a prediction network 330. Wherein the prediction network 330 is comprised of three prediction branches, which may be classification branch 331, regression branch 332, and associated prediction branch 333, respectively, as described above.
The feature extraction network 310 may be a CNN-based feature extraction network described above, and the decoding network 320 may be a decoder in a transducer architecture, for example. Classification branch 331 may be constructed, for example, based on a support vector machine (Support Vector Machine, SVM) or any other classifier for two classifications. The regression branch 332 may be constructed based on, for example, a regression neural network such as regress. The associated prediction branch 333 may be constructed based on a plurality of support vector machines, or may be constructed from a fully connected network or the like, for example. It will be appreciated that the architecture of the various networks described above is merely exemplary to facilitate an understanding of the present disclosure, which is not limited thereto.
In one embodiment, the text image 301 is input to a feature extraction network 310, and the image features 303 are obtained after processing by the feature extraction network 310. This embodiment may take image features 303 and a predetermined query feature sequence 304 as inputs to a decoding network 320, which, after processing via a decoder in the decoding network, may output a decoded feature sequence 305. This embodiment may input the decoded features in the decoded feature sequence 305 in sequence into three branches of the prediction network 330. After processing via the three branches, classification information 306 is output by classification branch 331, position information 307 is output by regression branch 332, and association information 308 between the character at the position indicated by position information 307 and the character at the position indicated by all the position information obtained for all the decoding features is output by association prediction branch 333 for each decoding feature. The classification information 306, the location information 307, and the association information 308 obtained for each decoded feature may constitute the prediction result 302 corresponding to the each decoded feature.
According to an embodiment of the present disclosure, when the decoder adopts a decoder of a transform architecture, the decoding network 320 may first obtain a Key feature Key and a Value feature Value according to an image feature, for example. For example, decoding network 320 may apply a weight matrix W to the image features and pre-training K And performing matrix multiplication operation to obtain a Key characteristic Key. The decoding network 320 may perform training on the image features and the weight matrix W obtained in advance V And performing matrix multiplication operation to obtain a Value characteristic Value. Similarly, in this embodiment, the decoding network 320 may also derive Query feature Query from a predetermined sequence of Query features, for example. For example, decoding network 320 may combine a pre-determined query feature sequence with a pre-trained weight matrix W Q And performing matrix multiplication operation to obtain Query characteristics Query.
After obtaining the Query feature Query, key feature Key, and Value feature Value, the decoding network 320 may input the Query feature, key feature, and Value feature to a decoder, which outputs the decoded feature sequence 305. It will be appreciated that the decoder may employ a self-attention mechanism to operate on the Query feature Query, key feature Key, and Value feature Value to obtain the decoded feature sequence 305.
According to the embodiment, the image features 303 are decoded by adopting the principle based on the self-attention mechanism, so that the image features can be more comprehensively considered in the decoding process, and the association relationship between pixels with larger intervals in the image can be captured, thereby being beneficial to improving the precision of the association information and the classification information obtained by prediction.
Fig. 4 is a schematic diagram of decoding image features according to an embodiment of the present disclosure.
According to the embodiment of the disclosure, the image features can be decoded by combining the position information of each pixel in the text image, so that more abundant context information is provided for the decoding process, the expression capability of the obtained decoding feature sequence is improved, and the precision of the associated information and the classification information obtained by prediction is improved.
As shown in fig. 4, in the embodiment 400, the embodiment may also perform position encoding on the text image 401 to obtain the position feature 403 for the text image 401, while extracting the image feature 402 of the text image 401.
Illustratively, the text image 401 may be position encoded using a sine-cosine function as described above, and during the position encoding process, the text image 401 may be expanded line by line into a sequence of pixels, with the sine function being used for even-bit pixels in the sequence of pixels and the cosine function being used for odd-bit pixels in the sequence of pixels. It will be appreciated that the above-described method of position encoding is merely exemplary to facilitate an understanding of the present disclosure, and that any predetermined function or any method may be employed by the present disclosure to position encode the text image 401.
After obtaining the position feature 403 and the image feature 402, this embodiment may, for example, fuse the position feature 403 and the image feature 402 to obtain a fused feature 404. The position feature 403 and the image feature 402 may be fused by using a concat () function, and the fused feature 404 may be obtained by directly splicing the position feature 403 and the image feature 402.
After the fused feature 404 is obtained, the embodiment may obtain key features and value features from the fused feature 404 by a decoding network based on the fused feature 404 as an input to the decoding network 410. While the predetermined query feature sequence 405 is taken as input to the decoding network 410, the query features, key features and value features derived from the predetermined query feature sequence 405 are processed by a decoder in the decoding network to obtain the decoded feature sequence 406.
In order to facilitate implementation of the text detection method provided in the present disclosure, the present disclosure further provides a training method of the text detection model, and the training method will be described in detail below with reference to fig. 5 to 6.
Fig. 5 is a flow diagram of a training method of a text detection model according to an embodiment of the present disclosure.
As shown in fig. 5, the training method 500 of the text detection model of this embodiment may include operations S510 to S540. The text detection model may include, among other things, a feature extraction network, a decoder, and a prediction network.
In operation S510, image features of a text image as a sample are extracted using a feature extraction network.
According to an embodiment of the present disclosure, a text image with corresponding indication information may be taken as a sample. The indication information may indicate a character detection result corresponding to the text image, where the character detection result is different from the text detection result described above in that the text detection result is a detection result of granularity of a text line, and the character detection result is a detection result of granularity of a character. The character detection result may include, for example, actual position information of M characters included in the text image, which may be represented by positions of character boxes surrounding the characters in an image coordinate system constructed based on the text image.
It will be appreciated that the decoded features in the decoded feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence. The feature extraction network may employ CNN and the like described above, and the implementation principle of the operation S510 is similar to that of the operation S210 described above, which is not described herein.
In operation S520, a decoder is employed to decode image features according to a predetermined query feature sequence, resulting in a decoded feature sequence. The implementation principle of this operation S520 is similar to that of the operation S220 described above, and will not be described here again.
In operation S530, a prediction network is employed to predict a plurality of prediction results from the decoded feature sequence.
According to embodiments of the present disclosure, the prediction network may include the regression branch and the classification branch described above. In an embodiment, the prediction network may further comprise the associated prediction branches described above. The implementation principle of this operation S530 is similar to that of the operation S230 described above, and will not be described here again.
In one embodiment, the plurality of predictors are obtained in one-to-one correspondence with a plurality of decoding features in the sequence of decoding features. Each of the prediction results includes prediction position information, classification information corresponding to the prediction position information, and prediction association information between a character at a prediction position information indicating position and a plurality of characters at all of the prediction position information indicating positions in the plurality of detection results. The predicted position information is similar to the position information obtained in operation S230 described above, and the predicted association information is similar to the association information obtained in operation S230 described above, and will not be described again. Wherein the classification information is used to indicate whether the predicted position information indicates that there is a character at the position.
In operation S540, the text detection model is trained according to the plurality of prediction results and the character detection result.
According to the embodiment of the present disclosure, a plurality of predicted position information (for example, N) among a plurality of predicted results may be compared with actual position information of M characters among character detection results, and a loss of the text detection model may be determined according to the comparison result. The embodiment can aim at minimizing loss to adjust network parameters of the text detection model, thereby realizing training of the text detection model.
For example, when comparing the N pieces of predicted position information with the M pieces of actual position information of the characters, for example, each of the M pieces of actual position information of the characters may be used as a clustering center, and the N pieces of predicted position information may be clustered to obtain M pieces of predicted position information each centered on the M pieces of actual position information of the characters. Subsequently, the embodiment may determine the loss of the text detection model based on the difference between the predicted position information in each predicted position information group and the actual position information as the center of the each predicted position information group. The difference between the position information can be determined, for example, from the cross-correlation ratio, and the magnitude of the loss of the text detection model is inversely related to the cross-correlation ratio.
In an embodiment, for example, N pieces of predicted position information and M pieces of actual position information of the characters may be matched, and one matched piece of predicted position information and one piece of actual position information form one position information pair, for example, P position information pairs may be obtained in total. Subsequently, the embodiment may determine a first penalty of the text detection model based on the difference between the predicted position information and the actual position information in each of the pairs of position information to train the text detection model by minimizing the penalty. For example, an L1 loss function or the like may be utilized to determine a first loss of the text detection model from the difference between the predicted position information and the actual position information.
Wherein for example a matching algorithm may be used to match the predicted position information with the actual position information. The matching algorithm may include, for example, an algorithm capable of matching two information groups having different numbers of elements, such as a hungarian algorithm.
For example, in matching, the embodiment may first calculate the intersection ratio between the character frame represented by each of the N pieces of predicted position information and the character frame represented by each of the M pieces of actual position information, and total n×m intersection ratios. The n×m cross ratios may form, for example, a cross ratio matrix having a size of N rows and M columns. The embodiment may then process the intersection matrix based on the hungarian algorithm, resulting in matched pairs of P position information.
The principles of training the text detection model will be further defined and extended in connection with fig. 6 below.
Fig. 6 is a schematic diagram of a training text detection model according to an embodiment of the present disclosure.
As shown in fig. 6, in this embodiment 600, a text image 601 as a sample may be input to a feature extraction network 611 in a text detection model, and image features 602 may be output by the feature extraction network 611. The embodiment may then input the image features 602 and the predetermined query feature sequence 603 into a decoding network 612 in the text detection model, with the decoding network 612 outputting the decoded feature sequence 604. The decoded feature sequence 604 is input to three branches included in the prediction network 613 in the text detection model, and information output from the three branches may constitute N detection results 630.
Illustratively, the decoding network 612 may include a decoder of a transducer architecture. In this embodiment, when obtaining the decoded feature sequence, for example, key features and value features may be obtained first according to the image features 602, and query features may be obtained according to a predetermined query feature sequence. Finally, the query feature, key feature, and value feature are input into a decoder included in the decoding network 612, and a decoded feature sequence is output by the decoder. It will be appreciated that the principle of obtaining the decoded feature sequence is similar to that of the above-described principle shown in fig. 3, and will not be described herein.
In one embodiment, the image features may also be decoded in conjunction with the location information of each pixel in the text image. Specifically, the text image may be position encoded to obtain the position features while the image features 602 are obtained. And then, fusing the position features and the image features to obtain fused features. Finally, key features and value features are derived from the fusion features. In this way, the decoded feature sequence may be decoded using a decoder based on the key feature, the value feature, and the query feature. The principle of decoding image features in this embodiment is similar to the decoding principle shown in fig. 4 described above, and will not be described again here.
According to the embodiment of the present disclosure, after obtaining N prediction results 630, the embodiment may match N prediction position information 631 included in the N prediction results 630 with the actual position information 641 of M characters corresponding to the text image 601, to obtain P position information pairs 650 described above. The model is then trained based on the matching relationship of the location information in the P location information pairs 650.
For example, it may be set that there should be a character at a position indicated by the predicted position information that matches the actual position information, i.e., it is determined that a true value of the classification information corresponding to the predicted position information should be a character-bearing class. It is set that there should be no character at the position indicated by the predicted position information that does not match the actual position information, i.e., it is determined that the true value of the classification information corresponding to the predicted position information should be a character-free class. Based on this, the embodiment can determine the second loss of the text detection model based on the difference between the classification information corresponding to the predicted position information and the true value, and train the text detection model with the aim of minimizing the second loss.
For example, the classification information may include, for example, a predicted probability value for a character, and the embodiment may set a probability value for a true value for a character class to a first value and a probability value for a true value for no character class to a second value. As such, the embodiment may determine the second loss of the text detection model according to a difference between the predicted probability value and the first value included in the classification information corresponding to the first position information and a difference between the predicted probability value and the second value included in the classification information corresponding to the other position information than the first position information among the N pieces of predicted position information. The first position information is a part matched with the actual position information in the N pieces of predicted position information. The first value is greater than the second value, e.g., the first value may be 1, or any value near 1 and less than 1, and the second value may be 0, or any value near 0 and greater than 0, as this disclosure is not limited. For example, a cross entropy loss function or the like may be employed to determine the loss of the text detection model from the predicted probability values.
For example, in order to realize text detection at the granularity of text lines, the character detection result corresponding to the text image as a sample may further include, for example, actual association information indicating the association relationship of M characters with each other. The actual association information may indicate a row relationship of M characters, i.e., which characters of the M characters are characters of the same row. This embodiment can represent, for example, actual association information between a plurality of characters belonging to the same line among the M characters as 1. Accordingly, the actual association information between characters that do not belong to the same line may be represented as 0. After determining the P position information pairs 650, the embodiment may determine a third loss of the text detection model according to the predicted association information between P characters at the P predicted position information indication positions in the P position information pairs and the difference between the actual association information between P characters at the P actual position information indication positions in the P position information pairs, and train the text detection model with the aim of minimizing the third loss.
Illustratively, a text line sequence s= { L is set to be included in the text image I as a sample 1 ,L 2 ,...,L n }. Wherein L is i The I text line included in the text image I is represented by the value range of I being [1, n ]]。L i For example, it may be constituted of a plurality of characters whose positions may be constituted of the actual position information sequence { L } i1 ,L i2 ,...,L im And } is composed. Wherein n and m are natural numbers greater than 1. L (L) ij Represent the firstActual position information of the j-th character in the i text lines. For the actual position information matched to the predicted position information, the true value of the classification information corresponding to the actual position information is 1, which indicates that the actual position information indicates that the position has characters, and the actual position information sequence { L } i1 ,L i2 ,...,L im The actual association information between any two of the actual position information indicating two characters at the position may include information having an association relationship.
In an embodiment, the text detection model may be trained by taking into account at least two of the first, second, and third losses described above. For example, a weighted sum of the three losses may be used as the total loss for the text detection model, which may then be trained with the goal of minimizing the total loss. The weights used in calculating the weighted sum of the three losses may be set according to actual needs, which is not limited by the present disclosure.
In an embodiment, the text detection model may further comprise, for example, an embedded network for encoding the random data sequence to obtain a predetermined query feature sequence upon which the image features are decoded. Wherein the random data sequence may be {0,1,2, (N-1) }, as described above, and the embedded network may be an embedding layer (embedding) for mapping the sparse random data sequence into the dense space. It will be appreciated that the network parameters in the embedded network may participate in training with the network parameters in the feature extraction network, decoder, prediction network.
Based on the text detection method provided by the present disclosure, the present disclosure further provides a text detection device, which will be described in detail below with reference to fig. 7.
Fig. 7 is a block diagram of a text detection device according to an embodiment of the present disclosure.
As shown in fig. 7, the text detection device 700 of this embodiment may include a feature extraction module 710, a feature decoding module 720, a prediction module 730, and a detection result obtaining module 740.
The feature extraction module 710 is used to extract image features of the text image. In an embodiment, the feature extraction module 710 may be configured to perform the operation S210 described above, which is not described herein.
The feature decoding module 720 is configured to decode the image feature according to the predetermined query feature sequence by using a decoder, to obtain a decoded feature sequence. Wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence. In an embodiment, the feature decoding module 720 may be configured to perform the operation S220 described above, which is not described herein.
The prediction module 730 is configured to predict a plurality of prediction results according to the decoded feature sequence; the plurality of prediction results are in one-to-one correspondence with decoding features in the decoding feature sequence. Wherein each of the predicted outcomes includes location information, classification information corresponding to the location information, and association information between a character at the location indicated by the location information and a plurality of characters at the location indicated by the location information in the plurality of predicted outcomes. Wherein the classification information is used for indicating whether the position information indicates that the position has characters. In an embodiment, the prediction module 730 may be configured to perform the operation S230 described above, which is not described herein.
The detection result obtaining module 740 is configured to determine, according to the association information and the classification information, position information indicating that a character exists at a position, and integrate position information of characters having an association relationship among the plurality of characters, to obtain a text detection result. In an embodiment, the detection result obtaining module 740 may be configured to perform the operation S240 described above, which is not described herein.
According to an embodiment of the present disclosure, the feature decoding module 720 may include a first feature obtaining sub-module, a second feature obtaining sub-module, and a decoding sub-module. The first feature obtaining submodule is used for obtaining key features and value features according to image features. The second feature obtaining submodule is used for obtaining query features according to a preset query feature sequence. The decoding submodule is used for inputting the query feature, the key feature and the value feature into the decoder to obtain a decoding feature sequence output by the decoder.
According to an embodiment of the present disclosure, the first feature obtaining sub-module may include a position encoding unit, a feature fusion unit, and a feature obtaining unit. The position coding unit is used for performing position coding on the text image to obtain position characteristics. The feature fusion unit is used for fusing the position features and the image features to obtain fusion features. The feature obtaining unit is used for obtaining key features and value features according to the fusion features.
According to an embodiment of the present disclosure, the classification information includes a probability value of a character, and the association information includes a degree of association. The detection result obtaining module 740 may include a target information determining sub-module, a character set determining sub-module, and a detection result determining sub-module. The target information determination submodule is used for determining position information with characters at indicated positions in a plurality of position information included in a plurality of prediction results according to the probability value and a preset probability threshold value as target position information. The character set determining submodule is used for forming characters with association relations in target characters into character sets according to the association degree of the target characters at the target position information indication position, and at least one character set is obtained. The detection result determining submodule is used for determining the position information of the text line corresponding to each character group according to the position information of the characters in each character group to obtain a text detection result.
Based on the training method of the text detection model provided by the present disclosure, the present disclosure further provides a training device of the text detection model, and the device will be described in detail below with reference to fig. 8.
Fig. 8 is a block diagram of a training device of a text detection model according to an embodiment of the present disclosure.
As shown in fig. 8, the training apparatus 800 for a text detection model of this embodiment may include a feature extraction module 810, a feature decoding module 820, a prediction module 830, and a model training module 840. The text detection model comprises a feature extraction network, a decoder and a prediction network.
The feature extraction module 810 is configured to extract image features of a text image as a sample using a feature extraction network. The text image has corresponding indication information, and the indication information indicates a character detection result corresponding to the text image. In an embodiment, the feature extraction module 810 may be configured to perform the operation S510 described above, which is not described herein.
The feature decoding module 820 is configured to decode the image feature according to the predetermined query feature sequence using a decoder to obtain a decoded feature sequence. Wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence. In an embodiment, the feature decoding module 820 may be used to perform the operation S520 described above, which is not described herein.
The prediction module 830 is configured to predict a plurality of prediction results according to the decoding feature sequence by using a prediction network, where the plurality of prediction results are in one-to-one correspondence with the decoding features in the decoding feature sequence. Wherein each of the prediction results includes prediction position information, classification information corresponding to the prediction position information, and prediction association information between a character at a prediction position information indicating position and a plurality of characters at the prediction position information indicating position among the plurality of detection results. The classification information is used to indicate whether the predicted position information indicates that there is a character at the position. In an embodiment, the prediction module 830 may be configured to perform the operation S530 described above, which is not described herein.
The model training module 840 is configured to train a text detection model according to the plurality of prediction results and the character detection result. In an embodiment, the model training module 840 may be configured to perform the operation S540 described above, which is not described herein.
According to an embodiment of the present disclosure, the character detection result may include actual position information of M characters included in the text image, and the classification information may include a predicted probability value of the character. The model training module 840 may include a location information matching sub-module and a first training sub-module. The position information matching submodule is used for matching N predicted position information included in the plurality of predicted results with actual position information of M characters to obtain P position information pairs, wherein each position information pair comprises one first position information in the N predicted position information and second position information matched with the first position information in the actual position information of the M characters. The first training sub-module is used for training the text detection model according to the difference between the predicted probability value and the first value, which are included in the classification information corresponding to the first position information, and the difference between the predicted probability value and the second value, which are included in the classification information corresponding to the other position information except the first position information in the N pieces of predicted position information. Wherein the second value is less than the first value.
According to an embodiment of the present disclosure, the character detection result includes actual position information of M characters included in the text image. The model training module 840 may include a location information matching sub-module and a second training sub-module. The position information matching submodule is used for matching N predicted position information included in the plurality of predicted results with actual position information of M characters to obtain P position information pairs, wherein each position information pair comprises one first position information in the N predicted position information and second position information matched with the first position information in the actual position information of the M characters. The second training submodule is used for training the text detection model according to the difference between the first position information and the second position information in the position information pair.
According to an embodiment of the present disclosure, the character detection result includes actual position information of M characters included in the text image and actual association information indicating an association relationship between the M characters. The model training module 840 may include a location information matching sub-module and a third training sub-module. The position information matching submodule is used for matching N predicted position information included in the plurality of predicted results with actual position information of M characters to obtain P position information pairs, wherein each position information pair comprises one first position information in the N predicted position information and second position information matched with the first position information in the actual position information of the M characters. The third training sub-module is used for training the text detection model according to the difference between the actual association information of the P second characters corresponding to the P second position information according to the prediction association information of the P first characters at the position indicated by the P first position information.
According to an embodiment of the present disclosure, the text detection model further includes an embedded network, and the apparatus 800 may further include a sequence encoding module configured to encode the random data sequence using the embedded network to obtain a predetermined query feature sequence.
According to an embodiment of the present disclosure, the above-described feature decoding module 820 may include a first feature obtaining sub-module, a second feature obtaining sub-module, and a decoding sub-module. The first feature obtaining submodule is used for obtaining key features and value features according to image features. The second feature obtaining submodule is used for obtaining query features according to a preset query feature sequence. The decoding submodule is used for inputting the query feature, the key feature and the value feature into the decoder to obtain a decoding feature sequence output by the decoder.
According to an embodiment of the present disclosure, the first feature obtaining sub-module may include a position encoding unit, a feature fusion unit, and a feature obtaining unit. The position coding unit is used for performing position coding on the text image to obtain position characteristics. The feature fusion unit is used for fusing the position features and the image features to obtain fusion features. The feature obtaining unit is used for obtaining key features and value features according to the fusion features.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated. In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement the text detection methods and/or training methods for text detection models of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as a text detection method and/or a training method of a text detection model. For example, in some embodiments, the text detection method and/or the training method of the text detection model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text detection method and/or the training method of the text detection model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the text detection method and/or the training method of the text detection model in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS"). The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (24)

1. A text detection method, comprising:
extracting image features of the text image;
adopting a decoder to decode the image features according to a preset inquiry feature sequence to obtain a decoded feature sequence; wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence;
predicting to obtain a plurality of prediction results according to the decoding characteristic sequence; the prediction results are in one-to-one correspondence with decoding features in the decoding feature sequence; each of the prediction results includes position information, classification information corresponding to the position information, and association information between a character at the position indicated by the position information and a plurality of characters at the position indicated by the position information in the plurality of prediction results; and
Determining position information indicating that characters exist at positions according to the association information and the classification information, integrating the position information of the characters with association relation among the plurality of characters to obtain a text detection result,
wherein the classification information is used for indicating whether the position information indicates that the position has characters.
2. The method of claim 1, wherein said employing a decoder to decode the image features from a predetermined query feature sequence, resulting in a decoded feature sequence comprises:
obtaining key features and value features according to the image features;
obtaining query features according to the predetermined query feature sequence; and
and inputting the query feature, the key feature and the value feature into the decoder to obtain a decoding feature sequence output by the decoder.
3. The method of claim 2, wherein the deriving key features and value features from the image features comprises:
performing position coding on the text image to obtain position characteristics;
fusing the position features and the image features to obtain fused features; and
and obtaining the key feature and the value feature according to the fusion feature.
4. The method of claim 1, wherein the classification information includes a probability value of a character; the association information comprises association degree; determining the position information indicating that the character exists at the position according to the association information and the classification information, integrating the position information of the character with the association relation in the plurality of characters, and obtaining a text detection result comprises the following steps:
determining position information with characters at indicated positions in a plurality of position information included in a plurality of prediction results according to the probability value and a preset probability threshold value as target position information;
forming characters with association relations in target characters into character groups according to the association degree of the target characters at the target position information indication position, and obtaining at least one character group; and
and determining the position information of the text line corresponding to each character group according to the position information of the characters in each character group, and obtaining the text detection result.
5. A training method of a text detection model, wherein the text detection model comprises a feature extraction network, a decoder and a prediction network; the method comprises the following steps:
extracting image features of the text image serving as a sample by adopting the feature extraction network; the text image is provided with corresponding indication information, and the indication information indicates a character detection result corresponding to the text image;
Adopting the decoder to decode the image features according to a preset inquiry feature sequence to obtain a decoded feature sequence; wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence;
the prediction network is adopted to predict and obtain a plurality of prediction results according to the decoding characteristic sequence, and the plurality of prediction results are in one-to-one correspondence with the decoding characteristics in the decoding characteristic sequence; each of the prediction results includes prediction position information, classification information corresponding to the prediction position information, and prediction association information between a character at a position indicated by the prediction position information and a plurality of characters at a position indicated by the prediction position information in a plurality of the detection results; and
training the text detection model according to a plurality of the prediction results and the character detection results,
wherein the classification information is used for indicating whether the predicted position information indicates that the position has characters.
6. The method of claim 5, wherein the character detection result includes actual position information of M characters included in the text image; the classification information comprises a predicted probability value of a character; the training the text detection model according to the plurality of prediction results and the character detection result comprises the following steps:
Matching N predicted position information included in the plurality of predicted results with actual position information of the M characters to obtain P position information pairs, wherein each position information pair comprises one first position information in the N predicted position information and second position information matched with the first position information in the actual position information of the M characters; and
training the text detection model according to a difference between a predicted probability value included in the classification information corresponding to the first position information and a first value and a difference between a predicted probability value included in the classification information corresponding to the other position information except the first position information among the N pieces of predicted position information and a second value,
wherein the second value is less than the first value.
7. The method of claim 5, wherein the character detection result includes actual position information of M characters included in the text image; the training the text detection model according to the plurality of prediction results and the character detection result comprises the following steps:
matching N predicted position information included in the plurality of predicted results with actual position information of the M characters to obtain P position information pairs, wherein each position information pair comprises one first position information in the N predicted position information and second position information matched with the first position information in the actual position information of the M characters; and
Training the text detection model according to the difference between the first position information and the second position information in the position information pair.
8. The method of claim 5, wherein the character detection result includes actual position information of M characters included in a text image and actual association information indicating an association relationship between the M characters; the training the text detection model according to the plurality of prediction results and the character detection result comprises the following steps:
matching N predicted position information included in the plurality of predicted results with actual position information of the M characters to obtain P position information pairs, wherein each position information pair comprises one first position information in the N predicted position information and second position information matched with the first position information in the actual position information of the M characters; and
and training the text detection model according to the difference between the actual association information among the P second characters corresponding to the P second position information according to the prediction association information among the P first characters at the P first position information indication positions.
9. The method of claim 5, wherein the text detection model further comprises an embedded network; the method further comprises the steps of:
and adopting the embedded network to encode the random data sequence to obtain the predetermined inquiry feature sequence.
10. The method of claim 5, wherein said employing the decoder to decode the image features according to a predetermined query feature sequence, resulting in a decoded feature sequence comprises:
obtaining key features and value features according to the image features;
obtaining query features according to the predetermined query feature sequence; and
and inputting the query feature, the key feature and the value feature into the decoder to obtain a decoding feature sequence output by the decoder.
11. The method of claim 10, wherein the deriving key features and value features from the image features comprises:
performing position coding on the text image to obtain position characteristics;
fusing the position features and the image features to obtain fused features; and
and obtaining the key feature and the value feature according to the fusion feature.
12. A text detection device, comprising:
the feature extraction module is used for extracting image features of the text image;
The feature decoding module is used for decoding the image features according to a preset query feature sequence by adopting a decoder to obtain a decoded feature sequence; wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence;
the prediction module is used for predicting and obtaining a plurality of prediction results according to the decoding characteristic sequence; the prediction results are in one-to-one correspondence with decoding features in the decoding feature sequence; each of the prediction results includes position information, classification information corresponding to the position information, and association information between a character at the position indicated by the position information and a plurality of characters at the position indicated by the position information in the plurality of prediction results; and
a detection result obtaining module for determining the position information indicating that the character exists at the position according to the association information and the classification information and integrating the position information of the character with association relation among the plurality of characters to obtain a text detection result,
wherein the classification information is used for indicating whether the position information indicates that the position has characters.
13. The apparatus of claim 12, wherein the feature decoding module comprises:
The first feature obtaining submodule is used for obtaining key features and value features according to the image features;
the second characteristic obtaining submodule is used for obtaining query characteristics according to the preset query characteristic sequence; and
and the decoding submodule is used for inputting the query feature, the key feature and the value feature into the decoder to obtain a decoding feature sequence output by the decoder.
14. The apparatus of claim 13, wherein the first feature derivation submodule comprises:
the position coding unit is used for carrying out position coding on the text image to obtain a position characteristic;
the feature fusion unit is used for fusing the position features and the image features to obtain fusion features; and
and a feature obtaining unit configured to obtain the key feature and the value feature from the fusion feature.
15. The apparatus of claim 12, wherein the classification information comprises a probability value of a character; the association information comprises association degree; the detection result obtaining module comprises:
a target information determining sub-module, configured to determine, as target position information, position information having characters at indicated positions among a plurality of position information included in a plurality of prediction results according to the probability value and a predetermined probability threshold;
The character set determining submodule is used for forming character sets of characters with association relations in the target characters according to the association degree of the target characters at the target position information indication position to obtain at least one character set; and
and the detection result determining submodule is used for determining the position information of the text row corresponding to each character group according to the position information of the characters in each character group to obtain the text detection result.
16. A training device of a text detection model, wherein the text detection model comprises a feature extraction network, a decoder and a prediction network; the device comprises:
a feature extraction module for extracting image features of the text image as a sample using the feature extraction network; the text image is provided with corresponding indication information, and the indication information indicates a character detection result corresponding to the text image;
the feature decoding module is used for decoding the image features according to a preset query feature sequence by adopting the decoder to obtain a decoded feature sequence; wherein the decoding features in the decoding feature sequence are in one-to-one correspondence with the query features in the predetermined query feature sequence;
The prediction module is used for predicting a plurality of prediction results according to the decoding feature sequence by adopting the prediction network, and the plurality of prediction results are in one-to-one correspondence with the decoding features in the decoding feature sequence; each of the prediction results includes prediction position information, classification information corresponding to the prediction position information, and prediction association information between a character at a position indicated by the prediction position information and a plurality of characters at a position indicated by the prediction position information in a plurality of the detection results; and
a model training module for training the text detection model according to a plurality of the prediction results and the character detection results,
wherein the classification information is used for indicating whether the predicted position information indicates that the position has characters.
17. The apparatus of claim 16, wherein the character detection result includes actual position information of M characters included in the text image; the classification information comprises a predicted probability value of a character; the model training module comprises:
a position information matching sub-module, configured to match N predicted position information included in the plurality of predicted results with actual position information rows of the M characters, to obtain P position information pairs, where each position information pair includes a first position information in the N predicted position information and a second position information matched with the first position information in the actual position information of the M characters; and
A first training sub-module for training the text detection model according to a difference between a predicted probability value and a first value included in the classification information corresponding to the first position information and a difference between a predicted probability value and a second value included in the classification information corresponding to other position information than the first position information among the N pieces of predicted position information,
wherein the second value is less than the first value.
18. The apparatus of claim 16, wherein the character detection result includes actual position information of M characters included in the text image; the model training module comprises:
a position information matching sub-module, configured to match N predicted position information included in the plurality of predicted results with actual position information of the M characters, to obtain P position information pairs, where each position information pair includes a first position information in the N predicted position information and a second position information matched with the first position information in the actual position information of the M characters; and
and the second training sub-module is used for training the text detection model according to the difference between the first position information and the second position information in the position information pair.
19. The apparatus of claim 16, wherein the character detection result includes actual position information of M characters included in a text image and actual association information indicating an association relationship of the M characters with each other; the model training module comprises:
a position information matching sub-module, configured to match N predicted position information included in the plurality of predicted results with actual position information rows of the M characters, to obtain P position information pairs, where each position information pair includes a first position information in the N predicted position information and a second position information matched with the first position information in the actual position information of the M characters; and
and the third training sub-module is used for training the text detection model according to the difference between the actual association information of the P second characters corresponding to the P second position information according to the prediction association information of the P first characters at the P first position information indication positions.
20. The apparatus of claim 16, wherein the text detection model further comprises an embedded network; the apparatus further comprises:
and the sequence coding module is used for coding the random data sequence by adopting the embedded network to obtain the preset inquiry characteristic sequence.
21. The apparatus of claim 16, wherein the feature decoding module comprises:
the first feature obtaining submodule is used for obtaining key features and value features according to the image features;
the second characteristic obtaining submodule is used for obtaining query characteristics according to the preset query characteristic sequence; and
and the decoding submodule is used for inputting the query feature, the key feature and the value feature into the decoder to obtain a decoding feature sequence output by the decoder.
22. The apparatus of claim 21, wherein the first feature derivation submodule comprises:
the position coding unit is used for carrying out position coding on the text image to obtain a position characteristic;
the feature fusion unit is used for fusing the position features and the image features to obtain fusion features; and
and a feature obtaining unit configured to obtain the key feature and the value feature from the fusion feature.
23. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 11.
24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-11.
CN202211205551.2A 2022-09-29 2022-09-29 Text detection method and training method and device of text detection model Active CN115578735B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211205551.2A CN115578735B (en) 2022-09-29 2022-09-29 Text detection method and training method and device of text detection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211205551.2A CN115578735B (en) 2022-09-29 2022-09-29 Text detection method and training method and device of text detection model

Publications (2)

Publication Number Publication Date
CN115578735A CN115578735A (en) 2023-01-06
CN115578735B true CN115578735B (en) 2023-09-15

Family

ID=84582937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211205551.2A Active CN115578735B (en) 2022-09-29 2022-09-29 Text detection method and training method and device of text detection model

Country Status (1)

Country Link
CN (1) CN115578735B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050465B (en) * 2023-02-09 2024-03-19 北京百度网讯科技有限公司 Training method of text understanding model, text understanding method and device
CN116343233B (en) * 2023-04-04 2024-02-06 北京百度网讯科技有限公司 Text recognition method and training method and device of text recognition model
CN116824609A (en) * 2023-06-29 2023-09-29 北京百度网讯科技有限公司 Document format detection method and device and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851692A (en) * 2018-07-27 2020-02-28 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN113657390A (en) * 2021-08-13 2021-11-16 北京百度网讯科技有限公司 Training method of text detection model, and text detection method, device and equipment
CN113963340A (en) * 2021-09-02 2022-01-21 中国科学院信息工程研究所 Scene character recognition system and method based on parallel iteration imitation decoding
WO2022099325A1 (en) * 2022-01-10 2022-05-12 Innopeak Technology, Inc. Transformer-based scene text detection
CN114495113A (en) * 2022-02-18 2022-05-13 北京百度网讯科技有限公司 Text classification method and training method and device of text classification model
CN114565789A (en) * 2022-02-15 2022-05-31 华南理工大学 Text detection method, system, device and medium based on set prediction
CN114863437A (en) * 2022-04-21 2022-08-05 北京百度网讯科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN114926828A (en) * 2022-05-17 2022-08-19 北京百度网讯科技有限公司 Scene text recognition method and device, electronic equipment and storage medium
CN115019143A (en) * 2022-06-16 2022-09-06 湖南大学 Text detection method based on CNN and Transformer mixed model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2691214C1 (en) * 2017-12-13 2019-06-11 Общество с ограниченной ответственностью "Аби Продакшн" Text recognition using artificial intelligence
US20220019834A1 (en) * 2018-11-15 2022-01-20 Element Ai Inc. Automatically predicting text in images
US11386885B2 (en) * 2020-02-17 2022-07-12 Wipro Limited Method and system for detecting intent as an ordered sequence from a user query
CN111860479B (en) * 2020-06-16 2024-03-26 北京百度网讯科技有限公司 Optical character recognition method, device, electronic equipment and storage medium
WO2022129610A1 (en) * 2020-12-17 2022-06-23 UMNAI Limited Explainable transducer transformers

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851692A (en) * 2018-07-27 2020-02-28 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN113657390A (en) * 2021-08-13 2021-11-16 北京百度网讯科技有限公司 Training method of text detection model, and text detection method, device and equipment
CN113963340A (en) * 2021-09-02 2022-01-21 中国科学院信息工程研究所 Scene character recognition system and method based on parallel iteration imitation decoding
WO2022099325A1 (en) * 2022-01-10 2022-05-12 Innopeak Technology, Inc. Transformer-based scene text detection
CN114565789A (en) * 2022-02-15 2022-05-31 华南理工大学 Text detection method, system, device and medium based on set prediction
CN114495113A (en) * 2022-02-18 2022-05-13 北京百度网讯科技有限公司 Text classification method and training method and device of text classification model
CN114863437A (en) * 2022-04-21 2022-08-05 北京百度网讯科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN114926828A (en) * 2022-05-17 2022-08-19 北京百度网讯科技有限公司 Scene text recognition method and device, electronic equipment and storage medium
CN115019143A (en) * 2022-06-16 2022-09-06 湖南大学 Text detection method based on CNN and Transformer mixed model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining;Pengyuan Lyu 等;《arXiv:2206.00311v1 [cs.CV]》;1-13 *
TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers;Zengyuan guo 等;《arXiv:2208.14687v1 [cs.CV]》;1-9 *
融合多头自注意力机制的中文分类方法;熊漩;严佩敏;;电子测量技术(第10期);125-130 *

Also Published As

Publication number Publication date
CN115578735A (en) 2023-01-06

Similar Documents

Publication Publication Date Title
CN113657390B (en) Training method of text detection model and text detection method, device and equipment
CN115578735B (en) Text detection method and training method and device of text detection model
CN113222916B (en) Method, apparatus, device and medium for detecting image using object detection model
CN111488826A (en) Text recognition method and device, electronic equipment and storage medium
CN114549840B (en) Training method of semantic segmentation model and semantic segmentation method and device
CN113657274B (en) Table generation method and device, electronic equipment and storage medium
CN113971751A (en) Training feature extraction model, and method and device for detecting similar images
CN113792526B (en) Training method of character generation model, character generation method, device, equipment and medium
CN114677565B (en) Training method and image processing method and device for feature extraction network
CN115546488B (en) Information segmentation method, information extraction method and training method of information segmentation model
CN114429637B (en) Document classification method, device, equipment and storage medium
CN114863437B (en) Text recognition method and device, electronic equipment and storage medium
CN113177449B (en) Face recognition method, device, computer equipment and storage medium
CN113591566A (en) Training method and device of image recognition model, electronic equipment and storage medium
CN114724133B (en) Text detection and model training method, device, equipment and storage medium
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN116343233B (en) Text recognition method and training method and device of text recognition model
CN112686243A (en) Method and device for intelligently identifying picture characters, computer equipment and storage medium
CN114707017A (en) Visual question answering method and device, electronic equipment and storage medium
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN113887394A (en) Image processing method, device, equipment and storage medium
CN114117037A (en) Intention recognition method, device, equipment and storage medium
CN114187435A (en) Text recognition method, device, equipment and storage medium
Zahid Ishraque et al. A local adaptive image descriptor
CN112749691A (en) Image processing method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant