WO2023202197A1 - 文本识别方法及相关装置 - Google Patents

文本识别方法及相关装置 Download PDF

Info

Publication number
WO2023202197A1
WO2023202197A1 PCT/CN2023/076411 CN2023076411W WO2023202197A1 WO 2023202197 A1 WO2023202197 A1 WO 2023202197A1 CN 2023076411 W CN2023076411 W CN 2023076411W WO 2023202197 A1 WO2023202197 A1 WO 2023202197A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
model
image
recognition
target
Prior art date
Application number
PCT/CN2023/076411
Other languages
English (en)
French (fr)
Inventor
姜媚
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023202197A1 publication Critical patent/WO2023202197A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of artificial intelligence technology, especially text recognition technology.
  • Image-based text recognition technology is used to identify text information contained in images.
  • the target language to which the text information in the image to be recognized belongs is usually determined first, and then the text information contained in the image to be recognized is determined through the text recognition model corresponding to the target language.
  • Embodiments of the present application provide a text recognition method and related devices, which can perform text recognition on images containing text in multiple languages, and can improve the accuracy of text recognition.
  • embodiments of the present application provide a text recognition method, which is executed by an electronic device, including:
  • the target text recognition model associated with the corresponding language is used for processing, and a text recognition result corresponding to the image to be recognized is obtained.
  • a text recognition device including:
  • the image classification unit is used to input the image to be recognized containing text into the target classification model, and obtain the language distribution information and the original text presentation direction of the image to be recognized, wherein the language distribution information includes the text corresponding to the text.
  • An image correction unit configured to present the target text based on the original text presentation direction and the preset target text direction, correct the image to be recognized to obtain a target recognition image
  • a text positioning unit configured to determine a text area image set corresponding to each of the multiple languages in the target recognition image based on the text position information corresponding to each of the multiple languages;
  • the image recognition unit is configured to perform processing based on the text area image sets corresponding to the plurality of languages, respectively, using the target text recognition model associated with the corresponding language, to obtain a text recognition result corresponding to the image to be recognized.
  • embodiments of the present application provide an electronic device, including a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the above Steps of text recognition method.
  • embodiments of the present application provide a computer-readable storage medium, which includes a computer program.
  • the computer program When the computer program is run on an electronic device, the computer program is used to cause the electronic device to execute the above text recognition method. A step of.
  • inventions of the present application provide a computer program product.
  • the program product includes a computer program.
  • the computer program is stored in a computer-readable storage medium.
  • the processor of the electronic device obtains the program from the computer-readable storage medium. Reading and executing the computer program causes the electronic device to perform the steps of the above text recognition method.
  • the image to be recognized containing text is input into the target classification model to obtain the corresponding language distribution information and the original text presentation direction. Then, based on the original text presentation direction and the preset target text presentation direction, the image to be recognized is The image is corrected to obtain a target recognition image. Then, based on the text position information corresponding to multiple languages in the language distribution information, in the target recognition image, the text area image set corresponding to multiple languages is determined. Finally, based on each text For the regional image sets, the target text recognition model associated with the corresponding language is used to obtain the text recognition results corresponding to the images to be recognized.
  • Figure 1 is a schematic diagram of an application scenario provided in the embodiment of this application.
  • Figure 2 is a schematic flow chart of the text recognition method provided in the embodiment of the present application.
  • Figure 3A is a schematic diagram of each language provided in the embodiment of the present application.
  • Figure 3B is a schematic diagram of each text presentation direction provided in the embodiment of the present application.
  • Figure 4 is a schematic diagram of obtaining language distribution information and original text presentation direction provided in the embodiment of the present application.
  • Figure 5 is a schematic diagram of the image correction process provided in the embodiment of the present application.
  • Figure 6 is a schematic diagram of a target recognition image provided in an embodiment of the present application.
  • Figure 7 is a logical schematic diagram of the text recognition method provided in the embodiment of the present application.
  • Figure 8A is a schematic structural diagram of the target text line detection model provided in the embodiment of the present application.
  • Figure 8B is a schematic diagram of the shape correction process provided in the embodiment of the present application.
  • Figure 9 is a schematic diagram of the target classification model provided in the embodiment of the present application.
  • Figure 10A is a schematic flowchart of the language recognition sub-model training method provided in the embodiment of the present application.
  • Figure 10B is a logical schematic diagram for determining the first model loss provided in the embodiment of the present application.
  • Figure 11A is a schematic flow chart of the language recognition sub-model training method provided in the embodiment of the present application.
  • Figure 11B is a logical schematic diagram for determining the loss of the second model provided in the embodiment of the present application.
  • Figure 11C is a schematic diagram of contrast loss and cross-entropy loss provided in the embodiment of the present application.
  • Figure 12 is a schematic diagram of two text recognition results provided in the embodiment of the present application.
  • Figure 13 is a schematic structural diagram of the pre-trained language recognition model provided in the embodiment of the present application.
  • Figure 14 is a schematic structural diagram of the image-text distance recognition model provided in the embodiment of the present application.
  • Figure 15 is a schematic flowchart of the text recognition model training method provided in the embodiment of the present application.
  • Figure 16 is a schematic diagram of the spatial distribution characteristics of Thai provided in the embodiment of the present application.
  • Figure 17A is a schematic diagram of the text recognition model based on SAR and CTC provided in the embodiment of the present application;
  • Figure 17B is a schematic diagram of several text recognition results provided in the embodiment of the present application.
  • Figure 18A is a logical schematic diagram of the data synthesis method provided in the embodiment of the present application.
  • Figure 18B is a schematic diagram of the synthesized annotated sample provided in the embodiment of the present application.
  • Figure 19 is a logical schematic diagram of data text style migration provided in the embodiment of this application.
  • Figure 20 is a schematic diagram of a sample obtained by text style migration provided in the embodiment of the present application.
  • Figure 21 is a schematic structural diagram of a text recognition device provided in an embodiment of the present application.
  • Figure 22 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
  • Text detection Locate the location of text in the image.
  • Text recognition Based on the text position obtained by text detection, the text recognition result is obtained by conversion, that is, the text information is obtained by conversion.
  • the solutions provided by the embodiments of this application involve artificial intelligence machine learning technology.
  • it mainly involves the training process of the text detection model, classification model, and text recognition model, as well as the corresponding model application process.
  • the model training process can be either offline training or online training, and there is no restriction on this.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • This application scenario includes at least the terminal device 110 and the server 120 .
  • the number of terminal devices 110 can be The number of servers 120 may be one or more, and the number of servers 120 may also be one or more.
  • This application does not specifically limit the number of terminal devices 110 and servers 120 .
  • the terminal device 110 may be installed with a client related to text recognition, and the server 120 may be a server related to data processing.
  • the client in this application can be software, a web page, an applet, etc.
  • the server is a backend server corresponding to the software, web page, applet, etc., or a server specifically used for data processing.
  • This application There are no specific restrictions on application.
  • the terminal device 110 may be a smartphone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, an Internet of Things device, a smart home appliance, a vehicle-mounted terminal, etc., but is not limited thereto.
  • Server 120 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • the terminal device 110 and the server 120 can be connected directly or indirectly through wired or wireless communication methods, which is not limited in this application.
  • the text recognition method in the embodiment of the present application can be executed by the server or the terminal device alone, or can be executed by the server and the terminal device in collaboration.
  • the terminal device inputs the image to be recognized into the target classification model to obtain the corresponding language distribution information and the original text presentation direction. Then, based on the original text presentation direction and the preset target text presentation direction, the image to be recognized is corrected to obtain Target recognition image, and then based on the position information of each text, determine the text area image sets corresponding to multiple languages from the target recognition image. Finally, based on each text area image set, use the target text recognition model associated with the corresponding language, Determine the text recognition result corresponding to the image to be recognized.
  • the server performs the above text recognition process.
  • the terminal device responds to the text recognition operation on the image to be recognized, obtains the image to be recognized, and transmits the image to be recognized to the server, and then the server inputs the image to be recognized into the target classification model to obtain the corresponding language distribution information and The original text presentation direction. Then, based on the original text presentation direction and the preset target text presentation direction, the image to be recognized is corrected to obtain the target recognition image. Then, based on the position information of each text, multiple languages are determined from the target recognition image. Each corresponding text area image set. Finally, based on each text area image set, the target text recognition model associated with the corresponding language is used to determine the text recognition result corresponding to the image to be recognized.
  • the text recognition method in the embodiment of the present application can be applied to any scenario where text in an image needs to be extracted, for example, picture text extraction, scan translation, picture translation, reading, literature retrieval, letters and packages. Sorting, editing and proofreading of manuscripts, summary and analysis of reports and cards, statistical summary of commodity invoices, identification of commodity codes, management of commodity warehouses, etc., but not limited to this.
  • FIG. 2 is a schematic flowchart of the text recognition method provided in the embodiment of the present application.
  • This method can be executed by an electronic device, and the electronic device can be a terminal device or a server.
  • the specific process is as follows:
  • S201 Input the image to be recognized containing text into the target classification model, and obtain the language distribution information and the original text presentation direction of the image to be recognized, where the language distribution information includes multiple languages corresponding to the text, and each of the multiple languages. Corresponding text location information.
  • the target classification model may also be called a multi-task architecture text language/orientation classification model (Lingual & Orientation Prediction Network, LOPN).
  • LOPN a multi-task architecture text language/orientation classification model
  • the target classification model can detect the language distribution and judge the text presentation direction of the image to be recognized.
  • the multiple languages can be multiple of the following languages: Chinese, Japanese, Korean, Latin, Thai, Arabic, Hindi, symbol identification, where the symbol identification includes one or more of numbers and symbols, But it is not limited to this.
  • Figure 3A is a schematic diagram of various languages provided by the embodiment of the present application. Among them, the semantic meaning of the texts corresponding to Chinese, Japanese, Korean, Latin, Thai, Arabic, and Hindi is "Hello” , the symbol mark represents 11 o'clock (11:00) to 12 o'clock (12:00).
  • the text presentation direction is used to characterize the layout direction of the text.
  • the text presentation direction includes but is not limited to 0°, 90°, 180°, and 270°. Taking Chinese as an example, as shown in Figure 3B, when the text presentation directions are 0° and 180°, the text layout directions are all horizontal. When the text presentation directions are 90° and 270°, the text layout directions are all horizontal. Vertical layout.
  • the image to be recognized may be an original image, or a partial image containing text extracted from the original image through text detection.
  • text detection For specific text detection methods, see below.
  • the image 1 to be recognized contains the Chinese word "outing” and the English word "spring".
  • the image 1 to be recognized is input into the target classification model to obtain the language corresponding to the image 1 to be recognized.
  • Distribution information and original text presentation direction where the original text presentation direction is 180°.
  • the language distribution information includes multiple languages corresponding to the text (i.e., Chinese and English), as well as text position information corresponding to Chinese and text position corresponding to English.
  • Information the text position information corresponding to Chinese is used to represent the text position of "Outing"
  • the text position information corresponding to English is used to represent the text position of "spring".
  • the target recognition image 2 is obtained after correcting the image 2 to be recognized.
  • the target recognition image 2 contains Chinese, Japanese and English.
  • Figure 6 contains a dotted box 61, a dotted line Box 62, dotted box 63, and dotted box 64.
  • the dotted box 61 and the dotted box 64 both represent the text area image corresponding to Japanese
  • the dotted box 62 represents the text area image corresponding to Chinese
  • the dotted box 63 represents the text area image corresponding to English.
  • the text area images represented by the dotted box 61 and the dotted box 64 constitute the text area image set corresponding to Japanese
  • the text area image represented by the dotted box 62 constitutes the text area image set corresponding to the Chinese
  • the text area represented by the dotted box 63 The images constitute the English corresponding text area image set.
  • the five target text recognition models are respectively corresponding to Chinese, Japanese, Korean, English, and mixed Latin.
  • Text recognition model, mixed Latin includes Latin, Thai, Vietnamese, Russian, Arabic, and Hindi, but is not limited to these.
  • the corresponding character sets of Chinese, Japanese, Korean, Thai, and mixed Latin The scales are 1w+, 9000+, 8000+, 200+ and 1000+ respectively.
  • the above five target text recognition models are used as examples for explanation.
  • the text area image set For the text area image set corresponding to each of the multiple languages, the text area image set is input into the target text recognition model associated with the language, and the text recognition sub-result corresponding to the text area image set is obtained;
  • the text recognition results corresponding to the image to be recognized are obtained.
  • the dotted box 61 and the dotted box 64 both represent the text area image corresponding to Japanese
  • the dotted box 62 represents the text area image corresponding to Chinese
  • the dotted box 63 represents English.
  • the text area image represented by the dotted box 61 is input into the Japanese-related target text recognition model to obtain the corresponding text recognition sub-result 61.
  • the text area image represented by the dotted box 64 is input into the Japanese-related target.
  • the corresponding text recognition sub-result 64 is obtained, the text area image represented by the dotted box 62 is input into the Chinese-related target text recognition model, the corresponding text recognition sub-result 62 is obtained, and the text represented by the dotted box 63 is The regional image is input into the English-related target text recognition model to obtain the corresponding text recognition sub-result 63. Then, based on the text recognition sub-result 61, the text recognition sub-result 62, the text recognition sub-result 63 and the text recognition sub-result 64, we obtain Text recognition results.
  • the image to be recognized can be obtained through text detection, specifically but not limited to the following methods:
  • Method 1 Obtain the original image and extract at least one sub-image containing text from the original image.
  • the original image can be input into the target text line detection model to obtain at least one sub-image containing text.
  • the target text line detection model can be implemented based on the Differentiable Binarization (DB) algorithm, but is not limited to this.
  • the backbone network part of the target text line detection model can adopt a fully convolutional network (Fully Convolution Network, FCN) architecture based on a lightweight network architecture.
  • FCN Fully Convolutional Network
  • the multi-stream branch at the head is used to determine the content of the original image. Whether the pixel is text and used for binarization threshold learning.
  • the target text line detection model can be composed of a 3 ⁇ 3 convolution operator and two deconvolution operators with a stride of 2, 1/2, 1/4, 1/8, 1/16 and 1/ 32 respectively represent the ratio compared to the input original image.
  • the lightweight network architecture can adopt but is not limited to mobilenetv2, mobilenetv3, and shufflenet.
  • the input original image passes through the resnet50-vd layer of the Feature Pyramid Networks (FPN).
  • FPN Feature Pyramid Networks
  • the output of the Feature Pyramid Network is transformed into the same size by upsampling, and cascaded to generate the feature map.
  • the probability map and threshold map can be predicted.
  • the probability map is used to characterize the probability that each pixel in the original image belongs to text
  • the threshold map is used to characterize each pixel in the original image.
  • the threshold corresponding to the pixel and then, based on the probability map and the threshold map, the binary map (approximate binary map) can be obtained.
  • the corresponding sub-images can be obtained.
  • Each sub-image includes sub-image 1, sub-image 2 and sub-image 3, where sub-image 1 contains the text "Harvesting", sub-image 2 contains the text "great”, and sub-image 3 contains the text "skills".
  • the preset image shape can be set to a regular shape such as a rectangle, but is not limited to this. In practical applications, in order to facilitate subsequent image processing operations, the preset image shape is usually set to a rectangle, that is, the image shape of the sub-image is corrected to a rectangle. Since the process of extracting sub-images in method 2 is the same as the process of extracting sub-images in method 1, it will not be described again here.
  • the preset image shape is a rectangle
  • sub-image 1 contains the text "Harvesting”
  • the image shape of sub-image 1 is a curved shape.
  • the extracted sub-image Image 1 is subjected to shape correction processing, and the corrected sub-image 1 will be obtained.
  • the corrected sub-image 1 is a rectangular image containing the text "Harvesting”. This rectangular image can be used as is the image to be recognized.
  • the curved text area is corrected into a rectangular text area, and then subsequent text recognition is performed based on the rectangular text area, and text of any shape can be detected and recognized. area, at the same time, the text recognition accuracy can be further improved.
  • the target classification model can include a target feature extraction network, a target language recognition sub-model, and a target direction recognition sub-model.
  • the target feature extraction network can use but is not limited to a convolutional neural network (CNN).
  • CNN convolutional neural network
  • the target feature extraction network includes S1, S2, S3, and S4 layers.
  • the S1 layer includes depthwise (DW) convolution, conventional convolution, activation function, matrix multiplication, and pointwise (PW) convolution.
  • PW pointwise
  • the target language recognition sub-model can be trained through the following operations:
  • the initial language recognition sub-model included in the initial recognition model is iteratively trained to obtain the target language recognition sub-model, wherein, as shown in Figure 10A, in an iterative process, the following operations are performed :
  • the training data x can be any training data included in the first training data set.
  • Image a contains the text "20th anniversary of a certain food company.”
  • Image a is input into the initial language recognition sub-model to obtain the predicted language distribution information corresponding to image a.
  • the predicted language distribution information corresponding to image a includes Chinese and character identifiers, as well as predicted text position information corresponding to Chinese and character identifiers.
  • the predicted text position information corresponding to Chinese is used to characterize the text "0th anniversary of a certain food company"
  • the predicted position information, the predicted text position information corresponding to the character identifier is used to represent the predicted position information of the text "2".
  • the real language distribution information corresponding to image a includes Chinese and character identifiers, as well as the real text position information corresponding to Chinese and character identifiers.
  • the real text position information corresponding to Chinese The text position information is used to represent the real position information of the text "XX Food Company" and "Anniversary", and the real text position information corresponding to the character identifier is used to represent the real position information of the text "20".
  • the predicted language distribution information corresponding to the training data and the real The language distribution information is used to train the model, thereby improving the language classification accuracy of the model and thereby improving the accuracy of text recognition.
  • the text contained in the image may not entirely belong to a single type.
  • Chinese characters often appear in Japanese and Korean.
  • Latin and symbolic logos often appear mixed with characters of any language.
  • the predicted language distribution information is fitted to a soft target by optimizing the model parameters.
  • the soft target refers to the statistics of the probability of occurrence of various types of characters in each text string. Specifically, as shown in Figure 10A, when performing S1002, the following operations may be used but are not limited to:
  • the predicted distribution probability includes the predicted probability corresponding to each language.
  • the predicted probability is used to represent the proportion of the predicted text length of its corresponding language in each language.
  • the predicted distribution probability is determined based on the predicted language distribution information corresponding to image a.
  • the predicted probability of Chinese is 90%
  • the predicted probability of symbol logo is 10%.
  • the real distribution probability includes the real probability corresponding to each language.
  • the real probability is used to represent the proportion of the real text length of its corresponding language in each language.
  • the real distribution probability is determined.
  • the real probability of Chinese is 80%
  • the real probability of symbol logo is 20%.
  • the distribution of mixed characters can be effectively described, thereby improving the prediction accuracy of language distribution, and thereby improving the recognition accuracy of multilingual texts.
  • the first model loss when performing S10023, can be cross entropy loss (Cross Entropy loss, CE loss) or KL divergence loss (Kullback-Leibler divergence loss).
  • the KL divergence loss can be used as the target loss for language prediction by the network.
  • the calculation method of the KL divergence loss is shown in formula (1):
  • P is used to represent the predicted distribution probability
  • Q is used to represent the true distribution probability
  • P(x) is used to represent the predicted probability corresponding to the language type x
  • Q(x) is used to represent the real probability corresponding to the language type x
  • x is each language a certain language in .
  • the training data x in an iterative process, can be input into the initial direction recognition sub-model to obtain the predicted text presentation direction of the training data x, and then, based on the training data x The predicted text presentation direction and the real text presentation direction determine the second model loss, and then based on the second model loss, identify the model of the sub-model for the initial direction. Parameters are adjusted.
  • images with different text presentation directions can be obtained in the input layer, and by training the model Maximize the feature distance of images with different text presentation directions to increase this difference, allowing the model to better learn to understand the text presentation direction.
  • the target direction identification sub-model is obtained through the following operations:
  • the initial direction recognition sub-model included in the initial recognition model is iteratively trained to obtain the target direction recognition sub-model.
  • training data x is still used as an example to illustrate an iterative process. Refer to Figure 11A. During an iterative process, the following operations are performed:
  • the preset image rotation angle can be Set to 180°. It should be noted that the image rotation angle can be set according to the actual application scenario and is not limited to 180°.
  • Image b contains the Korean word "Hello”.
  • the training data x is rotated according to the preset image rotation angle of 180°, and we get Compare data y.
  • the training data x as image b as an example, input the training data x into the initial direction recognition sub-model to obtain the predicted text presentation direction corresponding to the training data x.
  • the predicted text presentation direction corresponding to the training data x is 0°
  • the comparison data y is input into the initial direction recognition sub-model, and the predicted text presentation direction corresponding to the comparison data y is obtained.
  • the predicted text presentation direction corresponding to the comparison data y is 180°.
  • the model can learn the differences between images in different text presentation directions, thereby improving the recognition accuracy of text in different text presentation directions.
  • the second model loss may be model prediction loss or contrast loss, or a weighted result of model prediction loss and contrast loss.
  • the model prediction loss can be cross-entropy loss or focal loss, but is not limited to this. The following only uses cross-entropy loss as an example for explanation.
  • the cross-entropy loss is calculated based on the obtained predicted text presentation direction and the corresponding real text presentation direction, and the calculated cross-entropy loss is, as second model loss.
  • the second model loss uses contrast loss, then based on the obtained prediction text presentation directions, the contrast loss between each prediction text presentation direction is calculated, and the calculated contrast loss is used as the second model loss.
  • the second model loss uses the weighted result of cross-entropy loss and contrast loss, the second model loss can be determined in the following way:
  • the second model loss is determined.
  • C loss represents the contrast loss
  • d is used to characterize the Euclidean distance between the corresponding image features of the training data x and the comparison data y
  • margin is the set threshold
  • max() function is used to obtain the maximum value.
  • image features corresponding to the training data x and the comparison data y refer to the image features obtained after image feature extraction of the image.
  • the image features can also be called image embedding.
  • the cross-entropy loss weight can be used based on the sum of the cross-entropy losses or the average of the cross-entropy losses. , compare the loss weights to determine the second model loss, but is not limited to this.
  • this difference can be increased by maximizing the distance between images with different text presentation directions in the feature layer during model training, so that the model can better learn to understand the text presentation direction, thereby improving
  • the model's recognition accuracy of the text presentation direction improves the accuracy of text recognition.
  • the target direction identification sub-model and the target language identification sub-model can be included in the same target classification model, and the target direction identification sub-model and the target language identification sub-model can also be configured separately to achieve the goal.
  • the different functions of the classification model will not be described in detail.
  • both image c and image d contain the text "codep”.
  • the text presentation direction of image c is 180°
  • the text presentation direction of image d is 0°.
  • the text recognition result corresponding to image d is "codep”
  • the text recognition result corresponding to image c is "dapos”. Obviously, the text recognition is wrong.
  • a multi-stream recognition model using the same backbone network as the target classification model is introduced as a pre-training task.
  • the multi-stream recognition model is used for To realize the recognition of multi-language text content, after the training of the multi-stream recognition model is completed, the backbone network of the multi-stream recognition model will be trained as the target feature extraction network. Furthermore, based on the target feature extraction network, the initial language recognition sub-model and the initial direction recognition sub-model are trained to obtain the target language recognition sub-model and the target direction recognition sub-model.
  • the target feature extraction network is trained through the following operations:
  • the pre-trained recognition network is iteratively trained to obtain the target feature extraction network.
  • the pre-trained language recognition model may also be called a multi-stream recognition model.
  • the pre-trained language recognition model includes an input layer, a backbone network, a timing model, a multi-stream decoder and an output layer.
  • the backbone network has the same structure as the backbone network in the target classification model.
  • the backbone network is used for Learn the apparent features of images, and the temporal model is used to learn contextual information of text.
  • Multi-stream decoders include but are not limited to decoders corresponding to Chinese, Japanese, Korean, Latin, Thai, Arabic, Hindi, and symbol logos.
  • the timing model can use the Long Short-Term Memory network (Long Short-Term Memory, LSTM).
  • LSTM Long Short-Term Memory
  • x t represents the input value at time t
  • y t represents the output value at time t
  • represents the gate activation function.
  • LSTM by forgetting and memorizing new information in the cell state, information useful for calculations at subsequent moments can be obtained
  • the forward LSTM and the reverse LSTM are combined to form a bidirectional long short-term memory network (BiLSTM).
  • the model convergence speed can be significantly improved.
  • the classification accuracy of text presentation direction and language distribution can be improved, thereby improving text recognition accuracy.
  • the target classification model will introduce additional overhead in the entire text recognition process. Once the number of text lines in the image is large, the time overhead will increase exponentially.
  • model compression and cropping can be performed on the target classification model.
  • the target feature extraction network can use a lightweight network architecture, and the lightweight network architecture can use but is not limited to mobilenetv2, mobilenetv3, and shufflenet.
  • an SE (squeeze-excitation) layer can be added to the target feature extraction network, and added to the target direction recognition sub-model and the target language recognition sub-model. Add a dimensionality reduction layer to at least one of them to further reduce the computation.
  • pytorch quantification-aware training can be used to adapt TensorRT int8 to perform 8-bit quantization model tuning, so as to improve the classification accuracy while ensuring only a slight loss.
  • the target text recognition model includes an image feature encoder, a timing model, and a decoder.
  • the decoder can use a Connectionist Temporal Classification (CTC) decoder and an attention mechanism (attention) to implement multiple Task decoding, attention mechanism is used to assist CTC learning.
  • CTC Connectionist Temporal Classification
  • the image feature encoder can use but is not limited to ResNet-50VD
  • the temporal model can use but is not limited to bidirectional LSTM.
  • the temporal model is used to enhance the learning of text context information in images. Since the structure of the text recognition model is similar to that of the multi-stream recognition model, we will not go into details here.
  • text recognition performance can be improved through semi-supervised learning (Semi-supervised Learning, SSL).
  • SSL Semi-supervised Learning
  • image classification tasks it can be used to solve the problem of shortage of labeled training data.
  • image classification tasks it can be used to solve the problem of shortage of labeled training data.
  • data generation it can also improve text recognition performance through data generation.
  • the target text recognition model is trained through the following operations:
  • the second text recognition model is iteratively trained to obtain the target text recognition model.
  • the first training data set, the second training data set, and the third training data set may be the same or different, and there is no limit to this.
  • each labeled sample contains the corresponding real text recognition result, and each unlabeled sample does not contain the corresponding real text recognition result.
  • the first text recognition model refers to an untrained text recognition model.
  • Image feature extraction network can Using but not limited to CNN, the image-text distance recognition model uses the image feature extraction network included in the second text recognition model as its own image feature extraction network.
  • the input of the image-text distance recognition model is each labeled sample and the corresponding sample label.
  • the model loss can be a ranking loss. The purpose is to minimize the image-text distance between the same pair of labeled samples and sample labels, and maximize the distance between different pairs of labeled samples and sample labels. Image text distance.
  • the labeled sample 1 and its corresponding sample label are input into the image-text distance recognition model to obtain the image features corresponding to the labeled sample 1 and the text features corresponding to the sample label.
  • the image-text distance recognition model minimizes the image-text distance between the image features corresponding to labeled sample 1 and the text features corresponding to sample labels.
  • N and M are both positive integers.
  • the values of N and M can be the same or different.
  • the obtained N labeled samples and M unlabeled samples can be called a batch of image data (batch).
  • unlabeled sample 2 is input into the second text recognition model, and the predicted text recognition result 2 corresponding to the unlabeled sample 2 is obtained.
  • the predicted text recognition result 2 is "BA CHU LON CON".
  • each model sub-loss corresponding to at least one unlabeled sample is determined, including:
  • Each acquired target enhanced sample is input into the second text recognition model respectively, and the corresponding model sub-loss of each target enhanced sample is obtained, and the corresponding model sub-loss of each target enhanced sample is used as the corresponding model sub-loss of M unlabeled samples. Loss of each model.
  • the target enhanced sample refers to a sample that is enhanced by rotating, flipping, Data enhancement methods such as scaling, contrast changes, noise disturbance, etc. are used to obtain sample data after data enhancement of the corresponding samples to be enhanced.
  • the sample label of the target enhanced sample can also be called the pseudo label of the target enhanced sample.
  • unlabeled sample 2 can be used as an image to be enhanced. sample. Then, obtain the target enhanced sample 2 corresponding to the unlabeled sample 2, and use the predicted text recognition result 2 as "BA CHU LON CON" as the pseudo label of the target enhanced sample 2. After that, the obtained target enhanced sample 2 is input into the second text recognition model to obtain the predicted text recognition result corresponding to the target enhanced sample 2. Then, based on the predicted text recognition result and the pseudo label, a model corresponding to the target enhanced sample 2 is obtained. child loss.
  • each model corresponding to the M unlabeled samples can be obtained.
  • child loss When the image-text distance corresponding to each of the M unlabeled samples is not greater than the preset distance threshold, each model corresponding to the M unlabeled samples can be obtained. child loss.
  • model sub-loss can use cross-entropy loss, focus loss, etc.
  • labeled sample 1 is input into the second text recognition model to obtain the predicted text recognition result corresponding to the labeled sample 1. Then, based on the relationship between the predicted text recognition result and the sample label corresponding to the labeled sample 1 Difference, the model sub-loss corresponding to labeled sample 1 is obtained, and the model sub-loss corresponding to labeled sample 1 is 1.
  • the sum of the model sub-losses corresponding to N labeled samples and M unlabeled samples can be used as the third model loss. Then gradient backpropagation is performed based on the third model loss to achieve the purpose of optimizing model parameters.
  • the characters usually include vowels and tone marks, etc., and these characters are usually located above and below the base characters. Therefore, text recognition for such languages needs to consider not only the temporal order, but also the two dimensional space information. Taking Thai as an example, as shown in Figure 16, there are vowel appendixes and tone symbols in Thai characters. The vowel appendix and tone symbols are usually located above and below the base character.
  • the decoder in the target text recognition model can use the two-dimensional spatial attention decoder in the irregular text recognition method (SAR).
  • SAR irregular text recognition method
  • the decoder in the recognition model can also use the dual-stream decoding structure of CTC and SAR.
  • the two decoders share the timing characteristics of LSTM. Only the CTC branch results are predicted during decoding, and SAR is used to assist CTC learning.
  • the CNN module can use a 31-layer ResNet to obtain a feature map.
  • the feature map passes through the LSTM-based encoder-decoder framework and decoding
  • the 2D attention module connected to the device finally outputs text based on SAR recognition.
  • the feature map can output text based on CTC recognition through the CTC decoder.
  • the image is downsampled to 1/8 of the original height after being processed by CNN, which can retain more spatial information and improve the recognition accuracy of the text recognition model without causing any forward time-consuming overhead.
  • SAR and CTC are used as decoders respectively in the Thai language recognition model. Among them, the Thai language in image e represents hello, and the Thai language in image f represents summer. There are characters with recognition errors in CTC. Obviously, SAR The recognition accuracy is higher than CTC.
  • Operation A Data synthesis.
  • each text corpus, each font format, and each background image are obtained, and based on each text corpus, each font format, and each background image, each annotated sample is synthesized.
  • data synthesis can be performed based on the TextRenderer architecture.
  • the input is text corpus, font format, and background image in any language, and the output is a synthesized annotation sample.
  • text corpus 1 is rberinMann
  • font format 1 is Robot.ttf
  • background image 1 is a textured background.
  • the synthesized annotation sample is as shown in the figure As shown in 18A.
  • the font size, color, gap, thickness and other information of the text corpus can be configured, and the horizontal level can also be configured.
  • Text rendering or vertical text rendering For the background image, you can perform processing operations such as interception and image enhancement transformation on the background image, and then obtain the processed The background image is superimposed with the text corpus. As an example, the superimposed images can be directly used as synthetic annotation samples.
  • one or more of the following operations can be performed: Poisson fusion, perspective transformation, alpha channel image overlay, image highlight enhancement, image printing enhancement, image enhancement, interference, Image size transformation, where interference includes but is not limited to blur, noise, and horizontal linear superposition interference.
  • Operation B Data style migration.
  • each text corpus is obtained, and each text corpus is input into the target data style transfer model to obtain each labeled sample.
  • the target data style transfer model includes a text transfer module, a background extraction module and a text fusion module.
  • the target data style transfer model uses Generative Adversarial Networks (GAN). Among them, after the text corpus and the image of the target text style are input into the target data style transfer model, the text corpus is processed by the text transfer module to obtain the text corpus Osk corresponding to the target text font in the target text style, and the target text style text The transfer module and the background extraction module output the image background Ob contained in the target text style. In addition, the target text style through the text transfer module can also output the text corpus Ot corresponding to the target text font containing the original image background, and then generate the corresponding Label the sample.
  • Figure 19 also includes LT , LB and LF , where LT represents the model loss of the text transfer module, LB represents the model loss of the background extraction module, and LF represents the model loss of the text fusion module.
  • the Hinge loss in SN-GAN can be used to replace the original adversarial loss to stabilize the training process and avoid large oscillations of the gradient.
  • the L1 loss weighted by the text mask area can be used, thereby reducing the model's over-learning of the background and enhancing the pixels in the text area. constraints.
  • the trained data style transfer model is used to perform style transfer learning on the input text prediction and target text style images, thereby generating recognition data that is more close to real data.
  • the font style of real data can be learned.
  • the domain difference between synthetic data and real data can be further solved.
  • real data can be used in the future to synthesize the data.
  • style transfer real data can also be used to perform style transfer on other real data, thereby increasing sample diversity.
  • reinforcement learning can also be used to train the model, where the text recognition model serves as the agent part in DQN, and the image and Predicted text can be used as the environment part in DQN, and feedback can be represented by image-text distance and edit distance reward.
  • the text language/direction classification network (LOPN) proposed in the embodiment of this application can quickly and accurately predict the language distribution and direction of a text line image.
  • the experimental results on the set test set are shown in Table 1.
  • Table 1 using soft target probability to model language distribution and using KL divergence loss can achieve nearly 7% performance gain in language classification accuracy.
  • the classification model pre-trained using the multi-stream text recognition model has greatly improved in language classification and direction classification.
  • the introduction of the recognition task strengthens the model's understanding of image text, making up for the problem of relying solely on the apparent features of the image. Shortcomings in classification.
  • adding dual-stream data supervision to direction classification helps the model better distinguish the direction of image text and improves the performance of direction classification.
  • TensorRT is used to deploy the above classification network model, and the NVIDIA-T4 GPU model is used online.
  • the comparison results of the prediction speed of the original model and the quantified model are shown in Table 2.
  • the text recognition model in the embodiment of this application integrates data synthesis optimization and semi-supervised training, and achieves high recognition performance on the set language test set. At the same time, compared with the open source model, the recognition accuracy is higher. The magnitude exceeds existing open source models.
  • NED normalized edit distance
  • SeqACC sequence accuracy
  • D represents Lewenstein distance
  • s i represents predicted text (i.e., text recognition result)
  • N represents the total number of images to be recognized
  • + represents a statistical operation, and adds 1 when the value of x is true.
  • FIG. 21 it is a schematic structural diagram of a text recognition device 2100, which may include:
  • the image classification unit 2101 is used to input the image to be recognized containing text into the target classification model, and obtain the language distribution information and the original text presentation direction of the image to be recognized, wherein the language distribution information includes the text correspondence Multiple languages, and text position information corresponding to each of the multiple languages;
  • the image correction unit 2102 is configured to correct the image to be recognized based on the original text presentation direction and the preset target text presentation direction to obtain a target recognition image;
  • the text positioning unit 2103 is configured to determine the text area image set corresponding to each of the multiple languages in the target recognition image based on the text position information corresponding to the multiple languages;
  • the image recognition unit 2104 is configured to perform processing based on the text area image sets corresponding to the multiple languages, respectively, using the target text recognition model associated with the corresponding language, to obtain a text recognition result corresponding to the image to be recognized.
  • the image recognition unit 2104 is specifically used to:
  • the text area image set For the text area image set corresponding to each of the multiple languages, the text area image set is input into the target text recognition model associated with the language type, and the text recognition sub-result corresponding to the text area image set is obtained ;
  • the text recognition device 2100 also includes a model training unit 2105.
  • the target classification model includes a target language recognition sub-model, and the model training unit 2105 is used to:
  • the initial language recognition sub-model included in the initial recognition model is iteratively trained to obtain the target language recognition sub-model, wherein during one iteration, the following operations are performed:
  • a first model loss is determined, and based on the first model loss, the model parameters of the initial language recognition sub-model are adjusted.
  • the model training unit 2105 when determining the first model loss based on the predicted language distribution information and the real language distribution information corresponding to the training data, the model training unit 2105 is specifically used to:
  • the predicted distribution probability includes the corresponding predicted probability of each language.
  • the predicted probability is used to represent the predicted text length of its corresponding language in each language. proportion;
  • the real distribution probability includes the real probability corresponding to each language.
  • the real probability is used to represent the proportion of the corresponding language in the real text length in each language. Compare;
  • the first model loss is determined based on the predicted distribution probability and the true distribution probability.
  • the text recognition device 2100 also includes a model training unit 2105.
  • the target classification model includes a target direction recognition sub-model, and the model training unit 2105 is used to:
  • the training data and the comparison data are respectively input into the initial direction identification sub-model, and the predicted text presentation directions corresponding to the training data and the comparison data are obtained;
  • a second model loss is determined, and based on the second model, the model parameters of the initial direction recognition sub-model are adjusted.
  • the model training unit 2105 when determining the second model loss based on the predicted text presentation directions corresponding to the training data and the comparison data, is specifically used to:
  • the second model loss is determined.
  • the target classification model also includes a target feature extraction network
  • the model training unit 2105 is also used to:
  • the pre-trained recognition model is iteratively trained to obtain the target feature extraction network.
  • the image classification unit 2101 before inputting the image to be recognized containing text into the target classification model and obtaining the language distribution information and original text presentation direction of the image to be recognized, the image classification unit 2101 is also used to:
  • shape correction processing is performed on each extracted sub-image, and any one of the sub-images obtained after the correction processing is used as the image to be recognized.
  • model training unit 2105 is also used to:
  • the third training data set includes each labeled sample and each unlabeled sample
  • a first text recognition model including an image feature extraction network is trained to obtain a second text recognition model, and based on the image feature extraction network included in the second text recognition model, an image-text distance is constructed identification model;
  • the second text recognition model is iteratively trained to obtain the target text recognition model.
  • the second text recognition model is iteratively trained based on each labeled sample, each unlabeled sample, and the image-text distance recognition model, and when the target text recognition model is obtained,
  • the model training unit 2105 is specifically used for:
  • a third model loss is determined, and based on the third model loss, the model parameters of the second text recognition model are adjusted.
  • the model training unit 2105 when determining each model sub-loss corresponding to the at least one unlabeled sample based on the obtained image-text distance, is specifically used to:
  • Each acquired target enhanced sample is input into the second text recognition model respectively, and the model sub-loss corresponding to each target enhanced sample is obtained, and the model sub-loss corresponding to each target enhanced sample is used as the corresponding model sub-loss.
  • Each model sub-loss corresponding to at least one unlabeled sample is described.
  • model training unit 2105 is also used to perform at least one of the following operations:
  • Each text corpus is obtained, and each text corpus is input into the target data style transfer model respectively to obtain each labeled sample.
  • each module or unit
  • the functions of each module can be implemented in the same or multiple software or hardware.
  • the electronic device may be a server or a terminal device.
  • FIG. 22 is a schematic structural diagram of a possible electronic device provided in the embodiment of the present application.
  • the electronic device 2200 includes: a processor 2210 and a memory 2220 .
  • the memory 2220 stores a computer program that can be executed by the processor 2210.
  • the processor 2210 can execute the steps of the above text recognition method by executing instructions stored in the memory 2220.
  • the memory 2220 can be a volatile memory (volatile memory), such as a random access memory (random-access memory, RAM); the memory 2220 can also be a non-volatile memory (non-volatile memory), such as a read-only memory (Read- Only Memory (ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); or the memory 2220 is capable of carrying or storing instructions or data structures.
  • ROM Read- Only Memory
  • flash memory flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the desired program code and any other medium capable of being accessed by a computer without limitation.
  • the memory 2220 may also be a combination of the above memories.
  • the processor 2210 may include one or more central processing units (CPUs) or be a digital processing unit or the like.
  • the processor 2210 is configured to implement the above text recognition method when executing the computer program stored in the memory 2220.
  • processor 2210 and the memory 2220 may be implemented on the same chip, and in some embodiments, they may also be implemented on separate chips.
  • connection medium between the above-mentioned processor 2210 and the memory 2220 is not limited in the embodiment of the present application.
  • the connection between the processor 2210 and the memory 2220 through a bus is taken as an example.
  • the bus is described as a thick line in Figure 22.
  • the connection method between other components is only a schematic explanation and is not taken as a guide. limit.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of description, only one thick line is used in Figure 22, but it does not describe only one bus or one type of bus.
  • embodiments of the present application provide a computer-readable storage medium, which includes a computer program.
  • the computer program When the computer program is run on an electronic device, the computer program is used to cause the electronic device to perform the steps of the above text recognition method.
  • various aspects of the text recognition method provided by this application can also be implemented in the form of a program product, which includes a computer program.
  • the program product When the program product is run on an electronic device, the computer program is used to make the electronic device The device performs the steps in the above text recognition method. For example, the electronic device can perform the steps shown in Figure 2 steps.
  • the Program Product may take the form of one or more readable media in any combination.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more wires, portable disk, hard disk, RAM, ROM, erasable programmable read-only memory (EPROM or flash memory) , optical fiber, portable Compact Disk Read Only Memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • EPROM erasable programmable read-only memory
  • the program product of the embodiment of the present application can adopt a CD-ROM and include a computer program, and can be run on an electronic device.
  • the program product of the present application is not limited thereto.
  • a readable storage medium may be any tangible medium containing or storing a computer program that may be used by or in combination with a command execution system, apparatus or device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

本申请涉及人工智能技术领域,提供了一种文本识别方法及相关装置,用以提高文本识别准确率,该方法包括:将包含文本的待识别图像输入至目标分类模型中,获得语种分布信息和原始文本呈现方向,然后,基于原始文本呈现方向,对待识别图像进行矫正,获得目标识别图像,之后,确定多个语种各自对应的文本区域图像集,最后,基于各文本区域图像集,分别采用对应语种关联的目标文本识别模型进行处理,得到文本识别结果。这样,通过对语种分布信息和文本呈现方向进行准确判断和预测,提高了文本识别精度。

Description

文本识别方法及相关装置
本申请要求于2022年04月18日提交中国专利局、申请号为2022104029338、申请名称为“文本识别方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及文本识别技术。
背景技术
随着计算机技术的不断发展,基于图像的文本识别技术应用广泛,基于图像的文本识别技术用于识别图像中包含的文本信息。
相关技术中,通常先确定待识别图像中文本信息所属的目标语种,再通过目标语种对应的文本识别模型,确定待识别图像中包含的文本信息。
然而,采用上述文本识别方式,无法对包含多个语种的文本的图像进行文本识别。此外,一旦目标语种识别错误,会直接影响文本识别结果,导致识别准确率较低。
发明内容
本申请实施例提供一种文本识别方法及相关装置,能够对包含多个语种的文本的图像进行文本识别,并且能够提高文本识别准确率。
第一方面,本申请实施例提供一种文本识别方法,由电子设备执行,包括:
将包含文本的待识别图像输入至目标分类模型中,获得待识别图像的语种分布信息和原始文本呈现方向,其中,所述语种分布信息中包含所述文本对应的多个语种、以及所述多个语种各自对应的文本位置信息;
基于所述原始文本呈现方向和预设的目标文本呈现方向,对所述待识别图像进行矫正,得到目标识别图像;
基于所述多个语种各自对应的文本位置信息,在所述目标识别图像中,确定所述多个语种各自对应的文本区域图像集;
基于所述多个语种各自对应的文本区域图像集,分别采用对应语种关联的目标文本识别模型进行处理,得到所述待识别图像对应的文本识别结果。
第二方面,本申请实施例提供一种文本识别装置,包括:
图像分类单元,用于将包含文本的待识别图像输入至目标分类模型中,获得所述待识别图像的语种分布信息和原始文本呈现方向,其中,所述语种分布信息中包含所述文本对应的多个语种、以及所述多个语种各自对应的文本位置信息;
图像矫正单元,用于基于所述原始文本呈现方向和预设的目标文本呈现 方向,对所述待识别图像进行矫正,得到目标识别图像;
文本定位单元,用于基于所述多个语种各自对应的文本位置信息,在所述目标识别图像中,确定所述多个语种各自对应的文本区域图像集;
图像识别单元,用于基于所述多个语种各自对应的文本区域图像集,分别采用对应语种关联的目标文本识别模型进行处理,得到所述待识别图像对应的文本识别结果。
第三方面,本申请实施例提供一种电子设备,包括处理器和存储器,其中,所述存储器存储有计算机程序,当所述计算机程序被所述处理器执行时,使得所述处理器执行上述文本识别方法的步骤。
第四方面,本申请实施例提供一种计算机可读存储介质,其包括计算机程序,当所述计算机程序在电子设备上运行时,所述计算机程序用于使所述电子设备执行上述文本识别方法的步骤。
第五方面,本申请实施例提供一种计算机程序产品,所述程序产品包括计算机程序,所述计算机程序存储在计算机可读存储介质中,电子设备的处理器从所述计算机可读存储介质中读取并执行所述计算机程序,使得电子设备执行上述文本识别方法的步骤。
本申请实施例中,将包含文本的待识别图像输入至目标分类模型中,获得相应的语种分布信息和原始文本呈现方向,然后,基于原始文本呈现方向和预设的目标文本呈现方向,对待识别图像进行矫正,得到目标识别图像,之后,基于语种分布信息中的多个语种各自对应的文本位置信息,在目标识别图像中,确定多个语种各自对应的文本区域图像集,最后,基于各文本区域图像集,分别采用对应语种关联的目标文本识别模型进行,得到待识别图像对应的文本识别结果。
这样,一方面,通过语种分布信息,可以对待识别文本中包含的多个语种进行定位,进而在一定程度上解决图像中多语言混排问题,另一方面,结合文本呈现方向进行图像矫正,可以提高文本识别效率和识别准确率,此外,通过对语种分布信息和文本呈现方向进行准确判断,使得各文本区域图像能够正确分发到对应语种的文本识别模型,进一步提高了文本识别精度。
附图说明
图1为本申请实施例中提供的一种应用场景示意图;
图2为本申请实施例中提供的文本识别方法的流程示意图;
图3A为本申请实施例中提供的各语种的示意图;
图3B为本申请实施例中提供的各文本呈现方向的示意图;
图4为本申请实施例中提供的获取语种分布信息和原始文本呈现方向的示意图;
图5为本申请实施例中提供的图像矫正过程的示意图;
图6为本申请实施例中提供的目标识别图像的示意图;
图7为本申请实施例中提供的文本识别方法的逻辑示意图;
图8A为本申请实施例中提供的目标文本行检测模型的结构示意图;
图8B为本申请实施例中提供的形状矫正处理过程的示意图;
图9为本申请实施例中提供的目标分类模型的示意图;
图10A为本申请实施例中提供的语种识别子模型训练方法的流程示意图;
图10B为本申请实施例中提供的确定第一模型损失的逻辑示意图;
图11A为本申请实施例中提供的语种识别子模型训练方法的流程示意图;
图11B为本申请实施例中提供的确定第二模型损失的逻辑示意图;
图11C为本申请实施例中提供的对比损失和交叉熵损失的示意图;
图12为本申请实施例中提供的两种文本识别结果的示意图;
图13为本申请实施例中提供的预训语种识别模型的结构示意图;
图14为本申请实施例中提供的图文距离识别模型的结构示意图;
图15为本申请实施例中提供的文本识别模型训练方法的流程示意图;
图16为本申请实施例中提供的泰语的空间分布特性的示意图;
图17A为本申请实施例中提供的基于SAR和CTC的文本识别模型的示意图;
图17B为本申请实施例中提供的几种文本识别结果的示意图;
图18A为本申请实施例中提供的数据合成方法的逻辑示意图;
图18B为本申请实施例中提供的合成的标注样本的示意图;
图19为本申请实施例中提供的数据文本风格迁移的逻辑示意图;
图20为本申请实施例中提供的文本风格迁移得到的样本的示意图;
图21为本申请实施例中提供的一种文本识别装置的结构示意图;
图22为本申请实施例中提供的一种电子设备的结构示意图。
具体实施方式
下面对本申请实施例中涉及的部分概念进行介绍。
文本检测:定位出图像中文本所在位置。
文本识别:基于文本检测得到的文本位置,转换得到文本识别结果,即转换得到文本信息。
本申请实施例提供的方案涉及人工智能的机器学习技术。在本申请实施例中,主要涉及文本检测模型、分类模型、文本识别模型的训练过程,以及相应的模型应用过程。需要说明的是,本申请实施例中,模型训练过程可以采用离线训练,也可以采用在线训练,对此不做限制。
以下结合说明书附图对本申请的优选实施例进行说明,应当理解,此处所描述的优选实施例仅用于说明和解释本申请,并不用于限定本申请,并且在不冲突的情况下,本申请实施例及实施例中的特征可以相互组合。
参阅图1所示,其为本申请实施例提供的一种应用场景的示意图。该应用场景中至少包括终端设备110以及服务器120。终端设备110的数量可以 是一个或多个,服务器120的数量也可以是一个或多个,本申请对终端设备110和服务器120的数量不做具体限定。终端设备110上可以安装有与文本识别相关的客户端,服务器120可以是与数据处理相关的服务器。另外,本申请中的客户端可以是软件,也可以是网页、小程序等,服务器则是与软件、网页、小程序等相对应的后台服务器,或者是专门用于进行数据处理的服务器,本申请不做具体限定。
本申请实施例中,终端设备110可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、物联网设备、智能家电、车载终端等,但并不局限于此。
服务器120可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备110与服务器120可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
需要说明的是,本申请实施例中的文本识别方法可以由服务器或者终端设备单独执行,也可以由服务器和终端设备协同执行。
例如,由终端设备将待识别图像输入至目标分类模型,获得相应的语种分布信息和原始文本呈现方向,然后,基于原始文本呈现方向和预设的目标文本呈现方向,对待识别图像进行矫正,得到目标识别图像,之后,基于各文本位置信息,从目标识别图像中,确定多个语种各自对应的文本区域图像集,最后,基于各文本区域图像集,分别采用对应语种关联的目标文本识别模型,确定待识别图像对应的文本识别结果。或者,由服务器执行上述文本识别过程。
再或者,由终端设备响应针对待识别图像的文本识别操作,获得待识别图像,并将待识别图像传输给服务器,再由服务器将待识别图像输入至目标分类模型,获得相应的语种分布信息和原始文本呈现方向,然后,基于原始文本呈现方向和预设的目标文本呈现方向,对待识别图像进行矫正,得到目标识别图像,之后,基于各文本位置信息,从目标识别图像中,确定多个语种各自对应的文本区域图像集,最后,基于各文本区域图像集,分别采用对应语种关联的目标文本识别模型,确定待识别图像对应的文本识别结果。
需要说明的是,本申请实施例中的文本识别方法可应用于需要提取图像中文字的任意场景,例如,图片文字提取、扫一扫翻译、图片翻译、阅读、文献资料检索、信件和包裹的分拣、稿件的编辑和校对、报表和卡片的汇总与分析、商品发票的统计汇总、商品编码的识别、商品仓库的管理等,但不局限于此。
下面结合上文描述的应用场景,参考附图来描述本申请示例性实施方式 提供的文本识别方法,需要注意的是,上述应用场景仅是为了便于理解本申请的精神和原理而示出,本申请的实施方式在此方面不受任何限制。
参阅图2所示,其为本申请实施例中提供的文本识别方法的流程示意图,该方法可以由电子设备执行,该电子设备可以是终端设备或服务器,具体流程如下:
S201、将包含文本的待识别图像输入至目标分类模型中,获得该待识别图像的语种分布信息和原始文本呈现方向,其中,语种分布信息中包含文本对应的多个语种、以及多个语种各自对应的文本位置信息。
本申请实施例中,目标分类模型也可以称为多任务架构的文本语种/方向分类模型(Lingual & Orientation Prediction Network,LOPN)。目标分类模型可以对待识别图像进行语种分布检测和文本呈现方向判断。
其中,多个语种可以是以下语种中的多项:中文、日语、韩语、拉丁语、泰语、阿拉伯语、印地语、符号标识,其中符号标识包含数字和符号中的一项或多项,但不局限于此。参阅图3A所示,其为本申请实施例提供的各种语种的示意图,其中,中文、日语、韩语、拉丁语、泰语、阿拉伯语、印地语对应的文本的语义均为“你好”,符号标识表征11时(11:00)至12时(12:00)。
文本呈现方向用于表征文本的排版方向,示例性的,文本呈现方向包括但不限于0°、90°、180°、270°。以中文为例,参阅图3B所示,当文本呈现方向为0°和180°时,文字的排版方向均为水平排版,当文本呈现方向为90°和270°时,文字的排版方向均为竖直排版。
需要说明的是,本申请实施例中,待识别图像可以是原始图像,也可以是通过文本检测从原始图像中提取出的包含文字的部分图像,具体的文本检测方式参见下文。
以待识别图像1为例,参阅图4所示,待识别图像1中包含中文“踏青”和英文“spring”,将待识别图像1输入至目标分类模型中,获得待识别图像1对应的语种分布信息和原始文本呈现方向,其中,原始文本呈现方向为180°,语种分布信息中包含文本对应的多个语种(即中文和英文)、以及包含中文对应的文本位置信息和英文对应的文本位置信息,中文对应的文本位置信息用于表征“踏青”所在的文本位置,英文对应的文本位置信息用于表征“spring”所在的文本位置。
S202、基于原始文本呈现方向和预设的目标文本呈现方向,对待识别图像进行矫正,得到目标识别图像。
仍以待识别图像1为例,参阅图5所示,假设,预设的目标文本呈现方向为0°,原始文本呈现方向为180°,基于原始文本呈现方向和预设的目标文本呈现方向,对待识别图像1进行矫正,将得到目标识别图像1,其中,目标识别图像1的文本呈现方向为0°。
S203、基于多个语种各自对应的文本位置信息,在目标识别图像中,确定多个语种各自对应的文本区域图像集。
以目标识别图像2为例,参阅图6所示,目标识别图像2是对待识别图像2进行矫正后得到的,目标识别图像2中包含中文、日语和英文,图6中包含虚线框61、虚线框62、虚线框63、虚线框64,其中,虚线框61和虚线框64均表征日语对应的文本区域图像,虚线框62表征中文对应的文本区域图像,虚线框63表征英文对应的文本区域图像,即,虚线框61和虚线框64表征的文本区域图像构成了日语对应的文本区域图像集,虚线框62表征的文本区域图像构成了中文对应的文本区域图像集,虚线框63表征的文本区域图像构成了英文对应的文本区域图像集。
S204、基于多个语种各自对应的文本区域图像集,分别采用对应语种关联的目标文本识别模型进行处理,得到待识别图像对应的文本识别结果。
为了提高文本识别效率,基于不同语种对应的字符集的特点和规模,可以设计五种目标文本识别模型,五种目标文本识别模型分别为中文、日语、韩语、英语、混合拉丁语各自对应的目标文本识别模型,混合拉丁语包含拉丁语、泰语、越文、俄语、阿拉伯语、印地语,但不局限于此,其中,中文、日语、韩语、泰语、混合拉丁语各自对应的字符集的规模分别为1w+、9000+、8000+、200+和1000+。下文中,以上述五种目标文本识别模型为例进行说明。
具体的,执行S204时,可以采用但不限于以下操作:
针对多个语种中每个语种对应的文本区域图像集,将该文本区域图像集输入至该语种关联的目标文本识别模型中,获得该文本区域图像集对应的文本识别子结果;
基于各个文本区域图像集各自对应的文本识别子结果,得到待识别图像对应的文本识别结果。
仍以目标识别图像2为例,参阅图7所示,图7中虚线框61和虚线框64均表征日语对应的文本区域图像,虚线框62表征中文对应的文本区域图像,虚线框63表征英文对应的文本区域图像,将虚线框61表征的文本区域图像输入至日语关联的目标文本识别模型中,获得对应的文本识别子结果61,将虚线框64表征的文本区域图像输入至日语关联的目标文本识别模型中,获得对应的文本识别子结果64,将虚线框62表征的文本区域图像输入至中文关联的目标文本识别模型中,获得对应的文本识别子结果62,将虚线框63表征的文本区域图像输入至英语关联的目标文本识别模型中,获得对应的文本识别子结果63,进而,基于文本识别子结果61、文本识别子结果62、文本识别子结果63和文本识别子结果64,得到文本识别结果。
由于不同语种的字符集数量分布差异性较大,因此,通过上述实现方式,相对可以避免识别结果偏向到大字符集语种(如中文、日语、韩语等),而忽略小字符集语种(如拉丁语,阿拉伯语、泰语等),从而提高文本识别精 度,且文本识别模型根据语种的文字特性进行适配性优化,实现模型的灵活更新。
在一些实施例中,待识别图像可以通过文本检测得到的,具体可以采用但不限于以下方式:
方式1:获取原始图像,并从原始图像中,提取出至少一个包含文本的子图像。
具体的,可以将原始图像输入至目标文本行检测模型中,获得至少一个包含文本的子图像。其中,目标文本行检测模型可以采用基于可微分二值化(Differentiable Binarization,DB)算法实现,但不局限于此。
参阅图8A所示,目标文本行检测模型的主干网络部分可以采用基于轻量级网络构架的全卷积网络(Fully Convolution Network,FCN)架构,头部的多流分支用于判断原始图像中的像素是否为文字以及用于二值化阈值学习。该目标文本行检测模型可以由1个3×3卷积算子和2个步幅为2的去卷积算子组成,1/2、1/4、1/8、1/16和1/32分别表示与输入的原始图像相比的比例。其中,轻量级网络构架可以采用但不限于mobilenetv2,mobilenetv3,以及shufflenet等。
输入的原始图像通过特征图金字塔网络(Feature Pyramid Networks,FPN)的resnet50-vd层,同时,通过上采样的方式将特征图金字塔网络的输出变换为同一尺寸,并级联(cascade)产生特征图,然后,基于特征图,可以预测得到概率图(probability map)和阈值图(threshold map),其中,概率图用于表征原始图像中各像素属于文本的概率,阈值图用于表征原始图像中各像素对应的阈值,进而,基于概率图和阈值图,可以得到二值图(approximate binary map),最后,基于二值图,可以得到相应的各个子图像,各个子图像包括子图像1、子图像2和子图像3,其中,子图像1中包含文本“Harvesting”,子图像2中包含文本“great”,子图像3中包含文本“skills”。
方式2:
获取原始图像,并从原始图像中,提取出至少一个包含文本的子图像,基于预设的图像形状,对提取出的各个子图像分别进行形状矫正处理,以及将矫正处理后得到的各个子图像中的任意一个子图像,作为待识别图像。
其中,预设的图像形状可以设置为矩形等规则图形,但不局限于此。在实际应用过程中,为了便于进行后续图像处理操作,预设的图像形状通常设置为矩形,也就是说,将子图像的图像形状矫正为矩形。由于方式2中提取子图像的过程,与方式1中提取子图像的过程相同,在此不再赘述。
例如,参阅图8B所示,假设,预设的图像形状为矩形,子图像1中包含文本“Harvesting”,子图像1的图像形状为曲形,基于预设的图像形状,对提取出的子图像1进行形状矫正处理,将得到矫正处理后的子图像1,矫正处理后的子图像1为包含文本“Harvesting”的矩形图像,该矩形图像即可作 为待识别图像。
显然,本申请实施例中,通过对提取出的子图像进行多边形拟合,将曲型文本区域矫正为矩形文本区域,进而基于矩形文本区域进行后续的文本识别,可以检测识别出任意形状的文本区域,同时,可以进一步提高文本识别准确率。
接下来,分别对本申请实施例中涉及的目标分类模型、目标文本识别模型进行介绍:
第一、目标分类模型
参阅图9所示,目标分类模型可以包含目标特征提取网络、目标语种识别子模型、目标方向识别子模型,其中,目标特征提取网络可以采用但不限于卷积神经网络(Convolutional Neural Network,CNN),目标特征提取网络中包含S1、S2、S3、S4层,其中,S1层中包含深度(Depthwise,DW)卷积、常规卷积、激活函数、矩阵相乘、逐点(Pointwise,PW)卷积等操作,常规卷积也可以直接称为卷积。
首先,对目标语种识别子模型的训练过程进行介绍。
在一些实施例中,目标语种识别子模型可以通过以下操作训练得到:
基于获取的第一训练数据集,对初始识别模型中包含的初始语种识别子模型迭代进行模型训练,得到目标语种识别子模型,其中,参阅图10A所示,在一次迭代过程中,执行以下操作:
S1001、将训练数据x输入至初始语种识别子模型中,得到训练数据x对应的预测语种分布信息。训练数据x可以是第一训练数据集中包含的任意一个训练数据。
以训练数据为图像a为例,参阅图10B所示,图像a中包含文本“某某食品公司20周年”,将图像a输入至初始语种识别子模型中,得到图像a对应的预测语种分布信息,图像a对应的预测语种分布信息中包含中文和字符标识、以及中文和字符标识各自对应的预测文本位置信息,其中,中文对应的预测文本位置信息用于表征文本“某某食品公司0周年”的预测位置信息,字符标识对应的预测文本位置信息用于表征文本“2”的预测位置信息。
S1002、基于预测语种分布信息、以及训练数据x对应的真实语种分布信息,确定第一模型损失。
仍以训练数据为图像a为例,参阅图10B所示,图像a对应的真实语种分布信息中包含中文和字符标识、以及中文和字符标识各自对应的真实文本位置信息,其中,中文对应的真实文本位置信息用于表征文本“某某食品公司”和“周年”的真实位置信息,字符标识对应的真实文本位置信息用于表征文本“20”的真实位置信息。
S1003、基于第一模型损失,对初始语种识别子模型的模型参数进行调整。
通过上述实现方式,可以基于训练数据对应的预测语种分布信息和真实 语种分布信息对模型进行训练,从而提高模型的语种分类准确性,进而提高文本识别的准确性。
在实际应用过程中,图像中包含的文本可能并非完全属于单一种类,例如,日语和韩语中经常出现中文字符,又例如,拉丁语和符号标识经常与任意语种字符混合出现,为了准确地描述混合字符分布的情况,本申请实施例中,通过优化模型参数,使得预测的语种分布信息拟合软目标,软目标(soft target)是指统计每个文本字符串中各类字符出现的概率。具体的,参阅图10A所示,执行S1002时,可以采用但不限于以下操作:
S10021、基于预测语种分布信息,确定预测分布概率,预测分布概率中包含有各语种各自对应的预测概率,预测概率用于表征其对应的语种在各语种中的预测文本长度占比。
以训练数据为图像a为例,基于图像a对应的预测语种分布信息,确定预测分布概率,该预测分布概率中,中文的预测概率为90%,符号标识的预测概率为10%。
S10022、基于真实语种分布信息,确定真实分布概率,真实分布概率中包含各语种各自对应的真实概率,真实概率用于表征其对应的语种在各语种中的真实文本长度占比。
仍以训练数据为图像a为例,基于真实语种分布信息,确定真实分布概率,该真实分布概率中,中文的真实概率为80%,符号标识的真实概率为20%。
S10023、基于预测分布概率和真实分布概率,确定第一模型损失。
通过上述实现方式,可以有效地描述混合字符分布的情况,从而提高语种分布的预测准确率,进而提高多语种文本的识别准确率。
在一些实施例中,执行S10023时,第一模型损失可以采用交叉熵损失(Cross Entropy loss,CE loss),也可以采用KL散度损失(Kullback-Leibler divergence loss)。
为了描述预测分布概率和真实分布概率的相似度,本申请实施例中可以采用KL散度损失作为网络对语种预测的目标损失,KL散度损失的计算方式参阅公式(1)所示:
其中,P用于表征预测分布概率,Q用于表征真实分布概率,P(x)用于表征语种x对应的预测概率,Q(x)用于表征语种x对应的真实概率,x为各语种中的某一语种。
接着,对目标方向识别子模型的获取过程进行介绍。
作为一种可能的实现方式,本申请实施例中,在一次迭代过程中,可以将训练数据x输入至初始方向识别子模型中,获得训练数据x的预测文本呈现方向,进而,基于训练数据x的预测文本呈现方向和真实文本呈现方向,确定第二模型损失,进而基于第二模型损失,对初始方向识别子模型的模型 参数进行调整。
作为另一种可能的实现方式,为了使模型更好地学习不同文本呈现方向的图像之间的差异性,本申请实施例中,可以在输入层获取不同文本呈现方向的图像,通过在模型训练中最大化不同文本呈现方向的图像的特征距离来增大这一差异性,让模型更好地学习对文本呈现方向的理解。具体的,目标方向识别子模型通过以下操作得到:
基于获取的第一训练数据集,对初始识别模型中包含的初始方向识别子模型迭代进行模型训练,得到目标方向识别子模型。
下面,仍以训练数据x为例对一次迭代过程进行说明,参阅图11A所示,在一次迭代过程中,执行以下操作:
S1101、获取训练数据x,并按照预设的图像旋转角度,对训练数据x进行旋转,得到对比数据y。
本申请实施例中,为了使模型更好地学习相反文字方向图像的差异性,即文本呈现方向0°和180°,以及文本呈现方向90°和270°,因此,预设的图像旋转角度可以设置为180°。需要说明的是,图像旋转角度可以根据实际应用场景设定,并不局限于180°。
以训练数据x为图像b为例,参阅图11B所示,图像b中包含韩语“你好”,获取图像b之后,并按照预设的图像旋转角度180°,对训练数据x进行旋转,得到对比数据y。
S1102、将训练数据x和对比数据y,分别输入至初始方向识别子模型中,得到训练数据x和对比数据y各自对应的预测文本呈现方向。
仍以训练数据x为图像b为例,将训练数据x输入至初始方向识别子模型中,得到训练数据x对应的预测文本呈现方向,训练数据x对应的预测文本呈现方向为0°,以及将对比数据y输入至初始方向识别子模型中,得到对比数据y对应的预测文本呈现方向,对比数据y对应的预测文本呈现方向为180°。
S1103、基于训练数据x和对比数据y各自对应的预测文本呈现方向,确定第二模型损失,并基于第二模型损失,对初始方向识别子模型的模型参数进行调整。
通过上述实现方式,可以使模型学习不同文本呈现方向的图像之间的差异性,从而提高不同文本呈现方向下文本的识别准确率。
在一些实施例中,第二模型损失可以采用模型预测损失或对比损失(contrast loss),也可以采用模型预测损失和对比损失的加权结果。模型预测损失可以是交叉熵损失或焦点损失(focal loss),但不局限于此,下面仅以交叉熵损失为例进行说明。
若第二模型损失采用交叉熵损失,则基于得到的各预测文本呈现方向,以及相应的真实文本呈现方向,计算交叉熵损失,并将计算出的交叉熵损失, 作为第二模型损失。
若第二模型损失采用对比损失,则基于得到的各预测文本呈现方向,计算各预测文本呈现方向之间的对比损失,并将计算出的对比损失,作为第二模型损失。
若第二模型损失采用交叉熵损失和对比损失的加权结果,则可以采用以下方式确定第二模型损失:
基于训练数据x和对比数据y各自对应的图像特征,确定对比损失;
基于训练数据x和对比数据y各自对应的真实文本呈现方向、以及训练数据x和对比数据y各自对应的预测文本呈现方向,确定训练数据x和对比数据y各自对应的交叉熵损失,即模型预测损失;
基于得到的各交叉熵损失、对比损失、以及交叉熵损失权重、对比损失权重,确定第二模型损失。
其中,对比损失的计算公式参阅公式(2)所示:
Closs=max(margin-d,0)2公式(2)
其中,Closs表示对比损失,d用于表征训练数据x和对比数据y各自对应的图像特征之间的欧式距离,margin为设定的阈值,max()函数用于取最大值。
需要说明的是,训练数据x和对比数据y各自对应的图像特征是指对图像进行图像特征提取后得到的图像特征,图像特征也可以称为图像嵌入。
基于得到的各交叉熵损失、对比损失、以及交叉熵损失权重、对比损失权重,确定第二模型损失时,可以基于各交叉熵损失之和或者各交叉熵损失的平均值,采用交叉熵损失权重、对比损失权重,确定第二模型损失,但不局限于此。
仍以训练数据x为图像b为例,参阅图11C所示,基于训练数据x和对比数据y各自对应的图像特征,确定对比损失为0.2,然后,基于训练数据x对应的真实文本呈现方向和预测文本呈现方向,确定训练数据x对应的交叉熵损失为0.1,基于对比数据y对应的真实文本呈现方向和预测文本呈现方向,确定对比数据y对应的交叉熵损失为0.1,接着,假设交叉熵损失、对比损失各自对应的权重均为0.5,采用交叉熵损失权重和对比损失权重,对各交叉熵损失之和以及对比损失进行加权求和处理,确定第二模型损失为0.2。
显然,本申请实施例中,可以通过在模型训练中最大化不同文本呈现方向的图像在特征层的距离来增大这一差异性,让模型更好地学习对文本呈现方向的理解,从而提高模型对于文本呈现方向的识别准确率,进而提高文本识别的准确率。
需要说明的是,本申请实施例中,目标方向识别子模型和目标语种识别子模型可以包含于同一目标分类模型中,目标方向识别子模型和目标语种识别子模型也可以单独配置,以实现目标分类模型的不同功能,具体不再赘述。
最后,对目标特征提取网络的获取过程进行介绍。
对于一张包含文本的图像,对语种的判断可以依赖图像表观特征,但对文本呈现方向的判断涉及到对文本内容的识别和理解,单纯地基于图像表观特征并不能处理一些特殊图片。例如,参阅图12所示,图像c与图像d中均包含文字“codep”,图像c的文本呈现方向为180°,图像d的文本呈现方向为0°,但是,图像d对应的文本识别结果为“codep”,而图像c对应的文本识别结果为“dapos”,显然,文本识别有误。
为了辅助模型对文本呈现方向的学习,提升文本呈现方向分类的精度,本申请实施例中,引入一个与目标分类模型采用相同主干网络的多流识别模型作为预训练任务,多流识别模型用于实现多语言文本内容的识别,对多流识别模型训练完成后,将训练得到多流识别模型的主干网络,作为目标特征提取网络。进一步的,基于目标特征提取网络,训练初始语种识别子模型、初始方向识别子模型,得到目标语种识别子模型、目标方向识别子模型。
具体的,目标特征提取网络通过以下操作训练得到:
基于初始特征提取网络,构建预训练语种识别模型;
基于获取的第二训练数据集,对预训练识别网络进行迭代训练,得到目标特征提取网络。
需要说明的是,本申请实施例中,预训练语种识别模型也可以称为多流识别模型。
参阅图13所示,预训练语种识别模型中包含输入层、主干网络、时序模型、多流解码器和输出层,其中,主干网络与目标分类模型中的主干网络的结构相同,主干网络用于学习图像表观特征,时序模型用于学习文本的上下文信息,多流解码器包括但不限于中文、日语、韩语、拉丁语、泰语、阿拉伯语、印地语、符号标识各自对应的解码器。
其中,时序模型可以采用长短期记忆网络(Long Short-Term Memory,LSTM)。其中,xt表示t时刻的输入值,yt表示t时刻的输出值,σ表示门激活函数,LSTM中通过对细胞状态中信息遗忘和记忆新的信息,使得对后续时刻计算有用的信息得以传递,而无用的信息被丢弃,而前向的LSTM与反向的LSTM结合组成双向长短期记忆网络(BiLSTM)。
通过上述实现方式,可以显著提升模型收敛速度,同时,可以提高文本呈现方向和语种分布的分类精度,进而提高文本识别精度。
在一些实施例中,目标分类模型会在整个文字识别过程中引入额外的开销,一旦图像中的文本行数目较多,时间开销将成倍增大。为了最大化地降低目标分类模型的时间代价,本申请实施例中,可以对目标分类模型进行模型压缩裁剪。具体的,目标特征提取网络可以采用轻量级网络构架,轻量级网络构架可以采用但不限于mobilenetv2,mobilenetv3,以及shufflenet等。此外,为了增强特征注意力,可以在目标特征提取网络中增加SE(squeeze-excitation)层,并在目标方向识别子模型和目标语种识别子模型中 的至少一个中增加降维层,以进一步降低运算。
为了进一步提高模型线上预测速度,本申请实施例中,可以采用pytorch量化感知训练(QAT)适配TensorRT int8进行8比特量化模型调优,从而在保证分类精度仅有微弱损失的情况下,提升线上图形处理器(Graphics Processing Unit,GPU)的预测速度。
第二、目标文本识别模型
本申请实施例中,目标文本识别模型中包含图像特征编码器、时序模型、解码器,其中,解码器可以采用连接时序分类(Connectionist Temporal Classification,CTC)解码器和注意力机制(attention)实现多任务解码,注意力机制用于辅助CTC学习。图像特征编码器可以采用但不限于ResNet-50VD,时序模型可以采用但不限于双向LSTM,时序模型用于增强对图像中文本上下文信息的学习。由于文本识别模型的结构与多流识别模型的结构类似,在此不再赘述。
由于模型训练采用的真实数据和合成数据规模差异较大,例如,真实数据是万级,而合成数据是百万级,此外,由于不同于中英文识别任务,多语种识别标注数据极度稀缺,成本较高,数据检查困难,且需要专业的语言学者进行辅助标注,对标注者的能力要求较高,短时间无法获取大量的训练数据。
为了避免多语种识别标注数据的严重稀缺对文本识别性能产生影响,本申请实施例中,一方面,可以通过半监督学习(Semi-supervised Learning,SSL)来提升文本识别性能,SSL被广泛应用于图像分类任务中,可用于解决带标签的训练数据短缺的问题,另一方面,还可以通过数据生成来提升文本识别性能。
下面,先对半监督学习过程进行介绍。
具体的,本申请实施例中,目标文本识别模型通过以下操作训练得到:
获取第三训练数据集,第三训练数据集中包含各标注样本和各未标注样本;
基于各标注样本,对包含图像特征提取网络的第一文本识别模型进行训练,得到第二文本识别模型,并基于第二文本识别模型中包含的图像特征提取网络,构建图文距离识别模型;
基于各标注样本和各未标注样本、以及图文距离识别模型,对第二文本识别模型进行迭代训练,获得目标文本识别模型。
需要说明的是,本申请实施例中,第一训练数据集、第二训练数据集、第三训练数据集可以相同,也可以不同,对此不做限制。
其中,每个标注样本中包含对应的真实文本识别结果,每个未标注样本中不包含对应的真实文本识别结果。
第一文本识别模型是指未经训练的文本识别模型。图像特征提取网络可 以采用但不限于CNN,图文距离识别模型采用第二文本识别模型中包含的图像特征提取网络,作为自身的图像特征提取网络。
图文距离识别模型的输入是各标注样本和对应的样本标签,模型损失可以采用排序损失,目的是最小化同一对标注样本和样本标签的图像文本距离,最大化不同对标注样本和样本标签的图像文本距离。
例如,参阅图14所示,针对标注样本1,将标注样本1及其对应的样本标签输入至图文距离识别模型中,得到标注样本1对应的图像特征和样本标签对应的文本特征,通过优化模型参数,最小化标注样本1对应的图像特征与样本标签对应的文本特征之间的图像文本距离。
具体的,参阅图15所示,对第二文本识别模型进行迭代训练时,可以针对各标注样本和各未标注样本,迭代执行以下操作:
S1501、获取N个标注样本和M个未标注样本,并将M个未标注样本分别输入至第二文本识别模型中,得到M个未标注样本各自对应的预测文本识别结果。
需要说明的是,本申请实施例中,N、M的取值均为正整数。N、M的取值可以相同,也可以不同。获取的N个标注样本和M个未标注样本可以称为一批图像数据(batch)。
以未标注样本2为例,参阅图14所示,将未标注样本2输入至第二文本识别模型中,得到未标注样本2对应的预测文本识别结果2,预测文本识别结果2为“BA CHU LON CON”。
S1502、将M个未标注样本及其对应的预测文本识别结果输入至图文距离识别模型中,获得M个未标注样本各自对应的图像文本距离,并基于获得的各图像文本距离,确定至少一个未标注样本对应的各模型子损失。
仍以未标注样本2为例,参阅图14所示,将未标注样本2和对应的预测文本识别结果2,输入至图文距离识别模型中,获得未标注样本2对应的图像文本距离D1,D1的取值为0.33。
具体的,基于获得的各图像文本距离,确定至少一个未标注样本对应的各模型子损失,包括:
基于获得的各图像文本距离,从M个未标注样本中,筛选出图像文本距离不大于预设距离阈值的未标注样本,并将筛选出的各未标注样本,作为各待增强样本;
获取各待增强样本各自对应的目标增强样本,将各待增强样本对应的预测文本识别结果,作为对应的目标增强样本的样本标签;
将获取的各目标增强样本分别输入至第二文本识别模型中,获得各目标增强样本各自对应的模型子损失,并将各目标增强样本各自对应的模型子损失,作为M个未标注样本对应的各模型子损失。
需要说明的是,本申请实施例中,目标增强样本是指通过旋转、翻转、 缩放、对比度变化、噪声扰动等数据增强方式,对相应的待增强样本进行数据增强后获取的样本数据。目标增强样本的样本标签也可以称为目标增强样本的伪标签。
仍以未标注样本2为例,参阅图14所示,假设,预设距离阈值为0.5,未标注样本2对应的图像文本距离D1不大于0.5,因此,可以将未标注样本2作为一个待增强样本。然后,获取未标注样本2对应的目标增强样本2,并将预测文本识别结果2为“BA CHU LON CON”作为目标增强样本2的伪标签。之后,将获取的目标增强样本2输入至第二文本识别模型中,得到目标增强样本2对应的预测文本识别结果,进而,根据该预测文本识别结果和伪标签,获得目标增强样本2对应的模型子损失。类似的,可以获取各目标增强样本各自对应的模型子损失,在M个未标注样本各自对应的图像文本距离均不大于预设距离阈值的情况下,可以得到M个未标注样本对应的各模型子损失。
S1503、将N个标注样本分别输入至第二文本识别模型中,获得N个标注样本各自对应的模型子损失。
需要说明的是,模型子损失可以采用交叉熵损失、焦点损失等。
以标注样本1为例,将标注样本1输入至第二文本识别模型中,得到标注样本1对应的预测文本识别结果,进而,根据该预测文本识别结果与标注样本1对应的样本标签之间的差异,获得标注样本1对应的模型子损失,标注样本1对应的模型子损失为1。
S1504、基于获得的各模型子损失,确定第三模型损失,并基于第三模型损失,对第二文本识别模型的模型参数进行调整。
本申请实施例中,在一个输入batch中,可以将N个标注样本和M个未标注样本各自对应的模型子损失的总和,作为第三模型损失。进而基于第三模型损失进行梯度反向传播,从而达到优化模型参数的目的。
模型训练采用的真实数据和合成数据规模差异较大,例如,真实数据是万级,而合成数据是百万级,通过上述实现方式,一方面,通过真实数据和合成数据双流输入协同训练,可以避免真实数据被合成数据“淹没”,另一方面,基于半监督的学习方式,可以提升文本识别性能。
在一些特殊语种中,其字符通常包含元音附标和声调符号等,而这些字符通常位于基字字符的上面和下面,因此对这类语种的文本识别不仅需要考虑时序顺序,还要考虑二维空间信息。以泰语为例,参阅图16所示,泰语的字符中存在元音附标和声调符号,元音附标和调类符号通常位于基字字符的上面和下面。
为了提升此类语种的识别精度,在一些实施例中,目标文本识别模型中的解码器可以采用不规则文字识别方法(SAR)中的二维空间注意力解码器。二维空间注意力解码器的引入使得每一步的解码不再仅关注时序信息,而是 考虑了空间图像特征信息,对于一些不规则的具有空间分布性的文本的识别性能相对更好。
由于SAR在解码时每一步都要计算2D注意力权值,因此解码速度是CTC的近15倍,此外对于长文本,SAR的识别精度也容易受限,因此,在一些实施例中,目标文本识别模型中的解码器也可以采用CTC和SAR的双流解码结构,两路解码器共享LSTM的时序特征,解码时只预测CTC分支结果,SAR用于辅助CTC学习。
例如,参阅图17A所示,CNN模块可以采用31层的ResNet,进而得到特征图(feature map),之后,特征图经过基于LSTM的编码器-解码器框架(encoder-decoder framework),以及与解码器相连的2D attention模块,最终输出基于SAR识别的文本。特征图经过CTC解码器可以输出基于CTC识别的文本。
通过上述实现方式,图像经CNN后下采样至原高度的1/8,可以保留更多空间信息,提高了文本识别模型的识别精度,同时没有带来任何前向耗时开销。
在基于SAR的解码流程中,二维注意力模块在泰语识别模型中引入了这一算法,实验数据表明SAR相比CTC的识别精度提升了近7%。
在SAR的解码流程中,二维注意力模块的引入使得每一步的解码不再仅关注时序信息,而是考虑了空间图像特征信息,因此对于一些不规则的具有空间分布性的文本的识别性能相对更好。参阅图17B所示,在泰语识别模型中分别使用SAR和CTC作为解码器,其中,图像e中的泰语表征你好,图像f中的泰语表征夏天,CTC中存在识别错误的字符,显然,SAR相比CTC的识别精度更高。
接下来,对数据生成过程进行介绍。
具体的,可以执行以下操作中的至少一种:
操作A:数据合成。
具体的,获取各文本语料、各字体格式和各背景图像,并基于各文本语料、各字体格式和各背景图像,合成各标注样本。
本申请实施例中,可以基于TextRenderer的架构进行数据合成,其输入为任意语种的文本语料、字体格式、背景图像,输出为合成的标注样本。
例如,参阅图18A所示,文本语料1为rberinMann,字体格式1为Robot.ttf、背景图像1为带纹理的背景,基于文本语料1、字体格式1和背景图像1,合成的标注样本如图18A所示。
为了进一步增加样本数量,参阅图18B所示,本申请实施例中,在合成标注样本的过程中,针对文本语料,可以对其字体尺寸、颜色、间隙、粗细等信息进行配置,还可以进行水平文本渲染或者竖直文本渲染。针对背景图像,可以对背景图像进行截取、图像增强变换等处理操作后,将处理后得到 的背景图像与文本语料进行叠加。作为一种示例,可以直接将叠加后得到的图像,作为合成的标注样本。作为另一种示例,针对叠加后得到的图像,可以执行以下操作中的一项或多项:泊松融合、透视变换、alpha通道图像叠加、图像高光增强、图像印花增强,图像增强、干扰、图像尺寸变换,其中,干扰包括但不限于模糊、噪声、水平线性叠加干扰。
通过上述实现方式,针对多语种数据的特点,能够生成包含中、日、韩、泰、越、俄、阿、拉丁、印等多语种的图像,同时支持水平和竖直文本图像生成,以及一些特殊语言,如阿拉伯或印地等(从右向左,特殊变形)文本图像的生成。另一方面,通过在数据合成中引入了高光、印花干扰,数据拼贴等操作,可以使合成图片更接近真实数据。
操作B:数据风格迁移。
具体的,获取各文本语料,并将各文本语料,分别输入至目标数据风格迁移模型中,得到各标注样本。
参阅图19所示,目标数据风格迁移模型中包含文本转移模块、背景提取模块和文本融合模块,目标数据风格迁移模型采用生成对抗网络(Generative Adversarial Networks,GAN)。其中,将文本语料和目标文本风格的图像输入至目标数据风格迁移模型中后,文本语料经过文本转移模块的处理,获取目标文本风格中的目标文本字体对应的文本语料Osk,目标文本风格经文本转移模块和背景提取模块,输出目标文本风格中包含的图像背景Ob,此外,目标文本风格经文本转移模块,还可以输出包含原图像背景的目标文本字体对应的文本语料Ot,进而,生成相应的标注样本。图19中还包含LT、LB和LF,其中,LT表示文本转移模块的模型损失,LB表示背景提取模块的模型损失,LF表示文本融合模块的模型损失。
参阅图20所示,当文本语料为“requires”、目标文本风格如图像g所示时,将文本语料“requires”和目标文本风格输入至目标数据风格迁移模型中,得到对应的标注样本,当文本语料为“crisis”、目标文本风格如图像h所示时,将文本语料“crisis”和目标文本风格输入至目标数据风格迁移模型中,得到对应的标注样本,当文本语料为“beyond”、目标文本风格如图像i所示时,将文本语料“beyond”和目标文本风格输入至目标数据风格迁移模型中,得到对应的标注样本。
由于GAN的模型训练不够稳定,一方面,本申请实施例中,可以采用SN-GAN中的Hinge损失,替换原始的对抗损失以稳定训练过程,避免梯度的大幅度震荡。另一方面,在计算生成的标注样本和目标文本风格的L1损失时,本申请实施例中,可以采用文本蒙版区域加权的L1损失,从而减少模型对背景的过度学习,增强对文字区域像素的约束。在测试阶段,采用已训练的数据风格迁移模型对输入的文本预料和目标文本风格的图像进行风格迁移学习,从而生成更多地接近真实数据的识别数据。
通过上述实现方式,通过数据风格迁移,学习真实数据的字体风格,一方面,可以进一步解决合成数据与真实数据之间的域差异性,另一方面,后续可以利用真实数据,对合成得到的数据进行风格迁移,也可以利用真实数据对其他真实数据进行风格迁移,从而增加样本多样性。
需要说明的是,本申请实施例中,基于半监督的文本识别模型中,还可以采用强化学习(DQN)对模型进行训练,其中,文本识别模型作为DQN中的代理(agent)部分,图像和预测文本可以作为DQN中的环境(environment)部分,反馈(reward)则可以用图像文本距离和编辑距离奖励来表示。
本申请实施例中提出的文本语种/方向分类网络(LOPN)能够快速准确地对一张文本行图像进行语种分布预测和方向判断,在设定测试集上的实验结果如表1所示。从表1中可以看出,利用软目标概率对语种分布进行建模,采用KL散度损失,对语种分类的精度有近7%的性能增益。同时,采用多流文本识别模型预训练后的分类模型在语种分类和方向分类上都有了较大的提升,识别任务的引入强化了模型对图像文本的理解,弥补了仅靠图像表观特征进行分类的不足。再者,在方向分类中加入双流数据监督,帮助模型更好地区分图像文字的方向,提升了方向分类的性能。
表1 LOPN测试集精度
本申请实施例中采用TensorRT对上述分类网络模型进行部署,线上采用NVIDIA-T4GPU机型,原始模型和量化模型的预测速度对比结果如表2所示。
表2 LOPN TensorRT模型预测精度速度对比
参阅表3所示,本申请实施例中的文本识别模型融合了数据合成优化和半监督训练,在设定语种测试集上取得了较高的识别性能,同时与开源模型相比,识别精度大幅度超越现有开源模型。
具体的,可以采用归一化编辑距离(NED)和序列精度(SeqACC)作为 识别任务评价指标,NED可以采用公式(3)计算得到,SeqACC可以采用公式(4):

在公式(3)和公式(4)中,D表示莱温斯坦距离,si表示预测文本(即文本识别结果),表示真实值,N表示待识别图像的总数目,|x|+表示统计操作,当x的值为真时加1。
表3 多语种文本识别模型精度
显然,通过表3可知,在不同的语种数据集训练中,本申请实施例中采用的多流数据训练、多任务解码、半监督文本识别训练、数据风格迁移和强化的数据增强均对文本识别任务带来了有效的性能增益,具有较高的实用性和普适性。
参阅表4所示,其为模型采用CTC解码器、采用SAR解码器、以及采用CTC+SAR双流解码器三种情况下,泰语识别精度对比结果,显然,采用CTC+SAR双流解码器时,识别速度和NED两项评估指标的表现最佳。
表4 泰语识别精度对比
基于相同的发明构思,本申请实施例提供一种文本识别装置。如图21所示,其为文本识别装置2100的结构示意图,可以包括:
图像分类单元2101,用于将包含文本的待识别图像输入至目标分类模型中,获得所述待识别图像的语种分布信息和原始文本呈现方向,其中,所述语种分布信息中包含所述文本对应的多个语种、以及所述多个语种各自对应的文本位置信息;
图像矫正单元2102,用于基于所述原始文本呈现方向和预设的目标文本呈现方向,对所述待识别图像进行矫正,得到目标识别图像;
文本定位单元2103,用于基于所述多个语种各自对应的文本位置信息,在所述目标识别图像中,确定所述多个语种各自对应的文本区域图像集;
图像识别单元2104,用于基于所述多个语种各自对应的文本区域图像集,分别采用对应语种关联的目标文本识别模型进行处理,得到所述待识别图像对应的文本识别结果。
作为一种可能的实现方式,所述图像识别单元2104具体用于:
针对所述多个语种中每个语种对应的文本区域图像集,将所述文本区域图像集输入至所述语种关联的目标文本识别模型中,获得所述文本区域图像集对应的文本识别子结果;
基于各个所述文本区域图像集各自对应的文本识别子结果,得到所述待识别图像对应的文本识别结果。
作为一种可能的实现方式,文本识别装置2100还包含模型训练单元2105,所述目标分类模型中包含目标语种识别子模型,则模型训练单元2105用于:
基于获取的第一训练数据集,对初始识别模型中包含的初始语种识别子模型迭代进行模型训练,得到所述目标语种识别子模型,其中,在一次迭代过程中,执行以下操作:
将所述第一训练数据集中包含的训练数据,输入至所述初始语种识别子模型中,得到所述训练数据对应的预测语种分布信息;
基于所述预测语种分布信息、以及所述训练数据对应的真实语种分布信息,确定第一模型损失,并基于所述第一模型损失,对所述初始语种识别子模型的模型参数进行调整。
作为一种可能的实现方式,所述基于所述预测语种分布信息、以及所述训练数据对应的真实语种分布信息,确定第一模型损失时,模型训练单元2105具体用于:
基于所述预测语种分布信息,确定预测分布概率,所述预测分布概率中包含有各语种各自对应的预测概率,所述预测概率用于表征其对应的语种在所述各语种中的预测文本长度占比;
基于所述真实语种分布信息,确定真实分布概率,所述真实分布概率中包含各语种各自对应的真实概率,所述真实概率用于表征其对应的语种在所述各语种中的真实文本长度占比;
基于所述预测分布概率和所述真实分布概率,确定所述第一模型损失。
作为一种可能的实现方式,文本识别装置2100还包含模型训练单元2105,所述目标分类模型中包含目标方向识别子模型,则模型训练单元2105用于:
基于获取的第一训练数据集,对初始识别模型中包含的初始方向识别子模型迭代进行模型训练,输出所述目标方向识别子模型,其中,在一次迭代过程中,执行以下操作:
获取所述第一训练数据集中包含的训练数据,并按照预设的图像旋转角度,对所述训练数据进行旋转,得到所述训练数据对应的对比数据;
将所述训练数据和所述对比数据,分别输入至所述初始方向识别子模型中,得到所述训练数据和所述对比数据各自对应的预测文本呈现方向;
基于所述训练数据和所述对比数据各自对应的预测文本呈现方向,确定第二模型损失,并基于所述第二模型,对所述初始方向识别子模型的模型参数进行调整。
作为一种可能的实现方式,所述基于所述训练数据和所述对比数据各自对应的预测文本呈现方向,确定第二模型损失时,模型训练单元2105具体用于:
基于所述训练数据和所述对比数据各自对应的图像特征,确定对比损失;
基于所述训练数据和所述对比数据各自对应的真实文本呈现方向、以及所述训练数据和所述对比数据各自对应的预测文本呈现方向,确定所述训练数据和所述对比数据各自对应的模型预测损失;
基于得到的各模型预测损失、所述对比损失、以及模型预测损失权重、对比损失权重,确定第二模型损失。
作为一种可能的实现方式,所述目标分类模型中还包含目标特征提取网络,模型训练单元2105还用于:
基于初始特征提取网络,构建预训练语种识别模型;
基于获取的第二训练数据集,对所述预训练识别模型进行迭代训练,得到所述目标特征提取网络。
作为一种可能的实现方式,所述将包含文本的待识别图像输入至目标分类模型中,获得所述待识别图像的语种分布信息和原始文本呈现方向之前,图像分类单元2101还用于:
获取原始图像,并从所述原始图像中,提取出至少一个包含文本的子图像;
基于预设的图像形状,对提取出的各个子图像分别进行形状矫正处理,并将矫正处理后得到的各个子图像中的任意一个子图像,作为所述待识别图像。
作为一种可能的实现方式,模型训练单元2105还用于:
获取第三训练数据集,所述第三训练数据集中包含各标注样本和各未标注样本
基于所述各标注样本,对包含图像特征提取网络的第一文本识别模型进行训练,得到第二文本识别模型,并基于所述第二文本识别模型中包含的图像特征提取网络,构建图文距离识别模型;
基于各标注样本和各未标注样本、以及所述图文距离识别模型,对所述第二文本识别模型进行迭代训练,获得所述目标文本识别模型。
作为一种可能的实现方式,所述基于各标注样本和各未标注样本、以及所述图文距离识别模型,对所述第二文本识别模型进行迭代训练,获得所述目标文本识别模型时,模型训练单元2105具体用于:
针对所述各标注样本和所述各未标注样本,迭代执行以下操作:
获取至少一个标注样本和至少一个未标注样本,并将所述至少一个未标注样本分别输入至所述第二文本识别模型中,得到所述至少一个未标注样本各自对应的预测文本识别结果;
将所述至少一个未标注样本及其对应的预测文本识别结果输入至所述图文距离识别模型中,获得所述至少一个未标注样本各自对应的图像文本距离,并基于获得的各图像文本距离,确定所述至少一个未标注样本对应的各模型子损失;
将所述至少一个标注样本分别输入至所述第二文本识别模型中,获得所述至少一个标注样本各自对应的模型子损失;
基于获得的各模型子损失,确定第三模型损失,并基于所述第三模型损失,对所述第二文本识别模型的模型参数进行调整。
作为一种可能的实现方式,所述基于获得的各图像文本距离,确定所述至少一个未标注样本对应的各模型子损失时,模型训练单元2105具体用于:
基于获得的各图像文本距离,从所述至少一个未标注样本中,筛选出图像文本距离不大于预设距离阈值的未标注样本,并将筛选出的各未标注样本,作为各待增强样本;
获取所述各待增强样本各自对应的目标增强样本,并将所述各待增强样本对应的预测文本识别结果,作为对应的目标增强样本的样本标签;
将获取的各目标增强样本分别输入至所述第二文本识别模型中,获得所述各目标增强样本各自对应的模型子损失,并将所述各目标增强样本各自对应的模型子损失,作为所述至少一个未标注样本对应的各模型子损失。
作为一种可能的实现方式,模型训练单元2105还用于执行以下至少一种操作:
获取各文本语料、各字体格式和各背景图像,基于所述各文本语料、各字体格式和各背景图像,合成所述各标注样本;
获取各文本语料,并将所述各文本语料,分别输入至目标数据风格迁移模型中,得到所述各标注样本。
为了描述的方便,以上各部分按照功能划分为各模块(或单元)分别描 述。当然,在实施本申请时可以把各模块(或单元)的功能在同一个或多个软件或硬件中实现。
关于上述实施例中的装置,其中各个单元执行请求的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。
基于相同的发明构思,本申请实施例还提供一种电子设备。在一种实施例中,该电子设备可以是服务器,也可以是终端设备。参阅图22所示,其为本申请实施例中提供的一种可能的电子设备的结构示意图,图22中,电子设备2200包括:处理器2210和存储器2220。
其中,存储器2220存储有可被处理器2210执行的计算机程序,处理器2210通过执行存储器2220存储的指令,可以执行上述文本识别方法的步骤。
存储器2220可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);存储器2220也可以是非易失性存储器(non-volatile memory),例如只读存储器(Read-Only Memory,ROM),快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);或者存储器2220是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器2220也可以是上述存储器的组合。
处理器2210可以包括一个或多个中央处理单元(central processing unit,CPU)或者为数字处理单元等等。处理器2210,用于执行存储器2220中存储的计算机程序时实现上述文本识别方法。
在一些实施例中,处理器2210和存储器2220可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。
本申请实施例中不限定上述处理器2210和存储器2220之间的具体连接介质。本申请实施例中以处理器2210和存储器2220之间通过总线连接为例,总线在图22中以粗线描述,其它部件之间的连接方式,仅是进行示意性说明,并不引以为限。总线可以分为地址总线、数据总线、控制总线等。为便于描述,图22中仅用一条粗线描述,但并不描述仅有一根总线或一种类型的总线。
基于同一发明构思,本申请实施例提供了一种计算机可读存储介质,其包括计算机程序,当计算机程序在电子设备上运行时,计算机程序用于使电子设备执行上述文本识别方法的步骤。在一些可能的实施方式中,本申请提供的文本识别方法的各个方面还可以实现为一种程序产品的形式,其包括计算机程序,当程序产品在电子设备上运行时,计算机程序用于使电子设备执行上述文本识别方法中的步骤,例如,电子设备可以执行如图2中所示的步 骤。
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、RAM、ROM、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(Compact Disk Read Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
本申请的实施方式的程序产品可以采用CD-ROM并包括计算机程序,并可以在电子设备上运行。然而,本申请的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储计算机程序的有形介质,该计算机程序可以被命令执行系统、装置或者器件使用或者与其结合使用。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (16)

  1. 一种文本识别方法,由电子设备执行,所述方法包括:
    将包含文本的待识别图像输入至目标分类模型中,获得所述待识别图像的语种分布信息和原始文本呈现方向,其中,所述语种分布信息中包含所述文本对应的多个语种、以及所述多个语种各自对应的文本位置信息;
    基于所述原始文本呈现方向和预设的目标文本呈现方向,对所述待识别图像进行矫正,得到目标识别图像;
    基于所述多个语种各自对应的文本位置信息,在所述目标识别图像中,确定所述多个语种各自对应的文本区域图像集;
    基于所述多个语种各自对应的文本区域图像集,分别采用对应语种关联的目标文本识别模型进行处理,得到所述待识别图像对应的文本识别结果。
  2. 如权利要求1所述的方法,所述基于所述多个语种各自对应的文本区域图像集,分别采用对应语种关联的目标文本识别模型进行处理,得到所述待识别图像对应的文本识别结果,包括:
    针对所述多个语种中每个语种对应的文本区域图像集,将所述文本区域图像集输入至所述语种关联的目标文本识别模型中,获得所述文本区域图像集对应的文本识别子结果;
    基于各个所述文本区域图像集各自对应的文本识别子结果,得到所述待识别图像对应的文本识别结果。
  3. 如权利要求1所述的方法,所述目标分类模型中包含目标语种识别子模型,所述目标语种识别子模型通过以下操作得到:
    基于获取的第一训练数据集,对初始识别模型中包含的初始语种识别子模型迭代进行模型训练,得到所述目标语种识别子模型,其中,在一次迭代过程中,执行以下操作:
    将所述第一训练数据集中包含的训练数据,输入至所述初始语种识别子模型中,得到所述训练数据对应的预测语种分布信息;
    基于所述预测语种分布信息、以及所述训练数据对应的真实语种分布信息,确定第一模型损失,并基于所述第一模型损失,对所述初始语种识别子模型的模型参数进行调整。
  4. 如权利要求3所述的方法,所述基于所述预测语种分布信息、以及所述训练数据对应的真实语种分布信息,确定第一模型损失,包括:
    基于所述预测语种分布信息,确定预测分布概率,所述预测分布概率中包含有各语种各自对应的预测概率,所述预测概率用于表征其对应的语种在所述各语种中的预测文本长度占比;
    基于所述真实语种分布信息,确定真实分布概率,所述真实分布概率中包含各语种各自对应的真实概率,所述真实概率用于表征其对应的语种在所述各语种中的真实文本长度占比;
    基于所述预测分布概率和所述真实分布概率,确定所述第一模型损失。
  5. 如权利要求1-4中任一项所述的方法,所述目标分类模型中包含目标方向识别子模型,所述目标方向识别子模型通过以下操作得到:
    基于获取的第一训练数据集,对初始识别模型中包含的初始方向识别子模型迭代进行模型训练,得到所述目标方向识别子模型,其中,在一次迭代过程中,执行以下操作:
    获取所述第一训练数据集中包含的训练数据,并按照预设的图像旋转角度,对所述训练数据进行旋转,得到所述训练数据对应的对比数据;
    将所述训练数据和所述对比数据,分别输入至所述初始方向识别子模型中,得到所述训练数据和所述对比数据各自对应的预测文本呈现方向;
    基于所述训练数据和所述对比数据各自对应的预测文本呈现方向,确定第二模型损失,并基于所述第二模型,对所述初始方向识别子模型的模型参数进行调整。
  6. 如权利要求5所述的方法,所述基于所述训练数据和所述对比数据各自对应的预测文本呈现方向,确定第二模型损失,包括:
    基于所述训练数据和所述对比数据各自对应的图像特征,确定对比损失;
    基于所述训练数据和所述对比数据各自对应的真实文本呈现方向、以及所述训练数据和所述对比数据各自对应的预测文本呈现方向,确定所述训练数据和所述对比数据各自对应的模型预测损失;
    基于各所述模型预测损失、所述对比损失、以及模型预测损失权重、对比损失权重,确定所述第二模型损失。
  7. 如权利要求5所述的方法,所述目标分类模型中还包含目标特征提取网络,所述目标特征提取网络通过以下操作训练得到:
    基于初始特征提取网络,构建预训练语种识别模型;
    基于获取的第二训练数据集,对所述预训练识别模型进行迭代训练,得到所述目标特征提取网络。
  8. 如权利要求1-4中任一项所述的方法,所述将包含文本的待识别图像输入至目标分类模型中,获得所述待识别图像的语种分布信息和原始文本呈现方向之前,还包括:
    获取原始图像,并从所述原始图像中,提取出至少一个包含文本的子图像;
    基于预设的图像形状,对提取出的各个所述子图像分别进行形状矫正处理,并将矫正处理后得到的各个子图像中的任意一个子图像,作为所述待识别图像。
  9. 如权利要求1-4中任一项所述的方法,所述目标文本识别模型通过以下操作训练得到:
    获取第三训练数据集,所述第三训练数据集中包含各标注样本和各未标 注样本;
    基于所述各标注样本,对包含图像特征提取网络的第一文本识别模型进行训练,得到第二文本识别模型,并基于所述第二文本识别模型中包含的图像特征提取网络,构建图文距离识别模型;
    基于所述各标注样本和所述各未标注样本、以及所述图文距离识别模型,对所述第二文本识别模型进行迭代训练,获得所述目标文本识别模型。
  10. 如权利要求9所述的方法,所述基于所述各标注样本和所述各未标注样本、以及所述图文距离识别模型,对所述第二文本识别模型进行迭代训练,获得所述目标文本识别模型,包括:
    针对所述各标注样本和所述各未标注样本,迭代执行以下操作:
    获取至少一个标注样本和至少一个未标注样本,并将所述至少一个未标注样本分别输入至所述第二文本识别模型中,得到所述至少一个未标注样本各自对应的预测文本识别结果;
    将所述至少一个未标注样本及其对应的预测文本识别结果输入至所述图文距离识别模型中,获得所述至少一个未标注样本各自对应的图像文本距离,并基于获得的各图像文本距离,确定所述至少一个未标注样本对应的各模型子损失;
    将所述至少一个标注样本分别输入至所述第二文本识别模型中,获得所述至少一个标注样本各自对应的模型子损失;
    基于获得的各所述模型子损失,确定第三模型损失,并基于所述第三模型损失,对所述第二文本识别模型的模型参数进行调整。
  11. 如权利要求10所述的方法,所述基于获得的各图像文本距离,确定所述至少一个未标注样本对应的各模型子损失,包括:
    基于获得的各图像文本距离,从所述至少一个未标注样本中,筛选出图像文本距离不大于预设距离阈值的未标注样本,并将筛选出的各未标注样本,作为各待增强样本;
    获取所述各待增强样本各自对应的目标增强样本,并将所述各待增强样本对应的预测文本识别结果,作为对应的目标增强样本的样本标签;
    将获取的各目标增强样本分别输入至所述第二文本识别模型中,获得所述各目标增强样本各自对应的模型子损失,并将所述各目标增强样本各自对应的模型子损失,作为所述至少一个未标注样本对应的各模型子损失。
  12. 如权利要求9所述的方法,所述各标注样本采用以下至少一种操作得到:
    获取各文本语料、各字体格式和各背景图像,基于所述各文本语料、各字体格式和各背景图像,合成所述各标注样本;
    获取各文本语料,并将所述各文本语料,分别输入至目标数据风格迁移模型中,得到所述各标注样本。
  13. 一种文本识别装置,包括:
    图像分类单元,用于将包含文本的待识别图像输入至目标分类模型中,获得所述待识别图像的语种分布信息和原始文本呈现方向,其中,所述语种分布信息中包含所述文本对应的多个语种、以及所述多个语种各自对应的文本位置信息;
    图像矫正单元,用于基于所述原始文本呈现方向和预设的目标文本呈现方向,对所述待识别图像进行矫正,得到目标识别图像;
    文本定位单元,用于基于所述多个语种各自对应的文本位置信息,在所述目标识别图像中,确定所述多个语种各自对应的文本区域图像集;
    图像识别单元,用于基于所述多个语种各自对应的文本区域图像集,分别采用对应语种关联的目标文本识别模型进行处理,得到所述待识别图像对应的文本识别结果。
  14. 一种电子设备,其包括处理器和存储器,其中,所述存储器存储有计算机程序,当所述计算机程序被所述处理器执行时,使得所述处理器执行权利要求1~12中任一所述方法的步骤。
  15. 一种计算机可读存储介质,其包括计算机程序,当所述计算机程序在电子设备上运行时,所述计算机程序用于使所述电子设备执行权利要求1~12中任一所述方法的步骤。
  16. 一种计算机程序产品,其包括计算机程序,所述计算机程序存储在计算机可读存储介质中,电子设备的处理器从所述计算机可读存储介质读取并执行所述计算机程序,使得所述电子设备执行权利要求1~12中任一项所述方法的步骤。
PCT/CN2023/076411 2022-04-18 2023-02-16 文本识别方法及相关装置 WO2023202197A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210402933.8A CN114596566B (zh) 2022-04-18 2022-04-18 文本识别方法及相关装置
CN202210402933.8 2022-04-18

Publications (1)

Publication Number Publication Date
WO2023202197A1 true WO2023202197A1 (zh) 2023-10-26

Family

ID=81813293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/076411 WO2023202197A1 (zh) 2022-04-18 2023-02-16 文本识别方法及相关装置

Country Status (2)

Country Link
CN (1) CN114596566B (zh)
WO (1) WO2023202197A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780276B (zh) * 2021-09-06 2023-12-05 成都人人互娱科技有限公司 一种结合文本分类的文本识别方法及系统
CN114596566B (zh) * 2022-04-18 2022-08-02 腾讯科技(深圳)有限公司 文本识别方法及相关装置
CN114998897B (zh) * 2022-06-13 2023-08-29 北京百度网讯科技有限公司 生成样本图像的方法以及文字识别模型的训练方法
CN114758339B (zh) * 2022-06-15 2022-09-20 深圳思谋信息科技有限公司 字符识别模型的获取方法、装置、计算机设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569830A (zh) * 2019-08-01 2019-12-13 平安科技(深圳)有限公司 多语言文本识别方法、装置、计算机设备及存储介质
CN111488826A (zh) * 2020-04-10 2020-08-04 腾讯科技(深圳)有限公司 一种文本识别方法、装置、电子设备和存储介质
CN112101367A (zh) * 2020-09-15 2020-12-18 杭州睿琪软件有限公司 文本识别方法、图像识别分类方法、文档识别处理方法
WO2021081562A2 (en) * 2021-01-20 2021-04-29 Innopeak Technology, Inc. Multi-head text recognition model for multi-lingual optical character recognition
CN113780276A (zh) * 2021-09-06 2021-12-10 成都人人互娱科技有限公司 一种结合文本分类的文本检测和识别方法及系统
CN114596566A (zh) * 2022-04-18 2022-06-07 腾讯科技(深圳)有限公司 文本识别方法及相关装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428717B (zh) * 2020-03-26 2024-04-26 京东方科技集团股份有限公司 文本识别方法、装置、电子设备及计算机可读存储介质
US11495014B2 (en) * 2020-07-22 2022-11-08 Optum, Inc. Systems and methods for automated document image orientation correction
CN111898696B (zh) * 2020-08-10 2023-10-27 腾讯云计算(长沙)有限责任公司 伪标签及标签预测模型的生成方法、装置、介质及设备
CN112508015A (zh) * 2020-12-15 2021-03-16 山东大学 一种铭牌识别方法、计算机设备、存储介质
CN113537187A (zh) * 2021-01-06 2021-10-22 腾讯科技(深圳)有限公司 文本识别方法、装置、电子设备及可读存储介质
CN112926684B (zh) * 2021-03-29 2022-11-29 中国科学院合肥物质科学研究院 一种基于半监督学习的文字识别方法
CN113919330A (zh) * 2021-10-14 2022-01-11 携程旅游信息技术(上海)有限公司 语种识别方法、信息分发方法以及设备、介质
CN114330483A (zh) * 2021-11-11 2022-04-12 腾讯科技(深圳)有限公司 数据处理方法及模型训练方法、装置、设备、存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569830A (zh) * 2019-08-01 2019-12-13 平安科技(深圳)有限公司 多语言文本识别方法、装置、计算机设备及存储介质
CN111488826A (zh) * 2020-04-10 2020-08-04 腾讯科技(深圳)有限公司 一种文本识别方法、装置、电子设备和存储介质
CN112101367A (zh) * 2020-09-15 2020-12-18 杭州睿琪软件有限公司 文本识别方法、图像识别分类方法、文档识别处理方法
WO2021081562A2 (en) * 2021-01-20 2021-04-29 Innopeak Technology, Inc. Multi-head text recognition model for multi-lingual optical character recognition
CN113780276A (zh) * 2021-09-06 2021-12-10 成都人人互娱科技有限公司 一种结合文本分类的文本检测和识别方法及系统
CN114596566A (zh) * 2022-04-18 2022-06-07 腾讯科技(深圳)有限公司 文本识别方法及相关装置

Also Published As

Publication number Publication date
CN114596566A (zh) 2022-06-07
CN114596566B (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
WO2023202197A1 (zh) 文本识别方法及相关装置
US11645826B2 (en) Generating searchable text for documents portrayed in a repository of digital images utilizing orientation and text prediction neural networks
US11113599B2 (en) Image captioning utilizing semantic text modeling and adversarial learning
Borisyuk et al. Rosetta: Large scale system for text detection and recognition in images
CN108288078B (zh) 一种图像中字符识别方法、装置和介质
US11544474B2 (en) Generation of text from structured data
WO2020114429A1 (zh) 关键词提取模型训练方法、关键词提取方法及计算机设备
Diaz et al. Rethinking text line recognition models
US20210209297A1 (en) Table detection in spreadsheet
CN113205047B (zh) 药名识别方法、装置、计算机设备和存储介质
Inunganbi et al. Handwritten Meitei Mayek recognition using three‐channel convolution neural network of gradients and gray
CN111008624A (zh) 光学字符识别方法和产生光学字符识别的训练样本的方法
CN113269009A (zh) 图像中的文本识别
Sharma et al. [Retracted] Optimized CNN‐Based Recognition of District Names of Punjab State in Gurmukhi Script
Ma et al. Modal contrastive learning based end-to-end text image machine translation
CN116306906A (zh) 一种翻译模型训练方法、语音翻译方法及相关设备
CN113807326B (zh) 制式表格文字识别方法和装置
US20230036812A1 (en) Text Line Detection
CN110222693B (zh) 构建字符识别模型与识别字符的方法和装置
Xie et al. Enhancing multimodal deep representation learning by fixed model reuse
Asadi-zeydabadi et al. IDPL-PFOD2: A New Large-Scale Dataset for Printed Farsi Optical Character Recognition
CN114399782B (zh) 文本图像处理方法、装置、设备、存储介质及程序产品
US11853393B2 (en) Method and system for generating synthetic documents for layout recognition and information retrieval
CN114758339B (zh) 字符识别模型的获取方法、装置、计算机设备和存储介质
CN118015644B (zh) 基于图片和文字的社交媒体关键词数据分析方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23790860

Country of ref document: EP

Kind code of ref document: A1