CN111488826B

CN111488826B - Text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111488826B
Application number: CN202010278046.5A
Authority: CN
Inventors: 王洪振; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2023-10-17
Anticipated expiration: 2040-04-10
Also published as: CN111488826A

Abstract

The application relates to the technical field of computer vision, and provides a text recognition method, a device, electronic equipment and a storage medium, which are used for realizing text recognition of different languages in an image and improving recognition accuracy, wherein the method comprises the following steps: extracting features of an image to be identified, and obtaining a feature map corresponding to the image to be identified, wherein the image to be identified contains texts of at least one language class; according to the feature diagram, performing position detection on the text in the image to be identified, obtaining position information of the text in the image to be identified, and performing language identification on the text in the image to be identified, obtaining language information of the text in the image to be identified; and identifying the text in the image to be identified based on the acquired position information and language information. According to the method and the device for recognizing the texts in the images to be recognized, language prediction is carried out on each text in the images to be recognized, user specification is not needed, each text is flexibly recognized according to the recognition method corresponding to each recognized language, and recognition accuracy is higher.

Description

Text recognition method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of computer vision, in particular to the technical field of machine learning, and provides a text recognition method, a text recognition device, electronic equipment and a storage medium.

Background

OCR (Optical Character Recognition ) is an important research hotspot problem in the field of computer vision, which involves two specific tasks: both text detection and text recognition are indispensable, and in particular text detection is a precondition for the overall task. The text detection task aims at locating the position where the text appears, and finally aims at acquiring text information by using a recognition algorithm.

Due to the variety of languages, the languages in the text to be detected are also various, and may also include multiple languages, some even small languages that are not well known, and how to accurately identify the languages is a problem to be considered in OCR.

Disclosure of Invention

The embodiment of the application provides a text recognition method, a text recognition device, electronic equipment and a storage medium, which are used for realizing text recognition of different languages in an image and improving the text recognition accuracy.

The text recognition method provided by the embodiment of the application comprises the following steps:

Extracting features of an image to be identified, and obtaining a feature map corresponding to the image to be identified, wherein the image to be identified contains texts of at least one language class;

according to the feature map, performing position detection on the text in the image to be identified, obtaining position information of the text in the image to be identified, and performing language identification on the text in the image to be identified, obtaining language information of the text in the image to be identified;

and identifying the text in the image to be identified based on the acquired position information and the language information.

The text recognition device provided by the embodiment of the application comprises:

the feature extraction unit is used for extracting features of the image to be identified and obtaining a feature map corresponding to the image to be identified, wherein the image to be identified contains texts of at least one language class;

the detection unit is used for carrying out position detection on the text in the image to be identified according to the feature map, obtaining the position information of the text in the image to be identified, carrying out language identification on the text in the image to be identified, and obtaining the language information of the text in the image to be identified;

And the identification unit is used for identifying the text in the image to be identified based on the acquired position information and the language information.

Optionally, the identification unit is specifically configured to:

generating at least one text box for identifying the area where the text in the image to be identified is located based on the position information;

determining the language category of the text in each text box according to the language information corresponding to the text in each text box;

and identifying the text in each text box according to the text identification mode corresponding to the language category to which the text in each text box belongs.

Optionally, the language information includes at least two probability maps, each probability map corresponds to a preset language in a preset language set one by one, and the probability map is used for indicating the probability that the content of the position of each pixel point in the image to be identified belongs to the preset language corresponding to the probability map;

for each text box, the identification unit is specifically configured to:

acquiring a target area corresponding to the text box in each probability map;

taking the average value of probability values of all pixel points in a target area of each probability map as the prediction probability of the text in the text box belonging to the preset language corresponding to the corresponding probability map;

And taking a preset language corresponding to the maximum probability value in each prediction probability as the language category to which the text in the text box belongs.

Optionally, the feature extraction unit is specifically configured to:

inputting the image to be identified into a trained classification model, and carrying out feature extraction on the image features of the image to be identified based on an image feature extraction part in the classification model to obtain a feature map corresponding to the image to be identified;

the trained classification model is obtained through training according to a training sample set, and the training sample set comprises sample images marked with text position information and language information.

Optionally, the trained classification model further includes a text detection portion and a language identification portion;

the detection unit is specifically used for:

inputting the feature map into the text detection part, and carrying out feature extraction on text position features in the feature map based on the text detection part to acquire the position information of the text in the image to be identified, which is output by the text detection part; and

and inputting the feature map into the language identification part, and carrying out feature extraction on the text language features in the feature map based on the language identification part to acquire language information of the text in the image to be identified, which is output by the language identification part.

Optionally, the language identification part is a full convolutional neural network, and the language information includes at least two probability graphs, each probability graph corresponds to a preset language in a preset language set one by one;

the number of convolution kernels of the last layer of convolution layers in the language identification part is the same as the number of preset languages in the preset language set.

Optionally, the device further comprises a training unit;

the training unit is used for training to obtain the trained classification model by the following modes:

selecting a sample image from the training sample set;

inputting the sample image into an untrained classification model to obtain position information and language information of a text in the sample image output by the untrained classification model;

and constructing a target loss function based on the position information and the language information output by the untrained classification model, and optimizing parameters in the untrained classification model according to the target loss function until the untrained classification model converges to obtain the trained classification model.

Optionally, the objective loss function includes a text classification loss term, a distance loss term, and a language classification loss term;

The text classification loss item is used for representing the difference between a text classification result and an actual text classification result in position information predicted by the untrained classification model, wherein the text classification result is used for representing whether the content of the position of the pixel point belongs to text content or not;

the distance loss item is used for representing the difference between the boundary distance in the position information predicted by the untrained classification model and the actual boundary distance, wherein the boundary distance is used for representing the distance between the position of the pixel point and the boundary of the text to which the pixel point belongs when the content of the position of the pixel point belongs to the text content;

the language classification loss term is used for representing the difference between the language information predicted by the untrained classification model and the actual language information.

The electronic device provided by the embodiment of the application comprises a processor and a memory, wherein the memory stores program codes, and when the program codes are executed by the processor, the processor is caused to execute the steps of any one of the text recognition methods.

An embodiment of the application provides a computer readable storage medium comprising program code for causing an electronic device to perform the steps of any one of the text recognition methods described above, when the program product is run on the electronic device.

The application has the following beneficial effects:

the embodiment of the application provides a text recognition method, a device, electronic equipment and a storage medium, wherein the embodiment of the application not only carries out position detection on texts in an image to be recognized, but also carries out language recognition on the texts in the image to be recognized based on a feature diagram of the image to be recognized obtained by feature extraction, does not need a user to designate a class of language category in advance, extracts languages of each text in the image to be recognized based on the feature diagram, and carries out recognition on the corresponding texts based on a predicted text recognition method corresponding to each language.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1A is a schematic diagram of a text detection model of the related art;

FIG. 1B is a schematic diagram of another text detection model in the related art;

FIG. 2 is a schematic diagram of a text recognition method in the related art;

fig. 3 is a schematic diagram of an application scenario according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of an alternative text recognition method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative classification model in accordance with an embodiment of the application;

FIG. 6 is a schematic diagram of a language prediction process according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a complete text recognition method implementation timing sequence in an embodiment of the present application;

fig. 8 is a schematic diagram of a composition structure of a text recognition device according to an embodiment of the present application;

fig. 9 is a schematic diagram of a composition structure of an electronic device according to an embodiment of the present application;

fig. 10 is a schematic diagram of a hardware configuration of a computing device to which an embodiment of the present application is applied.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, based on the embodiments described in the present document, which can be obtained by a person skilled in the art without any creative effort, are within the scope of protection of the technical solutions of the present application.

Some of the concepts involved in the embodiments of the present application are described below.

1. Position information: and based on the text position in the image obtained by text detection, realizing text position prediction. In the embodiment of the application, the position information mainly comprises two parts, namely a text classification result and a boundary distance. The text detection method of the pixel level extracts the characteristic information of each pixel point in the image through the pixel level image segmentation technology, analyzes the characteristics of each pixel point based on the local and global characteristics of the image, and determines the position information of each pixel point, so that a text classification result indicates whether the content of the position of each pixel point belongs to text content or not; when the content of the position of the pixel belongs to the text content, the boundary distance represents the distance between the position of the pixel and the boundary of the text to which the pixel belongs; taking the text between the storage as an example, 3 characters are contained in total, each character is composed of a plurality of pixel points, and taking the word storage as an example, the boundary distance refers to the distance between one pixel point on the character and the upper, lower, left and right sides of the text line between the storage. In addition, the position information may further include a rotation angle of the text to which the pixel point belongs. Still taking the word storage as an example, the rotation angle of a pixel point on the word is the rotation angle of the text line between the stored objects.

2. Language information: in the embodiment of the application, the language information may be represented in the form of a probability map, where each probability map corresponds to a preset language, and each pixel point in the probability map also has a probability value corresponding to the probability value, where the probability value represents a probability that the content at the position of the pixel point belongs to the preset language corresponding to the probability map. The method comprises the steps of judging whether the content of the position of each pixel belongs to a certain language or not, determining the position information in a manner similar to that of the position information, obtaining the language characteristics of each pixel based on a pixel-level language identification method, and analyzing the characteristics of each pixel through the local and global characteristics of an image after the language characteristics of each pixel are obtained, so as to determine the language information of each pixel.

3. OCR: the method refers to a process that an electronic device (such as a scanner or a digital camera) checks characters printed on paper, determines the shape of the characters by detecting dark and bright modes, and then translates the shape into computer characters by a character recognition method; that is, the technology of converting the characters in the paper document into the image file of black-white lattice by optical mode and converting the characters in the image into the text format by the recognition software for further editing and processing by the word processing software is adopted. Mainly comprises two specific tasks: detection of text and recognition of text. The text recognition method in the embodiment of the application comprises two target detection tasks of the text detection task and the text recognition task in OCR.

4. Convolution layer: each convolution layer (Convolutional layer) in the convolution neural network is composed of a plurality of convolution units, and parameters of each convolution unit are optimized through a back propagation algorithm. The purpose of convolution operations is to extract different features of the input, and a first layer of convolution may only extract some low-level features, such as edges, lines, and corners, from which a network of more layers can iteratively extract more complex features. The classification model in the embodiment of the application is mainly based on a convolution layer to perform feature extraction.

5. Convolution kernel: that is, given an input image, a weighted average of pixels in a small region of the input image becomes each corresponding pixel in the output image, where the weights are defined by a function called a convolution kernel. Specifically, certain features (such as edge information) of the original image can be obtained after convolution kernel convolution, and the position information and language information in the embodiment of the application are also obtained based on convolution kernel convolution.

6. Loss function: typically, there is an objective function in each algorithm by machine learning, and the algorithm's solution is optimized by optimizing this objective function. In classification or regression problems, an loss function (cost function) is typically used as its objective function. The loss function is used to evaluate how different the model's predicted and actual values are, the better the loss function, the better the model's performance in general. The different algorithms use different loss functions.

7. Cross entropy loss function: the method is suitable for a two-class or multi-class model, and is mainly used for measuring the difference information between two probability distributions by taking the two classes as examples. The cross entropy loss function can be used as a loss function in a neural network (machine learning), if p is used for representing the distribution of real marks, q is the predicted mark distribution of a trained model, and the cross entropy loss function can be used for measuring the similarity of p and q. The text classification loss term in the embodiment of the application can be expressed by adopting a cross entropy loss function so as to measure the similarity between the predicted text classification result and the real text classification result, and the language identification loss term is the same, so that the language identification is actually a multi-classification process.

8. Euclidean distance loss function: also referred to as the L2 canonical loss (L2 norm) function, refers to the square of the target distance. The loss function is calculated assuming that the target value is 0. The L2 norm loss is very curved near the target, and the algorithm can utilize this feature to approach the target more slowly and more slowly. The distance loss term in the embodiment of the application can adopt the form of Euclidean distance loss function to measure the difference between the predicted boundary distance and the actual boundary distance.

9. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The text recognition method provided in the embodiment of the application can be divided into two parts, including a training part and an application part; the training part relates to the technical field of machine learning, in the training part, a classification model is trained by the machine learning technology, so that a sample image in a training sample passes through an image feature extraction part in the classification model to obtain a corresponding feature image, then the feature image is input into a text detection part and a language identification part, after pixel-level text position detection is carried out based on the text detection part, a text formed by a plurality of pixel points can be determined according to the position information of each pixel point, the position of the text in the sample image is positioned, the probability that each pixel point in the sample image belongs to each language is identified based on a language identification part, and model parameters are continuously adjusted by an optimization algorithm to obtain a trained classification model; the application part is used for obtaining the position information and the language information of the texts in the image to be recognized by using the classification model obtained by training in the training part, further determining the language category to which each text in the image to be recognized belongs according to the position information and the language information, and further recognizing each text in the image to be recognized according to the corresponding text recognition method.

The following briefly describes the design concept of the embodiment of the present application:

with the development of deep learning technology in the video field, multi-angle text detection based on deep learning becomes a mainstream trend, and texts at various angles in images can be detected. In text detection, related technologies are mainly divided into a Two-step (Two-stage) CNN (convolutional neural network) network based on a fast RCNN (fast Region-based Convolutional Neural Network) framework and a One-step (One-stage) CNN network based on an SSD (Single Shot MultiBox Detector) model.

The Two-stage CNN network is represented as RRPN (Rotate Region Proposal Network, rotating area candidate network), the core idea is as shown in FIG. 1A, firstly, a feature map of an image is obtained through convolution, RPN (Region Proposal Network, area candidate network) in target detection is further used, rotation information is added, a candidate text box with a rotation angle is generated, text in any direction is detected, learning of RROI (Rotate Region Of Interest, rotating interested area) pooling layers and the rotating candidate box is added into a structure based on the candidate box area, the text area and the background area are classified based on the structure, and the output result is shown as Classiier in FIG. 1A, so that multi-angle text detection can be realized based on the network model. The One-stage CNN network is represented as a textbox model, the core idea is shown in fig. 1B, a frame structure of SSD is adopted, the length-width ratio of a default text box is modified, the model is more suitable for the characteristics of longer text line length and shorter width, and in addition, the size of a convolution kernel in a feature extraction network (VGG-16) is modified, so that the model is more suitable for text line detection, and non-text noise is avoided. The two models are mainly used for detecting texts and non-texts.

In addition, multi-angle text detection algorithms based on segmentation ideas are also commonly used, such as East (an Efficient and Accuracy Scene Text detection pipeline, efficient and accurate scene text recognition pipeline) models, pixelink (Pixel connection) models, pseNet (progressive scale expansion Metwork, stepwise scale expansion network) models, pixel-Anchor (Pixel-Anchor) models, and the like. The core idea is to divide each pixel point into two categories: the text and the background, and then the pixel points belonging to the text class are assembled into text lines through a post-processing algorithm.

However, the model proposed in the related art can only deal with a specific language, and requires a user to specify a category of language, and input the text detection result into the recognition algorithm of the language for recognition, for example, as shown in fig. 2, after the text is detected, the user specifies one category of language for the text of three categories shown in the figure, so that only the algorithm corresponding to the specified language can be invoked, and the algorithm cannot recognize the text line content of other languages. For example, when a chinese language is specified, the text line content of the other two languages cannot be identified. With the population flow and economic globalization, pictures containing multilingual species are dramatically increased, and moreover, some languages cannot be clearly distinguished, so that OCR algorithms in the related art are more and more challenged.

In view of this, the embodiment of the application provides a text recognition method, a device, an electronic apparatus and a storage medium, where the text recognition method is based on languages, and is not only a method for detecting a position of a text in an image to be recognized, but also a feature diagram of the image to be recognized obtained by feature extraction, and the text in the image to be recognized is recognized, so that a user does not need to specify a class of language class in advance, and extracts languages of each text in the image to be recognized based on the feature diagram, and based on each predicted language, the text recognition method corresponding to each language can be invoked to recognize the corresponding text.

In addition, the language identification part in the embodiment of the application can be flexibly added into the text detection algorithm listed above, such as adding a branch of the language identification part at the tail end of the detection network of the PixelLink, and can be used for training in an end-to-end manner, and further used after the training is completed.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and embodiments of the present application and features of the embodiments may be combined with each other without conflict.

Fig. 3 is a schematic diagram of an application scenario according to an embodiment of the present application. The application scenario diagram includes two terminal devices 310 and a server 330, and the related interfaces 320 for content recommendation can be logged in through the terminal devices 310. Communication between the terminal device 310 and the server 330 may be through a communication network.

In an alternative embodiment, the communication network is a wired network or a wireless network.

In the embodiment of the present application, the terminal device 310 is an electronic device used by a user, and the electronic device may be a computer device having a certain computing capability, such as a personal computer, a mobile phone, a tablet computer, a notebook computer, an electronic book reader, etc., and running instant messaging software and a website or social software and a website. Each terminal device 310 is connected to a server 330 through a wireless network, where the server 330 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

The classification model in the embodiment of the present application may be deployed on the server 330 or the terminal device 310, and is typically deployed on the server 330.

Optionally, the server 330 may include a server for extracting features of the image to be identified, and implementing a classification platform 331 for text position prediction and language prediction; optionally, the server 330 may further include a server for implementing the image management platform 332.

Optionally, the image management platform 332 is further configured to maintain and push images to the terminal device 310, so that the terminal device 310 presents to the user.

It should be noted that the above two terminal devices are only examples, and in the embodiments of the present disclosure, a plurality of terminal devices may be generally referred to in practice. The servers of the classification platform 331 and the image management platform 332 may be independent servers, or may be implemented in the same server. When the plurality of platforms are implemented in a plurality of servers, the plurality of servers are connected to each other through a communication network.

Referring to fig. 4, a flowchart of an implementation of a text recognition method according to an embodiment of the present application is shown, where the implementation flow of the method is as follows:

S41: and extracting features of the image to be identified, and obtaining a feature map corresponding to the image to be identified, wherein the image to be identified contains texts of at least one language type.

In the embodiment of the application, the language category refers to a language category obtained based on country, region or ethnicity, for example, chinese, mongolian, tibetan, and wiki belong to different languages, and chinese, english, french, and russian also belong to different languages.

In the process of text recognition, the text recognition modes of different language types are different, so that the language type to which the text belongs needs to be determined before the text is recognized, and the text recognition is further performed based on the recognized language type. Before this, however, it was necessary to extract image features from the image to be identified.

In an alternative embodiment, the extraction of image features may be implemented based on a neural network model. The embodiment of the application provides a neural network model for extracting image features, which is hereinafter simply referred to as a classification model, and specifically comprises three parts, namely a text detection part and a language identification part, besides an image feature extraction part. The three parts are mainly divided into two execution stages, wherein the first stage is to acquire the image characteristics of the image to be recognized based on the image characteristic extraction part, and the second stage is divided into two image segmentation branches, namely, the positions of texts in the image to be recognized are detected based on the text detection part, and languages of the texts in the image to be recognized are detected based on the language recognition part. The image segmentation refers to dividing an image into a plurality of mutually non-overlapping areas according to characteristics such as gray scale, color, texture, shape and the like, and enabling the characteristics to show similarity in the same area and obvious difference among different areas. In the embodiment of the application, the text detection part extracts the position characteristics of the text in the image to be identified based on the image segmentation mode, the position information of each pixel point in the same text area in the image to be identified is similar, and the position information of the pixel points in the text area and the background area have obvious difference; similarly, the language detection part is also similar, and the language information of each pixel point in the same text area is similar, so that obvious difference exists between the language information of the pixel points in the text area and the background area.

The process of extracting features of an image to be identified based on the image feature extraction section will be described in detail below:

inputting the image to be identified into a trained classification model, and carrying out feature extraction on the image features of the image to be identified based on an image feature extraction part in the classification model to obtain a feature map corresponding to the image to be identified; the trained classification model is obtained by training according to a training sample set, wherein the training sample set comprises sample images marked with text position information and language information.

In the embodiment, the image features of the image to be identified are extracted through the trained classification model, the feature images corresponding to the image can be accurately and rapidly extracted based on the learning ability of the model, the model is simple in structure, the training process is simple, and the feature extraction efficiency can be effectively improved.

In an embodiment of the present application, the training sample set may contain a plurality of sample images, including both positive and negative samples. The text position information marked on the sample image is for the pixel, for example, the content of the position of a certain pixel is text, and can be marked as 1 at the moment, and if the content of the position of the pixel is background, the content of the position of the pixel can be marked as 0 at the moment; similarly, the language information marked on the sample image is 5 preset languages for the pixel point, wherein the content of the position of a certain pixel point is a text of the 1 st language, the content can be marked as 10000 at the moment, the probability of the content of the position of the pixel point is 1 of the 1 st language, and the probability of the other languages is 0; if the content of the position of the pixel is text in the 2 nd language, the text may be marked as 01000, the probability of representing that the content of the position of the pixel is in the 2 nd language is 1, the probability of representing that the content of the position of the pixel is in the other languages is 0, and so on.

It should be noted that the above-listed labeling modes are only examples, and any labeling mode is applicable to the embodiments of the present application, and is not specifically limited herein.

Fig. 5 is a schematic structural diagram of a classification model according to an embodiment of the present application, wherein the image feature extraction portion is a cnn+fpn (Feature Pyramid Network ) network, and the size of an input image to be identified is 1024x1024. First, the first stage extracts a feature map using a cnn+fpn network.

The convolution neural network selected in the embodiment of the application is vgg formed by sixteen layers of convolution layers, firstly, an image to be identified with the size of 1024x1024 is firstly input into the vgg convolution neural network, and after the image characteristics are extracted through convolution operation, 512 layers of characteristic diagrams with the size of 32x32 are output; then, the 512-layer feature map with the size of 32x32 output by the convolutional neural network is input into the FPN, and after up-sampling of 3 stages, the feature map with the size of 256x256 of the 32-layer feature map is finally output, namely, the feature map corresponding to the image to be identified obtained based on the image feature extraction part.

It should be noted that, the cnn+fpn structure listed in the embodiment of the present application is merely illustrative, and any type of cnn+fpn combination may be actually used for processing, and only one of the cnn+fpn structures is selected in the embodiment of the present application, which is not specifically limited herein.

In the above embodiment, the classification of the pixel level of the image can be implemented based on the FCN, so that the problem of image segmentation of the semantic level is solved, the FCN can accept an input image of any size, and the deconvolution layer is adopted to upsample the feature map of the last convolution layer, so that the feature map is restored to the same size of the input image, thereby generating a prediction for each pixel, simultaneously retaining the spatial information in the original input image, and finally performing pixel-by-pixel classification on the upsampled feature map. And finally, calculating the loss pixel by pixel, which is equivalent to a training sample corresponding to each pixel.

S42: according to the feature diagram, position detection is carried out on the text in the image to be identified, position information of the text in the image to be identified is obtained, language identification is carried out on the text in the image to be identified, and language information of the text in the image to be identified is obtained.

Optionally, in the second stage of the introduction, the text detection section in the classification model is used to detect the position of the text, and the language of the text is predicted based on the language recognition section in the classification model, which is as follows:

inputting the feature map into a text detection part, and extracting features of text position features in the feature map based on the text detection part to acquire the position information of the text in the image to be identified output by the text detection part; and inputting the feature map into a language identification part, and carrying out feature extraction on the text language features in the feature map based on the language identification part to acquire language information of the text in the image to be identified output by the language identification part.

The position information mainly comprises two parts: text classification results and boundary distances. Because the text detection part listed in the embodiment of the application is pixel-level text detection, the characteristic information of each pixel point in the image is extracted through a pixel-level image segmentation technology, and the characteristics of each pixel point are analyzed based on the local and global characteristics of the image to determine the position information of each pixel point, so that a text classification result indicates whether the content of the position of each pixel point belongs to text content or not; and the boundary distance represents the distance between the position of the pixel point and the boundary of the text to which the pixel point belongs when the content of the position of the pixel point belongs to the text content.

Taking the image to be recognized input in fig. 5 as an example, the text between the stores contains 3 characters in total, each character is composed of a plurality of pixels, and taking the word storage as an example, the boundary distance refers to the distance between a pixel on the character and the upper, lower, left and right sides of the text line between the stores. In addition, the position information may further include a rotation angle of the text to which the pixel point belongs. Still taking the word storage as an example, the rotation angle of a pixel point on the word is the rotation angle of the text line between the stored objects.

The language information is used for describing the language category to which the text in the image to be identified belongs, and in the embodiment of the application, the language information can be expressed in the form of probability maps, wherein each probability map corresponds to a preset language, each pixel point in the probability map also has a corresponding probability value, and the probability value indicates the probability that the content of the position of the pixel point belongs to the preset language corresponding to the probability map. The method is characterized in that whether the content of the position of each pixel belongs to a certain language is judged, the method is similar to the determination mode of position information in practice, the method is obtained based on a pixel-level language identification method, after the language characteristics of each pixel are obtained, the characteristics of each pixel are analyzed through the local and global characteristics of an image, and the language information of each pixel is determined.

In the embodiment, two target tasks of text position detection and text language identification are directly realized based on the trained classification model, and the position information and language information of the text in the image can be accurately and efficiently acquired based on the learning ability of the model, so that convenience is provided for the next text identification. In addition, the text detection part and the language identification part adopt a pixel level detection mode to refine the detection result to the pixel level, so that when the position or language of a text is determined, the error of the identification result of partial pixel points in the text area can be tolerated, and the detection accuracy can be effectively improved.

As shown in fig. 5, the text detection part and the language recognition part in the image are inputted into the feature map outputted based on the image feature extraction part cnn+fpn, wherein the text detection part and the language recognition part shown in the image are both network models at pixel level, and the classification of the image at pixel level can be realized, thereby solving the problem of image segmentation at semantic level.

Wherein the first branch in the second stage, the pixel-level text position prediction branch in the graph, i.e. the text detection part in the classification model, serves to predict the position of a text line in the image to be identified. Taking the text detection part as an East model as an example, the branch can predict whether the content of the position of each pixel point belongs to text content and the distance between the current position and the four edges of the text line, and can also predict the rotation angle of the text line, namely the text classification result, the boundary distance and the rotation angle.

In the embodiment, the text line position in the picture can be extracted based on the branches and the branches are provided with the rotation angles, so that the text line is twisted according to the rotation angles, and the accuracy of the text recognition system can be greatly improved.

The second branch in the second stage is a pixel-level language prediction branch in the graph, that is, a language identification part in the classification model, which is used for judging the language category of the text line to which each pixel belongs, that is, judging which language in the preset language set each pixel belongs to.

Optionally, when the language information is in the form of a probability map, the language information output by the language identification part includes at least two probability maps, and each probability map corresponds to a preset language in the preset language set one by one.

Wherein, the category number K of the preset languages can be set according to different tasks. For example, the languages mainly related to task 1 are language a, language B, language C, and language D, and these 4 languages may be used as predetermined languages to form predetermined language set 1. Correspondingly, the language information is 4 probability maps, and each probability map corresponds to one preset language in the preset language set 1. For example, the languages mainly related to task 2 include language a, language B, language E, language F, and language G, and these 5 languages are used as predetermined languages at this time and combined to form predetermined language set 2. Correspondingly, the language information is 5 probability maps, and each probability map corresponds to one preset language in the preset language set 2.

Alternatively, the text detection part and the language recognition part may be FCNs (Fully Convolutional Networks, full convolution networks) composed of a plurality of convolution layers, and the number of convolution kernels of the last convolution layer is determined according to the classification task. For example, the last convolution layer of the text detection part corresponds to two types of convolution kernels, one type is used for extracting text classification results, the other type is used for extracting boundary distances, and the number of the convolution kernels is respectively as follows: 2 and 5, wherein 2 represents whether the text is displayed, and 5 represents the distance between the position of the pixel point and the upper, lower, left and right sides of the text line by 4 sides and the rotation angle of the text line; the number of convolution kernels of the last convolution layer of the language identification part is K, which represents the number of preset languages in the preset language set. One convolution kernel corresponds to one preset language, the probability that each pixel belongs to different preset languages is determined based on different convolution kernels, and each pixel in a probability map formed by combining all the pixels has a corresponding probability value so as to represent the probability that each pixel belongs to the preset language corresponding to the probability map.

In the embodiment, the full convolution neural network model has a simple structure and is convenient to train, in addition, the number of convolution kernels of the last layer of convolution layer of the language identification part in the embodiment of the application is the same as the number of preset languages in the preset language set, multiple classification can be realized based on the number of convolution kernels, the content of the position of the pixel point is classified into the languages, and the identification of the languages of the text in the image can be realized based on the classification.

In addition, the classification model in the embodiment of the application can be continuously adjusted, and the classification model can be adjusted along with the increase of the number of preset languages in the preset language set, for example, the number of convolution kernels of the last layer of convolution layers in the language identification part is correspondingly increased, so that the applicability of the model is continuously improved; in addition, the classification model may be continuously adjusted based on the increase in the sample image. The accuracy of the model is improved.

In an alternative embodiment, the trained classification model is trained by:

selecting a sample image from the training sample set; inputting the sample image into an untrained classification model to obtain the position information and language information of the text in the sample image output by the untrained classification model; and constructing a target loss function based on the position information and the language information output by the untrained classification model, and optimizing parameters in the untrained classification model according to the target loss function until the untrained classification model converges to obtain a trained classification model.

In the process of continuously adjusting model parameters, an objective loss function is optimized mainly through an optimization algorithm, and at least one stage of training is carried out on a classification model by utilizing the objective loss function until the model converges, so that the best model is trained.

The optimization algorithm can be gradient descent method, genetic algorithm, newton method, quasi-Newton method, etc.

Optionally, the objective loss function includes a text classification loss term, a distance loss term, and a language classification loss term; the text classification loss term and the distance loss term belong to the loss function of the text detection part, and the language classification loss term refers to the loss function of the language identification part. These three penalty terms are described in detail below:

and the first loss term and the text classification loss term are used for representing the difference between the text classification result and the actual text classification result in the position information predicted by the untrained classification model.

In an embodiment of the application, the term loss function may beIs denoted by L as a two-class cross entropy loss _cls (p _cls ，g _cls ) Wherein p is _cls G is the predicted value of the text classification result _cls Is the true value of the text classification result.

The true value of the text classification result can be obtained according to the position information marked in the sample image. For example, if the content of the position of a certain pixel is text, a probability value is correspondingly marked, if the true value is marked with 0, the text is represented, and if the true value is marked with 1, the text is not represented. The predicted value may then be expressed as a probability value between 0 and 1, for example 0.5.

Specifically, L _cls (p _cls ，g _cls ) The calculation formula of (2) is as follows:

wherein, beta is a preset value, which can represent weight, and the value range of beta can be set to be 0-1, for example, the value is 0.5; g _i Representing the real text classification result corresponding to the ith pixel point, p _i And representing the predicted text classification result corresponding to the ith pixel point. And calculating the difference between the real text classification result and the predicted text classification result of each pixel point in the image to be identified based on the text classification loss term.

And a second loss term and a distance loss term, which are used for representing the difference between the boundary distance and the actual boundary distance in the position information predicted by the untrained classification model.

In the embodiment of the application, the term loss function refers to a Euclidean distance loss function in the 4 direction, and is expressed as L _dis (p _dis ,g _dis ) Wherein p is _dis Predictive value g for representing distance between pixel point and 4 edges of text line _dis I.e. representing the corresponding real value. Wherein the method comprises the steps ofI.e. the pixel point is located at a distance from the upper, lower, left and right of the text line to which it belongsThe predicted distance of the four sides; accordingly, the +>The true distances between the positions of the pixel points and the upper, lower, left and right edges of the text line to which the pixel points belong can be obtained according to the position information marked in the sample image.

Specifically, L _dis (p _dis ,g _dis ) The calculation formula of (2) is as follows:

/>

wherein, the liquid crystal display device comprises a liquid crystal display device,

during the calculation process, willSubstitution into smooth _l1 The calculation formula of (2) is that if the absolute value of x is smaller than 1, then the sm is _l1 ＝0.5x ² The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, smooth _l1 ＝|x|-0.5。

And the loss item III and the language classification loss item are used for representing the difference between the language information predicted by the untrained classification model and the actual language information.

Wherein the term loss function may be a K-class cross entropy loss function, i.e., a multi-class cross entropy loss function, denoted as L _lang (p _lang ,g _lang ) Wherein p is _lang The predictive probability value for representing the language of the content of the position of the pixel point is K kinds of preset languages, so that the predictive probability value of the content of the position of the pixel point belonging to the preset languages of each kind can be obtained for each kind of preset languages, and the predictive probability value is K probability values, namely K p probability values _k The value range of K is 1-K; correspondingly g _lang The true probability value for representing the language to which the content of the position of the pixel belongs also corresponds to K true probability values for one pixel, namelyK g _k 。

Specifically, L _lang (p _lang ,g _lang ) The calculation formula of (2) is as follows:

wherein i in the several loss terms represents the ith pixel point.

Based on the above-mentioned loss terms, the specific calculation manner of the overall loss function (i.e., the objective loss function) of the classification model in the embodiment of the present application is as follows:

L(p _cls ，p _dis ，p _lang ，g _cls ，g _dis ，g _lang )＝L _cls (p _cls ，g _cls )+[g _cls >0]L _dis (p _dis ，g _dis )+[g _cls >0]L _lang (p _lang ，g _lang )；

Wherein g _cls >0 then means that the loss is calculated only in the text region and the background region is ignored, so that the two loss functions of the distance loss term and the language classification loss term in the objective loss function only calculate the loss of the text region, e.g. the pixel point g in the text region _cls Can be represented as 1, pixel point g in the background area _cls Can be expressed as 0, in g _cls >The text-only region corresponds with a condition of 0.

In the embodiment of the application, when the cross entropy is adopted as the loss function, the problem of reduced learning rate of the mean square error loss function can be avoided when the gradient is reduced by using the sigmoid function, because the learning rate can be controlled by the output error. After the objective loss function is constructed based on the loss terms, the deviation between the true value and the predicted value can be calculated based on the loss function, and then parameters of the classification model are continuously adjusted until the model converges, and the deviation between the true value and the predicted value output by the model is within an allowable error range. At the moment, the predicted value obtained based on the model prediction is closer to the true value, and the accuracy is higher.

It should be noted that, in the embodiment of the present application, the language identification part in the classification model is a model, and the model may be used to identify different languages, similar to a multi-classification model, and in addition, the language identification part may also be in the form of a plurality of classification models, where different models correspond to different languages, for example, model 1 corresponds to language 1, model 2 corresponds to languages 2, …, in this way, the number of convolution kernels of the last layer of each model is 2, which indicates whether the language is the language corresponding to the model, and each model outputs a probability map. The specific training and use is similar to the above process, and the detailed description is not repeated here.

S43: and identifying the text in the image to be identified based on the acquired position information and language information.

Optionally, based on the obtained position information and language information, when recognizing the text in the image to be recognized, the specific process is as follows:

generating at least one text box for identifying the area where the text in the image to be identified is located based on the position information; determining the language category of the text in each text box according to the language information corresponding to the text in each text box; and identifying the text in each text box according to the text identification mode corresponding to the language category to which the text in each text box belongs.

In the embodiment of the present application, the text recognition mode refers to a recognition algorithm or a recognition method for recognizing texts in different languages, for example, the text recognition mode a may be used for recognizing texts in chinese languages, and may also be referred to as a text recognition method a in the embodiment of the present application; the text recognition mode B can be used for recognizing the text of Tibetan language, and can be called as a text recognition method B; the text recognition mode C may be used to recognize text in a Mongolian language, and may be referred to as text recognition method C, etc.

When generating text boxes based on the position information, NMS (Non Maximum Suppression, non-maximum suppression) may be used, for example, four vertices of the predicted text line are obtained by first using the position information predicted in the first stage (the distance between the current position and the text line, the distance between the current position and the left and right, and the rotation angle), and since one text line to which the pixel point belongs is predicted for each pixel point, there are a large number of overlapped text boxes, at this time, redundant text boxes may be filtered by NMS, so as to obtain coordinate information box_loc= (x 1, y1, x2, y2, x3, y3, x4, y 4) of 4 corners of each text.

Wherein the NMS is the element which suppresses the local maximum value. In the text detection process of the embodiment of the application, the sliding window is subjected to the feature extraction, and after text classification is carried out by the full convolution neural network of the text detection part, each window can obtain a classification (whether text) and a score. Sliding windows can result in many windows being inclusive or mostly crossed with other windows. The NMS is then required to choose those windows in the neighborhood that have the highest score (highest probability of being text) and suppress those windows that have low scores. The process of filtering out redundant text boxes by the NMS is illustrated below:

assuming that 4 candidate boxes are provided, sorting is performed according to the text classification probability of the text detection part, and the probabilities of the text detection part from small to large belong to texts are A, B, C, D respectively. Thereafter, the following procedure is performed:

1) Starting from a rectangular frame with the maximum probability (namely a frame with the largest area) D, judging whether IOU (Intersection over Union, overlapping degree) of A-C and D are larger than a certain set threshold value or not respectively;

2) Assuming that the overlap of B and D exceeds the threshold, B is thrown away and the first rectangular box D is marked as the remaining rectangular box.

3) From the remaining rectangular frames A, C, the C with the highest probability is selected, and then 1 and 2 are repeated to find all the remaining rectangular frames.

Based on the above process, text boxes for identifying the region where the text in the image to be identified is located can be generated, each identified text is identified, taking the image to be identified input as shown in fig. 4 as an example, three text boxes can be actually detected, and the three text boxes respectively correspond to storage spaces of three languages, namely, storage spaces of three languages, and correspond to three texts detected by the text in fig. 2.

Optionally, the language information may be in the form of a probability map, where in the embodiment of the present application, the probability map is actually used to represent the probability that the content at the position of each pixel point in the image to be identified belongs to the preset language corresponding to the probability map, and each probability map corresponds to the preset language in the preset language set one by one; when determining the language category to which the text in each text box belongs based on the probability map, the specific determination process for each text box is as follows:

acquiring a target area corresponding to a text box in each probability map; taking the average value of probability values of all pixel points in a target area of each probability map as the prediction probability that the text in the text box belongs to the preset language corresponding to the corresponding probability map; and taking the preset language corresponding to the maximum probability value in each prediction probability as the language category to which the text in the text box belongs.

As shown in fig. 6, the feature map obtained based on the image feature extraction part is input into a language identification part, the language identification part shown in fig. 6 is a full convolution neural network composed of three convolution layers, K layers of probability maps corresponding to K preset languages are obtained after language feature extraction is performed based on the three convolution layers, then a text position box_loc obtained by combining the first branch detection in the second stage is combined, and a probability map part of a corresponding position of a text 1 'between stores' in each probability map, namely a probability map of a target area, such as a gray part in each probability map in a certain text language prediction probability map in fig. 6, is extracted. And then, averaging probability values of all pixel points in a target area in each layer of probability map to obtain a predicted probability value of each language of K classes of preset languages of the text, and finally obtaining the language of the text by taking the maximum probability value.

For example, as shown in fig. 6, for the text between the stores, the predicted probability values for the corresponding preset languages are: chinese 0.9, tibetan 0.05 and Mongolian 0.001 …, wherein the maximum probability value is 0.9, and the preset language corresponding to the probability value is Chinese, so that the language class to which the text belongs is Chinese.

Assume that, for text 2, the predicted probability value of each corresponding preset language is: chinese 0.1, tibetan 0.95 and Mongolian 0.001 …, wherein the maximum probability value is 0.95, and the preset language corresponding to the probability value is Tibetan, so that the language class to which the text 2 belongs is Tibetan. Similarly, assume that the language category to which text 3 belongs is Mongolian.

If the texts in the image to be identified are the texts 1 to 3, the language categories to which each text belongs are respectively: based on the Chinese, the Tibetan and the Mongolian, when the text is identified, a text identification method A corresponding to the Chinese is required to be called to identify the text 1, a text identification method B corresponding to the Tibetan is required to identify the text 2, and a text identification method C corresponding to the Mongolian is required to identify the text 3.

In the embodiment, the user does not need to specify languages, but the language types of each text in the image to be identified are identified based on the feature extraction, and then each text in the image to be identified is identified based on the corresponding text identification method, so that the condition that only the text corresponding to the specified language can be identified is avoided, the identification accuracy can be obviously improved based on the text identification method corresponding to each language, and the user experience is improved.

Referring to fig. 7, a flow chart of a complete method of text recognition is shown. The specific implementation flow of the method is as follows:

step S71: acquiring an image to be identified;

step S72: inputting the image to be identified into a trained classification model;

step S73: based on an image feature extraction part in the classification model, extracting features from image features of the image to be identified;

step S74: inputting the feature map into a text detection part in the classification model, and carrying out feature extraction on text position features in the feature map based on the text detection part to obtain the position information of the text in the image to be identified, which is output by the text detection part;

step S75: inputting the feature map into a language identification part in the classification model, and carrying out feature extraction on text language features in the feature map based on the language identification part to obtain language information of texts in the image to be identified, which is output by the language identification part;

step S76: generating a text box for identifying the area where the text in the image to be identified is located based on the position information;

step S77: determining the language category of the text in each text box according to the language information corresponding to the text in each text box;

step S78: and identifying the text in each text box according to the text identification mode corresponding to the language category to which the text in each text box belongs.

As shown in fig. 8, which is a schematic structural diagram of the text recognition device 800, may include: a feature extraction unit 801, a detection unit 802, and an identification unit 803;

the feature extraction unit 801 is configured to perform feature extraction on an image to be identified, and obtain a feature map corresponding to the image to be identified, where the image to be identified includes text of at least one language class;

the detecting unit 802 is configured to perform position detection on a text in an image to be identified according to the feature map, obtain position information of the text in the image to be identified, and perform language identification on the text in the image to be identified, and obtain language information of the text in the image to be identified;

and a recognition unit 803 for recognizing the text in the image to be recognized based on the acquired position information and language information.

Optionally, the identifying unit 803 is specifically configured to:

Optionally, the language information includes at least two probability maps, each probability map corresponds to a preset language in the preset language set one by one, and the probability map is used for representing the probability that the content of the position of each pixel point in the image to be identified belongs to the preset language corresponding to the probability map;

for each text box, the recognition unit 803 specifically functions to:

acquiring a target area corresponding to a text box in each probability map;

taking the average value of probability values of all pixel points in a target area of each probability map as the prediction probability that the text in the text box belongs to the preset language corresponding to the corresponding probability map;

and taking the preset language corresponding to the maximum probability value in each prediction probability as the language category to which the text in the text box belongs.

Optionally, the feature extraction unit 801 is specifically configured to:

the trained classification model is obtained by training according to a training sample set, wherein the training sample set comprises sample images marked with text position information and language information.

Optionally, the trained classification model further comprises a text detection portion and a language recognition portion;

the detection unit 802 specifically is configured to:

inputting the feature map into a text detection part, and extracting features of text position features in the feature map based on the text detection part to acquire the position information of the text in the image to be identified output by the text detection part; and

and inputting the feature map into a language identification part, and carrying out feature extraction on the text language features in the feature map based on the language identification part to obtain language information of the text in the image to be identified output by the language identification part.

Optionally, the language identification part is a full convolutional neural network, and the language information includes at least two probability graphs, each probability graph corresponds to a preset language in the preset language set one by one;

the number of convolution kernels of the last convolution layer in the language identification part is the same as the number of preset languages in the preset language set.

Optionally, the apparatus further comprises a training unit 804;

the training unit is used for training to obtain a trained classification model by the following modes:

selecting a sample image from the training sample set;

inputting the sample image into an untrained classification model to obtain the position information and language information of the text in the sample image output by the untrained classification model;

And constructing a target loss function based on the position information and the language information output by the untrained classification model, and optimizing parameters in the untrained classification model according to the target loss function until the untrained classification model converges to obtain a trained classification model.

the text classification loss item is used for representing the difference between a text classification result and an actual text classification result in position information obtained through prediction of an untrained classification model, wherein the text classification result is used for representing whether the content of the position of the pixel point belongs to text content or not;

the distance loss item is used for representing the difference between the boundary distance in the position information obtained through prediction of the untrained classification model and the actual boundary distance, wherein the boundary distance is used for representing the distance between the position of the pixel point and the boundary of the text to which the pixel point belongs when the content of the position of the pixel point belongs to the text content;

the language classification loss term is used to represent the difference between the language information predicted by the untrained classification model and the actual language information.

For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.

Having described the text recognition method and apparatus of an exemplary embodiment of the present application, next, an apparatus for text detection according to another exemplary embodiment of the present application is described.

Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

In some possible embodiments, a text recognition device according to the present application may include at least a processor and a memory. Wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps in the text recognition method according to various exemplary embodiments of the application described in this specification. For example, the processor may perform the steps as shown in fig. 4.

The text recognition device of this embodiment is similar in structure to the text recognition device 800 shown in fig. 8, and will not be described again.

In some possible implementations, a computing device according to the application may include at least one processor, and at least one memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps in the text recognition method according to various exemplary embodiments of the application described above in this specification. For example, the processor may perform the steps shown in fig. 4.

In an exemplary embodiment, a storage medium is also provided, such as a memory 902, comprising instructions executable by the processor 901 of the electronic device 900 to perform the above-described method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, a ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

In some possible implementations, embodiments of the present application further provide a computing device that may include at least one processing unit, and at least one storage unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps in the service invocation method according to various exemplary embodiments of the application described in the present specification. For example, the processing unit may perform the steps as shown in fig. 4.

A computing device 100 according to such an embodiment of the application is described below with reference to fig. 10. The computing device 100 of fig. 10 is only one example and should not be taken as limiting the functionality and scope of use of embodiments of the application.

As shown in fig. 10, the computing device 100 is in the form of a general purpose computing device. Components of computing device 100 may include, but are not limited to: the at least one processing unit 101, the at least one memory unit 102, a bus 103 connecting the different system components, including the memory unit 102 and the processing unit 101.

Bus 103 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.

The storage unit 102 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 1021 and/or cache memory unit 1022, and may further include Read Only Memory (ROM) 1023.

Storage unit 102 may also include program/utility 1025 having a set (at least one) of program modules 1024, such program modules 1024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The computing device 100 may also communicate with one or more external devices 104 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the computing device 100, and/or any devices (e.g., routers, modems, etc.) that enable the computing device 100 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 105. Moreover, computing device 100 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 106. As shown, network adapter 106 communicates with other modules for computing device 100 over bus 103. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with computing device 100, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

In some possible embodiments, aspects of the text recognition method provided by the present application may also be implemented in the form of a program product comprising program code for causing a computer device to carry out the steps of the text recognition method according to the various exemplary embodiments of the application described above when the program product is run on a computer device, for example, the computer device may carry out the steps as shown in fig. 4.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code and may run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of text recognition, the method comprising:

inputting an image to be identified into a trained classification model, and carrying out feature extraction on image features of the image to be identified based on an image feature extraction part in the classification model to obtain a feature map corresponding to the image to be identified, wherein the image to be identified contains texts of at least one type of language class; the trained classification model also comprises a text detection part and a language identification part;

inputting the feature map into the text detection part, and carrying out feature extraction on text position features in the feature map based on the text detection part to acquire the position information of the text in the image to be identified, which is output by the text detection part; inputting the feature map into the language identification part, and carrying out feature extraction on text language features in the feature map based on the language identification part to obtain language information of the text in the image to be identified, which is output by the language identification part;

2. The method of claim 1, wherein the identifying text in the image to be identified based on the obtained location information and the language information comprises:

3. The method of claim 2, wherein the language information includes at least two probability maps, each probability map corresponds to a preset language in a preset language set one by one, and the probability map is used for indicating a probability that a content of a position of each pixel in the image to be identified belongs to the preset language corresponding to the probability map;

when determining the language category to which the text in each text box belongs according to the language information corresponding to the text in each text box, the method specifically comprises the following steps for each text box:

acquiring a target area corresponding to the text box in each probability map;

4. The method of claim 1, wherein the trained classification model is trained from a training sample set comprising sample images labeled with text location information and language information.

5. The method of claim 1 wherein the language identification portion is a full convolutional neural network and the language information includes at least two probability maps, each probability map corresponding one-to-one to a predetermined language in a predetermined language set;

6. The method of claim 4, wherein the trained classification model is trained by:

selecting a sample image from the training sample set;

7. The method of claim 6, wherein the objective loss function includes a text classification loss term, a distance loss term, and a language classification loss term;

8. A text recognition device, comprising:

the feature extraction unit is used for inputting the image to be identified into a trained classification model, and carrying out feature extraction on the image features of the image to be identified based on an image feature extraction part in the classification model to obtain a feature map corresponding to the image to be identified, wherein the image to be identified contains texts of at least one language class; the trained classification model also comprises a text detection part and a language identification part;

The detection unit is used for inputting the feature map into the text detection part, extracting features of text position features in the feature map based on the text detection part, and acquiring the position information of the text in the image to be identified, which is output by the text detection part; inputting the feature map into the language identification part, and carrying out feature extraction on text language features in the feature map based on the language identification part to obtain language information of the text in the image to be identified, which is output by the language identification part;

9. An electronic device comprising a processor and a memory, wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that it comprises a program code for causing an electronic device to perform the steps of the method according to any one of claims 1-7, when said program code is run on the electronic device.