CN117727037A

CN117727037A - Text recognition method, text recognition device, computer equipment, storage medium and product

Info

Publication number: CN117727037A
Application number: CN202311301253.8A
Authority: CN
Inventors: 隆超
Original assignee: Shuhang Technology Beijing Co ltd
Current assignee: Shuhang Technology Beijing Co ltd
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2024-03-19

Abstract

The embodiment of the application discloses a text recognition method, a text recognition device, computer equipment, a storage medium and a product, wherein the method comprises the following steps: acquiring a sample image data set, wherein the sample image data set comprises a plurality of sample images, and each sample image comprises text information and text category labels added for each text information; acquiring one or more text region images in each sample image, and acquiring region information of each text region image in each corresponding sample image, wherein the region information comprises position information and/or size information; inputting each text region image, each region information and each text category label into a preset neural network model for training to obtain a text recognition model; the method comprises the steps of obtaining an image to be processed, inputting the image to be processed into the text recognition model to obtain texts and text types of the image to be processed, effectively recognizing the texts and the text types in the image, and improving the accuracy of text recognition.

Description

Text recognition method, text recognition device, computer equipment, storage medium and product

Technical Field

The present disclosure relates to the field of image recognition technologies, and in particular, to a text recognition method, apparatus, computer device, storage medium, and product.

Background

Along with the development of computer technology, image recognition is increasingly widely applied, such as the application of notebook pictures, video screenshot, advertisement pictures, commodity pictures and the like, most of images comprise characters, and the character recognition has great auxiliary effects on scene recognition, commodity content recognition, search recommendation and the like. Currently, for text recognition in an image, a main method is to recognize text information in the image through OCR text recognition (optical character recognition, OCR), however, although OCR text recognition can recognize text information in the image, it is not known what text category the recognized text information is, for example, OCR recognizes two texts from one image, respectively: 622 5541, guangdong, and does not know that the first text represents the bank card number and the second text represents the home address. Therefore, how to more effectively recognize text in an image is very important.

Disclosure of Invention

The embodiment of the application provides a text recognition method, a device, computer equipment, a storage medium and a product, which can effectively recognize texts and text types in images and improve the accuracy of text recognition.

In a first aspect, an embodiment of the present application provides a text recognition method, including:

acquiring a sample image data set, wherein the sample image data set comprises a plurality of sample images, and each sample image comprises text information and text category labels added for each text information;

acquiring one or more text region images in each sample image, and acquiring region information of each text region image in each corresponding sample image in the one or more text region images, wherein the region information comprises position information and/or size information;

inputting the text region images, the region information of the text region images in the corresponding sample images and the text category labels of the text information corresponding to the sample images into a preset neural network model for training to obtain a text recognition model;

and acquiring an image to be processed, and inputting the image to be processed into the text recognition model to obtain text information of the image to be processed, wherein the text information comprises texts and text categories.

In a second aspect, an embodiment of the present application provides a text recognition apparatus, including:

A first acquisition unit configured to acquire a sample image dataset including a plurality of sample images, each sample image including text information and a text category label added for each text information;

a second obtaining unit, configured to obtain one or more text region images in each sample image, and obtain region information of each text region image in each corresponding sample image in the one or more text region images, where the region information includes position information and/or size information;

the training unit is used for inputting the text region images, the region information of the text region images in the corresponding sample images and the text category labels of the text information corresponding to the sample images into a preset neural network model for training to obtain a text recognition model;

the recognition unit is used for acquiring an image to be processed, inputting the image to be processed into the text recognition model, and obtaining text information of the image to be processed, wherein the text information comprises texts and text categories.

In a third aspect, embodiments of the present application provide a computer device, the computer device comprising: a processor and a memory, the processor being configured to perform the method according to the first aspect.

In a fourth aspect, embodiments of the present application further provide a computer readable storage medium, where program instructions are stored, the program instructions when executed implement the method according to the first aspect.

In a fifth aspect, embodiments of the present application disclose a computer program product or computer program comprising program instructions which, when executed by a processor, implement the method of the first aspect described above.

According to the embodiment of the application, the sample image data set can be obtained, the sample image data set comprises a plurality of sample images, and each sample image comprises text information and a text category label added for each text information; acquiring one or more text region images in each sample image, and acquiring region information of each text region image in each corresponding sample image, wherein the region information comprises position information and/or size information; inputting each text region image, region information of each text region image in each corresponding sample image and text category labels of each text information corresponding to each sample image into a preset neural network model for training to obtain a text recognition model; the image to be processed is input into a text recognition model to obtain the text and the text category of the image to be processed, and the text category in the image can be effectively recognized in the mode, so that the accuracy of text recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a text recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a text region image provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a convolutional neural network model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a text recognition provided by an embodiment of the present application;

FIG. 5 is a flowchart of another text recognition method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of model training based on region information provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a structure for determining a loss function value;

FIG. 8 is a flowchart of yet another text recognition method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a text recognition device according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The text recognition method can be applied to various scenes for recognizing texts from images (such as note pictures, video screenshot, advertisement pictures, commodity pictures and the like), and the scenes recognition, the scene recognition, the commodity content recognition, the search recommendation and the like are assisted by recognizing the texts from the images.

The method comprises the steps that a sample image data set is obtained, wherein the sample image data set comprises a plurality of sample images, and each sample image comprises text information and text category labels added for each text information; acquiring one or more text region images in each sample image, and acquiring region information of each text region image in each corresponding sample image, wherein the region information comprises position information and/or size information; inputting each text region image, region information of each text region image in each corresponding sample image and text category labels of each text information corresponding to each sample image into a preset neural network model for training to obtain a text recognition model; the image to be processed is input into a text recognition model to obtain the text and the text category of the image to be processed, and the text category in the image can be effectively recognized in the mode, so that the accuracy of text recognition is improved.

The text recognition method provided in the embodiments of the present application may be applied to a text recognition device, where the text recognition device may be disposed in a computer device, and in some embodiments, the computer device may include, but is not limited to, a smart terminal device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, an on-vehicle smart terminal, a smart watch, and the like.

The text recognition method provided by the embodiment of the application is schematically described below with reference to the accompanying drawings.

Referring specifically to fig. 1, fig. 1 is a schematic flow chart of a text recognition method provided in an embodiment of the present application, where the text recognition method in the embodiment of the present application may be executed by a text recognition device, where the text recognition device may be disposed in a computer apparatus.

S101: a sample image dataset is obtained, the sample image dataset comprising a plurality of sample images, each sample image comprising text information and a text category label added for each text information.

In an embodiment of the present application, the computer device may obtain a sample image dataset, where the sample image dataset includes a plurality of sample images, and each sample image includes text information and a text category label added for each text information. In some embodiments, the text category label may include, but is not limited to, any one or more of characters such as letters, numbers, letters, etc., and is used to indicate a category of text information in the corresponding sample image, for example, a category of the text category label-including a bank name, a sign name, a name, etc.

S102: one or more text region images in each sample image are acquired, and region information of each text region image in the one or more text region images in the corresponding each sample image is acquired, wherein the region information comprises position information and/or size information.

In this embodiment of the present application, the computer device may obtain one or more text region images in each sample image, and obtain region information of each text region image in each corresponding sample image in the one or more text region images, where the region information may include location information and/or size information, where the location information may include, but is not limited to, coordinate information, longitude and latitude information, and the like, and the size information may include, but is not limited to, one or more of length, width, height, area, and the like.

In one embodiment, the region information includes location information, and the computer device may obtain coordinate information of each text region image in each corresponding sample image when obtaining the region information of each text region image in each corresponding sample image in the one or more text region images; and determining the position information of each text region image in each corresponding sample image according to the coordinate information of each text region image in each corresponding sample image.

In one embodiment, when the computer device acquires that one sample image includes a plurality of text region images, coordinate information of each text region image in the one sample image, and length, width, height and area of the text region image may be acquired.

In some embodiments, the coordinate information includes vertex coordinate information of each text region image in a corresponding each sample image, the vertex coordinate information being determined from a coordinate system established with a geometric center of an image region of each text region image as an origin. In some embodiments, the vertex coordinate information may include upper left vertex coordinate information and lower right vertex coordinate information of an image region in which each text region image is located in the coordinate system, where the upper left vertex coordinate and the lower right vertex coordinate are determined with an origin of the coordinate system as a midpoint (i.e., a reference point), that is, a region where the upper left vertex coordinate is between an X positive coordinate and a Y positive coordinate, and the lower right vertex coordinate is between an X negative coordinate and a Y positive coordinate, where the upper left vertex coordinate information includes an X coordinate and a Y coordinate of the upper left vertex, and the lower right vertex coordinate information includes an X coordinate and a Y coordinate of the lower right vertex. For example, the text region image is a regular rectangular text region which is composed of upper left dots (x 1, y 1), lower right dots (x 2, y 2) and cut out.

In some embodiments, the vertex coordinate information may include upper left vertex coordinate information, lower left vertex coordinate information, upper right vertex coordinate information, and lower right vertex coordinate information of the image region in which the respective text region image is located in the coordinate system. Specifically, taking fig. 2 as an example, fig. 2 is a schematic diagram of a text region image provided in the embodiment of the present application, and as shown in fig. 2, the vertex coordinate information is determined according to a coordinate system established with the geometric center 22 of the image region 21 of the text region image as an origin, and the vertex coordinate information includes upper left vertex coordinate information 231, lower left vertex coordinate information 232, upper right vertex coordinate information 233, and lower right vertex coordinate information 234.

In one embodiment, the region information includes size information, and the computer device may determine an image height and an image width of each text region image in each sample image when acquiring the region information in which the text region image in each sample image is located in each corresponding sample image; determining the area of each text region image according to the image height and the image width; and determining the image height, the image width and the area of each text region image as the size information of each text region image in the corresponding sample image.

In one embodiment, the computer device may calculate a product of the image height and the image width of each text region image and determine the product of the image height and the image width of each text region image as the area of the corresponding each text region image when determining the area of each text region image from the image height and the image width. Taking fig. 2 as an example, assume that the computer device determines that the image height of the text region image 20 in the corresponding sample image is h and the image width is w; the area of the text region image 20 is determined to be h×w according to the image height h and the image width w.

In one embodiment, the region information includes position information and size information, and the computer device may acquire coordinate information of each text region image in each corresponding sample image when acquiring the region information of the text region image in each corresponding sample image; determining position information of each text region image according to coordinate information of each text region image in each corresponding sample image, and determining image height and image width of each text region image in each corresponding sample image; determining the area of each text region image according to the image height and the image width; and determining the image height, the image width and the area of each text region image as the size information of each text region image in the corresponding sample image. The embodiment of the application is helpful for more accurately confirming the area where the text information in the sample image is located.

In one embodiment, the computer device may obtain the target scene information and select one or more text region images from each sample image based on the target scene before obtaining the one or more text region images from each sample image. In some embodiments, the target scenario information may be information of any scenario pre-defined by a user, including but not limited to, a scenario such as data identification, data query, etc., e.g., a scenario of user address query, a scenario of bank category query, a scenario of commodity category identification, etc.

In one embodiment, when one or more text region images are selected from each sample image according to the target scene, the computer device may determine data information associated with the target scene according to the target scene, and select one or more text region images corresponding to the data information associated with the target scene from each sample image.

For example, assuming that the target scenario is a scenario of a user address query, the computing device may determine user information and address information associated with the scenario of the user address query from the scenario of the user address query and select one or more text region images from the respective sample images that include the user information and the address information.

For another example, assuming that the target scene is a scene of a bank category query, the computing device may determine information such as a bank identifier, a bank name, and the like according to the scene of the bank category query, and select one or more text region images associated with the information such as the bank identifier, the bank name, and the like from the respective sample images.

For another example, assuming that the target scene is a scene identified by a commodity category, the computing device may determine information such as a commodity name, a commodity shape, a commodity color, and the like according to the scene identified by the commodity category, and select one or more text region images corresponding to the information such as the commodity name, the commodity shape, the commodity color, and the like from the respective sample images.

According to the method and the device for identifying the text region images, the one or more text region images corresponding to the target scene are selected from the sample images by combining the target scene, the text region images corresponding to the target scene can be identified from the sample images more quickly and accurately, training is facilitated, a text identification model capable of identifying text information required under different scenes can be obtained, and accuracy and efficiency of identifying the text identification model are improved.

S103: and inputting each text region image, region information of each text region image in each corresponding sample image and text category labels of each text information corresponding to each sample image into a preset neural network model for training to obtain a text recognition model.

In the embodiment of the application, the computer device may input each text region image, region information of each text region image in a corresponding sample image, and a text category label of each text information corresponding to each sample image into a preset neural network model for training, so as to obtain a text recognition model. In certain embodiments, the predetermined neural network model may include, but is not limited to, a convolutional neural network model (Convolutional Recurrent Neural Network, CRNN), or the like. Further, as shown in fig. 3, fig. 3 is a schematic structural diagram of a convolutional neural network model provided in the embodiment of the present application, and as shown in fig. 3, the convolutional neural network model includes a convolutional layer 31, a recursive layer 32 and an output layer 33 (i.e. a recognition layer), an image 34 to be recognized including a text state is input into the convolutional layer 31 to obtain a convolutional feature map 311, the convolutional feature map 311 is further input into the recursive layer 32 to obtain a feature vector 321 (i.e. a feature sequence), the feature vector 321 is input into a bidirectional Long Short-Term Memory (LSTM) module 322 of bidirectional deep learning in the recursive layer 32, and output data of the bidirectional deep learning LSTM is input into the output layer 33 to obtain the text 331"state" in the image 34 to be recognized.

In one embodiment, when the computer device inputs each text region image, region information of each text region image in each corresponding sample image, and text category labels of each text information corresponding to each sample image into a preset neural network model to train, and obtains a text recognition model, text image feature information can be extracted from each text region image; extracting region characteristic information from region information of each text region image; determining text category labels of the text region images according to the text category labels of each text message corresponding to each sample image; and inputting the extracted characteristic information of each text image, the extracted characteristic information of each region and the text category labels of each text region image into a preset neural network model for training to obtain the text recognition model.

In one embodiment, when a computer inputs each text region image, region information of each text region image in each corresponding sample image and text category labels of each text information corresponding to each sample image into a preset neural network model for training to obtain a text recognition model, the computer can acquire target scene information and extract scene characteristic information from the target scene information; further extracting text image characteristic information from each text region image; extracting region characteristic information from region information of each text region image; determining text category labels of the text region images according to the text category labels of each text message corresponding to each sample image; and inputting the extracted scene characteristic information, each text image characteristic information, each region characteristic information and the text category labels of each text region image into a preset neural network model for training to obtain the text recognition model.

By introducing scene information to train the text recognition model, the text information under different scenes can be accurately and efficiently recognized through the text recognition model.

S104: and acquiring an image to be processed, and inputting the image to be processed into the text recognition model to obtain text information of the image to be processed, wherein the text information comprises texts and text categories.

In this embodiment of the present application, the computer device may acquire an image to be processed, and input the image to be processed into the text recognition model to obtain text information of the image to be processed, where the text information includes text and a text category.

In one embodiment, the computer device may obtain a text region image in the image to be processed and region information of the text region image, where the region information of the text region image in the image to be processed may include position information and/or size information of the text region, and input the text region image of the image to be processed and the region information of the text region image into the text recognition model to obtain text and text category of the image to be processed.

Taking fig. 4 as an example, fig. 4 is a schematic diagram of text recognition provided in the embodiment of the present application, as shown in fig. 4, it is assumed that the region information 42 of the text region image is obtained from the to-be-processed image 41, and the to-be-processed image 41 and the text region information 42 of the text region image of the to-be-processed image 41 are input into the text recognition model 43, and the text 44 of the to-be-processed image is recognized as "front gate big fence general store" and the text category 45 of the text 44 as "shop sign".

According to the method and the device for identifying the text and the text category in the image to be processed, the corresponding text is selected according to the text category under different identification scenes, and therefore a user can quickly and accurately acquire the corresponding text under different identification scenes.

In one embodiment, after the text recognition model recognizes the text and the text category of the image to be processed, the computer device may generate, according to the text and the text category of the image to be processed, a prompt message, where the prompt message is used to indicate each text and the text category corresponding to each text, so that a user can view the text and the text category more conveniently and quickly, and user experience is improved.

In one embodiment, the computer device may obtain the target scene information corresponding to the image to be processed, and input the target scene information corresponding to the image to be processed and the image to be processed into the text recognition model, to obtain the text and the text category of the image to be processed in the target scene.

In one embodiment, when the target scene information corresponding to the image to be processed and the image to be processed are input into the text recognition model, the computer device may determine one or more text region images corresponding to the target scene information corresponding to the image to be processed from the image to be processed according to the target scene information corresponding to the image to be processed, obtain region information (i.e. position information and/or size information) of each text region image in the image to be processed, further input each text region image and region information of each text region image in the image to be processed into the text recognition model, and recognize and obtain texts and text categories required by the image to be processed under the target scene.

Taking fig. 4 as an example, assuming that the target scene information corresponding to the image to be processed is a shop name, the computer device may determine one text area image 41 corresponding to the shop name corresponding to the image to be processed from the image to be processed, obtain area information 42 of the text area image 41 in the image to be processed, further input the text area image 41 and the area information 42 of the text area image 41 in the image to be processed into the text recognition model 43, and recognize and obtain a text 44 "front gate big fence total store" and a text category 45 "shop sign" required by the image to be processed in the shop name recognition scene.

According to the method and the device for identifying the text recognition model, the text recognition model obtained through scene information training is introduced to identify the image to be processed, different types of texts in the same image to be processed can be identified under different scenes, unnecessary text information of a certain scene is prevented from being identified, recognition workload is reduced, and efficiency and accuracy of text recognition are improved.

In one embodiment, after the computer device identifies the text and the text category of the image to be processed in the target scene through the text identification model, the computer device can generate prompt information according to the text and the text category of the image to be processed in the target scene, where the prompt information is used for indicating each text and the text category corresponding to each text of the image to be processed in the target scene, so that a user can check more conveniently and quickly, and user experience is improved.

According to the embodiment of the application, the sample image data set can be obtained, the sample image data set comprises a plurality of sample images, and each sample image comprises text information and a text category label added for each text information; acquiring one or more text region images in each sample image, and acquiring region information of each text region image in each corresponding sample image, wherein the region information comprises position information and/or size information; inputting each text region image, each region information and each text category label into a preset neural network model for training to obtain a text recognition model; the method comprises the steps of obtaining an image to be processed, inputting the image to be processed into the text recognition model to obtain texts and text types of the image to be processed, effectively recognizing the texts and the text types in the image, and improving the accuracy of text recognition.

Referring specifically to fig. 5, fig. 5 is a schematic flow chart of another text recognition method provided in an embodiment of the present application, where the text recognition method of the embodiment of the present application may be performed by a text recognition device, where the text recognition device is disposed in a computer device, and a specific explanation of the computer device is as described above. Specifically, the method of the embodiment of the application comprises the following steps.

S501: a sample image dataset is obtained, the sample image dataset comprising a plurality of sample images, each sample image comprising text information and a text category label added for each text information.

S502: one or more text region images in each sample image are acquired, and region information of each text region image in the one or more text region images in the corresponding each sample image is acquired, wherein the region information comprises position information and/or size information.

S503: text image feature information is extracted from each text region image, and region feature information is extracted from region information of each text region image.

In the embodiment of the application, the computer device can extract the text image characteristic information from each text region image and extract the region characteristic information from the region information of each text region image.

In one embodiment, the computer device may extract location feature information and/or size feature information from the region information of each text region image when extracting the region feature information from the region information of each text region image.

Further, the computer device may extract the position feature information from the position information of each text region image when extracting the position feature information from the region information of each text region image.

Further, the computer device may extract the height feature information from the height information in the size information of each text region image, extract the width feature information from the width information in the size information of each text region image, and extract the area feature information from the area information in the size information of each text region image when extracting the size feature information from the region information of each text region image.

S504: and inputting the extracted characteristic information of each text image, the extracted characteristic information of each region and the text category labels of each text region image into a preset neural network model for training to obtain the text recognition model.

In the embodiment of the application, the computer device may input the extracted feature information of each text image, the extracted feature information of each region, and the text category label of each text region image into a preset neural network model to train, so as to obtain the text recognition model.

In one embodiment, the area information includes location information and size information; the computer equipment can extract position characteristic information and size characteristic information from the area information of each text area image, and input the extracted text image characteristic information, position characteristic information and size characteristic information of each text area image and text category labels of each text area image into a preset neural network model for training to obtain the text recognition model.

Taking fig. 6 as an example for illustration, fig. 6 is a schematic diagram of model training based on region information, where, as shown in fig. 6, position information 61 extracted from region information of a text region image includes an upper left vertex X coordinate, an upper left vertex Y coordinate, a lower right vertex X coordinate and a lower right vertex Y coordinate, size information 62 extracted from region information of the text region image includes an image width, an image height and an area (i.e. an area of the text occupied by the image), the extracted upper left vertex X coordinate, upper left vertex Y coordinate, lower right vertex X coordinate and lower right vertex Y coordinate of each text region image and the image width, image height and area are input into an embedding layer 63 of a preset neural network model to perform emboding to obtain position feature information (e.g. a position feature vector) and size feature information (e.g. a size feature vector), the position feature information and the size feature information are input into a convolution layer FC64, and further convolved with the text image feature information input into the convolution layer, and training is performed according to a convolution result to obtain the text recognition model.

In one embodiment, the area information includes location information; the computer equipment can extract position characteristic information from the area information of each text area image, and input the extracted text image characteristic information and position characteristic information of each text area image and text category labels of each text area image into a preset neural network model for training to obtain the text recognition model.

In one embodiment, when the computer device inputs the extracted text image feature information and region feature information of each text region image and the text category label of each text region image into a preset neural network model to train to obtain a text recognition model, the computer device can input the obtained text image feature information, region feature information and text category label of each text region image into the preset neural network model to train to obtain sample texts corresponding to each text region image and category information of each sample text; determining a loss function value according to each sample text, category information of each sample text and text category labels of each text region image, and adjusting model parameters of a preset neural network model according to the loss function value; and (3) inputting the characteristic information of each text image, the characteristic information of each region and the text type label of each text region image into the neural network model after the model parameters are adjusted for retraining, and determining to obtain the text recognition model when the loss function value obtained by retraining is smaller than the function threshold value.

In one embodiment, the computer device may compare the category information of each sample text with the text category label corresponding to each text region image when determining the loss function value based on each sample text, the category information of each sample text, and the text category label of each text region image; determining the category similarity between the category of each sample text obtained by the neural network model and the corresponding text category label of each text region image according to the comparison result; and determining the loss function value according to the category similarity.

In one embodiment, when comparing the category information of each sample text with the text category label corresponding to each sample image, and determining the category similarity between the category of each sample text obtained by the neural network model and the text category label corresponding to each sample image according to the comparison result, the computer device may calculate the similarity between the category information of each sample text and the text category label corresponding to each sample image by using a preset similarity algorithm. The predetermined similarity algorithm may include, but is not limited to, a cosine similarity algorithm.

In one embodiment, a computer device includes a Loss function value obtained by text recognition of a sample image and a structured cross entropy Loss function value obtained by a text recognition model in determining the Loss function value, wherein the text recognition derived Loss function value includes a Loss function value (Connectionist Temporal Classification, CTC) Loss based on a time series class classification of a neural network and a center Loss function value centrloss for text similarity, wherein CTC Loss is a Loss calculation method that does not require alignment specifically for such a scene, and centrloss is obtained by taking the distance of a feature and a feature center together with Softmax Loss as a Loss function. Specifically, as shown in fig. 7, fig. 7 is a schematic diagram of a structure for determining a Loss function value, and as shown in fig. 7, ctlock 711 and centrlock 712 are obtained by performing text recognition 71 on a sample image 70, and cross entropy lock 721 of a structure 72 obtained by a text recognition model is obtained.

The loss function value of the training text recognition model is determined by introducing the loss function value determined by text recognition into the cross entropy loss function value, so that the training is facilitated to obtain a better text recognition model, and the reliability and the accuracy of the text recognition model obtained by training are facilitated to be improved.

According to the embodiment of the application, the text recognition model capable of recognizing the text information in the image and the category of the text information can be obtained through training in the mode, the reliability of the text recognition model can be improved, and the accuracy and the effectiveness of recognizing the text from the image can be improved.

The embodiment of the application can acquire a sample image data set, wherein the sample image data set comprises a plurality of sample images, and each sample image comprises text information and a text category label added for each text information; acquiring the region information of text region images in each sample image in the corresponding sample image, wherein the region information comprises position information and/or size information; extracting text image feature information from each text region image, and extracting region feature information from the region information of each text region image; inputting the text image characteristic information and the region characteristic information of each extracted text region image into a preset neural network model for training to obtain the text recognition model; and acquiring an image to be processed, inputting the image to be processed into the text recognition model to obtain text information of the image to be processed, wherein the text information comprises texts and text types, and the accuracy and the effectiveness of text recognition are improved.

Referring specifically to fig. 8, fig. 8 is a flowchart of yet another text recognition method provided in an embodiment of the present application, where the text recognition method of the embodiment of the present application may be performed by a text recognition device, where the text recognition device is disposed in a computer device, and where a specific explanation of the computer device is as described above. Specifically, the method of the embodiment of the application comprises the following steps.

S801: a sample image dataset is obtained, the sample image dataset comprising a plurality of sample images, each sample image comprising text information and a text category label added for each text information.

S802: acquiring target scene information, selecting one or more text region images from each sample image according to the target scene, and acquiring region information of each text region image in the one or more text region images in the corresponding sample images, wherein the region information comprises position information and/or size information.

S803: scene feature information is extracted from the target scene information, text image feature information is extracted from each text region image, and region feature information is extracted from region information of each text region image.

In the embodiment of the application, the computer device may extract scene feature information from the target scene information, extract text image feature information from each text region image, and extract region feature information from the region information of each text region image. In some embodiments, the target scene information may be text description information.

The computer device may extract scene feature information, i.e., text description feature information, from text description information for describing the target scene information when extracting the scene feature information from the target scene information.

The computer device may extract the position feature information and/or the size feature information from the region information of each text region image when extracting the region feature information from the region information of each text region image.

S804: and inputting the extracted scene characteristic information, each text image characteristic information, each region characteristic information and the text category labels of each text region image into a preset neural network model for training to obtain the text recognition model.

In the embodiment of the application, the computer device may input the extracted scene feature information, each text image feature information, each region feature information, and the text category label of each text region image into a preset neural network model to train, so as to obtain the text recognition model. Wherein each region characteristic information includes position characteristic information and/or size characteristic information.

In one embodiment, when the computer device inputs the extracted scene feature information, the text image feature information and the area feature information of each text area image and the text category label of each text area image into a preset neural network model to train to obtain a text recognition model, the computer device can input the obtained scene feature information, each text image feature information, each area feature information and the text category label of each text area image into the preset neural network model to train to obtain a sample text corresponding to each text area image and category information of each sample text in a target scene; determining a loss function value according to each sample text, category information of each sample text and text category labels of each text region image, and adjusting model parameters of a preset neural network model according to the loss function value; and (3) inputting the characteristic information of each text image, the characteristic information of each region and the text type label of each text region image into the neural network model after the model parameters are adjusted for retraining, and determining to obtain the text recognition model when the loss function value obtained by retraining is smaller than the function threshold value.

According to the method for training the text recognition model by introducing the target scene, the text recognition model can recognize the corresponding text information aiming at different scenes, and the accuracy and the efficiency of text recognition in different scenes are improved.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a text recognition device according to an embodiment of the present application. Specifically, the device is arranged in computer equipment, and the device comprises: a first acquisition unit 901, a second acquisition unit 902, a training unit 903, and an identification unit 904;

a first obtaining unit 901 for obtaining a sample image dataset including a plurality of sample images, each sample image including text information and a text category label added for each text information;

a second obtaining unit 902, configured to obtain one or more text region images in the respective sample images, and obtain region information of each text region image in the one or more text region images in the respective sample images, where the region information includes location information and/or size information;

the training unit 903 is configured to input the text region images, region information of the text region images in the corresponding sample images, and text category labels of each text information corresponding to the sample images into a preset neural network model for training, so as to obtain a text recognition model;

The identifying unit 904 is configured to obtain an image to be processed, and input the image to be processed into the text identifying model to obtain text information of the image to be processed, where the text information includes text and text category.

Further, the area information includes location information; the second acquiring unit 902 is specifically configured to, when acquiring the area information of each text area image in the one or more text area images in the corresponding sample images:

acquiring coordinate information of each text region image in each corresponding sample image;

and determining the position information of each text region image in each corresponding sample image according to the coordinate information of each text region image in each corresponding sample image.

Further, the coordinate information includes vertex coordinate information of each text region image in the corresponding sample image, the vertex coordinate information is determined according to a coordinate system established by taking a geometric center of an image region of each text region image as an origin, and the vertex coordinate information includes upper left vertex coordinate information and lower right vertex coordinate information of the image region of each text region image in the coordinate system.

Further, the area information includes size information; the second acquiring unit 902 is specifically configured to, when acquiring the area information of each text area image in the one or more text area images in the corresponding sample images:

determining the image height and the image width of each text region image in each corresponding sample image;

determining the area of each text region image according to the image height and the image width;

and determining the image height, the image width and the area of each text region image as the size information of each text region image in the corresponding sample image.

Further, the training unit 903 inputs the text region images, region information of the text region images in the corresponding sample images, and text category labels of each text information corresponding to the sample images into a preset neural network model to perform training, so as to obtain a text recognition model, which is specifically used for:

extracting text image characteristic information from each text region image;

extracting region characteristic information from the region information of each text region image;

Determining text category labels of the text region images according to the text category labels of each text message corresponding to the sample images;

and inputting the extracted characteristic information of each text image, the characteristic information of each region and the text category labels of each text region image into a preset neural network model for training to obtain the text recognition model.

Further, the training unit 903 inputs the extracted feature information of each text image, feature information of each region, and text category labels of each text region image into a preset neural network model for training, and is specifically used for when the text recognition model is obtained:

inputting the obtained characteristic information of each text image, the characteristic information of each region and the text category label of each text region image into the preset neural network model for training to obtain sample texts corresponding to each text region image and category information of each sample text;

determining a loss function value according to the sample texts, the category information of the sample texts and the text category labels of the text region images, and adjusting model parameters of the preset neural network model according to the loss function value;

And inputting the characteristic information of each text image, the characteristic information of each region and the text type label of each text region image into a neural network model after adjusting model parameters for retraining, and determining to obtain the text recognition model when the loss function value obtained by retraining is smaller than a function threshold value.

Further, when the training unit 903 determines the loss function value according to the respective sample texts, the category information of the respective sample texts, and the text category labels of the respective text region images, the training unit is specifically configured to:

comparing the category information of each sample text with the text category labels corresponding to each text region image;

determining the category similarity between the category of each sample text obtained by the neural network model and the corresponding text category label of each text region image according to the comparison result;

and determining the loss function value according to the category similarity.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application. Specifically, the computer device includes: memory 1001, processor 1002.

In one embodiment, the computer device further comprises a data interface 1003, the data interface 1003 being used for transferring data information between the computer device and other devices.

The memory 1001 may include volatile memory (volatile memory); memory 1001 may also include non-volatile memory (nonvolatile memory); memory 1001 may also include a combination of the above types of memory. The processor 1002 may be a central processing unit (central processing unit, CPU). The processor 1002 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), or any combination thereof.

The memory 1001 is configured to store a program, and the processor 1002 may call the program stored in the memory 1001, for performing the steps of:

Further, the area information includes location information; the processor 1002 is specifically configured to, when acquiring the region information of each text region image in the one or more text region images in the corresponding sample images:

Further, the area information includes size information; the processor 1002 is specifically configured to, when acquiring the region information of each text region image in the one or more text region images in the corresponding sample images:

Further, the processor 1002 inputs the text region images, the region information of the text region images in the corresponding sample images, and the text category labels of each text information corresponding to the sample images into a preset neural network model for training, so as to obtain a text recognition model, which is specifically used for:

extracting text image characteristic information from each text region image;

Further, the processor 1002 inputs the extracted feature information of each text image, feature information of each region, and text category labels of each text region image into a preset neural network model for training, so as to obtain the text recognition model, which is specifically used for:

Further, when the processor 1002 determines the loss function value according to the respective sample text, the category information of the respective sample text, and the text category label of the respective text region image, it is specifically configured to:

and determining the loss function value according to the category similarity.

The embodiments of the present application further provide a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by a processor implements a method described in an embodiment corresponding to the present application, and may also implement an apparatus of an embodiment corresponding to the present application, which is not described herein again.

The computer readable storage medium may be an internal storage unit of the device according to any of the foregoing embodiments, for example, a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the device. Further, the computer readable storage medium may also include both internal storage units and external storage devices of the device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the steps performed in the method embodiments provided in the various implementations described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The above disclosure is only a few examples of the present application, and it is not intended to limit the scope of the claims, and those skilled in the art will understand that all or a portion of the above-described embodiments may be implemented and equivalents may be substituted for elements thereof, which are included in the scope of the present invention.

Claims

1. A method of text recognition, comprising:

2. The method of claim 1, wherein the area information comprises location information; the acquiring the region information of each text region image in the one or more text region images in the corresponding sample images comprises the following steps:

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the coordinate information comprises vertex coordinate information of each text region image in the corresponding sample image, the vertex coordinate information is determined according to a coordinate system established by taking the geometric center of the image region of each text region image as an origin, and the vertex coordinate information comprises upper left vertex coordinate information and lower right vertex coordinate information of the image region of each text region image in the coordinate system.

4. The method of claim 1, wherein the region information includes size information; the acquiring the region information of each text region image in the one or more text region images in the corresponding sample images comprises the following steps:

5. The method according to claim 1, wherein the training the text region images, the region information of the text region images in the corresponding sample images, and the text category labels of each text information corresponding to the sample images in a preset neural network model to obtain a text recognition model includes:

extracting text image characteristic information from each text region image;

6. The method according to claim 5, wherein inputting the extracted feature information of each text image, the extracted feature information of each region, and the text category labels of each text region image into a preset neural network model for training, to obtain the text recognition model, includes:

7. The method of claim 6, wherein said determining a loss function value from said respective sample text, category information of said respective sample text, and text category labels of said respective text region image comprises:

and determining the loss function value according to the category similarity.

8. A text recognition device, comprising:

9. A computer device comprising a processor and a memory, the processor and the memory being interconnected, wherein the memory is adapted to store a computer program, the processor being configured to invoke the computer program to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein program instructions which, when executed, implement the method according to any of claims 1-7.

11. A computer program product, characterized in that it comprises program instructions which, when executed by a processor, implement the method of any of claims 1-7.