CN115035530A

CN115035530A - Image processing method, image text obtaining method, device and electronic equipment

Info

Publication number: CN115035530A
Application number: CN202210422501.3A
Authority: CN
Inventors: 马傲; 王莽; 赵永飞; 王章成; 唐铭谦
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-09-09

Abstract

The application provides an image processing method, an image text obtaining method, and devices, electronic equipment and computer storage media corresponding to the image processing method and the image text obtaining method. According to the image processing method, the target image is processed, the character area is extracted, character recognition is carried out on the basis of the character area, a character recognition result is obtained, after the character recognition result is obtained, attribute type characters and attribute result characters in the character recognition result are distinguished, and the attribute type characters and the attribute result characters are corresponding. The image text processing method can clearly reflect the logical relation between texts appearing in the image, so that the logical relation of the text recognition result in the image is clearly perceived. By using the image processing method, the corresponding relation between the attribute type characters and the attribute result characters in the film and television video can be quickly and accurately determined, and the image processing method is particularly suitable for processing the demonstration staff table characters.

Description

Image processing method, image text obtaining method, device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method, an image text obtaining method, and an apparatus, an electronic device, and a computer storage medium corresponding to the image processing method and the image text obtaining method.

Background

Recognizing texts in images has become an important processing direction in the field of image processing, and recognizing texts in images has been applied in daily life, for example, recognizing license plate numbers in the traffic field, recognizing names of lines or movies in movie videos, and the like, and all need to be involved in recognizing texts in images.

Most of the methods for recognizing texts in images are for recognizing texts in images by adopting an optical character recognition method, and the methods are mostly suitable for simply recognizing texts in images, but the recognition cannot sense the logical relationship between characters in images, and the recognized result is usually meaningless character stacking. For some special text recognition occasions, the recognition result especially reduces the meaning of text recognition. For example, in recognition of subtitles of a caption worker who expresses a beginning or an end of a movie work, if only ordinary character recognition is performed, it is generally difficult to understand the exact meaning of subtitle expression.

Disclosure of Invention

The application provides an image processing method, which aims to solve the technical problem of how to process images containing more and complex text types, so that a user can more accurately recognize texts in the images. The application also provides an image processing device, an electronic device and a computer storage medium.

The application provides an image processing method, comprising the following steps:

acquiring a target image to be processed;

processing the target image and extracting a character area;

performing character recognition on the character area to obtain a character recognition result;

distinguishing attribute type characters and attribute result characters in the character recognition result aiming at the character recognition result;

and corresponding the attribute type characters to the attribute result characters to obtain the corresponding relation between the attribute type characters and the attribute result characters.

Optionally, the processing the target image and extracting a text region includes:

based on the probability of whether each pixel point in the target image belongs to a character region, carrying out binarization processing on the pixel points in the target image to obtain a binary image for representing whether each pixel point in the target image belongs to the character region;

and extracting a character area in the target image based on the binary image and the target image.

Optionally, the method further includes: based on the target image, obtaining a feature map used for representing the pixel features of the target image, and obtaining a probability map used for representing the probability that each position in the feature map belongs to a character region; the characteristic graph is an image obtained after convolution operation is carried out on the target image, and the characteristic graph is used for describing the characteristics of pixel points in the target image;

the binarizing processing is performed on the pixel points in the target image based on the probability of whether each pixel point in the target image belongs to a character region, so as to obtain a binary image for representing whether each pixel point in the target image belongs to a character region, and the binarizing processing includes:

based on the probability graph and the feature graph, obtaining the probability that each pixel point in the target image belongs to a character area;

and carrying out binarization processing on each pixel point according to a preset probability threshold value and the obtained probability that each pixel point belongs to the character area to obtain a binary image for representing whether each pixel point in the target image belongs to the character area.

Optionally, the performing binarization processing on each pixel point according to a predetermined probability threshold and the obtained probability that each pixel point belongs to a text region to obtain a binary image used for indicating whether each pixel point in the target image belongs to the text region includes:

according to the obtained probability that each pixel point in the target image belongs to the text region, comparing the probability that each pixel point belongs to the text region with a preset probability threshold value, and judging whether the probability that each pixel point belongs to the text region is larger than the probability threshold value;

if yes, setting the gray value corresponding to the pixel point in the target image to be 255; if not, setting the gray value corresponding to the pixel point in the target image as 0;

according to the mode, the gray value of each pixel point in the target image is reset, and a binary image used for representing whether each pixel point in the target image belongs to a character area or not is obtained.

Optionally, the obtaining, based on the target image, a feature map used for representing pixel features of the target image, and obtaining a probability map used for representing probabilities that respective positions in the feature map belong to a text region includes:

taking the target image as input data of a first convolution neural network model, obtaining a feature map for describing pixel features in the target image, and obtaining a probability map for representing the probability that each position in the feature map belongs to a character region; the first convolution neural network model is used for obtaining a feature map used for representing pixel features in the target image according to the target image and obtaining a probability map used for representing the probability that each position in the feature map belongs to a character area.

Optionally, the first convolutional neural network model is obtained by adopting the following training mode:

obtaining an image sample, a feature map sample for representing pixel features in the image sample, and a probability map sample for representing the probability that each position in the feature map sample belongs to a character region;

providing the image sample to an initial convolutional neural network model, wherein the initial convolutional neural network model generates an estimated characteristic map sample and an estimated probability map sample corresponding to the image sample;

comparing the characteristic diagram sample with the estimated characteristic diagram sample, comparing the probability diagram sample with the estimated probability diagram sample, and performing parameter adjustment on the initial convolutional neural network model according to a comparison result until a difference value of the comparison result is within a preset threshold range;

and taking the initial convolutional neural network model adjusted by the parameters as the first convolutional neural network model.

Optionally, the distinguishing, for the text recognition result, the attribute type text and the attribute result text in the text recognition result includes:

and taking the character recognition result as input data of a second machine learning model to obtain attribute type characters and attribute result characters in the character recognition result, wherein the second machine learning model is used for obtaining the attribute type characters and the attribute result characters in the character recognition result according to the character recognition result.

Optionally, the corresponding the attribute type text to the attribute result text to obtain a corresponding relationship between the attribute type text and the attribute result text includes:

and taking the attribute type characters and the attribute result characters as input data of a third machine learning model to obtain the corresponding relation between the attribute type characters and the attribute result characters, wherein the third machine learning model is used for obtaining the corresponding relation between the attribute type characters and the attribute result characters according to the attribute type characters and the attribute result characters.

Optionally, the acquiring a target image to be processed includes: acquiring a video containing a target image to be processed; and extracting a video frame containing the target image from the video.

Optionally, the method further includes: if the characters correspond to the same attribute category and a plurality of attribute result characters, a fifth machine learning model is further included and is used for distinguishing the plurality of attribute result characters corresponding to the same attribute category character.

Optionally, the performing character recognition on the character region to obtain a character recognition result includes:

obtaining a feature vector for vector representation of the features of the character region;

coding the feature vector, and extracting a global feature vector; the global feature vector is a vector obtained by removing spatial information from the feature vector;

and decoding the global feature vector to obtain a character recognition result.

Optionally, the obtaining a feature vector for performing vector representation on features of the text region includes:

and taking the character region as input data of a fourth machine learning model to obtain a feature vector for performing vector representation on the features of the character region, wherein the fourth machine learning model is a model for obtaining the feature vector for performing vector representation on the features of the character region.

Optionally, the encoding the feature vector to extract a global feature vector includes:

and taking the feature vector as input data of a time cycle neural network encoder, and extracting a global feature vector.

Optionally, the decoding the global feature vector to obtain a character recognition result includes:

and taking the global feature vector as input data of a time cycle neural network decoder to obtain a character recognition result.

The application provides a movie and television play video processing method, which comprises the following steps:

acquiring a movie and television play video to be processed;

acquiring a target video frame containing attribute type characters and attribute result characters in the movie and television video;

processing the target video frame and extracting a character area;

The application provides an image processing apparatus, including:

a target image acquisition unit for acquiring a target image to be processed;

the extraction unit is used for processing the target image and extracting a character area;

the recognition unit is used for carrying out character recognition on the character area to obtain a character recognition result;

the distinguishing unit is used for distinguishing attribute type characters and attribute result characters in the character recognition result aiming at the character recognition result;

and the corresponding unit is used for corresponding the attribute category characters with the attribute result characters to obtain the corresponding relation between the attribute category characters and the attribute result characters.

The application provides an electronic device, including:

a processor;

a memory for storing a computer program to be executed by the processor for performing the above-mentioned image processing method.

The present application provides a computer storage medium storing a computer program that is executed by a processor to execute the above-described image processing method.

Compared with the prior art, the embodiment of the application has the following advantages:

the application provides an image processing method, comprising the following steps: acquiring a target image to be processed; processing the target image and extracting a character area; performing character recognition on the character area to obtain a character recognition result; distinguishing attribute type characters and attribute result characters in the character recognition result aiming at the character recognition result; and corresponding the attribute type characters to the attribute result characters to obtain the corresponding relation between the attribute type characters and the attribute result characters. According to the image processing method, the target image is processed, the character area is extracted, character recognition is carried out on the basis of the character area, the character recognition result is obtained, after the character recognition result is obtained, the attribute type characters and the attribute result characters in the character recognition result are distinguished, the attribute type characters and the attribute result characters are corresponded, the corresponding relation between the attribute type characters and the attribute result characters is obtained, actually, the recognized characters in the target image are classified according to attributes and are corresponded, and further, the logical relation between the texts appearing in the image can be clearly reflected on the basis of the corresponding relation between the attribute type characters and the attribute result characters, so that the logical relation of the text recognition result in the image is clearly perceived. In addition, by using the image processing method, when a large amount of videos are processed in the field of media asset processing, the calculation amount is reduced, and meanwhile, the corresponding relation between the attribute category characters and the attribute result characters in the videos can be rapidly and accurately determined from the large amount of videos; as a typical application scene, the image processing method is particularly suitable for processing the typical role characters in the film and television videos.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a first schematic diagram of a scene of an image processing method provided in the present application.

Fig. 2 is a second schematic diagram of a scene of an image processing method provided in the present application.

Fig. 3 is a flowchart of an image processing method according to a first embodiment of the present application.

Fig. 4 is a flowchart of an image text obtaining method according to a second embodiment of the present application.

Fig. 5 is a schematic diagram of an image processing apparatus according to a third embodiment of the present application.

Fig. 6 is a schematic diagram of an image text obtaining apparatus according to a fourth embodiment of the present application.

Fig. 7 is a schematic diagram of an electronic device provided in a fifth embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The application provides an image processing method, an image text obtaining method, an image processing device, an image text obtaining device, an electronic device and a computer storage medium. An image processing method, an image text obtaining method, an image processing apparatus, an image text obtaining apparatus, an electronic device, and a computer storage medium are respectively described below by specific embodiments.

The image processing method can be applied to a scene of text recognition of the image. The image processing method and the image text recognition can be applied to multiple aspects, for example, in the traffic field, the license plate number of a motor vehicle in a video or an image can be recognized by processing the video or the image of the motor vehicle and recognizing the text, and further the license plate number can be used for recording illegal vehicles. For another example, in the field of movie and video, by processing a movie and video frame or image and recognizing a speech text therein or a text corresponding to a related name of a movie and video, it is helpful to arrange speech or related name in a movie and video. For another example, in the field of media asset processing, since a large amount of videos need to be processed and the amount of calculation needed to be processed is large, the image processing method can quickly and accurately obtain the corresponding relationship between the attribute type characters and the attribute result characters in the character recognition result in the video. Specifically, taking the example that the image processing method is applied to processing scenes in a movie and television video, after acquiring the movie and television video, which video frames in the movie and television video contain attribute type characters and attribute result characters can be detected, after detecting that the video frames contain the attribute type characters and the attribute result characters, the video frames are taken as target video frames, and character areas are extracted based on the target video frames; performing character recognition on the character area to obtain a character recognition result; finally, aiming at the character recognition result, distinguishing the attribute type characters and the attribute result characters in the character recognition result, and corresponding the attribute type characters with the attribute result characters to obtain the corresponding relation between the attribute type characters and the attribute result characters.

Of course, the image processing and image text recognition scenes listed above are only for facilitating understanding of specific applications of the image processing method, and the image processing method of the present application may also be used in other scenes.

In the present application, the image processing method of the present application is described in detail by taking the processing of a movie video frame or an image as an example.

In the video of the beginning or end of the existing movie, the text information of the staff participating in the movie is usually included. For example, the character information of the photographer, the character information of the cosmetic staff, the character information of the field staff, and the character information of the participant. The texts corresponding to the information of the persons may appear in the same video frame or the same image, and in order to identify the characters in the character information of the related persons in the same video frame or the same image, the image processing method of the application is adopted to process the video frame of the head or the tail of the movie and television play.

The related information of the movie and television people is generally composed of two parts in a video frame, wherein one part is position or title information, and the other part is name information. In a video frame, these title information and name information correspond to each other, specifically, please refer to fig. 1, which is a first schematic diagram of a scene of the image processing method provided in the present application, where fig. 1 includes a video frame to be processed, which is an image a, and it can be seen from fig. 1 that: the text content of image a is: ' Zhang san ' for photography, ' Zhao xi ' for field affair, and ' Wangwei six for executing slide production. The following characters can be known through the character information: photographing is Zhang III; the field affairs are Zhao four; the tablet production is carried out by six people, namely king five and plum. The job information in the image a includes: photography, field work, and performing production; the name information includes Zhang three, Zhao four, Wang five and Li six. Texts in the video frame images are of various types and are complex, and the texts not only contain position or title information, but also contain name information; meanwhile, the number of people corresponding to different positions is different.

If the existing image text recognition method (for example, the optical character recognition method) is adopted to recognize the text in the image A, the recognized characters are relatively disordered, so that the information of the staff participating in the movie does not correspond to the position information of the staff.

The image processing method can correspond the characters of the job and the names of the people. Specifically, the process of associating the text of the job with the text of the name of the person can be continued with reference to fig. 1, and in this scene, the text in the movie video frame (image a) is identified as an example.

In this scenario, for example, the image processing method is executed at a server, the server is a computing device for providing services such as data processing and storage for a client, and a server in general may refer to a server or a server cluster. In the application, the movie video frames are processed by the server, and the processing result is provided to the client, so that the user can perform subsequent other processing (e.g., archiving) on the processing result obtained at the client. Of course, the image processing method may be executed at the client, and specifically, a program or software for implementing the image processing method provided by the present application may be configured in advance in the electronic device corresponding to the client, or a module for implementing the image processing method provided by the present application may be configured in advance in a target application installed therein. So-called electronic devices are typically smart phones, and a range of different types of computers including tablet computers. The target Application is generally an APP (Application program) or a computer Application.

Specifically, referring to fig. 1, the server first obtains a movie video including characters, and after obtaining the movie video, extracts a video frame including a target image from the video, and uses the video frame to be processed as the target image. How to process the image a will be described in detail by taking the video frame (image a) as an example of the target image.

In the present application, the image processing method actually processes the image a based on the following idea. In order to enable a user to clearly perceive a text recognition result of an image, the image processing method extracts a character area in the image A after the image A is acquired; and then, carrying out character recognition on the character area to obtain a character recognition result. Then, according to the character recognition result, distinguishing position information and name information in the character recognition result; and then, corresponding the position information with the name information to obtain the corresponding relation between the position information and the name information. Therefore, the characters of the position and the characters of the name of the person are corresponding, and the character recognition result obtained by the subsequent user is actually the information after the characters of the position and the characters of the name of the person are corresponding, so that the user can clearly perceive the text recognition result of the image.

The above-mentioned extracting the text region in the image a after acquiring the image a may refer to: performing binarization processing on the pixel points in the image A based on the probability of whether each pixel point in the image A belongs to a character region or not to obtain a binary image for representing whether each pixel point in the image A belongs to the character region or not; then, based on the binary image and the image A, a character area in the image A is determined, and the character area in the image A is extracted.

Before performing binarization processing on the pixel points in the image A based on the probability of whether each pixel point in the image A belongs to a character region, a feature map for representing the pixel features of the image A is obtained based on the image A, and a probability map for representing the probability that each position in the feature map belongs to the character region is obtained.

In the present application, the feature map refers to a vector representation obtained by performing a convolution operation on the image a, and is used for encoding and representing the visual features of the image a by using the vector representation so as to be used for subsequent calculation. In practice, the feature map can be understood as the original image (image a) is subjected to the reduction processing. In general, an original image includes three color channels of RGB, and the original image obtains various colors by the variation of the three color channels and their superposition with each other. The feature map is a feature vector obtained by performing convolution operation on the original image, and the number of channels of the feature map is increased relative to that of the original image, so that the size of the feature map is reduced. It should be noted that the feature map is a more "refined" representation of the original image. The feature graph extracted through convolution operation can help people to obtain the probability that each pixel point in the target image A belongs to the character area, and therefore the position of the character frame is found.

Specifically, based on the image a, obtaining a feature map for representing the pixel features of the image a, and obtaining a probability map for representing the probability that each position in the feature map belongs to a text region may refer to: the image A is used as input data of a first convolution neural network model, a feature map used for representing the pixel feature of the image A is obtained, and a probability map used for representing the probability that each position in the feature map belongs to a character area is obtained. In the application, the first convolution neural network model is used for obtaining a feature map used for representing pixel features in a target image according to the target image and obtaining a probability map used for representing the probability that each position in the feature map belongs to a character region.

After obtaining the feature map corresponding to the image a and the probability map, based on the probability of whether each pixel in the image a belongs to the text region, the binarization processing is performed on the pixels in the image a to obtain a binary map for indicating whether each pixel in the image a belongs to the text region, which may be: based on the probability graph and the characteristic graph, obtaining the probability that each pixel point in the image A belongs to the character area; then, according to a predetermined probability threshold value and the obtained probability that each pixel belongs to the character region, binarization processing is performed on each pixel to obtain a binary image for representing whether each pixel in the image A belongs to the character region.

The above-mentioned performing binarization processing on each pixel point according to the predetermined probability threshold and the obtained probability that each pixel point belongs to the text region to obtain a binary image for indicating whether each pixel point in the image a belongs to the text region may refer to: according to the obtained probability that each pixel point in the image A belongs to the character region, comparing the probability that each pixel point belongs to the character region with a preset probability threshold value, and judging whether the probability that each pixel point belongs to the character region is larger than the probability threshold value; if yes, setting the gray value of the pixel point in the image A to be 255; if not, setting the gray value of the pixel point in the image A to be 0; according to the mode, the gray value of each pixel point in the image A is reset, and a binary image used for indicating whether each pixel point in the image A belongs to a character area or not is obtained.

The binary image actually visually distinguishes whether each pixel point in the image A is in the region of the character or not in an image mode, so that the region of the character in the image A can be quickly determined by subsequently combining the binary image with the image A.

In the binarization process, the gray value of the pixel point corresponding to the probability threshold value, which is greater than the probability threshold value, in the image a is set to be 255, the gray value of the pixel point corresponding to the probability threshold value, which is not greater than the probability threshold value, in the image a, is set to be 0, and practically, according to whether the probability of the pixel point in the image a, which belongs to the probability threshold value, is greater than the probability threshold value, all the pixel points in the image a are divided into two classes, and the gray values are respectively processed. Of course, the above-mentioned setting of the gray-scale values to 255 and 0 is merely an example, and the gray-scale values may be set to two different values.

Referring to fig. 1, after determining the area where the characters in the image a are located, extracting the area where the characters in the image a are located; as shown in fig. 1, the extracted area where the characters in the image a are located includes 7 areas, which are respectively: the image processing device comprises a region corresponding to 'shooting', a region corresponding to 'zhangsan', a region corresponding to 'field affair', a region corresponding to 'zhao quan', a region corresponding to 'execute production', a region corresponding to 'wangwu' and a region corresponding to 'lissix'.

Then, after extracting the region where the character is located in the image a, character recognition is performed on the region where each character is located in the image a (any one of the seven regions described above), and a character recognition result in each region is obtained. Taking character recognition on the area corresponding to the three-page character as an example, after the character recognition is carried out, the obtained character recognition result is the two-page three-character.

Specifically, performing character recognition on the region where each character is located in the image a to obtain the character recognition result in each character region, which may be: obtaining a feature vector for vector representation of the features of the character region; coding the feature vector, and extracting a global feature vector; the global feature vector is a vector obtained by removing spatial information from the feature vector; and decoding the global feature vector to obtain a character recognition result.

The above-mentioned obtaining of the feature vector for vector-representing the features of the text region (actually, each text region) may refer to: the character region is used as input data of a fourth machine learning model to obtain a feature vector for vector representation of the feature of the character region, and the fourth machine learning model is a model for obtaining a feature vector for vector representation of the feature of the character region. In the present application, the region where the character is located (character region) is actually a part of the image a, and is also an image in nature.

Specifically, the encoding processing on the feature vector to extract the global feature vector may refer to: the feature vectors are used as input data of a time-cycle neural network Encoder (LSTM Encoder) to extract global feature vectors. Similarly, decoding the global feature to obtain a character recognition result in each region may refer to: and taking the global features as input data of a time-cycle neural network Decoder (LSTM Decoder) to obtain a character recognition result in each region.

The above-described text recognition of the region where each text is located, and obtaining the text recognition result in each region is merely an example, and other manners may also be adopted to perform text recognition of the region where each text is located, and obtain the text recognition result in each region.

After the character recognition results in each region are obtained, they are summarized, and the summarized character recognition results include "photograph, field affairs, zhang san, execute production, zhao, li xi, wang wu". It is clear that these aggregated text recognition results are rather confusing, so that the user cannot clearly understand which job positions correspond to which names. Therefore, the word recognition results are further classified, that is: and distinguishing which of the character recognition results is the position information and which is the name information.

Specifically, the above distinguishing process may refer to: and the collected character recognition results are used as input data of a second machine learning model to respectively obtain position information and name information in the character recognition results, and the second machine learning model is used for obtaining attribute type characters and attribute result characters in the character recognition results according to the character recognition results. Of course, in this scenario, the position information is merely an example of the attribute category text, and the attribute category text may also refer to a category text other than the position information. Similarly, the name information is also just one example of the attribute result information. The reason why the second machine learning model can obtain the attribute type characters and the attribute result characters in the character recognition result respectively based on the character recognition result is that the second machine learning model is a trained model, and the training process is introduced in the subsequent process. Similarly, the first convolutional neural network model, the third machine learning model, the fourth machine learning model, and the fifth machine learning model presented in this application are all trained neural network models or machine learning models.

Referring to fig. 1, in the process of respectively obtaining the job information and the name information in the character recognition results, the obtained job information respectively is: photographing, field affair, and performing production; the obtained name information is respectively: zhang three, Zhao four, Wang five and Li six.

After obtaining the position information and the name information, the position information and the name information need to be corresponded to obtain a corresponding relationship between the position information and the name information, specifically, the position information and the name information are corresponded to obtain a corresponding relationship between the position information and the name information, which may be: and using the obtained position information and the obtained name information as input data of a third machine learning model to obtain the corresponding relation between the position information and the name information, wherein the third machine learning model is used for obtaining the corresponding relation between the attribute type characters and the attribute result characters according to the attribute type characters and the attribute result characters.

And after the server acquires the corresponding relation between the position information and the name information, the server provides the corresponding relation between the position information and the name information to the client. Specifically, the correspondence between the position information and the name information provided by the server to the client is a writing display mode capable of reflecting the correspondence between the position information and the name information. In the application, the writing and displaying manner provided by the server and capable of reflecting the corresponding relationship between the position information and the name information may be as follows: zhang III; field affairs: zhao four; performing production: the Wangwu and Li six styles are writing and displaying. Here, if a job corresponds to a plurality of names, the writing display method includes: the writing and displaying modes of the name information of each person can be distinguished. For example, assume "separates a plurality of names of persons. In this embodiment, if there are multiple attribute result words corresponding to the same attribute type word, the method further includes a fifth machine learning model for distinguishing the multiple attribute result words corresponding to the same attribute type word.

To facilitate understanding of the data interaction process between the server and the client, please refer to fig. 2, which is a second schematic diagram of a scenario of the image processing method provided in the present application. Firstly, a client provides an image A to be processed to a server, and after the server obtains the image A, the server processes the image A and extracts a plurality of character areas in the image A; then, identifying the plurality of character areas to obtain character identification results corresponding to the plurality of character areas; then, based on the character recognition result, distinguishing position information and name information in the character recognition result; and finally, corresponding the position information and the name information to obtain the corresponding relation between the position information and the name information.

And after the server acquires the corresponding relation between the position information and the name information, acquiring a writing display mode capable of reflecting the corresponding relation between the position information and the name information, and providing the writing display mode for the client.

Fig. 1 to fig. 2 introduced above are diagrams of an application scenario of the image processing method of the present application, and an application scenario of the image processing method is not specifically limited in the embodiments of the present application, and the application scenario of the image processing method is only one embodiment of the application scenario of the image processing method provided by the present application, and the application scenario is provided to facilitate understanding of the image processing method provided by the present application, and is not used to limit the image processing method provided by the present application. In the embodiment of the application, other application scenarios of the image processing method are not described in detail.

First embodiment

A first embodiment of the present application provides an image processing method, which is described below with reference to fig. 3. It should be noted that the above scenario embodiment is a further example and a detailed description of the present embodiment, and please refer to the above scenario embodiment for some detailed descriptions of the present embodiment.

Please refer to fig. 3, which is a flowchart illustrating an image processing method according to a first embodiment of the present application.

The image processing method of the embodiment of the application comprises the following steps:

step S301: and acquiring a target image to be processed.

In this embodiment, since in the above-described scene embodiment, it has been described in detail that the target image processed by the image processing method may be a video frame, as a way of acquiring the target image to be processed, it may refer to: acquiring a video containing a target image to be processed; and extracting a video frame containing the target image from the video.

Because the execution main body of the image processing method of the present application may be a server or a client, when the execution main body is the server, acquiring the target image to be processed may refer to: a target image is obtained from a client.

Step S302: and processing the target image and extracting a character area.

In this embodiment, as an embodiment of processing the target image and extracting the text region, the following may be mentioned: firstly, carrying out binarization processing on pixel points in a target image based on the probability of whether each pixel point in the target image belongs to a character region or not to obtain a binary image for representing whether each pixel point in the target image belongs to the character region or not; and then, extracting a character area in the target image based on the binary image and the target image.

Before performing binarization processing on pixel points in a target image based on the probability of whether each pixel point in the target image belongs to a text region, the probability of whether each pixel point in the target image belongs to the text region needs to be determined. In order to obtain the probability of whether each pixel point in the target image belongs to the text region, a feature map for representing the visual features of the target image can be obtained in advance based on the target image by utilizing the calculation of a convolutional neural network. Through the visual characteristic difference of the character area and the non-character area in the training process, the character area and the non-character area can be distinguished through the information in the characteristic diagram, and therefore a probability diagram for representing the probability that each position in the characteristic diagram belongs to the character area is obtained.

After the feature map and the probability map are obtained, the probability of whether each pixel point in the target image belongs to the character region can be obtained by combining the feature map and the probability map, and based on the probability, binarization processing can be performed on the pixel points in the target image to obtain a binary map for representing whether each pixel point in the target image belongs to the character region. Based on the probability, the binarization processing is performed on the pixel points in the target image to obtain a binary image used for representing whether each pixel point in the target image belongs to a character region, which may be: and carrying out binarization processing on each pixel point according to a preset probability threshold value and the obtained probability that each pixel point belongs to the character area to obtain a binary image for representing whether each pixel point in the target image belongs to the character area.

As an implementation manner of performing binarization processing on each pixel point according to a predetermined probability threshold and the obtained probability that each pixel point belongs to a text region, to obtain a binary map used for indicating whether each pixel point in a target image belongs to a text region, the implementation manner may be: firstly, according to the probability that each pixel point in the obtained target image belongs to the character region, comparing the probability that each pixel point belongs to the character region with a preset probability threshold value, and judging whether the probability that each pixel point belongs to the character region is greater than the probability threshold value; if yes, setting the gray value corresponding to the pixel point in the target image to be 255; if not, setting the gray value corresponding to the pixel point in the target image as 0; and then, resetting the gray value of each pixel point in the target image according to the mode to obtain a binary image for representing whether each pixel point in the target image belongs to the character region.

In this embodiment, the obtaining, from the target image, a feature map representing pixel features of the target image and a probability map representing probabilities that respective positions in the feature map belong to the text region may be: the method comprises the steps of taking a target image as input data of a first convolution neural network model, obtaining a feature map used for describing pixel features in the target image and obtaining a probability map used for representing the probability that each position in the feature map belongs to a character area, wherein the first convolution neural network model is used for obtaining the feature map used for representing the pixel features in the target image according to the target image and obtaining the probability map used for representing the probability that each position in the feature map belongs to the character area.

The first convolutional neural network model is obtained by training an initial convolutional neural network model. The first convolution neural network model is obtained by adopting the following training mode:

first, an image sample, a feature map sample for representing the pixel feature in the image sample, and a probability map sample for representing the probability that each position in the feature map sample belongs to a character region are obtained.

And then, providing the image sample for an initial convolutional neural network model, and generating an estimated characteristic image sample and an estimated probability image sample corresponding to the image sample by the initial convolutional neural network model.

And then, comparing the characteristic diagram sample with the estimated characteristic diagram sample, comparing the probability diagram sample with the estimated probability diagram sample, and carrying out parameter adjustment on the initial convolutional neural network model according to the comparison result until the difference value of the comparison result is within a preset threshold range.

And finally, taking the initial convolutional neural network model subjected to the parameter adjustment as a first convolutional neural network model.

The training process of the initial convolutional neural network model is actually based on comparing a standard result (such as a characteristic map sample or a probability map sample) corresponding to an image sample with an output result corresponding to the image sample output by the initial convolutional neural network model, and adjusting parameters of the initial convolutional neural network model by comparing the standard result and the probability map sample until the difference between the standard result and the output result meets a preset condition.

After the binary image is obtained, based on the binary image and the target image, the region where the characters in the target image are located can be determined; and extracting the character area in the area where the characters in the target image are located.

After obtaining the binary image, as a way of determining the region where the characters in the target image are located based on the binary image and the target image: firstly, determining pixel points belonging to a character region in a target image based on pixel points belonging to the character region in a binary image and a mapping relation between the binary image and the target image; and then, determining the region where the characters in the target image are located based on the pixel points belonging to the character region in the target image.

In this embodiment, since the obtained binary image has a mapping relationship with the target image, when determining the region where the text in the target image is located, the pixel points in the text region in the target image may be determined based on the pixel points in the binary image that belong to the text region and the mapping relationship between the binary image and the target image, and then the region where the text in the target image is located may be determined based on the pixel points in the text region in the target image.

After the region where the characters in the target image are located is determined, the region where the characters in the target image are located, namely the character region, can be extracted. Extracting the text region may be a matting process from the target image. In the previous scene embodiment, it can be seen that a plurality of text areas are included in the image a. For each text region in image a, it can be extracted in the manner described above. It is understood that, in the image a, each text region has a certain distance therebetween, and thus when extracting a text region, a plurality of text regions may be extracted separately. Meanwhile, because the distance between different characters in the same character area is smaller than the distance between different character areas, the characters which originally belong to the same character area cannot be distributed into the two character areas. For example: "one" and "three" of "three sheets" are not allocated in the two text areas.

Step S303: and performing character recognition on the character area to obtain a character recognition result.

Specifically, the method for recognizing the characters in the character area to obtain the character recognition result comprises the following steps: firstly, obtaining a feature vector V-F for vector representation of the feature of a character area through convolution calculation; then, coding the feature vector, and extracting a global feature vector V-S; the global feature vector is a vector obtained by removing spatial information from the feature vector; and finally, decoding the global feature vector to obtain a character recognition result. The obtained character recognition result actually refers to a character recognition result corresponding to the character region.

In the above character recognition process, the spatial information in the image is of little significance to character recognition, and the data amount of the vector corresponding to the spatial information in the feature vector V-F is large, so as to reduce the data amount in the character recognition process and improve the recognition efficiency, therefore, in the recognition process, the spatial information is removed.

As one way to obtain a feature vector for vector representation of the features of a text region: the character region is used as input data of a fourth machine learning model to obtain a feature vector for vector representation of the feature of the character region, and the fourth machine learning model is a model for obtaining a feature vector for vector representation of the feature of the character region. The fourth machine learning model is also obtained by training based on the initial machine learning model and the training samples, and the specific process of training the initial machine learning model to obtain the fourth machine learning model is similar to the process of obtaining the first convolutional neural network model, and is not repeated here, specifically please refer to the obtaining process of the first convolutional neural network model.

In this embodiment, the encoding processing on the feature vector and extracting the global feature vector may refer to: and taking the feature vector as input data of a time cycle neural network encoder, and extracting a global feature vector.

Similarly, decoding the global feature to obtain a character recognition result may be: and taking the global features as input data of a time cycle neural network decoder to obtain a character recognition result in the character area.

The encoder and decoder in the time-cyclic neural network can be used for encoding-decoding the feature vectors to identify characters. Of course, it is understood that the time-cycle neural network is also a trained neural network model, and for the training process, please refer to the obtaining process of the first convolution neural network model.

Step S304: and aiming at the character recognition result, distinguishing attribute type characters and attribute result characters in the character recognition result.

After the character recognition result in the character area is obtained, attribute type characters and attribute result characters in the character recognition result are distinguished. In practice, this process may refer to distinguishing attribute type words and attribute result words in corresponding word recognition results in a plurality of word regions.

The above-mentioned distinguishing the attribute type characters and the attribute result characters in the character recognition result with respect to the character recognition result may be: and taking the character recognition result as input data of a second machine learning model to obtain attribute type characters and attribute result characters in the character recognition result, wherein the second machine learning model is used for obtaining the attribute type characters and the attribute result characters in the character recognition result according to the character recognition result.

The second machine learning model is also obtained by training based on the initial machine learning model and the training samples, and the specific process of training the initial machine learning model to obtain the second machine learning model is similar to the process of obtaining the first convolutional neural network model, and is not repeated here, specifically please refer to the obtaining process of the first convolutional neural network model.

Step S305: and corresponding the attribute type characters to the attribute result characters to obtain the corresponding relation between the attribute type characters and the attribute result characters.

After the attribute type characters and the attribute result characters in the character recognition result are distinguished in step S304, the attribute type characters and the attribute result characters are associated with each other to obtain the association relationship between the attribute type characters and the attribute result characters. As one way of obtaining the correspondence between the attribute type characters and the attribute result characters by associating the attribute type characters with the attribute result characters, the following may be mentioned: and taking the attribute type characters and the attribute result characters as input data of a third machine learning model to obtain the corresponding relation between the attribute type characters and the attribute result characters, wherein the third machine learning model is used for obtaining the corresponding relation between the attribute type characters and the attribute result characters according to the attribute type characters and the attribute result characters.

The third machine learning model is also obtained by training based on the initial machine learning model and the training samples, and the specific process of training the initial machine learning model to obtain the third machine learning model is similar to the process of obtaining the first convolutional neural network model, and is not repeated here, specifically please refer to the obtaining process of the first convolutional neural network model.

In this embodiment, in the process of training the initial machine learning model and obtaining the third machine learning model, the position information of different text regions in the target image is considered, so that in the process of corresponding the attribute type text to the attribute result text, the third machine learning model considers the text region where the attribute type text is located and the distribution factor of the text region where the attribute result text is located in the target image.

In this embodiment, if the main execution body of the method is the server, after obtaining the corresponding relationship between the attribute category text and the attribute result text, the step of providing the corresponding relationship between the attribute category text and the attribute result text to the client may be: and providing a writing display mode capable of reflecting the corresponding relation between the attribute type characters and the attribute result characters for the client.

Specifically, the writing display mode includes: if the same attribute type character corresponds to a plurality of attribute result characters, a writing display mode capable of distinguishing the attribute result characters is provided. For example, "photography: zhang III; field affairs: zhao four; performing production: the Wangwu and Li-six styles of writing and displaying, for example, adopt and separate a plurality of names. In this embodiment, if there are multiple attribute result words corresponding to the same attribute type word, the method of this embodiment further includes a fifth machine learning model for distinguishing the multiple attribute result words corresponding to the same attribute type word. For example, when the character recognition result "execute film production" corresponds to "wang wuli six", actually "wang wuli six" is two person names, and when "wang wuli six" is input to the fifth machine learning model at this time, "wang wuli six" can be separated into two person names, that is: wangwu and Liliu.

The fifth machine learning model is also obtained by training based on the initial machine learning model and the training samples, and a specific process of training the initial machine learning model to obtain the fifth machine learning model is similar to a process of obtaining the first convolutional neural network model, which is not repeated here, and please refer to the obtaining process of the first convolutional neural network model specifically.

If the execution main body of the method is a client, the client obtains the corresponding relation between the attribute category characters and the attribute result characters, and then obtains a writing display mode capable of reflecting the corresponding relation between the attribute category characters and the attribute result characters according to the corresponding relation between the attribute category characters and the attribute result characters.

In addition, in the present embodiment, the process of extracting the Character region and recognizing the Character may be performed by using an OCR (Optical Character Recognition) technique, instead of necessarily using a machine learning model or a neural network model. In the present embodiment, the processing is performed using a machine learning model or a neural network model, merely as an example.

According to the image processing method, the target image is processed, the character area is extracted, character recognition is carried out on the basis of the character area, the character recognition result is obtained, after the character recognition result is obtained, the attribute type characters and the attribute result characters in the character recognition result are distinguished, the attribute type characters and the attribute result characters are corresponded, the corresponding relation between the attribute type characters and the attribute result characters is obtained, actually, the recognized characters in the target image are classified according to attributes and are corresponded, and further, the logical relation between the texts appearing in the image can be clearly reflected on the basis of the corresponding relation between the attribute type characters and the attribute result characters, so that the logical relation of the text recognition result in the image is clearly perceived.

Second embodiment

Corresponding to the image processing method provided in the first embodiment of the present application, a second embodiment of the present application provides an image text obtaining method. The execution main body of the embodiment is a client, and is mainly used for obtaining the corresponding relationship between the attribute type words and the attribute result words provided by the server to the client when the execution main body of the first embodiment is the server. The embodiments described below are merely illustrative.

Please refer to fig. 4, which is a flowchart illustrating an image text obtaining method according to a second embodiment of the present application.

The image text obtaining method comprises the following steps:

step S401: and sending the target image to be processed to the server.

Step S402: and obtaining a writing display mode which is provided by the server and can reflect the corresponding relation between the attribute type characters and the attribute result characters.

In the present embodiment, the correspondence between the attribute category characters and the attribute result characters is obtained based on the character recognition result that recognizes the character region in the target image.

According to the image text obtaining method, the target image to be processed is sent to the server, the server processes the target image in the future, the character area is extracted, character recognition is carried out on the basis of the character area, the character recognition result is obtained, after the character recognition result is obtained, the attribute type characters and the attribute result characters in the character recognition result are distinguished, the attribute type characters and the attribute result characters are corresponding, and the corresponding relation between the attribute type characters and the attribute result characters is obtained. The embodiment can obtain the writing display mode which is provided by the server and can reflect the corresponding relation between the attribute type characters and the attribute result characters, so that the logic relation of the text recognition result in the image is clearly perceived.

Third embodiment

Corresponding to the image processing method provided in the first embodiment of the present application, a third embodiment of the present application also provides an image processing apparatus. Since the device embodiment is substantially similar to the first embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the first embodiment for relevant points. The device embodiments described below are merely illustrative.

Fig. 5 is a schematic view of an image processing apparatus according to a third embodiment of the present disclosure.

The image processing apparatus 500 includes:

a target image obtaining unit 501, configured to obtain a target image to be processed;

an extracting unit 502, configured to process the target image and extract a text region;

an identifying unit 503, configured to perform character identification on the character region to obtain a character identification result;

a distinguishing unit 504, configured to distinguish, for a text recognition result, an attribute type text and an attribute result text in the text recognition result;

a corresponding unit 505, configured to correspond the attribute type text to the attribute result text, and obtain a corresponding relationship between the attribute type text and the attribute result text.

Optionally, the extracting unit is specifically configured to:

Optionally, the method further includes: the feature map and probability map obtaining unit is specifically configured to:

based on the target image, obtaining a feature map used for representing the pixel features of the target image, and obtaining a probability map used for representing the probability that each position in the feature map belongs to a character region; the characteristic graph is an image obtained after convolution operation is carried out on the target image, and the characteristic graph is used for describing the characteristics of pixel points in the target image;

the extraction unit is specifically configured to:

Optionally, the extracting unit is specifically configured to:

according to the obtained probability that each pixel point in the target image belongs to the text region, comparing the probability that each pixel point belongs to the text region with a preset probability threshold, and judging whether the probability that each pixel point belongs to the text region is larger than the probability threshold;

Optionally, the feature map and probability map obtaining unit is specifically configured to:

Optionally, the method further includes: a training unit, configured to obtain the first convolutional neural network model by using the following training method:

comparing the characteristic diagram sample with the pre-estimated characteristic diagram sample, comparing the probability diagram sample with the pre-estimated probability diagram sample, and carrying out parameter adjustment on the initial convolutional neural network model according to a comparison result until a difference value of the comparison result is within a preset threshold range;

Optionally, the distinguishing unit is specifically configured to:

Optionally, the corresponding unit is specifically configured to:

Optionally, the target image obtaining unit is specifically configured to: acquiring a video containing a target image to be processed; and extracting a video frame containing the target image from the video.

Optionally, the target image obtaining unit is specifically configured to: obtaining the target image to be processed from a client;

the device further comprises: a providing unit, configured to provide, to the client, the corresponding relationship between the attribute category text and the attribute result text after the step of obtaining the corresponding relationship between the attribute category text and the attribute result text, where the providing unit includes: and providing a writing display mode capable of reflecting the corresponding relation to the client according to the corresponding relation between the attribute type characters and the attribute result characters.

Optionally, the providing unit is specifically configured to: and if the same attribute type character corresponds to a plurality of attribute result characters, providing a writing display mode capable of distinguishing the attribute result characters.

Optionally, if there are multiple attribute result words corresponding to the same attribute type word, the apparatus further includes a fifth machine learning model, configured to distinguish the multiple attribute result words corresponding to the same attribute type word.

Optionally, the identification unit is specifically configured to:

coding the feature vectors and extracting global feature vectors; the global feature vector is a vector obtained by removing spatial information from the feature vector;

Optionally, the identification unit is specifically configured to:

Fourth embodiment

Corresponding to the image text obtaining method provided in the second embodiment of the present application, a fourth embodiment of the present application further provides an image text obtaining apparatus. Since the apparatus embodiment is substantially similar to the second embodiment, it is described in a relatively simple manner, and reference may be made to the partial description of the second embodiment for relevant points. The device embodiments described below are merely illustrative.

Please refer to fig. 6, which is a schematic diagram of an image text obtaining apparatus according to a fourth embodiment of the present application.

The image text obtaining apparatus 600 includes:

a sending unit 601, configured to send a target image to be processed to a server;

a result obtaining unit 602, configured to obtain a writing display manner provided by the server and capable of reflecting a correspondence between the attribute category text and the attribute result text; and obtaining the corresponding relation between the attribute type characters and the attribute result characters based on a character recognition result for recognizing the character area in the target image.

Fifth embodiment

Corresponding to the methods of the first and second embodiments of the present application, a fifth embodiment of the present application further provides an electronic device.

As shown in fig. 7, fig. 7 is a schematic view of an electronic device provided in a fifth embodiment of the present application.

In this embodiment, an alternative hardware structure of the electronic device 700 may be as shown in fig. 7, including: at least one processor 701, at least one memory 702, and at least one communication bus 705; the memory 702 contains a program 703 and data 704.

The bus 705 may be a communication device that transfers data between components within the electronic device 700, such as an internal bus (e.g., a CPU-memory bus), an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port), and so forth.

In addition, the electronic device further includes: at least one network interface 706, and at least one peripheral interface 707. A network interface 706 to provide wired or wireless communication with external networks 708 (e.g., the internet, intranets, local area networks, mobile communication networks, etc.); in some embodiments, network interface 706 may include any number of Network Interface Controllers (NICs), Radio Frequency (RF) modules, repeaters, transceivers, modems, routers, gateways, any combination of wired network adapters, wireless network adapters, bluetooth adapters, infrared adapters, near field communication ("NFC") adapters, cellular network chips, and the like.

Peripheral interface 707 is used to interface with peripherals such as, for example, peripheral 1 (709 in FIG. 7), peripheral 2 (710 in FIG. 7), and peripheral 3 (711 in FIG. 7). Peripherals are peripheral devices that may include, but are not limited to, cursor control devices (e.g., a mouse, touchpad, or touch screen), keyboards, displays (e.g., cathode ray tube displays, liquid crystal displays). A display or light emitting diode display, a video input device (e.g., a camera or an input interface communicatively coupled to a video archive), etc.

The processor 701 may be a CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application.

The memory 702 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 702 stores programs (or referred to as instruction sets) and data required to execute the image processing method or the image text obtaining method provided by the embodiment of the present application.

The processor 701 calls the program and data stored in the memory 702 to execute the image processing method or the image text obtaining method provided by the embodiment of the present application.

Sixth embodiment

A sixth embodiment of the present application, corresponding to the methods of the first and second embodiments of the present application, further provides a computer storage medium for storing a computer program and related data, wherein the computer program is used for executing the image processing method or the image text obtaining method of the first and/or second embodiment of the present application, and the computer program and the data can be read by a processor to execute the methods provided by the first and second embodiments.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer-readable medium does not include non-transitory computer-readable storage media (non-transitory computer readable storage media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. An image processing method, characterized by comprising:

acquiring a target image to be processed;

processing the target image and extracting a character area;

2. The image processing method of claim 1, wherein the processing the target image to extract a text region comprises:

3. The image processing method according to claim 2, further comprising: based on the target image, obtaining a feature map used for representing the pixel features of the target image, and obtaining a probability map used for representing the probability that each position in the feature map belongs to a character region; the characteristic graph is an image obtained after convolution operation is carried out on the target image, and the characteristic graph is used for describing the characteristics of pixel points in the target image;

the method for performing binarization processing on the pixel points in the target image based on the probability of whether each pixel point in the target image belongs to the text region or not to obtain a binary image for representing whether each pixel point in the target image belongs to the text region or not comprises the following steps:

4. The image processing method according to claim 3, wherein the binarizing processing on each pixel point according to a predetermined probability threshold and the obtained probability that each pixel point belongs to a text region to obtain a binary map for indicating whether each pixel point in the target image belongs to a text region comprises:

5. The image processing method according to claim 3, wherein the obtaining a feature map representing pixel features of the target image based on the target image and obtaining a probability map representing probabilities that respective positions in the feature map belong to a text region comprises:

taking the target image as input data of a first convolutional neural network model, obtaining a feature map for describing pixel features in the target image, and obtaining a probability map for representing the probability that each position in the feature map belongs to a text region; the first convolution neural network model is used for obtaining a feature map used for representing pixel features in a target image according to the target image and obtaining a probability map used for representing the probability that each position in the feature map belongs to a character area.

6. The image processing method of claim 5, wherein the first convolutional neural network model is obtained by using the following training method:

7. The image processing method according to claim 1, wherein the distinguishing between the attribute type text and the attribute result text in the text recognition result comprises:

8. The image processing method according to claim 1, wherein said associating the attribute type text with the attribute result text to obtain a corresponding relationship between the attribute type text and the attribute result text comprises:

9. The image processing method according to claim 1, wherein the acquiring the target image to be processed comprises: acquiring a video containing a target image to be processed; and extracting a video frame containing the target image from the video.

10. The image processing method according to claim 1, further comprising: if the characters correspond to the same attribute type and a plurality of attribute result characters, the computer learning model further comprises a fifth machine learning model used for distinguishing the attribute result characters corresponding to the same attribute type characters.

11. The image processing method of claim 1, wherein the performing character recognition on the character region to obtain a character recognition result comprises:

obtaining a feature vector for performing vector representation on the features of the character area;

12. The method according to claim 11, wherein the obtaining a feature vector for vector representation of the feature of the text region comprises:

13. The image processing method according to claim 11, wherein said encoding the feature vector and extracting a global feature vector comprises:

14. The image processing method of claim 13, wherein the decoding the global feature vector to obtain a text recognition result comprises:

15. A method for processing a video of a movie and TV play is characterized by comprising the following steps:

acquiring a movie and television play video to be processed;

processing the target video frame and extracting a character area;

16. An electronic device, comprising:

a processor;

a memory for storing a computer program for execution by the processor to perform the method of any one of claims 1 to 14.

17. A computer storage medium, characterized in that it stores a computer program that is executed by a processor to perform the method of any one of claims 1-14.