CN115019307A

CN115019307A - Image processing method, image text obtaining method and device and electronic equipment

Info

Publication number: CN115019307A
Application number: CN202210422521.0A
Authority: CN
Inventors: 马傲; 王莽; 赵永飞; 王章成; 唐铭谦
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-09-06

Abstract

The application provides an image processing method, an image text obtaining method, and devices, electronic equipment and computer storage media corresponding to the image processing method and the image text obtaining method. According to the image processing method, the area where the vertical characters in the target image are located can be accurately determined, so that the vertical characters can be accurately identified based on the area where the vertical characters are located subsequently, and further the image processing method can provide a basis for accurately identifying the vertical characters in the image. In addition, by using the image processing method, when a large amount of videos are processed in the field of media asset processing, the calculation amount is reduced, and meanwhile, the vertical arrangement character area in the videos can be rapidly and accurately determined from the large amount of videos.

Description

Image processing method, image text obtaining method, device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method, an image text obtaining method, and an apparatus, an electronic device, and a computer storage medium corresponding to the image processing method and the image text obtaining method.

Background

Recognizing texts in images has become an important processing direction in the field of image processing, and recognizing texts in images has been applied in daily life, for example, recognizing license plate numbers in the traffic field, recognizing names of lines or movies in movie videos, and the like, and all need to be involved in recognizing texts in images.

Most of the texts in the image are recognized by adopting an optical character recognition method, the method is suitable for recognizing horizontal texts in the image and has accurate recognized text results, but the recognition method has poor applicability and difficult guarantee of the accuracy of the recognized results when the texts in the image are vertical texts, and the method mainly has inaccurate positioning of the vertical texts in the image. Therefore, how to accurately determine the position of the vertical characters in the image in order to identify the vertical characters subsequently becomes an urgent technical problem to be solved.

Disclosure of Invention

The application provides an image processing method, which aims to solve the technical problem of accurately determining the position of vertical characters in an image so as to subsequently identify the vertical characters. The application also provides an image processing device, an electronic device and a computer storage medium.

The application provides an image processing method, comprising the following steps:

acquiring a target image to be processed;

determining the probability of whether each pixel point in the target image belongs to a vertical text area or not based on the target image;

processing the target image based on the probability of whether each pixel point in the target image belongs to a vertical text area, and acquiring a binary image for representing whether each pixel point in the target image belongs to a vertical text area;

and determining the area where the vertically arranged characters in the target image are located based on the binary image and the target image.

Optionally, the determining, based on the target image, a probability of whether each pixel point in the target image belongs to a vertically arranged text region includes:

based on the target image, obtaining a feature map used for representing the pixel features of the target image, and obtaining a probability map used for representing the probability that each position in the feature map belongs to a vertical text region; the characteristic graph is an image obtained after convolution operation is carried out on the target image, and the characteristic graph is used for describing the characteristics of pixel points in the target image.

Optionally, the processing the target image based on the probability of whether each pixel point in the target image belongs to the vertical text region to obtain a binary image for representing whether each pixel point in the target image belongs to the vertical text region includes:

obtaining the probability that each pixel point in the target image belongs to a vertical text area based on the probability graph and the feature graph;

and carrying out binarization processing on each pixel point according to a preset probability threshold value and the obtained probability that each pixel point belongs to a vertical text area to obtain a binary image for representing whether each pixel point in the target image belongs to the vertical text area.

Optionally, the performing binarization processing on each pixel point according to a predetermined probability threshold and the obtained probability that each pixel point belongs to a vertical text region to obtain a binary image for indicating whether each pixel point in the target image belongs to the vertical text region includes:

according to the obtained probability that each pixel point in the target image belongs to the vertical text region, comparing the probability that each pixel point belongs to the vertical text region with a preset probability threshold value, and judging whether the probability that each pixel point belongs to the vertical text region is larger than the probability threshold value;

if yes, setting the gray value corresponding to the pixel point in the target image to be 255; if not, setting the gray value corresponding to the pixel point in the image as 0;

according to the mode, the gray value of each pixel point in the target image is reset, and a binary image used for indicating whether each pixel point in the target image belongs to a vertically-arranged character area or not is obtained.

Optionally, the obtaining, based on the target image, a feature map used for representing a pixel feature of the target image, and obtaining a probability map used for representing a probability that each position in the feature map belongs to a vertically-arranged text region includes:

taking the target image as input data of a first convolution neural network model, obtaining a feature map for describing pixel features in the target image, and obtaining a probability map for representing the probability that each position in the feature map belongs to a vertical-row character region; the first convolution neural network model is used for obtaining a characteristic diagram used for representing the pixel characteristics of the target image according to the target image and obtaining a probability diagram used for representing the probability that each position in the characteristic diagram belongs to the vertical text area.

Optionally, the first convolutional neural network model is obtained by adopting the following training mode:

obtaining an image sample, a feature map sample for representing the pixel feature of the image sample, and a probability map sample for representing the probability that each position in the feature map sample belongs to a vertical text region;

providing the image sample to an initial convolutional neural network model, wherein the initial convolutional neural network model generates an estimated characteristic map sample and an estimated probability map sample corresponding to the image sample;

comparing the characteristic diagram sample with the pre-estimated characteristic diagram sample, comparing the probability diagram sample with the pre-estimated probability diagram sample, and carrying out parameter adjustment on the initial convolutional neural network model according to a comparison result until a difference value of the comparison result is within a preset threshold range;

and taking the initial convolutional neural network model adjusted by the parameters as the first convolutional neural network model.

Optionally, the method further includes:

extracting an area where the vertically arranged characters are located in the target image;

and carrying out character recognition on the area where the vertically arranged characters are located to obtain a character recognition result.

Optionally, the performing character recognition on the region where the vertically arranged characters are located to obtain a character recognition result includes:

obtaining a characteristic vector for carrying out vector representation on the characteristics of the area where the vertically arranged characters are located;

coding the feature vector, and extracting a global feature vector; the global feature vector is a vector obtained by removing spatial information from the feature vector;

and decoding the global feature vector to obtain a character recognition result.

Optionally, the obtaining a feature vector for performing vector representation on the feature of the region where the vertically arranged characters are located includes:

and taking the area where the vertical arrangement characters are located as input data of a second machine learning model to obtain a feature vector for performing vector representation on the features of the area where the vertical arrangement characters are located, wherein the second machine learning model is used for obtaining the feature vector for performing vector representation on the features of the area where the vertical arrangement characters are located according to the area where the vertical arrangement characters are located.

Optionally, the encoding the feature vector and extracting a global feature vector include:

and taking the feature vector as input data of a time cycle neural network encoder, and extracting a global feature vector.

Optionally, the decoding the global feature vector to obtain a text recognition result includes:

and taking the global feature vector as input data of a time cycle neural network decoder to obtain a character recognition result.

Optionally, the determining, based on the binary image and the target image, an area where the vertically arranged characters in the target image are located includes:

determining pixel points belonging to a vertical arrangement character area in the target image based on the pixel points belonging to the vertical arrangement character area in the binary image and the mapping relation between the binary image and the target image;

and determining the area of the vertical texts in the target image based on the pixel points belonging to the vertical text area in the target image.

Optionally, the acquiring a target image to be processed includes: acquiring a video containing a target image to be processed; and extracting a video frame containing the target image from the video.

The application provides a movie and television play video processing method, which comprises the following steps:

acquiring a movie and television play video to be processed;

obtaining a target video frame containing vertically arranged characters in the movie and television play video;

determining the probability of whether each pixel point in the target video frame belongs to a vertical text region or not based on the target video frame;

processing the target video frame based on the probability of whether each pixel point in the target video frame belongs to a vertical text region or not to obtain a binary image for representing whether each pixel point in the target video frame belongs to the vertical text region or not;

and determining the area where the vertically arranged characters in the target video frame are located based on the binary image and the target video frame.

The application provides an image processing apparatus, including:

a target image acquisition unit for acquiring a target image to be processed;

the probability determining unit is used for determining the probability of whether each pixel point in the target image belongs to the vertical text area or not based on the target image;

a binary image obtaining unit, configured to process the target image based on a probability of whether each pixel point in the target image belongs to a vertical text region, and obtain a binary image used for representing whether each pixel point in the target image belongs to a vertical text region;

and the area determining unit is used for determining the area where the vertically arranged characters in the target image are located based on the binary image and the target image.

The application provides an electronic device, including:

a processor;

a memory for storing a computer program to be executed by the processor for performing the above-mentioned image processing method.

The present application provides a computer storage medium storing a computer program to be executed by a processor to execute the above-described image processing method.

Compared with the prior art, the embodiment of the application has the following advantages:

the application provides an image processing method, comprising the following steps: acquiring a target image to be processed; determining the probability of whether each pixel point in the target image belongs to a vertical text area or not based on the target image; processing the target image based on the probability of whether each pixel point in the target image belongs to the vertical text region or not to obtain a binary image for representing whether each pixel point in the target image belongs to the vertical text region or not; and determining the area where the vertical characters in the target image are located based on the binary image and the target image. According to the image processing method, based on the target image, the probability of whether each pixel point in the target image belongs to the vertical arrangement character region is determined, so that based on the probability, the target image can be processed to obtain a binary image for representing whether each pixel point in the target image belongs to the vertical arrangement character region; and based on the binary image and the target image, the area where the vertical characters in the target image are located can be accurately determined, so that the vertical characters can be accurately identified based on the area where the vertical characters are located subsequently, and further the image processing method can provide a basis for accurately identifying the vertical characters in the image. In addition, by using the image processing method, when a large amount of videos are processed in the field of media asset processing, the calculation amount is reduced, and meanwhile, the vertical arrangement character area in the videos can be rapidly and accurately determined from the large amount of videos.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a first schematic diagram of a scene of an image processing method provided in the present application.

Fig. 2 is a second schematic diagram of a scene of an image processing method provided in the present application.

Fig. 3 is a flowchart of an image processing method according to a first embodiment of the present application.

Fig. 4 is a flowchart of an image text obtaining method according to a second embodiment of the present application.

Fig. 5 is a schematic diagram of an image processing apparatus according to a third embodiment of the present application.

Fig. 6 is a schematic diagram of an image text obtaining apparatus according to a fourth embodiment of the present application.

Fig. 7 is a schematic diagram of an electronic device provided in a fifth embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit and scope of this application and, therefore, is not limited to the specific implementations disclosed below.

The application provides an image processing method, an image text obtaining method, an image processing device, an image text obtaining device, an electronic device and a computer storage medium. An image processing method, an image text obtaining method, an image processing apparatus, an image text obtaining apparatus, an electronic device, and a computer storage medium are described below by specific embodiments, respectively.

The image processing method can be applied to a scene of text recognition of the image. The image processing method and the image text recognition can be applied to multiple aspects, for example, in the traffic field, the license plate number of a motor vehicle in a video or an image can be recognized by processing the video or the image of the motor vehicle and recognizing the text, and further the license plate number can be used for recording a violation vehicle. For another example, in the field of movie and video, by processing a movie and video frame or image and recognizing a speech text therein or a text corresponding to a related name of a movie and video, it is helpful to arrange speech or related name in a movie and video. For another example, in the field of media asset processing, as a large amount of videos need to be processed and the amount of calculation to be processed is large, the image processing method can be used for quickly and accurately detecting the vertical text area in the videos for subsequent identification of the vertical texts in the videos. Specifically, taking the case that the image processing method is applied to processing scenes in a movie and television video as an example, after the movie and television video is obtained, which video frames in the movie and television video contain vertical texts can be detected, and after the video frames of which the frames contain the vertical texts are detected, the video frames are taken as target video frames, and based on the target video frames, the probability that whether each pixel point in the target video frames belongs to a vertical text area is determined; then, processing the target video frame based on the probability of whether each pixel point in the target video frame belongs to the vertical arrangement character region, and obtaining a binary image for representing whether each pixel point in the target video frame belongs to the vertical arrangement character region; finally, based on the binary image and the target video frame, the area where the vertical arrangement characters in the target video frame are located is determined, and the determined area where the vertical arrangement characters in the target video frame are located can be used for subsequent vertical arrangement character recognition.

Of course, the image processing and image text recognition scenes listed above are only for facilitating understanding of specific applications of the image processing method, and the image processing method of the present application may also be used in other scenes.

Further, the image processing method is actually used in a scene for identifying the vertical texts in the image, and the area where the vertical texts in the video frame or the image are located can be determined by using the image processing method, so that the vertical texts can be identified based on the determined area where the vertical texts are located in the subsequent process, and the vertical text identification result can be obtained.

In the video of the beginning or end of the existing movie, the text information of the staff participating in the movie is usually included. For example, character information of a photographer, character information of a cosmetic staff, character information of a field staff, and character information of a participant. The texts corresponding to the information of the persons possibly exist in a vertical row form in the video, and in order to identify the characters in the character information of the related persons in the video of the head or the tail of the movie, the image processing method of the application is adopted to process the video frame of the head or the tail of the movie so as to determine the area where the vertical row characters corresponding to the text information of the related persons are located in the video frame, and then the texts of the related persons are identified based on the area where the vertical row characters corresponding to the text information of the related persons are located in the video frame.

The image processing method is mainly used for determining the area where the vertical characters in the image are located. Of course, the text contained in the image may be only the case of vertical text; the present invention may be applied to a case where the image contains both vertical and horizontal text (for example, a case where the image contains both vertical and horizontal text). No matter which situation the characters contained in the image belong to, the image processing method can determine the area where the vertically arranged characters are located in the image.

Specifically, please refer to fig. 1, which is a first schematic diagram of a scene of the image processing method provided in the present application, in which, for example, vertical characters in a video frame are identified, and naturally, the image processing method is introduced by taking an area where the vertical characters in the video frame are located as an example.

In this scenario, taking the image processing method executed by a server as an example, the server is a computing device for providing services such as data processing and storage for a client, and a general server may refer to a server or a server cluster. In the application, the server identifies the text in the image and provides the text identification result to the client, so that the user can perform subsequent other processing (for example, filing, etc.) on the obtained processing result at the client. Of course, the image processing method may be executed at the client, and specifically, a program or software for implementing the image processing method provided by the present application may be configured in advance in the electronic device corresponding to the client, or a module for implementing the image processing method provided by the present application may be configured in advance in a target application installed therein. So-called electronic devices are typically smart phones, and a range of different types of computers including tablet computers. The target Application is generally an APP (Application program) or a computer Application.

Specifically, referring to fig. 1, the server first obtains a movie video including vertical lines of characters, and after obtaining the movie video, extracts a video frame including a target image from the video, and uses the video frame to be processed as the target image. Taking a video frame (target image a) as an example of a target image, how to process the target image a to determine a region where characters are located in the target image a will be described in detail. As can be seen from fig. 1, the characters contained in the target image a are: "photography: zhang III'. And "photography: zhang three "is present in the target image a in a vertical row.

In the present application, the image processing method actually processes the target image a based on the following idea. Firstly, determining the probability of whether each pixel point in a target image A belongs to a vertical text area or not based on the target image A; then, processing the target image A based on the probability of whether each pixel point in the target image A belongs to the vertical arrangement character region, and obtaining a binary image for representing whether each pixel point in the target image A belongs to the vertical arrangement character region; finally, based on the binary image and the target image A, the area where the vertical characters in the target image A are located is determined.

The above determining, based on the target image a, whether each pixel point in the target image a belongs to the probability of the vertically arranged text region may refer to: based on the target image A, a feature map used for representing the pixel features in the target image A is obtained, and a probability map used for representing the probability that each position in the feature map belongs to the vertical character region is obtained.

In the present application, the feature map refers to a vector representation obtained by performing a convolution operation on the target image a, and is used for encoding and representing the visual features of the image a by using the vector representation so as to be used for subsequent calculation. In practice, the feature map may be understood as the original image (target image a) is subjected to the reduction processing. In general, an original image includes three color channels of RGB, and the original image obtains various colors by the variation of the three color channels and their superposition with each other. The feature map is a feature vector obtained by performing convolution operation on the original image, and the number of channels of the feature map is increased relative to that of the original image, so that the size of the feature map is reduced. It should be noted that the feature map is a more "refined" representation of the original image. The characteristic graph extracted through convolution operation can help people to obtain the probability that each pixel point in the target image A belongs to the vertical text area, and therefore the position of the vertical text frame is found.

Specifically, based on the target image a, obtaining a feature map for representing the pixel features of the target image a, and obtaining a probability map for representing the probability that each position in the feature map belongs to the vertical text region may refer to: the target image A is used as input data of a first convolution neural network model, a feature map used for representing the pixel features of the target image A is obtained, and a probability map used for representing the probability that each position in the feature map belongs to a vertical text region is obtained. In the application, the first convolution neural network model is used for obtaining a feature map used for representing the pixel features of a target image according to the target image and obtaining a probability map used for representing the probability that each position in the feature map belongs to a vertical-row character area.

After the feature map corresponding to the target image a and the probability map are obtained, based on the probability of whether each pixel in the target image a belongs to the vertical text region, binarization processing is performed on the pixels in the target image a to obtain a binary map for indicating whether each pixel in the target image a belongs to the vertical text region, where the binary map may be: based on the probability graph and the characteristic graph, obtaining the probability that each pixel point in the target image A belongs to the vertical text area; then, according to a predetermined probability threshold value and the obtained probability that each pixel belongs to the vertical arrangement character region, binarization processing is performed on each pixel to obtain a binary image for representing whether each pixel in the target image A belongs to the vertical arrangement character region.

The above-mentioned method, according to the predetermined probability threshold and the obtained probability that each pixel belongs to the vertical text area, performs binarization processing on each pixel to obtain a binary image for indicating whether each pixel in the target image a belongs to the vertical text area, which may refer to: according to the obtained probability that each pixel point in the target image A belongs to the vertical arrangement character region, comparing the probability that each pixel point belongs to the vertical arrangement character region with a preset probability threshold value, and judging whether the probability that each pixel point belongs to the vertical arrangement character region is larger than the probability threshold value; if yes, setting the gray value of the pixel point corresponding to the target image A to be 255; if not, setting the gray value of the pixel point in the target image A as 0; according to the mode, the gray value of each pixel point in the target image A is reset, and a binary image used for indicating whether each pixel point in the target image A belongs to the vertically-arranged character area or not is obtained.

The binary image actually visually distinguishes whether each pixel point in the target image A is in the area of the vertical texts in an image mode, and therefore the area of the vertical texts in the target image A can be quickly determined by subsequently combining the binary image with the target image A.

In the binarization process, the gray value of the pixel point corresponding to the probability threshold value when the probability of the pixel point belonging to the vertical arrangement character area in the target image A is greater than the probability threshold value is set to be 255, the gray value of the pixel point corresponding to the probability threshold value when the probability of the pixel point belonging to the vertical arrangement character area in the target image A is not greater than the probability threshold value is set to be 0, and actually, according to whether the probability of the pixel point belonging to the vertical arrangement character area in the target image A is greater than the probability threshold value or not, all the pixel points in the target image A are divided into two types, and the gray values are respectively processed. Of course, the above-mentioned setting of the gray-scale values to 255 and 0 is merely an example, and the gray-scale values may be set to two other different values.

With reference to fig. 1, after determining the area where the vertical characters in the target image a are located, extracting the area where the vertical characters in the target image a are located; and performing character recognition on the area where the vertically arranged characters in the target image A are located to obtain a character recognition result in the target image A.

Specifically, the character recognition of the region where the vertically arranged characters are located in the target image a to obtain the character recognition result in the target image a may be: obtaining a characteristic vector for carrying out vector representation on the characteristics of the area where the vertically arranged characters are located; coding the feature vector, and extracting a global feature vector; the global feature vector is a vector obtained by removing spatial information from the feature vector; and decoding the global feature vector to obtain a character recognition result.

The obtaining of the feature vector for performing vector representation on the features of the region where the vertically arranged characters are located may be: and the second machine learning model is used for obtaining the characteristic vector used for carrying out vector representation on the characteristics of the region where the vertical texts are located according to the region where the vertical texts are located. In the present application, the region where the vertically arranged characters are located is actually a part of the target image a, and is also an image in nature.

Specifically, the encoding processing is performed on the feature vector to extract the global feature vector, which may refer to: the feature vectors are used as input data of a time-cyclic neural network Encoder (LSTM Encoder) to extract global feature vectors. Similarly, decoding the global feature to obtain a character recognition result may refer to: and taking the global features as input data of a time-cycle neural network Decoder (LSTM Decoder) to obtain a character recognition result.

The above-described character recognition of the region where the vertically arranged characters are located to obtain the character recognition result is merely an example, and other manners may be adopted to perform character recognition of the region where the vertically arranged characters are located to obtain the character recognition result.

And after the server side obtains the vertical character recognition result, the vertical character recognition result is provided for the client side. Specifically, the vertically arranged character recognition result provided by the server to the client is a writing display mode capable of reflecting the character recognition result. In the present application, the writing display method provided by the server and capable of reflecting the character recognition result may be, for example, "photography: zhang three "writing display mode of style.

To facilitate understanding of the data interaction process between the server and the client, please refer to fig. 2, which is a second schematic diagram of a scene of the image processing method provided in the present application. Firstly, a client provides a target image A to be processed to a server, and after the server obtains the target image A, the server determines the probability of whether each pixel point in the target image A belongs to a vertically arranged character region or not based on the target image A; then, processing the target image A based on the probability of whether each pixel point in the target image A belongs to the vertical arrangement character region, and obtaining a binary image for representing whether each pixel point in the target image A belongs to the vertical arrangement character region; then, based on the binary image and the target image A, determining the area of the vertical characters in the target image A; finally, character recognition is carried out based on the area where the vertically arranged characters in the target image A are located, and a character recognition result is obtained.

And after the server side obtains the character recognition result, the server side provides the character recognition result to the client side.

Fig. 1 to fig. 2 introduced above are diagrams of an application scenario of the image processing method of the present application, and an application scenario of the image processing method is not specifically limited in the embodiments of the present application, and the application scenario of the image processing method is only one embodiment of the application scenario of the image processing method provided by the present application, and the application scenario is provided to facilitate understanding of the image processing method provided by the present application, and is not used to limit the image processing method provided by the present application. Other application scenarios of the image processing method are not described in detail in the embodiment of the present application.

First embodiment

A first embodiment of the present application provides an image processing method, which is described below with reference to fig. 3. It should be noted that the above scenario embodiment is a further example and a detailed description of the present embodiment, and please refer to the above scenario embodiment for some detailed descriptions of the present embodiment.

Please refer to fig. 3, which is a flowchart illustrating an image processing method according to a first embodiment of the present application.

The image processing method of the embodiment of the application comprises the following steps:

step S301: and acquiring a target image to be processed.

In this embodiment, since in the above-described scene embodiment, it has been described in detail that the target image processed by the image processing method may be a video frame, as a way of acquiring the target image to be processed, it may refer to: acquiring a video containing a target image to be processed; and extracting a video frame containing the target image from the video.

Because the execution main body of the image processing method of the present application may be a server or a client, when the execution main body is the server, acquiring the target image to be processed may refer to: a target image is obtained from a client.

Step S302: and determining the probability of whether each pixel point in the target image belongs to the vertical text area or not based on the target image.

In this embodiment, as an implementable manner of determining, based on the target image, whether each pixel point in the target image belongs to the probability of the vertically-arranged text region, the implementable manner may be: based on the target image, obtaining a feature map used for representing the pixel features of the target image, and obtaining a probability map used for representing the probability that each position in the feature map belongs to a vertical text region; the characteristic graph is an image obtained after convolution operation is carried out on the target image, and the characteristic graph is used for describing the characteristics of pixel points in the target image.

As an embodiment, a feature map used for representing the pixel features of the target image is obtained based on the target image, and a probability map used for representing the probability that each position in the feature map belongs to the vertical text region is obtained: taking the target image as input data of a first convolution neural network model, obtaining a feature map for describing pixel features in the target image, and obtaining a probability map for representing the probability that each position in the feature map belongs to a vertical-row character region; the first convolution neural network model is used for obtaining a feature map used for representing the pixel features of the target image according to the target image and obtaining a probability map used for representing the probability that each position in the feature map belongs to the vertical-row character area.

The above process is actually a calculation using a convolutional neural network to obtain a feature map representing the visual features of the target image. Through the visual characteristic difference between the vertical text area and the non-vertical text area in the training process, the vertical text area and the non-vertical text area can be distinguished through the information in the characteristic diagram, and therefore a probability diagram for representing the probability that each position in the characteristic diagram belongs to the vertical text area is obtained.

The first convolutional neural network model is obtained by training an initial convolutional neural network model. The first convolution neural network model is obtained by adopting the following training mode:

firstly, an image sample, a feature map sample for representing the pixel feature of the image sample, and a probability map sample for representing the probability that each position in the feature map sample belongs to a vertical text region are obtained.

And then, providing the image sample for the initial convolutional neural network model, and generating an estimated characteristic map sample and an estimated probability map sample corresponding to the image sample by the initial convolutional neural network model.

And then, comparing the characteristic diagram sample with the estimated characteristic diagram sample, comparing the probability diagram sample with the estimated probability diagram sample, and carrying out parameter adjustment on the initial convolutional neural network model according to the comparison result until the difference value of the comparison result is within a preset threshold range.

And finally, taking the initial convolutional neural network model subjected to the parameter adjustment as a first convolutional neural network model.

The training process of the initial convolutional neural network model is actually based on comparing a standard result (such as a characteristic map sample or a probability map sample) corresponding to an image sample with an output result corresponding to the image sample output by the initial convolutional neural network model, and adjusting parameters of the initial convolutional neural network model by comparing the standard result and the probability map sample until the difference between the standard result and the output result meets a preset condition.

Step S303: and processing the target image based on the probability of whether each pixel point in the target image belongs to the vertical arrangement character region or not to obtain a binary image for expressing whether each pixel point in the target image belongs to the vertical arrangement character region or not.

After obtaining a feature map for representing the pixel features of the target image and obtaining a probability map for representing the probability that each position in the feature map belongs to a vertically arranged text region, processing the target image based on the probability of whether each pixel point in the target image belongs to the vertically arranged text region, to obtain a binary map for representing whether each pixel point in the target image belongs to the vertically arranged text region, which may be: firstly, based on the probability graph and the feature graph, obtaining the probability that each pixel point in the target image belongs to a vertical text area; and then, according to a preset probability threshold value and the obtained probability that each pixel belongs to the vertical arrangement character region, carrying out binarization processing on each pixel to obtain a binary image for representing whether each pixel in the target image belongs to the vertical arrangement character region.

Specifically, as a way of performing binarization processing on each pixel point according to a predetermined probability threshold and the obtained probability that each pixel point belongs to a vertical text region, obtaining a binary map used for indicating whether each pixel point in the target image belongs to the vertical text region: firstly, according to the probability that each pixel point in the obtained target image belongs to the vertical arrangement character region, comparing the probability that each pixel point belongs to the vertical arrangement character region with a preset probability threshold value, and judging whether the probability that each pixel point belongs to the vertical arrangement character region is larger than the probability threshold value; if yes, setting the gray value corresponding to the pixel point in the target image to be 255; if not, setting the gray value corresponding to the pixel point in the image as 0; and then, resetting the gray value of each pixel point in the target image according to the mode to obtain a binary image for representing whether each pixel point in the target image belongs to the vertically-arranged character area.

Step S304: and determining the area where the vertical characters in the target image are located based on the binary image and the target image.

After obtaining the binary image, as a way of determining the region where the vertically arranged characters in the target image are located based on the binary image and the target image: firstly, determining pixel points belonging to a vertical arrangement character area in a target image based on the pixel points belonging to the vertical arrangement character area in a binary image and a mapping relation between the binary image and the target image; and then, determining the area of the vertical texts in the target image based on the pixel points belonging to the vertical text area in the target image.

In this embodiment, since the obtained binary image has a mapping relationship with the target image, when determining the region where the vertical text in the target image is located, the pixel points in the vertical text region in the target image may be determined based on the pixel points in the binary image that belong to the vertical text region and the mapping relationship between the binary image and the target image, and then the region where the vertical text in the target image is located may be determined based on the pixel points in the vertical text region in the target image.

After the area where the vertical characters in the target image are located is determined, the area where the vertical characters in the target image are located can be extracted, character recognition is carried out on the area where the vertical characters are located, and a character recognition result is obtained. Extracting the region where the vertically arranged characters are located in the target image may be an operation of performing matting processing on the target image.

Specifically, performing character recognition on the region where the vertically arranged characters are located to obtain a character recognition result, which may be: firstly, obtaining a characteristic vector V-F for carrying out vector representation on the characteristics of an area where the vertically arranged characters are located; then, coding the feature vector, and extracting a global feature vector V-S; the global feature vector is a vector obtained by removing spatial information from the feature vector; and finally, decoding the global feature vector to obtain a character recognition result.

In the above character recognition process, the spatial information in the image is of little significance to character recognition, and the data amount of the vector corresponding to the spatial information in the feature vector V-F is large, so as to reduce the data amount in the character recognition process and improve the recognition efficiency, therefore, in the recognition process, the spatial information is removed.

As one way to obtain a feature vector for vector-representing the features of the region where the vertically-arranged characters are located: and taking the area where the vertical arrangement characters are located as input data of a second machine learning model to obtain a feature vector for performing vector representation on the features of the area where the vertical arrangement characters are located, wherein the second machine learning model is used for obtaining the feature vector for performing vector representation on the features of the area where the vertical arrangement characters are located according to the area where the vertical arrangement characters are located. The second machine learning model is obtained by training based on the initial machine learning model and the training samples, and the specific process of training the initial machine learning model to obtain the second machine learning model is similar to the process of obtaining the first convolutional neural network model, and is not repeated here, specifically please refer to the obtaining process of the first convolutional neural network model.

In this embodiment, the encoding processing on the feature vector to extract the global feature vector may refer to: and taking the feature vector as input data of a time cycle neural network encoder, and extracting a global feature vector.

Similarly, decoding the global feature vector to obtain a character recognition result may be: and taking the global feature vector as input data of a time cycle neural network decoder to obtain a character recognition result.

The encoder and decoder in the time-cycle neural network can be used for encoding-decoding the feature vectors to identify characters. Of course, it is understood that the time-cycle neural network is also a trained neural network model, and for the training process, please refer to the obtaining process of the first convolution neural network model.

In this embodiment, if the main execution body of the method is the server, after obtaining the text recognition result, providing the text recognition result to the client may refer to: and providing a writing display mode capable of reflecting the character recognition result for the client. For example: "photography: zhang three "writing display mode of style.

If the execution main body of the method is the client, the client obtains a writing display mode capable of reflecting the character recognition result according to the character recognition result after obtaining the character recognition result.

According to the image processing method, based on the target image, the probability of whether each pixel point in the target image belongs to the vertical arrangement character region is determined, so that based on the probability, the target image can be processed to obtain a binary image for representing whether each pixel point in the target image belongs to the vertical arrangement character region; and based on the binary image and the target image, the area where the vertical characters in the target image are located can be accurately determined, so that the vertical characters can be accurately identified based on the area where the vertical characters are located subsequently, and further the image processing method can provide a basis for accurately identifying the vertical characters in the image.

Second embodiment

Corresponding to the image processing method provided in the first embodiment of the present application, a second embodiment of the present application provides an image text obtaining method. The execution main body of the embodiment is a client, and is mainly used for showing a character recognition result provided by the server to the client when the execution main body of the first embodiment is the server. The embodiments described below are merely illustrative.

Please refer to fig. 4, which is a flowchart illustrating an image text obtaining method according to a second embodiment of the present application.

The image text obtaining method comprises the following steps:

step S401: and sending a target image to be processed to the server, wherein the target image comprises vertical texts.

Step S402: obtaining a writing display mode which is provided by a server and can reflect the recognition result of the vertically arranged characters in the target image; the vertical arrangement character recognition result is obtained based on character recognition of the area where the vertical arrangement characters in the target image are located.

According to the image text obtaining method, the target image to be processed is sent to the server, the server is enabled to be based on the target image in the following process, the probability of whether each pixel point in the target image belongs to the vertical arrangement character region is determined, and therefore based on the probability, the target image can be processed to obtain the binary image used for representing whether each pixel point in the target image belongs to the vertical arrangement character region; and based on the binary image and the target image, the area where the vertical characters are located in the target image can be accurately determined, so that the vertical characters can be accurately identified subsequently based on the area where the vertical characters are located. According to the embodiment, the writing display mode which is provided by the server and can reflect the recognition result of the vertically arranged characters in the target image can be obtained, and the user can conveniently carry out follow-up work such as filing based on the writing display mode.

Third embodiment

Corresponding to the image processing method provided in the first embodiment of the present application, a third embodiment of the present application also provides an image processing apparatus. Since the device embodiment is substantially similar to the first embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the first embodiment for relevant points. The device embodiments described below are merely illustrative.

Fig. 5 is a schematic view of an image processing apparatus according to a third embodiment of the present disclosure.

The image processing apparatus includes:

a target image obtaining unit 501, configured to obtain a target image to be processed;

a probability determining unit 502, configured to determine, based on the target image, a probability of whether each pixel point in the target image belongs to a vertically arranged text region;

a binary image obtaining unit 503, configured to process the target image based on a probability of whether each pixel point in the target image belongs to a vertically arranged text region, so as to obtain a binary image indicating whether each pixel point in the target image belongs to a vertically arranged text region;

an area determining unit 504, configured to determine, based on the binary image and the target image, an area where a vertically-arranged character in the target image is located.

Optionally, the probability determining unit is specifically configured to:

Optionally, the binary image obtaining unit is specifically configured to:

and carrying out binarization processing on each pixel point according to a preset probability threshold value and the obtained probability that each pixel point belongs to the vertical text area to obtain a binary image for representing whether each pixel point in the target image belongs to the vertical text area.

Optionally, the binary image obtaining unit is specifically configured to:

Optionally, the probability determining unit is specifically configured to:

taking the target image as input data of a first convolution neural network model, obtaining a feature map for describing pixel features in the target image, and obtaining a probability map for representing the probability that each position in the feature map belongs to a vertical-row character region; the first convolution neural network model is used for obtaining a feature map used for representing the pixel features of the target image according to the target image and obtaining a probability map used for representing the probability that each position in the feature map belongs to the vertical-row character area.

Optionally, the method further includes: a training unit, configured to obtain the first convolutional neural network model by using the following training method:

comparing the characteristic diagram sample with the estimated characteristic diagram sample, comparing the probability diagram sample with the estimated probability diagram sample, and performing parameter adjustment on the initial convolutional neural network model according to a comparison result until a difference value of the comparison result is within a preset threshold range;

Optionally, the method further includes: an identification unit; the identification unit is specifically configured to:

extracting an area where the vertically arranged characters in the target image are located;

and performing character recognition on the area where the vertically arranged characters are located to obtain a character recognition result.

Optionally, the identification unit is specifically configured to:

Optionally, the area determining unit is specifically configured to:

Optionally, the target image obtaining unit is specifically configured to: acquiring a video containing a target image to be processed; and extracting a video frame containing the target image from the video.

Fourth embodiment

Corresponding to the image text obtaining method provided in the second embodiment of the present application, a fourth embodiment of the present application further provides an image text obtaining apparatus. Since the apparatus embodiment is substantially similar to the second embodiment, it is described in a relatively simple manner, and reference may be made to the partial description of the second embodiment for relevant points. The device embodiments described below are merely illustrative.

Please refer to fig. 6, which is a diagram illustrating an image text obtaining apparatus according to a fourth embodiment of the present application.

The image text obtaining device comprises:

a sending unit 601, configured to send a target image to be processed to a server, where the target image includes vertically arranged characters;

a result obtaining unit 602, configured to obtain a writing display manner provided by the server and capable of reflecting a vertically arranged character recognition result in the target image; and the vertical arrangement character recognition result is obtained based on character recognition of the region where the vertical arrangement characters in the target image are located.

Fifth embodiment

Corresponding to the methods of the first and second embodiments of the present application, a fifth embodiment of the present application further provides an electronic device.

As shown in fig. 7, fig. 7 is a schematic view of an electronic device provided in a fifth embodiment of the present application. The electronic device includes: a processor 701; a memory 702 for storing a computer program to be executed by a processor for performing the methods of the first and second embodiments.

Sixth embodiment

In correspondence with the methods of the first and second embodiments of the present application, a sixth embodiment of the present application also provides a computer storage medium storing a computer program which is executed by a processor to perform the methods of the first and second embodiments.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include non-transitory computer readable storage media (non-transitory computer readable storage media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. An image processing method, comprising:

acquiring a target image to be processed;

processing the target image based on the probability of whether each pixel point in the target image belongs to a vertical text region or not to obtain a binary image for representing whether each pixel point in the target image belongs to a vertical text region or not;

2. The image processing method according to claim 1, wherein the determining, based on the target image, a probability of whether each pixel point in the target image belongs to a vertically-arranged text region comprises:

3. The image processing method according to claim 2, wherein the processing the target image based on the probability of whether each pixel point in the target image belongs to a vertically arranged text region to obtain a binary map for indicating whether each pixel point in the target image belongs to a vertically arranged text region comprises:

4. The image processing method according to claim 3, wherein the binarizing processing on each pixel point according to a predetermined probability threshold and the obtained probability that each pixel point belongs to a vertically arranged text region to obtain a binary map for indicating whether each pixel point in the target image belongs to a vertically arranged text region comprises:

and resetting the gray value of each pixel point in the target image according to the mode to obtain a binary image for representing whether each pixel point in the target image belongs to the vertically-arranged character area.

5. The image processing method according to claim 2, wherein the obtaining, based on the target image, a feature map representing pixel features of the target image and obtaining a probability map representing probabilities that respective positions in the feature map belong to a vertically-arranged character region comprises:

6. The image processing method of claim 5, wherein the first convolutional neural network model is obtained by using the following training method:

providing the image sample to an initial convolution neural network model, wherein the initial convolution neural network model generates a pre-estimated characteristic map sample and a pre-estimated probability map sample corresponding to the image sample;

and taking the initial convolutional neural network model which is subjected to the parameter adjustment as the first convolutional neural network model.

7. The image processing method according to claim 1, further comprising:

8. The image processing method of claim 7, wherein the performing character recognition on the region where the vertically arranged characters are located to obtain a character recognition result comprises:

obtaining a characteristic vector for carrying out vector representation on the characteristics of an area where the vertically arranged characters are located;

coding the feature vectors and extracting global feature vectors; the global feature vector is a vector obtained by removing spatial information from the feature vector;

9. The image processing method according to claim 8, wherein the obtaining a feature vector for vector-representing features of an area where the vertically arranged characters are located comprises:

10. The image processing method according to claim 8, wherein said encoding the feature vector and extracting a global feature vector comprises:

11. The image processing method of claim 10, wherein the decoding the global feature vector to obtain a text recognition result comprises:

12. The image processing method according to claim 1, wherein the determining, based on the binary image and the target image, an area where a vertical text in the target image is located comprises:

determining pixel points belonging to a vertical text area in the target image based on the pixel points belonging to the vertical text area in the binary image and the mapping relation between the binary image and the target image;

13. The image processing method according to claim 1, wherein the acquiring the target image to be processed comprises: acquiring a video containing a target image to be processed; and extracting a video frame containing the target image from the video.

14. A method for processing a video of a movie and TV play is characterized by comprising the following steps:

acquiring a movie and television play video to be processed;

15. An electronic device, comprising:

a processor;

a memory for storing a computer program for execution by the processor to perform the method of any one of claims 1 to 13.

16. A computer storage medium, characterized in that it stores a computer program that is executed by a processor to perform the method of any one of claims 1-13.