CN113436222A

CN113436222A - Image processing method, image processing apparatus, electronic device, and storage medium

Info

Publication number: CN113436222A
Application number: CN202110597964.9A
Authority: CN
Inventors: 麻凯利; 马志国; 张飞飞
Original assignee: New Oriental Education Technology Group Co ltd
Current assignee: New Oriental Education Technology Group Co ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-24

Abstract

An image processing method, an image processing apparatus, an electronic device, and a storage medium. The image processing method comprises the following steps: acquiring an input image; determining a first image based on the input image, wherein the first image is an area comprising text content in the input image, and texts in the text content are arranged in a line in the first image; extracting the features of the first image by adopting a dense convolution network to obtain target features, wherein the dense convolution network does not comprise a pooling layer; obtaining text content based on the target characteristics; and obtaining a second image based on the first image and the text content, and replacing the second image with the first image in the input image to generate an output image comprising the second image. The image processing method can keep the background of the input image, and simultaneously ensure that the text content in the input image is more regular and clearer and has higher resolution, thereby enhancing the watching quality of the text content; the feature information of the first image can be retained to a greater extent to improve the accuracy of text recognition.

Description

Image processing method, image processing apparatus, electronic device, and storage medium

Technical Field

Embodiments of the present disclosure relate to an image processing method, an image processing apparatus, an electronic device, and a non-transitory storage medium.

Background

With the continuous development of education informatization, the application of video courses in the teaching process is increasingly wide. For example, when performing a face-to-face classroom teaching, the recording system can be used to record classroom teaching contents to form classroom teaching video, so that students can watch classroom teaching video online to review and review the related teaching contents. In addition, the classroom teaching video is also widely applied to teaching evaluation, recording of demonstration classes, teaching observation, remote teaching and the like.

Disclosure of Invention

At least one embodiment of the present disclosure provides an image processing method, including: acquiring an input image; determining a first image based on the input image, wherein the first image is an area comprising text content in the input image, and texts in the text content are arranged in a line in the first image; extracting the features of the first image by adopting a dense convolution network to obtain target features, wherein the dense convolution network does not comprise a pooling layer; obtaining the text content based on the target characteristics; obtaining a second image based on the first image and the text content; replacing a first image of the input images with the second image to generate an output image comprising the second image.

For example, in an image processing method provided by at least one embodiment of the present disclosure, in convolution kernels of the dense convolution network, a step size of at least one convolution kernel in a first direction is larger than a step size in a second direction.

For example, in an image processing method provided by at least one embodiment of the present disclosure, in convolution kernels of the dense convolution network, a step size of at least one convolution kernel in the first direction is 2, and a step size in the second direction is 1.

For example, in an image processing method provided in at least one embodiment of the present disclosure, the first direction is a height direction, and the second direction is a width direction.

For example, in an image processing method provided by at least one embodiment of the present disclosure, the input image includes an image of a teaching scene, and the text content includes a blackboard-writing content.

For example, in an image processing method provided by at least one embodiment of the present disclosure, obtaining a second image based on the first image and the text content includes: determining a text region and a non-text region in the first image based on the first image; determining a background image of the second image based on the non-text region and the text region; and obtaining the second image based on the text content and the background image.

For example, in an image processing method provided by at least one embodiment of the present disclosure, determining a text region and a non-text region in the first image based on the first image includes: carrying out graying processing on the first image to obtain a grayscale image; carrying out binarization processing on the gray level image to obtain a binary image; and determining a text region and a non-text region in the first image according to the binary image.

For example, in an image processing method provided by at least one embodiment of the present disclosure, performing binarization processing on the grayscale image to obtain a binary image includes: and carrying out binarization processing on the gray level image by adopting an adaptive threshold algorithm to obtain the binary image.

For example, in an image processing method provided by at least one embodiment of the present disclosure, obtaining a background image of the second image based on the non-text region and the text region includes: obtaining a first pixel value based on each pixel value of the non-text region; and obtaining the background image based on the first pixel value and the text area.

For example, in an image processing method provided by at least one embodiment of the present disclosure, the first pixel value includes one of an average value, a maximum value, and a minimum value of respective pixel values of the non-text region.

For example, in an image processing method provided by at least one embodiment of the present disclosure, obtaining the second image based on the text content and the background image includes: and obtaining a foreground image based on the text content and the size of the background image. The foreground image comprises the text content; and covering the foreground image on the background image to obtain the second image.

For example, in an image processing method provided by at least one embodiment of the present disclosure, text images corresponding to the text content are uniformly arranged in the foreground image.

For example, in an image processing method provided by at least one embodiment of the present disclosure, text content in the second image is represented by a print.

For example, in an image processing method provided by at least one embodiment of the present disclosure, text content in the first image is represented by handwriting.

For example, an image processing method provided by at least one embodiment of the present disclosure further includes: acquiring an image sequence comprising a plurality of frames of the output images; extracting at least one frame of output image in the image sequence; in response to detecting that there is an output image satisfying a first predetermined condition in the at least one frame of output images, storing the output image satisfying the first predetermined condition as a target frame of output image, and storing text content corresponding to the target frame of output image for use in positioning the image sequence.

For example, in an image processing method provided by at least one embodiment of the present disclosure, the first predetermined condition includes: the text similarity of the text content corresponding to the target frame output image and the text content corresponding to the detected previous frame output image exceeds a first threshold, or in the case that the first image determined based on the input image includes at least one first image, the difference between the number of first images included in the target frame output image and the number of first images included in the detected previous frame output image exceeds a second threshold.

For example, in an image processing method provided by at least one embodiment of the present disclosure, the text similarity includes a normalized edit distance.

For example, in an image processing method provided by at least one embodiment of the present disclosure, determining a first image based on the input image includes: and detecting a text area of the input image to obtain a first image comprising the text content.

For example, in an image processing method provided by at least one embodiment of the present disclosure, performing text region detection on the input image to obtain a first image including the text content includes performing text region detection on the input image using a segmentation-based derivable binarization network to obtain the first image including the text content.

For example, in an image processing method provided by at least one embodiment of the present disclosure, obtaining the text content from the target feature includes: obtaining sequence features based on the target features through a recurrent neural network; and obtaining the text content based on the sequence characteristics by adopting a first processing method.

For example, in an image processing method provided in at least one embodiment of the present disclosure, the recurrent neural network includes a bidirectional recurrent neural network.

For example, in an image processing method provided in at least one embodiment of the present disclosure, the first processing method includes a connection time classification method.

For example, an image processing method provided in at least one embodiment of the present disclosure further includes: pre-processing the first image, the pre-processing comprising: at least one of an affine transformation, a thin plate spline transformation.

For example, at least one embodiment of the present disclosure also provides an image processing apparatus including: an acquisition unit configured to acquire an input image; a determination unit configured to determine a first image based on the input image, the first image being an area including text content in the input image, and text in the text content being arranged in a line in the first image; the feature extraction unit is configured to extract features of the first image by adopting a dense convolution network to obtain target features, wherein the dense convolution network does not include a pooling layer; the identification unit is configured to obtain the text content based on the target feature; the fusion unit is configured to obtain a second image based on the first image and the text content; a generating unit configured to replace the second image with a first image of the input images to generate an output image including the second image.

For example, at least one embodiment of the present disclosure also provides an electronic device including: a memory for non-transitory storage of computer readable instructions; and a processor for executing the computer readable instructions, wherein the computer readable instructions, when executed by the processor, perform the image processing method according to any embodiment of the disclosure.

For example, at least one embodiment of the present disclosure also provides a non-transitory storage medium that non-transitory stores computer-readable instructions that, when executed by a computer, perform the instructions of the image processing method according to any one of the embodiments of the present disclosure.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it should be apparent that the drawings described below only relate to some embodiments of the present disclosure and are not limiting on the present disclosure.

Fig. 1 is a frame of video image of a teaching scene provided in accordance with at least one embodiment of the present disclosure;

fig. 2 is a flowchart of an image processing method according to at least one embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a result of text detection on an input image according to at least one embodiment of the present disclosure;

fig. 4 is a schematic diagram of a first image provided in at least one embodiment of the present disclosure;

fig. 5 is a schematic diagram of a preprocessed first image according to at least one embodiment of the present disclosure;

fig. 6 is a schematic diagram of a text recognition method according to at least one embodiment of the present disclosure;

fig. 7 is a schematic diagram of a connection time classification method according to at least one embodiment of the present disclosure;

fig. 8 is a table comparing the structure of a dense convolutional network (densneet) with that of a conventional densneet-121 according to at least one embodiment of the present disclosure;

FIG. 9 is a first image with a non-uniform density profile provided by at least one embodiment of the present disclosure;

fig. 10 is a processed first image provided by at least one embodiment of the present disclosure;

fig. 11 is a schematic flowchart of a corresponding step S105 according to at least one embodiment of the disclosure;

fig. 12 is a schematic flowchart of a corresponding step S201 according to at least one embodiment of the disclosure;

fig. 13 is a schematic diagram of a second image provided in accordance with at least one embodiment of the present disclosure;

fig. 14 is a flowchart of another image processing method provided by at least one embodiment of the present disclosure;

fig. 15 is a schematic diagram of an image processing method applied to a teaching scene according to at least one embodiment of the present disclosure;

fig. 16 is a schematic block diagram of an image processing apparatus according to at least one embodiment of the present disclosure;

fig. 17 is a schematic block diagram of an electronic device provided in at least one embodiment of the present disclosure;

fig. 18 is a schematic block diagram of another electronic device provided in at least one embodiment of the present disclosure; and

fig. 19 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

When recording classroom teaching video, a camera is generally used for carrying out video acquisition on the direction of a platform in a front teaching classroom. In the actual classroom teaching field, the resolution of the cameras for shooting and recording the blackboard parts in different classrooms is different, and the difference of the resolution of different devices is larger. For example, in some classrooms, limited to low bandwidth lines, some devices compress video for transmission with a small volume and poor quality. In a complex classroom scene, because a blackboard is generally wide, and the distance difference between the middle part and the two side parts of the blackboard and a camera is large, the focusing degrees are different, and the phenomenon that the quality of blackboard writing local on the blackboard in a video is poor is easily caused. For example, fig. 1 is a frame of video image of a teaching scene provided in at least one embodiment of the present disclosure, and as shown in fig. 1, the video image has a low resolution, especially in the left blackboard-writing area, where the blackboard-writing characters are difficult to see.

In order to solve the above-mentioned problem of poor quality and difficulty in reading out the script characters in the video image, the commonly used methods mainly include three methods. The first method is indifferent blind repair of regions in the video image, such as sharpening, edge enhancement, histogram equalization, using laplacian, and the like. However, in this method, when performing indiscriminate blind repair, the repair effect on complex scenes such as shadows and different backgrounds is often not ideal, and the adjustment on the whole area of the whole image may degrade the quality of non-text areas. The second method is to use a fixed threshold or the like when separating text from background. However, in this method, various fixed thresholds are used for processing the text area, so that it is difficult to adapt to the classroom (e.g., blackboard or whiteboard) of different scenes and the light of different scenes (e.g., half of the blackboard portion is directly irradiated by sunlight and the other half is not directly irradiated by sunlight), the robustness is poor, and it is difficult to accurately distinguish the handwritten blackboard writing from the background. A third approach is to use a depth learning based image enhancement model for text regions. However, in this method, the text region pixels are enhanced in a targeted manner, and since the robustness depends on the model performance greatly due to the enhancement based on the pixel level, semantic information of the text part is not provided.

The inventor of the application finds that if the blackboard writing information is reconstructed based on the new background, the video watching experience is influenced once some elements are missed, and the like, in the long video, and the blackboard writing content is enhanced on the original video, so that the defect can be avoided.

At least one embodiment of the present disclosure provides an image processing method, including: acquiring an input image; determining a first image based on the input image, wherein the first image is an area comprising text content in the input image, and texts in the text content are arranged in a line in the first image; extracting the features of the first image by adopting a dense convolution network to obtain target features, wherein the dense convolution network does not comprise a pooling layer; obtaining text content based on the target characteristics; and obtaining a second image based on the first image and the text content, and replacing the second image with the first image in the input image to generate an output image comprising the second image.

At least one embodiment of the present disclosure also provides an image processing apparatus, an electronic device, and a non-transitory storage medium corresponding to the above-described image processing method.

According to the image processing method provided by at least one embodiment of the present disclosure, by generating the output image including the second image, the text content in the input image can be made more regular and clearer and have higher resolution while the background of the input image is maintained, so that the viewing quality of the text content is enhanced. By adopting the dense convolutional network without the pooling layer, the characteristic information of the first image is greatly reserved, so that the accuracy of text recognition is improved.

In the following, a method for processing an image according to at least one embodiment of the present disclosure is described in a non-limiting manner by referring to the drawings, and as described below, different features of these specific examples or embodiments may be combined with each other without mutual conflict, so as to obtain new examples or embodiments, and these new examples or embodiments also belong to the protection scope of the present disclosure. To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of known functions and known components have been omitted from the present disclosure. When any component of an embodiment of the present disclosure appears in more than one drawing, that component is represented by the same or similar reference numeral in each drawing.

Fig. 2 is a flowchart of an image processing method according to at least one embodiment of the present disclosure. For example, as shown in fig. 2, the image processing method 10 may be applied to a computing device including any electronic device with computing function, such as a smart phone, a notebook computer, a tablet computer, a desktop computer, a server, and the like, which is not limited in this respect by the embodiments of the present disclosure. For example, the computing device has a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), and the computing device further includes a memory. The Memory is, for example, a nonvolatile Memory (e.g., a Read Only Memory (ROM)) on which codes of an operating system are stored. For example, the memory further stores codes or instructions, and by executing the codes or instructions, the image processing method provided by the embodiment of the disclosure can be realized.

For example, as shown in fig. 2, the image processing method 10 may include the following steps S101 to S106. It should be noted that, in the embodiment of the present disclosure, steps S101 to S106 may be executed sequentially or in other adjusted orders, and some or all of the operations in steps S101 to S106 may also be executed in parallel. For example, in some examples, the image processing method 10 provided by implementing at least one embodiment of the present disclosure may selectively perform some of the steps S101 to S106, or may perform some additional steps other than the steps S101 to S106, and the embodiment of the present disclosure is not particularly limited in this respect.

Step S101: an input image is acquired.

For example, in at least one embodiment of the present disclosure, the input image may be captured by a camera, downloaded from a network (e.g., the internet), or read locally, etc., which is not limited in this respect. For example, the text content is included in the input image. For example, the text content may refer to words, letters, symbols, figures, and the like, and the embodiments of the present disclosure are not limited thereto.

For example, in at least one embodiment of the present disclosure, the input image may comprise an input image of a teaching scene and the textual content may comprise blackboard-writing content. For example, the input image may be an image as shown in fig. 1, and in a normal teaching scene, a teacher teaches on a platform and is supplemented with various blackboard-writing. For example, in the embodiments of the present disclosure, a blackboard-writing refers to a carrier that can be used to display the teaching content of a teacher, including a blackboard, a whiteboard, a projection area of a PPT, and the like; for example, in a blackboard writing, a teacher may explain teaching contents by the contents of the blackboard writing (i.e., text contents) such as characters, figures, characters, and the like. It should be understood that a blackboard-writing, such as a blackboard, a whiteboard, a projected area of PPT, etc., is considered to be a blackboard-writing regardless of whether it has specific blackboard-writing contents thereon.

For example, the input image of the teaching scene may include a photo and a video image obtained by shooting the direction of the platform of the teaching scene through a camera (e.g., a camera of a camera, a camera of a smartphone, etc.). For example, the picture of the input image of the teaching scene generally includes the picture of the aforementioned blackboard-writing (i.e., the blackboard-writing area).

For example, in some embodiments, the input image may be a color image. For example, color images include, but are not limited to, color images having three color channels, and the like. For example, the three color channels include a first color channel, a second color channel, and a third color channel. For example, the three color channels correspond to three primary colors, respectively. For example, in some embodiments, the first color channel is a red (R) channel, the second color channel is a green (G) channel, and the third color channel is a blue (B) channel, that is, the color image may be a color image in RGB format, and it should be noted that the embodiments of the present disclosure include but are not limited thereto. For example, in other embodiments, the input image may also be a grayscale image.

Step S102: based on the input image, a first image is determined, the first image is an area including text content in the input image, and text in the text content is arranged in a line in the first image.

For example, in at least one embodiment of the present disclosure, the first image may refer to an image of a region where text content appears in the input image, and text in the text content is arranged in a line in the first image. For example, in some embodiments, the first image may be a text box covering only one line of text length. Generally, for text recognition of a complex scene, a position of a text is first located, that is, text region detection is performed to obtain a text region image (e.g., a first image), so that a text recognition operation is subsequently performed more efficiently and accurately. For example, where the input image is an image that includes a teaching scene, there may be multiple lines of text content in the board book in the image. The image processing method provided by the embodiment of the disclosure takes one line of text content in a blackboard writing as an object for processing, and when the multiple lines of text content in the blackboard writing need to be processed, only corresponding operations in the image processing method provided by the embodiment of the disclosure need to be respectively executed for the text content of each line, so that all the text content in the blackboard writing can be processed.

For example, in at least one embodiment of the present disclosure, step S102 may include: and performing text detection on the input image to obtain a first image comprising text content.

Fig. 3 is a schematic diagram of a result of an input image after text detection according to at least one embodiment of the present disclosure. For example, the input image may include one or more lines of text. For example, in one example, the input image comprises a line of text, for example only one first image, in which case a subsequent text recognition operation may be performed on the detected first image. For example, in another example, as shown in fig. 3, the input image may include a plurality of lines of text, for example, a plurality of first images, in which case, the text recognition operation may be sequentially performed on the plurality of first images, and the text content of the plurality of first images may be output, which is not limited by the embodiment of the present disclosure.

It should be noted that, in the embodiments of the present disclosure, various conventional text detection methods may be used to detect the text region of the input image, and the embodiments of the present disclosure are not limited thereto as long as the first image including the text content in the input image is obtained.

Currently, text detection methods are roughly divided into two categories: regression-based methods and segmentation-based methods. In general, for complex scenes (e.g., teaching scenes), it is often more accurate to use a segmentation-based scene text detection method. For example, in at least one embodiment, a segmentation-based conductive binarization network (segmentation-based binary network) may be employed to perform text region detection on an input image, resulting in a first image including text content.

For example, in some embodiments, after an input image is input into a segmentation-based conductive binary network, a feature map is obtained after operations such as feature extraction and upsampling fusion, a probability map (probability map) and a threshold map (threshold map) are predicted by using the feature map, and an approximate binary map is calculated by using the probability map and the threshold map, so that a first image (e.g., a text region or a text box) is inferred by using the probability map or the approximate binary map. In the above method, the probability map is used to output the probability that each pixel is text, providing the pixel region where the text is located. The method has the advantages that self-adaptive binarization is carried out on each pixel point, the binarization threshold value is obtained by network learning, the step of binarization is thoroughly added into the network for training, and the final output has better robustness to the threshold value. Due to the existence of the threshold value map, the binary network can well learn the boundary, so that the detection effect on texts such as inclination, distortion and the like is good.

For example, by adopting a segmentation-based conductive binary network method, a polygonal text region can be output, and the problem of inaccurate fitting based on a rectangular frame can be solved. Of course, the shape of the text region may also be other shapes such as a rectangle, a step, and the like, and the embodiment of the disclosure is not limited thereto.

For example, in some embodiments, a connected Text suggestion Network (CTPN) may be employed to enable Text detection of an input image. For example, CTPN includes the steps of detecting small-scale text boxes, looping through text boxes, and text line edge refinement. It should be noted that, in the embodiments of the present disclosure, the method of text detection is not limited.

For example, in some embodiments, an image of an area located within a gray quadrilateral frame as shown in fig. 3 may be taken as the first image. For example, the input image shown in fig. 3 includes six lines of characters (i.e., six quadrangular text boxes), six first images may be detected based on the input image of fig. 3, so that a text recognition operation is subsequently performed on the six first images in sequence.

For example, in some embodiments, each edge of the first image (text box or text detection area) is within a threshold distance of the nearest word, for example, 2-5 mm. For example, when the text box is a rectangle, the distance between the upper and lower edges of the rectangle and the uppermost and lower edges of a line of characters does not exceed a certain threshold, and the distance between the left and right edges of the rectangle and the two side edges of a line of characters does not exceed a certain threshold. For example, the threshold may be adjustable as desired, i.e. the size of the first image may be defined, adjusted according to the text detection method used.

For example, in at least one embodiment of the present disclosure, text recognition may be performed on one or more first images obtained through step S102 to determine text content in each first image. It should be noted that various conventional text recognition methods may be adopted, and the embodiment of the disclosure is not limited thereto as long as the corresponding text content can be recognized based on the first image.

For example, in at least one embodiment of the present disclosure, to improve the accuracy of text recognition, the first image may be preprocessed before performing text recognition on the first image.

Fig. 4 is a schematic diagram of a first image according to at least one embodiment of the present disclosure, and fig. 5 is a schematic diagram of a preprocessed first image according to at least one embodiment of the present disclosure. For example, in one example, after text region detection is performed on an input image, a first image as shown in fig. 4 is obtained, and as shown in fig. 4, the first image is a tilted quadrangular text region. In order to improve the recognition accuracy of the subsequent text recognition operation, the tilted quadrangular text region is converted into a rectangular region after being subjected to preprocessing operations such as affine transformation, Thin Plate Spline (TPS) transformation, and the like, as shown in fig. 5. For example, affine transformation refers to a vector space which is subjected to linear transformation once and then translated into another vector space, and the affine transformation can allow the graph to be arbitrarily inclined and can allow the graph to be arbitrarily stretched in two directions. For example, the thin-plate spline (TPS) transform is typically based on 2D interpolation, often applied in image registration. For example, N matching points are found in two images, and the N points can be deformed to corresponding positions by applying TPS, while the deformation of the whole space is given. For example, a thin-plate spline (TPS) transform may further correct local distortions based on adaptive sampling points, improving the accuracy of subsequent text recognition.

It should be noted that the preprocessing operation may include other operations, such as perspective transformation, in addition to the above affine transformation and thin-plate spline transformation, and the embodiment of the present disclosure is not limited thereto.

For example, in at least one embodiment of the present disclosure, the neural network model may be used to implement text recognition on the first image, and of course, other conventional text recognition methods may also be used.

For example, in one example, a neural network model as shown in FIG. 6 is employed for text recognition. For example, the network structure of the neural network model may be composed of three parts, including, from bottom to top, a convolutional layer, a cyclic layer, and a transcriptional layer. For example, the convolutional layer may employ any kind of Convolutional Neural Network (CNN), such as the bottom CNN shown in fig. 6, for extracting features from a first image (e.g., the first image in fig. 5) to obtain a convolutional feature map. For example, in one example, the first image has a size of (32, 100, 3), and after passing through the CNN, is converted into a convolution feature matrix (convolution feature map) of size (1, 25, 512), where the size is expressed in terms of (height, width, channel).

In order to input features into the cyclic layer, it is necessary to extract a feature vector sequence from the feature map output by the CNN, each feature vector being generated from left to right on the feature map in columns, for example, in the above example, each feature vector extracted from a convolution feature matrix of size (1, 25, 512) contains 512-dimensional features, and these feature vectors constitute the feature sequence. The sequence of features may be provided as inputs to a recurrent layer, with each feature vector being provided as an input to a Recurrent Neural Network (RNN) at a time step T. In the above example of a convolution feature matrix of size (1, 25, 512), there are a total of 25 time steps T. For example, the cyclic layer may employ any cyclic neural network (RNN), such as a Bi-directional RNN (BiRNN) network, a Bi-directional Long Short Term Memory (BiLSTM) network, and the like, such as the RNN shown in fig. 6, for predicting the feature sequence, learning each feature vector in the feature sequence, and then outputting the sequence features (e.g., probability distribution of all tags). For example, a time step has one input feature vector, and for each input feature vector, a sequence feature, such as a probability vector, is output.

For example, the transcription layer may employ a first processing method, such as Connection Time Classification (CTC), for predicting a tag sequence with a highest probability combination according to a plurality of sequence features (probability vectors) output by the RNN, so as to obtain text content. For example, the label may be a character or a word, which may be set according to actual requirements. The CTC method used above can solve the matching problem of unfixed input length and unfixed output character length.

For example, fig. 7 is a schematic diagram of a CTC process provided in at least one embodiment of the present disclosure. The CTC method may solve the alignment problem by predicting the blank characters. For example, in one example, as shown in fig. 7, a CTC method may comprise: (1) the predicted results are overlapped and the characters are repeated; (2) removing blank characters; (3) and outputting the residual characters. As shown in fig. 7, for example, a plurality of h are predicted, and since there is no blank character in the middle of the h, the h will be merged, that is, duplicate characters between blank characters will be merged, and there is no blank character between the original duplicate characters. For example, there is a space between "l" and "l" of "hello" in the figure, and thus there are two "l" s after final merging. Therefore, assuming that a string has 8 "h", it will become "h-h-h-h-h-h-h-h" before step (2), at least the ability of the model to predict 15 characters is required to predict the 8 "h" that are repeated, otherwise some "h" will be merged.

It should be noted that, for brevity, the CTC process is described only briefly herein, and reference is made specifically to the relevant literature on the CTC process.

For example, by using the text recognition method described in fig. 6, it can be recognized that the text content in the first image is "fourteenth reading (two) of scholar chinese" based on the first image in fig. 5. In the embodiment of the present disclosure, other text recognition methods may be used as long as the corresponding text content can be obtained based on the first image.

Since the length and width of the first image (text box or text area) are not fixed and the first image corresponds to the handwritten text in most scenes (for example, teaching scenes), the density of characters in the first image is not uniform, or a large space area exists in the first image, and the like, which are detected in general, the effect is not ideal if some conventional text recognition models are directly adopted. Thus, in at least one embodiment of the present disclosure, a partial network structure is adjusted based on the neural network model shown in fig. 6.

Step S103: and extracting the characteristics of the first image by adopting a dense convolution network to obtain the target privilege, wherein the dense convolution network does not comprise a pooling layer.

For example, in at least one embodiment, the convolutional neural network in the convolutional layer described above may employ a dense convolutional network (DenseNet) for image feature extraction.

For example, in a conventional convolutional neural network, parameters can only be transmitted downward layer by layer, and in a dense convolutional network, all layers are directly connected on the premise of ensuring the maximum information transmission between the layers in the network, so that information in characteristics is better utilized, and the model identification precision is improved. For example, dense convolutional networks further eliminate the problem of gradient vanishing, improve feature reusability, and also reduce network parameters, compared to using residual networks (ResNet).

Fig. 8 is a table comparing the structures of a dense convolutional network (densneet) and a conventional densneet-121 according to at least one embodiment of the present disclosure.

For example, in at least one embodiment of the present disclosure, as shown in fig. 8, the scaled dense convolutional network (e.g., DenseNet in fig. 8) does not include a pooling layer as compared to a conventional dense convolutional network (e.g., DenseNet-121 in fig. 8). For example, a separate pooling layer in the conventional DenseNet-121 is removed, and the output dimension of each layer is adjusted only by means of a convolution kernel step size (stride) so as to prevent the network from losing features in pixels too fast, and the pixel information of the first image is retained to a greater extent, thereby improving the accuracy of text recognition.

For example, in at least one embodiment of the present disclosure, in the convolution kernels of the dense convolution network, the step size of at least one convolution kernel in the first direction is larger than the step size in the second direction, so that the feature map dimension decreases faster in the first direction than in the second direction. For example, the step size of the at least one convolution kernel in the first direction is 2 and the step size in the second direction is 1. For example, the first direction is a height direction, and the second direction is a width direction. It should be noted that the step size in the first direction and the step size in the second direction may also be other specific values, and the embodiment of the disclosure is not limited thereto.

For example, in at least one embodiment of the present disclosure, the convolution kernel step size of each layer in the dense convolution network may be set to be 2 in height step size and 1 in width step size.

For example, in at least one embodiment of the present disclosure, as shown in fig. 8, a conventional densnet-121 network structurally includes 1-8 layers, assuming an input dimension of 224 x 224, and a final output dimension of 7 x 7. Based on the DenseNet network provided by at least one embodiment of the present disclosure, it is assumed that the input dimension is 224 × 32, and the final output dimension is 56 × 1. As is apparent from fig. 8, compared with DenseNet-121, the DenseNet network provided by at least one embodiment of the present disclosure does not include a layer 2 pooling layer, so that the degree of dimension reduction can be reduced at the initial stage of the network, and retention of pixel information by the network is improved, thereby improving accuracy of text recognition. Because the feature map dimensions are reduced by the convolution operation with parameters, the feature information is more preserved than by the pooling operation without parameters.

For example, compared to the conventional DenseNet-121, the DenseNet network provided by at least one embodiment of the present disclosure replaces (2,2) with convolution kernel step sizes of (1,2) in the transition layer modules of the 4 th and 8 th layers, that is, step size is 1 in the width direction and 2 in the height direction. Therefore, the dimensionality of the pixels in the height direction is only reduced, the dimensionality of the pixels in the width direction is not reduced, the recognition capability of the network model for dense characters is improved, and the accuracy of text recognition is improved.

For example, with the above adjusted dense convolutional network, the final output dimension is 56 × 1, and the number of characters that can be recognized by the model for a single sample is theoretically 0-56. In some cases (e.g., where the single line characters are all the same), the number of recognizable characters is 0-28, since blank characters are required between repeated characters in the aforementioned CTC method to not be merged. Whereas by a conventional dense convolution network (e.g., DenseNet-121 in fig. 8), operating on pictures of the same dimension 224 × 32 using convolution kernel step sizes of (2,2), when the height is 1, the width is 7, and the number of characters that can be theoretically recognized is 0-7, but in some cases (e.g., in the case where the characters of a single line are all the same), only 4 characters can be recognized, and if the number of recognized single lines is to be increased, the input resolution is increased, or the width of the input picture is increased. Further, even if the number of theoretically recognized characters satisfies the actual demand, it is difficult for the model that theoretically recognizes 7 characters at the maximum to predict 7 characters due to uneven density of the handwriting board in the text line (first image) or large space.

Fig. 9 is a first image with non-uniform density distribution according to at least one embodiment of the present disclosure, and fig. 10 is a processed first image according to at least one embodiment of the present disclosure.

For example, when performing prediction based on the CTC model based on the first image shown in fig. 9, the feature vector is divided into N uniform regions, and at most N characters are predicted. The recognition result is good when the text is uniform, but may be poor when the text is not uniform. For example, in one example, when the model is at a certain resolution, e.g., 224 × 32, the output dimension is 7 × 1. As shown in fig. 10, the text region after uniform division has three space regions, and only 4 valid characters can be predicted at most. Therefore, the above-described model having the output dimension of 56 × 1 has sufficient resolution for distinguishing characters.

Step S104: text content is obtained based on the target features.

For example, in conjunction with the image-based text recognition method as shown in fig. 6, step S104 may include obtaining sequence features based on the target features through a recurrent neural network (e.g., bi-directional LSTM); and deriving the textual content based on the sequence features using a first processing method (e.g., a CTC method). For example, in one example, after DenseNet shown in fig. 8, feature vectors with height dimension of 1 and width of T may be sequentially input into bi-directional LSTM, sequence features are extracted, and then corresponding text content is predicted by using a CTC method. It should be noted that, for the detailed description of step S104, reference may be made to the related description of the text recognition method of fig. 6, and details are not repeated here.

Step S105: and obtaining a second image based on the first image and the text content.

For example, in the embodiment of the present disclosure, through step S105, an effect of realistically replacing the text in the texture background of the original scene can be achieved, for example, a handwritten text with illegible handwriting is replaced with a printed font with regular handwriting and a proper size, so that the viewing quality of the text content in the video is improved.

Fig. 11 is a schematic flowchart of a corresponding step S105 according to at least one embodiment of the present disclosure, fig. 12 is a schematic flowchart of a corresponding step S201 according to at least one embodiment of the present disclosure, and fig. 13 is a schematic diagram of a second image according to at least one embodiment of the present disclosure.

For example, in at least one embodiment of the present disclosure, as shown in fig. 11, for step S105, deriving the second image based on the first image and the text content may include the following steps S201 to S203.

Step S201: based on the first image, a text region and a non-text region in the first image are determined.

For example, in at least one embodiment of the present disclosure, for a first image (e.g., a text region or a text box), the text region and the non-text region need to be separated. For example, as shown in fig. 12, for step S201, the following steps S301 to S303 may be included.

Step S301: and carrying out graying processing on the first image to obtain a grayscale image.

For example, in some embodiments, the input image is a color image (e.g., a color image in RGB format), and accordingly, the first image is also a color image; in this case, the color image (e.g., the first image) may be converted into a grayscale image by a commonly used conversion formula. Taking the example of converting a color image in RGB format into a grayscale image, the conversion can be performed by the following conversion formula:

Gray＝R*0.299+G*0.587+B*0.114，

where Gray represents luminance information of a Gray image, and R, G and B represent red information (i.e., data information of a red channel), green information (i.e., data information of a green channel), and blue information (i.e., data information of a blue channel), respectively, of a color image in RGB format. Of course, other conversion formulas may be used to convert the color image into the grayscale image, and the embodiments of the disclosure are not limited in this respect.

For example, in other embodiments, the input image is a grayscale image, and accordingly, the first image is also a grayscale image; in this case, the first image may be directly taken as a grayscale image.

Step S302: and carrying out binarization processing on the gray level image to obtain a binary image.

For example, in at least one embodiment of the present disclosure, a suitable value may be selected as a threshold value, and the grayscale image may be subjected to binarization processing to obtain a binary image. For example, in one example, the average of the gray levels of all pixels in a gray-scale image is selected as the threshold. It is noted that the present disclosure includes but is not limited thereto. For example, in practical applications, the threshold value in the binarization processing may also be determined in any other feasible manner.

For example, in at least one embodiment of the present disclosure, in order to adapt to an image of a complex scene, such as an image of a teaching scene (which typically has different illumination conditions in different regions of the image), an adaptive threshold algorithm may be employed to binarize the grayscale image to obtain a binary image. For example, the adaptive threshold algorithm may adapt well to local pixel blocks. The adaptive thresholding algorithm determines the threshold for a pixel based on a small region around the pixel. Thus, different threshold values can be obtained for different areas of the same image, which provides better binarization results for images with varying illuminance.

It should be noted that various conventional methods for generating an adaptive threshold may be used, and the embodiments of the present disclosure are not limited thereto.

Step S303: a text region and a non-text region in the first image are determined based on the binary image.

For example, in at least one embodiment of the present disclosure, the number of text pixels in the first image is often less than the number of non-text pixels, and thus, based on a binary image, it is easy to distinguish between text regions and non-text regions in the first image. For example, for an input image of a teaching scene, it is also necessary to determine whether the scene is a whiteboard or a blackboard. For example, by determining the pixel difference of two categories in the binary image, it is possible to determine whether it is a whiteboard or a blackboard.

Step S202: a background image of the second image is determined based on the non-text region and the text region.

For example, in at least one embodiment of the present disclosure, in order to make the background image of the second image substantially consistent with the background image of the first image, the background image of the second image retains a non-text region of the first image and fills the text region of the first image with the first pixel values. For example, step S202 may include: obtaining a first pixel value based on each pixel value of the non-text region; and obtaining a background image based on the first pixel value and the text area. For example, in one example, the text region is filled with first pixel values to obtain a background image of the second image. For example, the first pixel value may be an average value, a maximum value, a minimum value, and the like of respective pixel values of a non-text region of the first image, and of course, the embodiment of the present disclosure does not limit this.

Step S203: and obtaining a second image based on the text content and the background image.

For example, in at least one embodiment of the present disclosure, step S203 includes obtaining a foreground image of the second image based on the text content and the size of the background image, where the foreground image includes the text content; the foreground image is overlaid on the background image to obtain a second image, as shown in fig. 13. For example, in some embodiments, as shown in fig. 13, text images corresponding to the text content are arranged uniformly in the foreground image of the second image.

For example, in some embodiments, based on the detected size of the text region or text box (e.g., the size of the first image or the background image, etc.), and the number of characters of the recognized text content, the font size required for the text to be rendered may be determined, and the text may be arranged uniformly within the text region or text box, thereby obtaining the foreground image of the second image. For example, the size of the foreground image is the same as the size of the background image. For example, by converting the recognized text content into a print font of a suitable size (e.g., a song style, a regular script, etc.), the problem of difficulty in seeing due to illegible handwritten text in the input image, too small a font, etc. can be avoided.

For example, in some embodiments, overlaying the foreground image on the background image to obtain the second image may be implemented by multiplying pixel blocks corresponding to the foreground image by pixel blocks corresponding to the background image, which is not limited by embodiments of the present disclosure.

For example, in one example, a handwriting area (i.e., a text area) in the first image is first filled with first pixel values, thereby wiping off the handwriting. For example, taking a white board background and black characters as an example, assuming that the first pixel value is (r, g, b), the width of the first image is w, the height of the first image is h, and the first pixel value is used to fill in the pixels in the text region, so as to obtain a background image. The method comprises the steps of obtaining the number N of character strings in a first image after text recognition, obtaining the pixel width of each character after rendering according to w/N because the width of the first image is w, generating a three-channel image which has the width w, the height h, the size h identical to that of a background image and the three-channel pixel value 1, and rendering black character strings by loading preset fonts (such as printing fonts of Song Dynasty, regular script and the like) on the three-channel image to obtain a foreground image. And multiplying the background image by the corresponding pixels of the foreground image with the same size to obtain a rendering image with the effect similar to that of the image 13, namely a second image.

Step S106: the second image is substituted for the first image in the input images to generate an output image comprising the second image.

For example, comparing the first image of fig. 4 or fig. 5 with the second image of fig. 13, it can be seen that the viewing quality of the text content in the second image is effectively improved. For example, compared with the first image, the second image effectively eliminates the pixel diffusion phenomenon around the handwritten handwriting in the first image; uniformly re-arranging the handwritten characters with different sizes and non-uniform arrangement in the first image; converting the handwriting font with the personal characteristic in the first image into a standard font, such as a printing font; the contrast between the font color and the background in the second image is higher, because the characters in the second image are re-rendered from the vector diagram in the font library, the color with higher contrast with the background can be selected, and the pixel values in the rendered characters are more uniform. Therefore, the image processing method 10 can replace the original handwritten text with more regular and higher-resolution printed text on the premise of keeping the background of the input image, thereby improving the viewing quality of the text content. And the accuracy of text recognition is improved by adopting the adjusted dense convolutional network.

For example, in a teaching scene, a complete video of a class is usually long, perhaps tens of minutes or even one or two hours, and the video variation is often small. Video changes are generally focused on blackboard-writing, teacher's location, and changes in the PPT projection. Since the blackboard writing part is the conciseness and summary of the lecture content, the text information corresponding to the blackboard writing content can be used as the information for searching or positioning the video position, thereby being beneficial to the video positioning. Therefore, when a user or a student expects to accurately find the explanation of a certain knowledge point from the video of a certain class for listening again, only relevant keywords are needed for searching, and the video is pertinently watched, so that the time cost for manually searching to locate the video explanation of a certain specific knowledge point in a long video is reduced.

Fig. 14 is a flowchart of another image processing method according to at least one embodiment of the present disclosure. For example, in order to implement the above-mentioned video positioning function, as shown in fig. 14, at least one embodiment of the present disclosure further provides an image processing method 100, where the image processing method 100 includes steps S401 to S409.

It should be noted that steps S401 to S406 in fig. 14 are substantially the same as steps S101 to S106 in fig. 2, and the achieved technical effects are also substantially the same, and for brevity, the description of steps S401 to S406 is not repeated in the embodiment of the present disclosure, and reference may be made to relevant contents of steps S101 to S106.

Step S407: an image sequence comprising a plurality of frames of output images is acquired.

For example, the operations of generating the output images may be implemented by using the foregoing steps S401 to S406, so as to obtain an image sequence including multiple frames of output images, and specific implementation processes and details may refer to the foregoing related descriptions, and are not repeated herein.

Step S408: at least one frame of output image in the image sequence is extracted.

For example, in some embodiments, one frame of output image may be decimated every predetermined time (e.g., 30 seconds) for a continuous sequence of images, resulting in at least one frame of output image. For another example, for example, 10 to 20 frames of output images may also be randomly extracted from the obtained image sequence, and the disclosure includes but is not limited thereto.

Step S409: in response to detecting that there is an output image satisfying a first predetermined condition in at least one frame of output images, storing the output image satisfying the first predetermined condition as a target frame of output image, and storing text content corresponding to the target frame of output image for use in positioning the image sequence.

For example, in some embodiments, the first predetermined condition may be that a text similarity of text content corresponding to the target frame output image and text content corresponding to the detected previous frame output image exceeds a first threshold.

For example, in some embodiments, the text similarity may be calculated by using a normalized edit distance, and the embodiments of the present disclosure are not limited thereto, when the text similarity may also be calculated by using other algorithms such as TF/IDF (word frequency/inverse document probability), cosine distance, Simhash distance, and the like.

For example, in one example, the first predetermined condition may be that the normalized edit distance of the text content corresponding to the target frame output image exceeds 10% from the text content corresponding to the detected previous frame output image. Of course, the setting of the first threshold is not limited in the embodiments of the present disclosure, and may be adjusted according to actual situations.

For example, in some embodiments, in the case where the first image determined based on the input image includes at least one first image, the first predetermined condition may be that a difference between the number of first images included in the target frame output image and the detected number of first images included in the previous frame output image exceeds a second threshold.

For example, in some embodiments, when the target frame output image includes 2 more first images than the detected previous frame output image, the target frame output image and the text content of the target frame output image are archived once as content for subsequent image sequence positioning or searching. Of course, the setting of the second threshold is not limited in the embodiments of the present disclosure, and may be adjusted according to actual situations.

Thus, with the above-described image processing method 100, in addition to enhancing text content in a video image, a function of locating or searching for a sequence of images can be implemented. For example, for teaching video, the teacher's blackboard writing can be structurally stored (e.g., including blackboard writing location, blackboard writing content, video frame where blackboard writing is located), facilitating the user (e.g., student) to search for video location through knowledge points.

Fig. 15 is a schematic diagram of an image processing method applied to a teaching scene according to at least one embodiment of the present disclosure. For example, in one example, a teaching video (e.g., a sequence of images) is input from which a single frame of input images is obtained; based on the input image, employing any text detection method (e.g., a text detection method provided by embodiments of the present disclosure) to determine a first image; any text recognition method (for example, the text recognition method provided by the embodiment of the present disclosure) is adopted to determine the text content corresponding to the first image, so that the blackboard writing content retrieval or positioning function of the teaching video can be provided for the user. For example, original text content in the first image can be removed to obtain a background image, the identified text content is combined to obtain a re-rendered second image, the second image replaces the first image, and an image frame with enhanced blackboard-writing content can be realized, so that the viewing quality of the blackboard-writing content is enhanced.

Since the steps in fig. 15 can be referred to the above description of the

image processing methods

10 and 100, the description is omitted here for brevity.

It should be noted that, in the embodiment of the present disclosure, the flow of the image processing method 10/100 may include more or less operations, and the operations may be executed sequentially or in parallel. Although the flow of the image processing method described above includes a plurality of operations that occur in a certain order, it should be clearly understood that the order of the plurality of operations is not limited. The image processing method described above may be executed once or a plurality of times in accordance with a predetermined condition.

The image processing method 10/100 provided by the embodiment of the present disclosure, by generating the output image including the second image, can make the text content in the input image more regular and clearer, and have higher resolution while preserving the background of the input image, thereby enhancing the viewing quality of the text content; the adjusted dense convolutional network is adopted, so that the characteristic information of the first image is retained to a greater extent, and the accuracy of text recognition is improved; the function of locating or searching the image sequence can also be realized.

At least one embodiment of the present disclosure also provides an image processing apparatus. The image processing device can keep the background of the input image, and simultaneously, the text content in the input image is more regular and clearer, the resolution ratio is higher, the watching quality of the text content in the input image is enhanced, and the accuracy of text identification can be improved.

Fig. 16 is a schematic block diagram of an image processing apparatus according to at least one embodiment of the present disclosure. As shown in fig. 16, the image processing apparatus 20 includes an acquisition unit 21, a determination unit 22, a feature extraction unit 23, a recognition unit 24, a fusion unit 25, and a generation unit 26. For example, the image processing apparatus 20 may be applied to applications such as video recording software and smart classroom software, and may also be applied to any device or system that requires writing and displaying, which is not limited in this respect by the embodiments of the present disclosure.

The acquisition unit 21 is configured to acquire an input image. For example, the acquisition unit 21 may execute step S101 of the image processing method shown in fig. 2. The determination unit 22 is configured to determine a first image based on the input image, the first image being an area including text content in the input image, and text in the text content being arranged in a line in the first image. For example, the determination unit 22 may execute step S102 of the image processing method shown in fig. 2. The feature extraction unit 23 is configured to perform feature extraction on the first image by using a dense convolutional network, so as to obtain a target feature, where the dense convolutional network does not include a pooling layer. For example, the feature extraction unit 23 may perform step S103 of the image processing method as illustrated in fig. 2. The recognition unit 24 is configured to derive the textual content based on the target feature. For example, the recognition unit 24 may perform step S104 of the image processing method as shown in fig. 2. The fusion unit 25 is configured to derive the second image based on the first image and the text content. For example, the fusion unit 25 may perform step S105 of the image processing method as illustrated in fig. 2. The generating unit 26 is configured to replace the second image with the first image in the input image to generate an output image including the second image. For example, the generation unit 26 may execute step S106 of the image processing method shown in fig. 2.

For example, the acquisition unit 21, the determination unit 22, the feature extraction unit 23, the recognition unit 24, the fusion unit 25, and the generation unit 26 may be hardware, software, firmware, and any feasible combination thereof. For example, the acquiring unit 21, the determining unit 22, the feature extracting unit 23, the identifying unit 24, the fusing unit 25, and the generating unit 26 may be dedicated or general circuits, chips, devices, or the like, or may be a combination of a processor and a memory. With regard to specific implementation forms of the obtaining unit 21, the determining unit 22, the feature extracting unit 23, the identifying unit 24, the fusing unit 25, and the generating unit 26, embodiments of the present disclosure are not limited thereto.

It should be noted that, in the embodiment of the present disclosure, each unit of the image processing apparatus 20 corresponds to each step of the foregoing image processing method 10, and for the specific function of the image processing apparatus 20, reference may be made to the description related to the image processing method above, and details are not repeated here. The components and configuration of the image processing apparatus 20 shown in fig. 16 are exemplary only, and not limiting, and the image processing apparatus 20 may further include other components and configurations as needed. For example, in some examples, the image processing apparatus 20 may further include an image sequence acquisition unit, an extraction unit, a determination unit, and the like. For example, the image sequence acquisition unit is configured to acquire an image sequence including a plurality of frames of output images. The extraction unit is configured to extract at least one frame of an output image in the sequence of images. The judging unit is configured to respond to the detection that the output image meeting the first preset condition exists in at least one frame of output image, store the output image meeting the first preset condition as a target frame of output image, and store text content corresponding to the target frame of output image for positioning the image sequence. That is, the image sequence acquisition unit, the extraction unit, and the judgment unit may perform steps S407 to S409 as shown in fig. 14.

At least one embodiment of the present disclosure also provides an electronic device comprising a processor and a memory, one or more computer program modules being stored in the memory and configured to be executed by the processor, the one or more computer program modules comprising instructions for implementing the image processing method provided by any of the embodiments of the present disclosure. The electronic equipment can keep the background of the input image, and simultaneously, the text content in the input image is more regular and clearer, and the resolution is higher, so that the watching quality of the text content is enhanced; the feature information of the first image can be reserved to a large extent, and the accuracy of text recognition is improved.

Fig. 17 is a schematic block diagram of an electronic device according to at least one embodiment of the present disclosure. As shown in fig. 17, the electronic device 30 includes a memory 32 and a processor 31.

Memory 32 is used to store non-transitory computer readable instructions (e.g., one or more computer program modules). The processor 31 is configured to execute non-transitory computer readable instructions, which when executed by the processor 31 may perform one or more of the steps of the image processing method described above. The memory 32 and the processor 31 may be interconnected by a bus system and/or other form of connection mechanism (not shown). For example, the electronic device 30 may adopt an operating system such as Windows and Android, and the image processing method according to the embodiment of the present disclosure is implemented by an application running in the operating system, or by a browser installed in the operating system accessing a website provided by a cloud server.

For example, the processor 31 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP) or other form of processing unit having data processing capability and/or program execution capability, such as a Field Programmable Gate Array (FPGA), or the like; for example, the Central Processing Unit (CPU) may be an X86 or ARM architecture or the like. The processor 31 may be a general-purpose processor or a special-purpose processor and may control other components in the electronic device 30 to perform desired functions.

For example, memory 32 may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer program modules may be stored on the computer-readable storage medium and executed by the processor 31 to implement various functions of the electronic device 30. Various applications and various data, as well as various data used and/or generated by the applications, and the like, may also be stored in the computer-readable storage medium.

It should be noted that, in the embodiment of the present disclosure, reference may be made to the above description on the image processing method 10/100 for specific functions and technical effects of the electronic device 30, and details are not described here.

Fig. 18 is a schematic block diagram of another electronic device provided in at least one embodiment of the present disclosure. The electronic device 40 is, for example, suitable for implementing the image processing method provided by the embodiment of the present disclosure. It should be noted that the electronic device 40 shown in fig. 18 is only an example, and does not bring any limitation to the functions and the use range of the embodiment of the present disclosure.

As shown in fig. 18, the electronic apparatus 40 may include a processing device (e.g., a central processing unit, a graphic processor, etc.) 41, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)42 or a program loaded from a storage device 48 into a Random Access Memory (RAM) 43. In the RAM 43, various programs and data necessary for the operation of the electronic apparatus 40 are also stored. The processing device 41, the ROM 42, and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.

Generally, the following devices may be connected to the I/O interface 45: input devices 46 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 47 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 48 including, for example, magnetic tape, hard disk, etc.; and a communication device 49. The communication means 49 may allow the electronic device 40 to communicate wirelessly or by wire with other electronic devices to exchange data. While fig. 18 illustrates an electronic device 40 having various means, it is to be understood that not all illustrated means are required to be implemented or provided, and that the electronic device 40 may alternatively be implemented or provided with more or less means.

For example, the image processing method 10/100 shown in fig. 2 and 14 may be implemented as a computer software program according to an embodiment of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program comprising program code for performing the image processing method described above. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 49, or installed from the storage means 48, or installed from the ROM 42. When the computer program is executed by the processing device 41, the functions defined in the image processing method provided by the embodiment of the present disclosure can be realized.

At least one embodiment of the present disclosure also provides a storage medium for storing non-transitory computer-readable instructions that, when executed by a computer, may implement the image processing method according to any one of the embodiments of the present disclosure. By using the storage medium, the background of the input image can be kept, and simultaneously, the text content in the input image is more regular and clearer, and the resolution is higher, so that the watching quality of the text content in the input image is enhanced, the characteristic information of the first image can be kept to a greater extent, and the accuracy of text recognition is improved.

Fig. 19 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure. As shown in fig. 19, the storage medium 50 is used to store non-transitory computer readable instructions 51. For example, the non-transitory computer readable instructions 51, when executed by a computer, may perform one or more steps according to the image processing method 10/100 described above.

For example, the storage medium 50 may be applied to the electronic device described above. The storage medium 50 may be, for example, the memory 32 in the electronic device 30 shown in fig. 17. For example, the related description about the storage medium 50 may refer to the corresponding description of the memory 32 in the electronic device 30 shown in fig. 17, and will not be repeated here.

The following points need to be explained:

(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to common designs.

(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims

1. An image processing method comprising:

acquiring an input image;

determining a first image based on the input image, wherein the first image is an area comprising text content in the input image, and texts in the text content are arranged in a line in the first image;

performing feature extraction on the first image by adopting a dense convolution network to obtain a target feature, wherein the dense convolution network does not comprise a pooling layer;

obtaining the text content based on the target characteristics;

obtaining a second image based on the first image and the text content;

replacing a first image of the input images with the second image to generate an output image comprising the second image.

2. The method of claim 1, wherein, among the convolution kernels of the dense convolutional network, a step size of at least one convolution kernel in a first direction is larger than a step size in a second direction.

3. The method of claim 2, wherein, of the convolution kernels of the dense convolutional network, a step size of at least one convolution kernel in the first direction is 2 and a step size in the second direction is 1.

4. A method according to claim 2 or 3, wherein the first direction is a height direction and the second direction is a width direction.

5. The method of claim 1, wherein the input image comprises an image of a teaching scene and the textual content comprises blackboard-writing content.

6. The method of claim 1, wherein deriving the second image based on the first image and the textual content comprises:

determining a text region and a non-text region in the first image based on the first image;

determining a background image of the second image based on the non-text region and the text region;

and obtaining the second image based on the text content and the background image.

7. The method of claim 6, wherein determining text regions and non-text regions in the first image based on the first image comprises:

carrying out graying processing on the first image to obtain a grayscale image;

carrying out binarization processing on the gray level image to obtain a binary image;

and determining a text region and a non-text region in the first image according to the binary image.

8. The method according to claim 7, wherein the binarizing the grayscale image to obtain the binary image comprises:

and carrying out binarization processing on the gray level image by adopting an adaptive threshold algorithm to obtain the binary image.

9. The method of claim 6, wherein deriving the background image of the second image based on the non-text regions and the text regions comprises:

obtaining a first pixel value based on each pixel value of the non-text region;

and obtaining the background image based on the first pixel value and the text area.

10. The method of claim 9, wherein the first pixel value comprises one of an average, a maximum, and a minimum of respective pixel values of the non-text region.

11. The method of claim 6, wherein deriving the second image based on the textual content and the background image comprises:

obtaining a foreground image based on the text content and the size of the background image, wherein the foreground image comprises the text content;

and covering the foreground image on the background image to obtain the second image.

12. The method of claim 11, wherein the text content corresponds to text images that are arranged uniformly in the foreground image.

13. The method of claim 1, wherein the text content in the second image is represented by a print.

14. The method of claim 1, wherein the textual content in the first image is represented in handwriting.

15. The method of claim 1, further comprising:

acquiring an image sequence comprising a plurality of frames of the output images;

extracting at least one frame of output image in the image sequence;

in response to detecting that there is an output image satisfying a first predetermined condition in the at least one frame of output images, storing the output image satisfying the first predetermined condition as a target frame of output image, and storing text content corresponding to the target frame of output image for use in positioning the image sequence.

16. The method of claim 15, wherein the first predetermined condition comprises:

the text similarity between the text content corresponding to the target frame output image and the text content corresponding to the detected previous frame output image exceeds a first threshold, or

In a case where the first image determined based on the input image includes at least one first image, a difference between the number of first images included in the target frame output image and the number of first images included in the detected previous frame output image exceeds a second threshold.

17. The method of claim 16, wherein the text similarity comprises a normalized edit distance.

18. The method of claim 1, wherein determining the first image based on the input image comprises:

and detecting a text area of the input image to obtain a first image comprising the text content.

19. The method of claim 18, wherein text region detection on the input image resulting in a first image comprising the textual content comprises:

and detecting a text area of the input image by adopting a segmentation-based derivable binarization network to obtain a first image comprising the text content.

20. The method of claim 1, wherein deriving the textual content based on the target feature comprises:

obtaining sequence features based on the target features through a recurrent neural network;

and obtaining the text content based on the sequence characteristics by adopting a first processing method.

21. The method of claim 20, wherein the recurrent neural network comprises a bidirectional recurrent neural network.

22. A method according to claim 20 or 21, wherein the first processing method comprises a connection time classification method.

23. The method of claim 1, further comprising:

the first image is pre-processed and,

wherein the pre-processing comprises at least one of an affine transformation, a thin-plate spline transformation.

24. An image processing apparatus comprising:

an acquisition unit configured to acquire an input image;

a determining unit configured to determine a first image based on the input image, wherein the first image is an area including text content in the input image, and text in the text content is arranged in a line in the first image;

the feature extraction unit is configured to extract features of the first image by adopting a dense convolutional network to obtain target features, wherein the dense convolutional network does not comprise a pooling layer;

the identification unit is configured to obtain the text content based on the target feature;

the fusion unit is configured to obtain a second image based on the first image and the text content;

a generating unit configured to replace the second image with a first image of the input images to generate an output image including the second image.

25. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions, wherein the computer readable instructions, when executed by the processor, perform the image processing method of any of claims 1-23.

26. A non-transitory storage medium storing, non-transitory, computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, perform the instructions of the image processing method of any one of claims 1-23.