CN110008922B

CN110008922B - Image processing method, device, apparatus, and medium for terminal device

Info

Publication number: CN110008922B
Application number: CN201910294110.6A
Authority: CN
Inventors: 梅利健; 黄生辉; 陈卫东
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2023-04-18
Anticipated expiration: 2039-04-12
Also published as: CN110008922A

Abstract

The present disclosure provides an image processing method, device, apparatus, and medium for a terminal device. The image processing method for the terminal equipment comprises the following steps: performing image processing on an image to be processed to generate a label list of the image to be processed; and performing image conversion on the image to be processed based on a conversion rule and the tag list, wherein the conversion rule corresponds to tags in the tag list, and the tag list comprises at least one of a feature tag list representing image features included in the image to be processed and an event tag list representing event features included in the image to be processed.

Description

Image processing method, device, apparatus and medium for terminal device

Technical Field

The present disclosure relates to the field of image processing, and in particular, to an image processing method, device, apparatus, and medium for a terminal device.

Background

With the development of the technology, the neural network is more and more widely applied in the field of image processing. The neural network can be used for processing target recognition, feature detection and the like on an input image, and image conversion can be realized on the image based on the result output by the neural network. For video, a key frame image may be extracted from the video, and the extracted image may be subjected to image conversion processing using a neural network, thereby automatically generating a video with special effects based on the processed image.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided an image processing method for a terminal device, including: performing image processing on an image to be processed to generate a label list of the image to be processed; and performing image conversion on the image to be processed based on a conversion rule and the tag list, wherein the conversion rule corresponds to tags in the tag list, and the tag list comprises at least one of a feature tag list representing image features included in the image to be processed and an event tag list representing event features included in the image to be processed.

According to some embodiments of the present disclosure, image processing an image to be processed to generate a tag list of the image to be processed includes: and carrying out image detection on the image to be processed by utilizing a first neural network for detecting image characteristics so as to generate a characteristic label list of the image to be processed.

According to some embodiments of the present disclosure, the first neural network comprises a face processing neural network, and the generating the feature tag list of the image to be processed comprises: and generating a feature label list corresponding to the human face features, wherein the human face processing neural network comprises a human face detection network for detecting the human face features and a human face key point detection network for detecting human face key points.

According to some embodiments of the present disclosure, the first neural network includes a target processing neural network for detecting a target feature, and the generating the feature tag list of the image to be processed includes: a list of feature tags corresponding to the target feature is generated.

According to some embodiments of the present disclosure, image processing an image to be processed to generate a tag list of the image to be processed comprises: and performing image detection on the image to be processed by utilizing a second neural network for detecting event features to generate an event label list of the image to be processed, wherein the event features correspond to action features of the human face.

According to some embodiments of the disclosure, image converting the image to be processed comprises: adding special effect features in the image to be processed based on a conversion rule and the label list.

According to some embodiments of the present disclosure, the method further comprises generating position information corresponding to a tag, wherein image converting the image to be processed comprises: adding special effect features at positions corresponding to the position information in the image to be processed based on the conversion rules and the label list.

According to some embodiments of the disclosure, the method further comprises: extracting key frames from the video to be used as the image to be processed; and generating a converted video based on the image converted image.

According to another aspect of the present disclosure, there is also provided a terminal device for image processing, including: the image processing unit is configured to perform image processing on an image to be processed to generate a label list of the image to be processed; and an image conversion unit configured to perform image conversion on the image to be processed based on a conversion rule and the tag list, wherein the conversion rule corresponds to a tag in the tag list, and the tag list includes at least one of a feature tag list representing an image feature included in the image to be processed and an event tag list representing an event feature included in the image to be processed.

According to some embodiments of the present disclosure, the image processing unit performs image detection on the image to be processed by using a first neural network for detecting image features to generate a feature tag list of the image to be processed, where the first neural network includes a face processing neural network, and the generating the feature tag list of the image to be processed includes: generating a feature tag list corresponding to human face features, wherein the human face processing neural network comprises a human face detection network for detecting the human face features and a human face key point detection network for detecting human face key points, and the human face processing neural network comprises a human face detection network for detecting the human face features and a human face key point detection network for detecting the human face key points; the first neural network comprises a target processing neural network for detecting target features, and the generating of the feature tag list of the image to be processed comprises: a list of feature tags corresponding to the target feature is generated.

According to some embodiments of the present disclosure, the image processing unit performs image detection on the image to be processed by using a second neural network for detecting event features to generate an event tag list of the image to be processed corresponding to the event features, wherein the event features correspond to action features of a human face.

According to some embodiments of the disclosure, the image processing unit is further configured to generate position information corresponding to a label, and the image conversion unit adds a special effect feature at a position in the image to be processed corresponding to the position information based on the conversion rule and a label list.

According to yet another aspect of the present disclosure, there is also provided a processing apparatus including: a video processing unit configured to extract a key frame from a video as the image to be processed; the image processing unit is configured to perform image processing on the image to be processed to generate a label list of the image to be processed; an image conversion unit configured to add a special effect to the image to be processed based on a conversion rule and the tag list; the video processing unit is further configured to generate a converted video based on the image to be processed to which the special effect is added, wherein the conversion rule corresponds to a tag in the tag list, and the tag list includes at least one of a feature tag list representing an image feature included in the image to be processed and an event tag list representing an event feature included in the image to be processed.

According to still another aspect of the present disclosure, there is also provided an apparatus for image processing, including: one or more processors; and one or more memories, wherein the memories have stored therein computer readable code which, when executed by the one or more processors, performs the image processing method for a terminal device as described above.

According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon instructions, which, when executed by a processor, cause the processor to execute the image processing method for a terminal device as described above.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 shows a flowchart of an image processing method for a terminal device according to an embodiment of the present disclosure;

FIG. 2 illustrates an example flow diagram for face detection of an image to be processed using a first neural network in accordance with an embodiment of this disclosure;

FIG. 3A is a schematic diagram of an image to be processed having human face features;

FIG. 3B shows a schematic diagram of an image with face keypoints;

FIG. 4A shows a schematic diagram of a list of feature labels for a human face, in accordance with an embodiment of the present disclosure;

FIG. 4B shows a schematic diagram of an event tag list, according to an embodiment of the present disclosure;

FIG. 5 illustrates an example flow diagram for target detection of an image to be processed using a first neural network in accordance with an embodiment of this disclosure;

FIG. 6 shows a schematic diagram of a feature tag list for a target in accordance with an embodiment of the present disclosure;

FIG. 7A shows a schematic diagram of image transformation for a human face, according to an embodiment of the disclosure;

FIG. 7B shows yet another schematic diagram of image transformation for a human face according to an embodiment of the disclosure;

FIG. 8A shows a schematic diagram of image transformation for a target according to an embodiment of the present disclosure;

FIG. 8B illustrates yet another schematic diagram for image conversion of a target according to an embodiment of the present disclosure;

FIG. 9 shows a flow diagram of video processing according to an embodiment of the present disclosure;

fig. 10A shows a schematic block diagram of a terminal device for image processing according to an embodiment of the present disclosure;

FIG. 10B shows a schematic block diagram of a processing device according to an embodiment of the present disclosure;

FIG. 11 shows a schematic diagram of an apparatus for image processing according to an embodiment of the present disclosure;

FIG. 12 shows a schematic diagram of an architecture of an exemplary computing device, in accordance with embodiments of the present disclosure;

FIG. 13 shows a schematic diagram of a storage medium according to an embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without any inventive step, are intended to be within the scope of the present disclosure.

The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Similarly, the word "comprising" or "comprises", and the like, means that the element or item preceding the word comprises the element or item listed after the word and its equivalent, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Flow charts are used in this disclosure to illustrate steps of methods according to embodiments of the disclosure. It should be understood that the preceding or subsequent steps need not be performed in the exact order shown. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or steps may be removed from the processes.

In an application that needs to process video data, the video data needs to be uploaded to a cloud server from a terminal device, images in the video data are processed and analyzed by the cloud server, and then the processed video data are downloaded to the terminal device. The video data may be referred to as video. For example, a neural network may be utilized in a cloud device, such as a cloud server, to perform image recognition, detection, conversion, etc. on a video, e.g., images of certain frames in the video, to generate a converted, processed video using computing resources of the cloud server.

However, the above-mentioned manner of analyzing and processing the video by using the cloud server needs to upload the video from the terminal device to the cloud server through the network connection, which makes the processing of the video depend on the network connection, and the process of uploading and downloading the video will take extra transmission time, which reduces the speed of video processing and affects the user experience. In addition, in a situation that many terminal devices need to process videos stored therein, the videos to be processed consume computing resources of the cloud server, which may cause data blocking, processing delay, and the like, and further reduce the speed of video processing.

According to an aspect of the present disclosure, an image processing method for a terminal device is provided, in which an image (such as a key frame in a video) can be processed in the terminal device without uploading the video to a cloud server, so that the processing of the video is no longer dependent on a network connection and computing resources of the cloud server. The image processing method according to the present disclosure can be applied to, for example, the field of artificial intelligence related to video analysis, processing, and the like.

Fig. 1 shows a flowchart of an image processing method for a terminal device according to an embodiment of the present disclosure. As shown in fig. 1, first, in step S101, an image to be processed is subjected to image processing to generate a tag list of the image to be processed.

According to the embodiment of the disclosure, the terminal device may refer to any device with computing capability, and the image processing method in the application, for example, video processing, may be implemented by using the device without uploading a video to a cloud server, so that dependence on network connection and cloud computing resources is reduced. For example, the analysis and processing of the video may be done automatically in the mobile device.

According to the embodiment of the disclosure, a key frame can be extracted from a video to serve as the image to be processed. The key frame may be the frame of image in which a character or object motion, a changing key action, etc. in the video is located, for example, the key frame in the video may be extracted according to a video processing technique such as FFmpeg (Fast Forward Mpeg). The video may include one or more key frames, and the one or more key frames may be subjected to the image processing as described above, respectively, and a processed video may be generated based on the processed key frame images, thereby implementing the processing on the video.

According to an embodiment of the present disclosure, the tag list includes at least one of a feature tag list representing an image feature included in the image to be processed and an event tag list representing an event feature included in the image to be processed. For example, based on the calculation result of the image processing performed on the image to be processed, a feature tag list of image features included in the image to be processed and an event tag list related to event features, i.e., motion information, included in the image to be processed may be generated, which will be described in detail below.

Next, in step S102, the image to be processed is subjected to image conversion based on a conversion rule and the tag list. The conversion rule may be predefined based on the tag list, for example, when the feature tag list identifies that a certain image feature is included in the image to be processed, the conversion rule may be to add a special effect to a corresponding position of the feature in the image to be processed. According to some embodiments of the present disclosure, the conversion rule may correspond to a tag list one by one, that is, each tag in the tag list corresponds to one conversion rule, and then, based on the generated tag list, the image to be processed may be subjected to corresponding image conversion processing according to the conversion rule. According to other embodiments of the present disclosure, the conversion rule may also not correspond to tags in a tag list one by one, that is, a plurality of tags may correspond to one conversion rule, or one tag may correspond to a plurality of conversion rules.

According to some embodiments of the present disclosure, image processing an image to be processed to generate a tag list of the image to be processed may include image detecting the image to be processed using a first neural network for detecting image features to generate a feature tag list of the image to be processed. For example, the first neural network may be a face processing neural network for image processing the face features to generate feature labels corresponding to the face features. According to the embodiment of the present disclosure, the face processing neural network includes a face detection network for detecting features of a face and a face key point detection network for detecting key points of the face.

FIG. 2 illustrates an example flow diagram for face detection of an image to be processed using a face processing neural network in accordance with an embodiment of this disclosure. As shown in fig. 2, the face processing neural network may include, for example, a face detection network 202 and a face keypoint detection network 204, and is configured to extract image feature information related to face features in an image to be processed and generate a tag list of the image to be processed. It should be noted that the face processing neural network shown in fig. 2 is only an example and does not constitute a limitation to the present disclosure, and other types of neural networks may also be applied to image-process the image to be processed to generate the tag list of the image to be processed.

Next, as shown in fig. 2, the obtained image to be processed 201 may be first input to the face detection network 202. The face detection network 202 is configured to detect a face feature, that is, detect whether a face feature exists in the image 201 to be processed, and if the face feature exists, the network 202 may output a corresponding detection result.

For example, the face detection network 202 may adopt a Multi-task Cascaded Convolutional neural network (MTCNN), which is widely used in applications related to face detection due to advantages of fast processing speed, real-time processing, high processing accuracy, and the like. And the MTCNN processes the image to be processed step by step through a three-order cascade convolution neural network in the processes of detecting the human face characteristics and marking the human face characteristic points. If facial features are detected to be present in the image to be processed, the MTCNN will output 5 keypoint locations of the detected facial features, e.g., points located on both eyes, both corners of the mouth, and the nose in the face.

In order to implement the face detection network 202, such as MTCNN, by the terminal device to implement the processing of the image to be processed, in the embodiment according to the present disclosure, the MTCNN may also be simplified to increase the processing speed of the neural network in the terminal device.

For example, the parameterized active layer (PReLU) in a convolutional neural network may be replaced by an active layer (ReLU), and downsampling processing (such as pooling layers) may also be introduced into the convolutional layer to reduce the size of the feature image, further increasing the processing speed of the neural network.

Therefore, the face detection network 202 shown in fig. 2 can be used to detect whether a face feature exists in the image 201 to be processed, and in the case that the face feature is detected to exist, 5 key points corresponding to the face feature are output, where the key points can be used to represent position information of the face feature.

As shown in fig. 2, face alignment 203 may also be implemented based on the output 5 keypoints for the face features. For example, the face alignment may be implemented by using similarity transform (similarity transform) processing, which may implement, for example, scaling, rotation change, and the like on the face features, so as to obtain an aligned image.

Then, as shown in fig. 2, the to-be-processed image after face alignment may be input to the face key point detection network 204 for generating key point data of the face features. The face key point detection network 204 may be, for example, a MobileNetV2 neural network with an inverted residual structure (inverted residual structure), and is configured to output 115 key points corresponding to face features in an image to be processed.

Fig. 3A is a schematic diagram of an image to be processed with human face features. For example, fig. 3A may be an image to be processed after face alignment, where the image includes a face feature. The image to be processed shown in fig. 3A can be input to a face keypoint detection network 204, such as MobileNetV2, so that an image with face keypoints as shown in fig. 3B can be obtained.

Based on the processing results of the face detection network 202 and the face keypoint detection network 204, the processing result that can generate the image to be processed 201 can generate a feature tag list 205 corresponding to the face features.

For example, in some embodiments according to the present disclosure, a detection unit for detecting the number of human faces, gender characteristics, age characteristics, motion characteristics (such as smiling, eyebrow picking, and other facial motions) and the like may be added to the human face keypoint detection network 204, and the detection unit may be a neural network with a feature detection function or other image detection algorithms, which is not limited herein.

Fig. 4A shows a schematic diagram of a feature tag list corresponding to a face feature according to an embodiment of the present disclosure. Taking the processing flow of the face features shown in fig. 2 as an example, after the image 201 to be processed is processed by the processing flow shown in fig. 2, the number of the face features and the position information of the face features (i.e., the positions of the face frames) may be output based on the position coordinates of the 5 key points generated by the face detection network 202, and the coordinates of the 115 face key points may be output based on the processing result of the face key point detection network 204. The 115 personal face key points are only exemplary, in other words, the data of the face key points is not limited to 115, and other numbers of key points may be output. In addition, other features related to facial features may also be generated, such as feature labels of gender, age, color value, expression, and so forth. It should be noted that the tag list shown in fig. 4A is only exemplary, and the tag list may also include only a part of the tag list shown in fig. 4A, for example, only the face frame position, and may also include other tag lists not shown in fig. 4A.

Thus, the feature tag list according to the embodiment of the present disclosure may be used to identify information such as the kind, number, position, and the like of features included in the image to be processed 201, and the information in the feature tag list may be represented as static classification information.

According to some embodiments of the present disclosure, image processing the image to be processed to generate the tag list of the image to be processed may further include image detecting the image to be processed using a second neural network for detecting event features to generate an event tag list of the image to be processed corresponding to the event features, the event tag list corresponding to motion features of a human face. The information in the event tag list may be represented as dynamic classification information, characterizing the type of action, in comparison to the feature tag list.

According to the embodiment of the present disclosure, the second neural network for generating the event tag list may be the same as the first neural network, for example, in an embodiment of detecting facial features, the second neural network may include a face detection network 202 and a face keypoint detection network 204 as shown in fig. 2, and generate the event tag list based on the processing result of the neural network. In addition, in embodiments that detect other features other than human faces, the second neural network may also have a different structure than the first neural network.

Fig. 4B shows a schematic diagram of an event tag list according to an embodiment of the present disclosure. In embodiments that detect human facial features, the event tag list may include tags related to human facial actions, such as eye opening, eyebrow lifting, eyebrow shrinking, mouth opening, smiling, tongue extending, etc., as shown in fig. 4B.

It should be noted that the structure of the neural network may be set according to the requirements of the practical application, for example, the structure of the first or second neural network may be replaced and simplified according to the requirements, and a neural network for detecting other features may be added to the flow shown in fig. 2, for example, to add the kinds of the tags included in the feature tag list shown in fig. 4A or the event tag list shown in fig. 4B.

According to some embodiments of the present disclosure, the first neural network may further include a target processing neural network for detecting a target feature, and the generating the feature tag list of the image to be processed includes generating a feature tag list corresponding to the target feature.

For example, the target processing neural network may be a neural network for detecting a feature of a target (e.g., cat, dog), and generating a list of feature labels associated with the detected target feature.

FIG. 5 illustrates an example flow diagram for target detection for an image to be processed using a first neural network in accordance with an embodiment of this disclosure. FIG. 6 shows a schematic diagram of a feature tag list for a target feature in accordance with an embodiment of the present disclosure. The process of performing object detection on the image to be processed to generate a feature tag list corresponding to an object feature will be described in detail below with reference to fig. 5 and 6.

In the process flow of object detection as shown in fig. 5, the image to be processed 301 may be, for example, an image including an object feature, and the object feature may be, for example, a cat, a dog, or an object such as a car, a building, an airplane, or the like. In an embodiment according to the present disclosure, a target detection network (SSD), for example, may be employed as the first neural network. As shown in fig. 5, the first neural network may include an SSD large network 302 and an SSD small network 305.

For example, the image 301 to be processed is processed by the SSD large network 302 to generate a target frame position 303, where the SSD large network 302 is configured to detect whether a feature of a cat or a dog exists in the image 301, and if so, output position coordinates of 3 key points of the feature (such as a cat face), where the key points may include a left eye center point, a right eye center point, and a nose center point of a cat face (or a dog face) feature, and the key points may serve as target frame positions to determine a specific position of the cat face in the image 301.

In order to be able to implement a neural network as shown in fig. 5 in the terminal device more quickly, i.e. with which the target detection of the input image takes place, the SSD large network 302 can be simplified. For example, a MobileNetV1 network structure more suitable for a terminal device may be used as an infrastructure of the SSD large network 302, and the size of an input image may be reduced accordingly (for example, an image to be processed has 150 × 150 pixels), so as to reduce the amount of computation. In addition, the number of processing layers (such as 5 layers) for extracting features in the MobileNetV1 network can also be reduced. According to the embodiment of the disclosure, other simplification methods can be adopted to make the simplified neural network more suitable for the terminal device, and the processing speed is increased under the condition of meeting the image feature detection precision.

Next, as shown in fig. 5, in the case that the target feature is detected, for example, a cat face feature is detected to be included in the image to be processed, a feature key Region (ROI) may be extracted based on the output target frame position 303, and the extracted ROI region may be further expanded to obtain an image subjected to ROI expansion 304.

The ROI expanded image can then be processed using the SSD small network 305. The SSD small network 305 may be used to further extract target facial feature keypoints 306. For example, the SSD large network 302 is configured to detect whether a target feature exists in an image to be processed, output 3 key points of the target feature if the target feature exists, further extract an ROI region based on the 3 key points, and further extract key points of the target feature via the SSD small network 305. If SSD small network 305 does not detect cat-face or dog-face features, the target facial keypoints are not output.

According to an embodiment of the present disclosure, the SSD small network 305 may also have a MobileNetV1 structure, for example. Similarly, the SSD small network 305 may be simplified to reduce the size of the input image (e.g., the image with the ROI expanded), for example, the size of the input image may be reduced to include 96 × 96 pixels to reduce the amount of computation. In addition, the number of processing layers (such as 2 layers) for extracting features in the SSD small network can also be reduced.

According to an embodiment of the present disclosure, a network structure for detecting a target feature may also be added to the SSD small network 305 to generate a feature tag list for the target feature. The feature tag list may be as shown in fig. 6, and may include, for example, the number of destinations generated based on SSD large network 302 and SSD small network 305, destination features, and the like. The target features may further include a target category, a target frame position, and a target key point. In embodiments where the target features detected are a cat and a dog, the target class may include, for example, a cat's body, a cat's face, a dog's body, a dog's face, etc., or may also be an automobile, a house, etc.

According to some embodiments of the present disclosure, the image to be processed may be further image processed with the target processing neural network to generate an event tag list for a target feature. For example, the image to be processed may be image-processed based on the SSD large network 302 and the SSD small network 305 as illustrated in fig. 5, and an event tag list of the image to be processed may be generated. In the case where the detected target is an aircraft, the event tag list may correspond to an action of the aircraft, such as take-off, landing, and the like. In addition, in other embodiments according to the present disclosure, other neural networks may be adopted to process the image to be processed to obtain the event tag list for the target feature.

According to some embodiments of the disclosure, image converting the image to be processed comprises: adding special effect features in the image to be processed based on a conversion rule and the tag list, wherein the conversion rule corresponds to the tag list. In addition, the method further comprises generating position information corresponding to a label, wherein the image conversion of the image to be processed comprises: adding special effect features at positions corresponding to the position information in the image to be processed based on the conversion rules and the label list.

For example, when the feature tag list indicates that a certain image feature is included in the image to be processed, the conversion rule may be to add a certain special effect mark to a corresponding position of the feature in the image to be processed. According to some embodiments of the present disclosure, the conversion rule may correspond to a tag list one to one, that is, each tag corresponds to one conversion rule, and then, based on the generated tag list, the image to be processed may be subjected to corresponding image conversion processing according to the conversion rule. According to other embodiments of the present disclosure, the conversion rule may not correspond to a tag list one-to-one, that is, a plurality of tags correspond to one conversion rule, or one tag corresponds to a plurality of conversion rules.

Fig. 7A shows a schematic diagram of image conversion for a human face according to an embodiment of the present disclosure, and fig. 7B shows another schematic diagram of image conversion for a human face according to an embodiment of the present disclosure. Fig. 8A shows a schematic diagram of image conversion for a target according to an embodiment of the present disclosure, and fig. 8B shows a further schematic diagram of image conversion for a target according to an embodiment of the present disclosure.

A process of image conversion of an image to be processed based on a conversion rule and a tag list according to an embodiment of the present disclosure will be described in detail below with reference to fig. 7A, 7B, 8A, and 8B.

For example, in an embodiment of detecting a face feature, as shown in fig. 7A, in a case where a feature tag list for a face feature indicates that an infant is included in the image to be processed, for example, the age of the face feature identified in the feature tag list shown in fig. 4A is 0 to 3 years, a special effect mark about "baby growth diary" may be added to the face feature, so as to implement image conversion. The conversion rule here may be to add special effect indicia such as "baby growth diary" to the baby. In this case, one conversion rule corresponds to one tag in the tag list.

In the case where the expression of the face feature included in the image to be processed is identified as smiling in the feature tag list shown in fig. 4A, or in the case where the event tag list for the face feature shown in fig. 4B identifies that the face included in the image to be processed is smiling, a special effect mark of the sun may be added on the left side of the face feature as shown in fig. 7B, that is, a special effect mark may be added at a corresponding position of the face based on the conversion rule, the tag list, and the position information. The conversion rule here may be, for example, adding a special effect mark of the sun to the left of the face feature of a smile. In this case, one conversion rule corresponds to a plurality of tags in the tag list.

For example, in an embodiment of detecting a target feature, in a case that a feature tag list for a target feature indicates that a cat face is included in the image to be processed, for example, in a case that a target category is identified as a cat face in the feature tag list shown in fig. 6, as shown in fig. 8A, a special effect mark of "cat-" may be added at an upper left of the cat face, and a conversion rule here may be, for example, a special effect mark of "cat-" added at a left side of a feature of the cat face. Alternatively, as shown in fig. 8B, a heart-shaped special effect mark may be added on the top of the cat face, and the conversion rule here may be, for example, adding a heart-shaped special effect mark on the left side of the features of the cat face. In this case, a plurality of conversion rules correspond to one tag in the tag list.

As described above, a conversion rule may be preset for the content in the tag list according to needs, preferences, or the like, and after the tag list of the image to be processed is generated, image conversion including but not limited to adding special effects, such as beauty treatment, image style conversion, or the like, may be automatically performed according to the conversion rule corresponding to the content in the tag list, which is not limited herein.

According to some embodiments of the disclosure, the method further comprises: a converted video is generated based on the converted image. For example, an image conversion process such as adding a special effect may be performed on the image to be processed based on the generated tag list and a conversion rule set in advance to obtain a converted image, and the converted image may be inserted into a corresponding position in the video to obtain a converted video.

Fig. 9 shows a flowchart of video processing according to an embodiment of the present disclosure, and as shown in fig. 9, an input video may be obtained first, where the input video may be a video shot by the terminal device, a video received through a network connection, or the like. Next, the video may also be subjected to decoding processing, and a key frame is extracted from the decoded video, for example, the currently extracted key frame is represented as an ith frame image.

According to the embodiment of the present disclosure, the image of the i-th frame may be processed by using a face detection process as shown in fig. 2 to obtain a feature tag list and an event tag list for face features, and may also be processed by using a target detection process as shown in fig. 5 to obtain a feature tag list for target features. Then, a label list of the ith frame image is generated based on face detection and target detection, and image conversion processing is carried out on the ith frame image based on a conversion rule and the label list. Then, the next key frame in the video may be extracted as the i +1 th frame to perform the above-mentioned image processing process, so as to obtain the converted video, that is, to implement the automatic conversion processing of the video. The flow of video processing provided according to the present disclosure as shown in fig. 9 can be applied to, for example, automatic editing of a video, for example, adding a special effect to an image in a video, and the processing of the above-described image conversion can be automatically implemented based on a set conversion rule and a generated tag list without manually performing the processing.

The image processing method for the terminal device comprises the steps of extracting a key frame in a video to serve as an image to be processed, carrying out image processing on the image to be processed through a neural network to generate a label list of the image to be processed, wherein the label list is used for identifying whether the image comprises features of a human face, a cat face, a dog face and the like, position information of the features and the like. Image conversion processing such as adding a special effect is performed on the image to be processed based on the generated tag list and a conversion rule set in advance to obtain a converted image, and a converted video is generated based on the converted key frame image.

The image processing method for the terminal equipment can be applied to automatic video editing, and corresponding special effect marks are automatically added to key frame images in the video according to established conversion rules based on the generated feature tag list and the event tag list without manually editing the video. Thus, the image processing method according to the present disclosure may be applied to the field of artificial intelligence related to images, video analysis, processing, and the like.

According to another aspect of the present disclosure, a terminal device for image processing is also provided. Fig. 10A shows a schematic block diagram of a terminal device for image processing according to an embodiment of the present disclosure. As shown in fig. 10A, the terminal apparatus 1000 for image processing may include an image processing unit 1010 and an image conversion unit 1020.

According to an embodiment of the present disclosure, the image processing unit 1010 may be configured to perform image processing on an image to be processed to generate a tag list of the image to be processed. The image conversion unit 1020 is configured to perform image conversion on the image to be processed based on a conversion rule and the tag list, where the conversion rule corresponds to a tag in the tag list, and the tag list includes at least one of a feature tag list representing an image feature included in the image to be processed and an event tag list representing an event feature included in the image to be processed.

According to some embodiments of the present disclosure, the terminal device 1000 for image processing may further include: an extraction unit (not shown). The extraction unit may be configured to extract a key frame from a video as the image to be processed.

According to some embodiments of the present disclosure, the image processing unit 1010 may perform image detection on the to-be-processed image by using a first neural network for detecting image features to generate a feature tag list of the to-be-processed image. According to an embodiment of the present disclosure, the first neural network may include a face processing neural network, and the generating the feature tag list of the image to be processed includes: generating a feature label list corresponding to the human face features, wherein the human face processing neural network comprises a human face detection network for detecting the human face features and a human face key point detection network for detecting human face key points, and the human face processing neural network comprises a human face detection network for detecting the human face features and a human face key point detection network for detecting the human face key points. The first neural network may further include a target processing neural network for detecting a target feature, and the generating the feature tag list of the image to be processed includes: a list of feature tags corresponding to the target feature is generated.

The image processing unit 1010 may be configured to perform image detection on the to-be-processed image by using a second neural network for detecting event features to generate an event tag list of the to-be-processed image corresponding to the event features, wherein the event features correspond to action features of a human face, according to some embodiments of the present disclosure, the first neural network includes a human face processing neural network, and the image processing unit 1010 is configured to generate a feature tag list corresponding to human face features.

According to some embodiments of the present disclosure, the first neural network comprises a target processing neural network, and the image processing unit 1010 is configured to generate a feature tag list corresponding to a target feature.

According to some embodiments of the present disclosure, the image conversion unit 1020 adds a special effect feature in the image to be processed based on a conversion rule and the tag list.

According to some embodiments of the present disclosure, the image processing unit 1010 is further configured to generate position information corresponding to a label, and the image conversion unit 1020 is further configured to add a special effect feature at a position in the image to be processed corresponding to the position information based on the conversion rule and the label list.

According to some embodiments of the present disclosure, the image conversion unit 1020 is further configured to generate a converted video based on the converted image.

According to still another aspect of the present disclosure, there is also provided a processing apparatus. Fig. 10B shows a schematic block diagram of a processing device according to an embodiment of the present disclosure. As shown in fig. 10B, the processing apparatus may include a video processing unit 1110, an image processing unit 1120, and an image conversion unit 1130. The video processing unit 1110 is configured to extract a key frame from a video as the image to be processed. The image processing unit 1120 is configured to perform image processing on the image to be processed to generate a tag list of the image to be processed. The image conversion unit 1130 is configured to add a special effect to the image to be processed based on a conversion rule and the tag list. Furthermore, the video processing unit 1110 is further configured to generate a converted video based on the image to be processed to which the special effect is added. According to an embodiment of the present disclosure, the conversion rule corresponds to a tag in the tag list, and the tag list includes at least one of a feature tag list representing an image feature included in the image to be processed and an event tag list representing an event feature included in the image to be processed.

According to still another aspect of the present disclosure, there is also provided an apparatus for image processing. Fig. 11 shows a schematic diagram of an apparatus 2000 for image processing according to an embodiment of the present disclosure.

As shown in fig. 11, the apparatus 2000 may include one or more processors 2010 and one or more memories 2020. Wherein the memory 2020 has stored therein computer readable code which, when executed by the one or more processors 2010, may perform an image processing method for a terminal device as described above.

Methods or apparatus according to embodiments of the present disclosure may also be implemented with the architecture of computing device 3000 as shown in FIG. 12. As shown in fig. 12, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM) 3030, a Random Access Memory (RAM) 3040, a communication port 3050 connected to a network, an input/output component 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as the ROM 3030 or the hard disk 3070, may store various data or files used for processing and/or communication of the image processing method for a terminal device provided by the present disclosure and program instructions executed by the CPU. Computing device 3000 can also include user interface 3080. Of course, the architecture shown in FIG. 12 is merely exemplary, and one or more components of the computing device shown in FIG. 12 may be omitted as needed in implementing different devices.

According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium. Fig. 13 shows a schematic diagram 4000 of a storage medium according to the present disclosure.

As shown in fig. 13, the computer storage medium 4020 has stored thereon computer readable instructions 4010. When the computer readable instructions 4010 are executed by a processor, the image processing method for a terminal device according to the embodiments of the present disclosure described with reference to the above drawings may be performed. The computer-readable storage medium includes, but is not limited to, volatile memory and/or non-volatile memory, for example. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc.

Those skilled in the art will appreciate that the disclosure may be susceptible to variations and modifications. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.

Further, while the present disclosure makes various references to certain elements of a system according to embodiments of the present disclosure, any number of different elements may be used and run on the client and/or server. The units are illustrative only, and different aspects of the systems and methods may use different units.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present disclosure is not limited to any specific form of combination of hardware and software.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present disclosure and is not to be construed as limiting thereof. Although a few exemplary embodiments of this disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the foregoing is illustrative of the present disclosure and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The disclosure is defined by the claims and their equivalents.

Claims

1. An image processing method for a terminal device, comprising:

performing image processing on an image to be processed to generate a label list of the image to be processed;

performing image conversion on the image to be processed based on a conversion rule and the tag list, wherein,

the conversion rule corresponds to a tag in the tag list, the tag list including at least one of a feature tag list representing an image feature included in the image to be processed and an event tag list representing an event feature included in the image to be processed; and is

The image processing of the image to be processed to generate the tag list of the image to be processed includes: performing image detection on the image to be processed by using a first neural network for detecting image features to generate a feature tag list of the image to be processed, wherein the first neural network comprises: a face processing neural network for detecting face features and a target processing neural network for detecting target features, the image features including the face features and the target features, an

The face processing neural network includes: the system comprises a face detection network for detecting face features and a face key point detection network for detecting face key points, wherein the face detection network is realized based on a multitask cascade convolution neural network MTCNN (multiple convolutional neural network) which simplifies a neural network layer, and the face key point detection network is realized based on a lightweight neural network MobileNet V2 neural network with an inverted residual error structure;

the target processing neural network includes: the device comprises a first target detection network and a second target detection network, wherein the scale of the first target detection network is larger than that of the second target detection network, and the first target detection network and the second target detection network are realized on the basis of a lightweight neural network MobileNet V1 neural network which simplifies a neural network layer.

2. The method of claim 1, wherein the generating the feature tag list of the image to be processed comprises:

generating a list of feature labels corresponding to the features of the face, an

A list of feature tags corresponding to the target features is generated.

3. The method of claim 1, wherein image processing the image to be processed to generate the tag list of the image to be processed comprises:

performing image detection on the image to be processed by utilizing a second neural network for detecting event characteristics to generate an event label list of the image to be processed corresponding to the event characteristics, wherein

The event features correspond to motion features of a human face.

4. The method of claim 1, wherein image converting the image to be processed comprises:

adding special effect features in the image to be processed based on a conversion rule and the label list.

5. The method of claim 4, further comprising generating location information corresponding to a tag, wherein image converting the image to be processed comprises:

adding special effect features at positions corresponding to the position information in the image to be processed based on the conversion rules and the label list.

6. The method of claim 1, further comprising:

extracting key frames from the video to serve as the images to be processed; and

the converted video is generated based on the image converted image.

7. A terminal device for image processing, comprising:

the image processing unit is configured to perform image processing on an image to be processed to generate a label list of the image to be processed; and

an image conversion unit configured to perform image conversion on the image to be processed based on a conversion rule and the tag list, wherein

The conversion rule corresponds to a tag in the tag list, the tag list including at least one of a feature tag list representing an image feature included in the image to be processed and an event tag list representing an event feature included in the image to be processed; and is provided with

The image processing unit performs image detection on the image to be processed by using a first neural network for detecting image features to generate a feature tag list of the image to be processed, wherein the first neural network comprises: a face processing neural network for detecting face features and a target processing neural network for detecting target features, the image features including the face features and the target features, an

The face processing neural network includes: the face detection system comprises a face detection network for detecting face features and a face key point detection network for detecting face key points, wherein the face detection network is realized based on a multitask cascade convolution neural network MTCNN (multiple convolutional neural network) for simplifying a neural network layer, and the face key point detection network is realized based on a lightweight neural network MobileNet V2 neural network with an inverted residual error structure;

the target processing neural network includes: the system comprises a first target detection network and a second target detection network, wherein the scale of the first target detection network is larger than that of the second target detection network, and the first target detection network and the second target detection network are realized on the basis of a lightweight neural network MobileNet V1 neural network which simplifies a neural network layer.

8. The apparatus of claim 7, wherein the generating the feature tag list of the image to be processed comprises:

A list of feature tags corresponding to the target feature is generated.

9. The device of claim 7, wherein the image processing unit performs image detection on the image to be processed by using a second neural network for detecting event features to generate an event tag list of the image to be processed corresponding to the event features, wherein the event features correspond to action features of a human face.

10. The apparatus according to claim 7, the image processing unit further configured to generate position information corresponding to a tag, the image conversion unit adding a special effect feature at a position in the image to be processed corresponding to the position information based on the conversion rule and a tag list.

11. A processing device, comprising:

the video processing unit is configured to extract key frames from the video to serve as images to be processed;

the image processing unit is configured to perform image processing on the image to be processed to generate a label list of the image to be processed;

an image conversion unit configured to add a special effect to the image to be processed based on a conversion rule and the tag list;

the video processing unit is further configured to generate a converted video based on the image to be processed after adding the special effect,

wherein the conversion rule corresponds to a tag in the tag list, the tag list including at least one of a feature tag list representing an image feature included in the image to be processed and an event tag list representing an event feature included in the image to be processed; and is

The image processing unit performs image detection on the image to be processed by using a first neural network for detecting image features to generate a feature tag list of the image to be processed, where the first neural network includes: a face processing neural network for detecting face features and a target processing neural network for detecting target features, the image features including the face features and the target features, an

12. An apparatus for image processing, comprising:

one or more processors; and

one or more memories, wherein the memories have computer-readable code stored therein which, when executed by the one or more processors, performs the method of any of claims 1-6.

13. A computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-6.