CN114283461A - Image processing method, apparatus, device, storage medium, and computer program product - Google Patents

Image processing method, apparatus, device, storage medium, and computer program product Download PDF

Info

Publication number
CN114283461A
CN114283461A CN202111141278.7A CN202111141278A CN114283461A CN 114283461 A CN114283461 A CN 114283461A CN 202111141278 A CN202111141278 A CN 202111141278A CN 114283461 A CN114283461 A CN 114283461A
Authority
CN
China
Prior art keywords
image
processed
reference object
posture
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111141278.7A
Other languages
Chinese (zh)
Inventor
刘浩
周勤
赵远远
郑青青
李琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111141278.7A priority Critical patent/CN114283461A/en
Publication of CN114283461A publication Critical patent/CN114283461A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses an image processing method, an image processing device, an image processing apparatus, a storage medium and a computer program product. The method comprises the following steps: acquiring an image to be processed of a reference object; classifying the images to be processed to obtain the matching probability of each candidate gesture with the images to be processed respectively; based on the matching probability of the first posture and the image to be processed, carrying out weighted fusion on the posture characterization parameters corresponding to the first posture to obtain fusion posture characterization parameters, taking the posture represented by the fusion posture characterization parameters as the posture of a reference object corresponding to the image to be processed, wherein the first posture is a candidate posture which meets a first condition with the matching probability of the image to be processed. The gesture of the fused gesture characterization parameter characterization can be regarded as the gesture obtained by fusing one or more candidate gestures according to the matching probability, the gesture of the reference object corresponding to the image to be processed determined in the mode has high accuracy, and the image processing quality is good.

Description

Image processing method, apparatus, device, storage medium, and computer program product
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an image processing method, an image processing apparatus, an image processing device, a storage medium, and a computer program product.
Background
With the development of computer technology, more and more application scenes are used for processing images. One application scenario is: and processing the image of the reference object to determine the posture of the reference object corresponding to the image, wherein the reference object is an object with changeable posture.
In the related art, one candidate pose with the highest degree of matching with an image is determined among a plurality of candidate poses of a reference object, and then the one candidate pose is used as the pose of the reference object corresponding to the image. The pose of the reference object corresponding to the image determined by the image processing method is limited to the one candidate pose, and the setting of the candidate pose of the reference object is generally rough, so the pose of the reference object corresponding to the image determined by the image processing method is low in accuracy, and the image processing quality is poor.
Disclosure of Invention
The embodiment of the application provides an image processing method, an image processing device, image processing equipment, a storage medium and a computer program product, which can be used for improving the accuracy of the posture of a reference object corresponding to a determined image and further improving the image processing quality. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides an image processing method, where the method includes:
acquiring an image to be processed of a reference object;
classifying the images to be processed to obtain the matching probability of each candidate gesture of the reference object and the images to be processed respectively;
based on the matching probability of a first gesture and the image to be processed, performing weighted fusion on the gesture characterization parameters corresponding to the first gesture to obtain fusion gesture characterization parameters, taking the gesture characterized by the fusion gesture characterization parameters as the gesture of a reference object corresponding to the image to be processed, wherein the first gesture is a candidate gesture meeting a first condition with the matching probability of the image to be processed.
In another aspect, there is provided an image processing apparatus, the apparatus including:
a first acquisition unit configured to acquire an image to be processed of a reference object;
the classification unit is used for classifying the image to be processed to obtain the matching probability of each candidate gesture of the reference object and the image to be processed;
and the second obtaining unit is used for performing weighted fusion on the posture characterization parameters corresponding to the first posture based on the matching probability of the first posture and the image to be processed to obtain fusion posture characterization parameters, taking the posture characterized by the fusion posture characterization parameters as the posture of a reference object corresponding to the image to be processed, and taking the first posture as a candidate posture of which the matching probability with the image to be processed meets a first condition.
In a possible implementation manner, the second obtaining unit is further configured to obtain a target reference object virtual model matching the fusion posture characterizing parameter, and take a posture that the target reference object virtual model has as a posture that the fusion posture characterizing parameter characterizes.
In a possible implementation manner, the second obtaining unit is further configured to generate the target reference object virtual model based on the fusion pose characterization parameter.
In one possible implementation, the fusion pose characterization parameter is used to indicate a variable parameter of a reference pose characterization parameter matched with a reference object virtual model, and the reference object virtual model changes with the adjustment of the reference pose characterization parameter; the second obtaining unit is further configured to adjust the reference attitude characterization parameter according to the fusion attitude characterization parameter, and use a virtual model obtained after a reference object virtual model changes along with the adjustment of the reference attitude characterization parameter as the target reference object virtual model.
In a possible implementation manner, the classification unit is configured to obtain an image classification model, where the image classification model is obtained by training based on a sample image of the reference object and a classification label corresponding to the sample image, where the classification label corresponding to the sample image is used to indicate a standard pose corresponding to the sample image, and the standard pose is one candidate pose in the candidate poses; and calling the image classification model to classify the images to be processed to obtain the matching probability of each candidate gesture and the images to be processed respectively.
In a possible implementation manner, the image classification model includes a convolution model, an attention model and a full-connection model which are connected in sequence, and the classification unit is configured to call the convolution model to perform feature extraction on the image to be processed to obtain a first image feature; calling the attention model to perform feature extraction on the first image features to obtain second image features; and calling the full-connection model to classify the second image features to obtain the matching probability of each candidate gesture and the image to be processed respectively.
In a possible implementation manner, the convolution model includes at least one convolution submodel connected in sequence, any convolution submodel includes a convolution layer, a pooling layer and an activation layer connected in sequence, and the classification unit is configured to invoke the convolution layer, the pooling layer and the activation layer in a first convolution submodel in the convolution model to perform feature extraction on the image to be processed, so as to obtain an activation feature output by the first convolution submodel; and calling a convolution layer, a pooling layer and an activation layer in the next convolution submodel from the second convolution submodel in the convolution model to perform feature extraction on the activation features output by the last convolution submodel to obtain the activation features output by the next convolution submodel until the activation features output by the last convolution submodel are obtained, and taking the activation features output by the last convolution submodel as the first image features.
In one possible implementation, the first obtaining unit is configured to obtain an original image, where the original image includes a sub-image of the reference object; determining the area of the sub-image of the reference object in the original image; and acquiring the image to be processed based on the area of the sub-image of the reference object in the original image.
In one possible implementation, the reference object is a part of the first object, and the apparatus further includes:
an adjustment unit for obtaining a first object virtual model, the first object virtual model comprising sub-models of the reference object; and carrying out attitude adjustment on the sub-model of the reference object by utilizing the attitude of the reference object corresponding to the image to be processed.
In one possible implementation, the reference object is a mouth, and the apparatus further comprises:
and the recognition unit is used for recognizing lip language contents matched with the postures of the mouths corresponding to the images to be processed.
In another aspect, a computer device is provided, which includes a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded by the processor and executed to cause the computer device to implement any one of the image processing methods described above.
In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, and the at least one computer program is loaded and executed by a processor to make a computer implement any of the above-mentioned image processing methods.
In another aspect, a computer program product is also provided, which comprises a computer program or computer instructions, which is loaded and executed by a processor, so as to make a computer implement any of the image processing methods described above.
The technical scheme provided by the embodiment of the application at least has the following beneficial effects:
according to the technical scheme, the gesture represented by the fusion gesture characterization parameters is used as the gesture of the reference object corresponding to the image to be processed, wherein the fusion gesture characterization parameters are obtained by fusing the gesture characterization parameters corresponding to the first gesture according to the matching probability of the first gesture in each candidate gesture and the image to be processed. The gesture of the fused gesture characterization parameter characterization can be regarded as the gesture obtained after one or more candidate gestures are fused according to the matching probability, the gesture of the reference object corresponding to the image to be processed determined in the mode is less influenced by the rough setting of the candidate gestures, the accuracy is high, and the image processing quality is good.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application;
fig. 2 is a flowchart of an image processing method provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a candidate mouth pose provided by an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an image classification model provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a process for determining a pose of a reference object corresponding to an image to be processed according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of an image processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic diagram of an image processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a server provided in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
In an exemplary embodiment, the image processing method provided by the embodiment of the present application can be applied to the technical field of artificial intelligence. The artificial intelligence technique is described next.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like. The image processing method provided by the embodiment of the application relates to a computer vision technology and a machine learning technology.
Computer Vision (CV) technology is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition), video processing, video semantic understanding, video content/behavior Recognition, Three-Dimensional object reconstruction, 3D (Three Dimensional) technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, smart transportation, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, internet of vehicles, automatic driving, smart traffic and the like.
In an exemplary embodiment, the image processing method provided in the embodiment of the present application is implemented in a blockchain system, and matching probabilities between the to-be-processed image and each candidate pose of the reference object related to the image processing method provided in the embodiment of the present application and the to-be-processed image, and fusion pose characterization parameters and the like are stored in a blockchain in the blockchain system, and are applied to each node device in the blockchain system, so as to ensure the security and reliability of data.
Fig. 1 is a schematic diagram illustrating an implementation environment of an image processing method provided by an embodiment of the present application. The implementation environment includes: a terminal 11 and a server 12.
The image processing method provided by the embodiment of the present application may be executed by the terminal 11, may also be executed by the server 12, and may also be executed by both the terminal 11 and the server 12, which is not limited in the embodiment of the present application. For the image processing method provided by the embodiment of the application, when the terminal 11 and the server 12 execute together, the server 12 undertakes the primary calculation work, and the terminal 11 undertakes the secondary calculation work; or, the server 12 undertakes the secondary computing work, and the terminal 11 undertakes the primary computing work; alternatively, the server 12 and the terminal 11 perform cooperative computing by using a distributed computing architecture.
In one possible implementation, the terminal 11 may be any electronic product capable of performing human-Computer interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction or handwriting equipment, for example, a PC (Personal Computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant), a wearable device, a PPC (Pocket PC, palmtop), a tablet Computer, a smart car machine, a smart television, a smart speaker, a vehicle-mounted terminal, and the like. The server 12 may be a server, a server cluster composed of a plurality of servers, or a cloud computing service center. The terminal 11 establishes a communication connection with the server 12 through a wired or wireless network.
It should be understood by those skilled in the art that the above-mentioned terminal 11 and server 12 are only examples, and other existing or future terminals or servers may be suitable for the present application and are included within the scope of the present application and are herein incorporated by reference.
Based on the implementation environment shown in fig. 1, the embodiment of the present application provides an image processing method, where the image processing method is executed by a computer device, and the computer device may be the server 12 or the terminal 11, which is not limited in the embodiment of the present application. As shown in fig. 2, the image processing method provided in the embodiment of the present application includes the following steps 201 to 203.
In step 201, an image to be processed of a reference object is acquired.
The image to be processed of the reference object is an image to be processed corresponding to the reference object, and the reference object is an object with a variable posture. The type of the reference object is not limited in the embodiments of the present application, and for example, the reference object refers to a mouth, or the reference object refers to an eye, or the reference object refers to a movable limb, or the like. The postures of the reference objects are different under different types of reference objects, for example, if the reference object is a mouth, the postures of the reference objects include but are not limited to mouth closing, mouth opening and the like; if the reference object refers to an eye, the posture of the reference object includes, but is not limited to, open eyes, closed eyes, etc.; if the reference object is a moveable limb, the pose of the reference object includes, but is not limited to, limb extension, limb flexion, and the like.
In an exemplary embodiment, an original image including a sub-image of the reference object can be obtained by image-capturing the reference object, and in an exemplary embodiment, after obtaining the original image, the original image may be stored in an image library for subsequent direct extraction.
In an exemplary embodiment, the image to be processed of the reference object is obtained based on the original image. Before the image to be processed is acquired, the original image needs to be acquired. The embodiment of the present application does not limit the manner of acquiring the original image. Exemplary ways in which the computer device acquires the raw image include, but are not limited to: the computer device extracts an original image from an image library for storing the original image; the image acquisition equipment which is in communication connection with the computer equipment sends an original image obtained by image acquisition of the reference object to the computer equipment; the computer equipment acquires an original image and the like which are manually uploaded.
After the original image is acquired, an image to be processed is acquired based on the original image. The embodiment of the present application does not limit the manner of acquiring the to-be-processed image based on the original image. In an exemplary embodiment, the manner of obtaining the image to be processed based on the original image is: and directly taking the original image as an image to be processed. In an exemplary embodiment, the manner of obtaining the image to be processed based on the original image is: determining the area of the sub-image of the reference object in the original image; and acquiring the image to be processed based on the area of the sub-image of the reference object in the original image.
The sub-image of the reference object refers to a partial image of the original image for representing the reference object. The embodiment of the present application does not limit the manner of determining the region in which the sub-image of the reference object is located in the original image. For example, the manner of determining the region in which the sub-image of the reference object is located in the original image is as follows: and calling an image segmentation model to segment the original image, and segmenting the area of the sub-image of the reference object in the original image. The image segmentation model may refer to a U-Net (U-type network) model, an FCN (full Convolutional neural network) model, a DSN (deep-Supervised network) model, and the like.
For example, the manner of determining the region in which the sub-image of the reference object is located in the original image is as follows: and performing key point detection on the original image, and taking the area where the key point belonging to the reference object in the detected key points is located as the area where the sub-image of the reference object is located in the original image. The process of detecting key points of an original image refers to a process of detecting key points belonging to a certain specific object in the original image, wherein the key points belonging to a certain specific object are determined according to object features of the specific object, and the position of the specific object can be positioned. Since the reference object belongs to a specific object, the key points belonging to the reference object can be detected. For example, the algorithm for performing the keypoint detection on the original image may be: an ASM (Active Shape Model) algorithm, an AAM (Active Appearance Model) algorithm, a CPR (Cascaded position Regression) algorithm, and the like.
For example, taking the reference object as a mouth as an example, the original image is a face image including a sub-image of the mouth, and by performing key point detection on the face image, key points belonging to various parts (such as mouth, eyes, nose, and the like) can be detected, then key points belonging to the mouth are extracted from the key points belonging to the various parts, and a region where the key points belonging to the mouth are located is taken as a region where the sub-image of the mouth is located in the face image.
After the area of the sub-image of the reference object in the original image is determined, the image to be processed is obtained based on the area of the sub-image of the reference object in the original image. In an exemplary embodiment, based on the area where the sub-image of the reference object is located in the original image, the manner of acquiring the image to be processed is as follows: and intercepting an image of an area where the sub-image including the reference object is located from the original image, and acquiring the image to be processed based on the intercepted image.
In an exemplary embodiment, the image to be processed may be an image of an arbitrary size, and the cut-out image may be directly used as the image to be processed. In an exemplary embodiment, the image to be processed is an image of a reference size, and the image to be processed of the reference size can be obtained by resizing (Resize) the truncated image. The reference size is set empirically or flexibly adjusted according to the actual application scenario, which is not limited in the embodiments of the present application. For example, the reference size is 64 × 64 (pixels).
The resolution of the image to be processed is not limited, and if the resolution of the image to be processed is higher, the posture of the reference object corresponding to the image to be processed can be more effectively analyzed; if the resolution of the image to be processed is low, the performance overhead of the image processing process can be saved. For example, the size of the image to be processed may be flexibly adjusted according to the actual application scenario, and the image to be processed with different sizes may be adopted in different application scenarios.
In step 202, the image to be processed is classified, and the matching probability between each candidate pose of the reference object and the image to be processed is obtained.
After the image to be processed is obtained, classifying the image to be processed to obtain the matching probability of each candidate posture of the reference object and the image to be processed. Illustratively, the candidate poses of the reference object and the matching probabilities of the images to be processed respectively form classification results corresponding to the images to be processed. The larger the matching probability of any candidate posture and the image to be processed is, the more probable the posture of the reference object corresponding to the image to be processed is to be any candidate posture.
The candidate pose is a pose that a preset reference object may have. The embodiment of the present application does not limit the identification manner of the candidate gesture. For example, the candidate gesture is identified with its name; or, a candidate gesture has a number, and the candidate gesture is identified by the number of the candidate gesture; or identifying the candidate gesture by using the reference object virtual model with the gesture as the candidate gesture.
The type and number of candidate poses are related to the type of the reference object, and may be set empirically or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. In the exemplary embodiment, taking the reference object as the mouth as an example, the candidate poses (or candidate mouth poses) include, but are not limited to, 12 poses of mouth closing, mouth opening, mouth slightly opening, mouth wide, mouth round, mouth hissing, mouth beeping, mouth slightly beeping, lip upper and lip lower, lip lower and lip upper, tongue spitting and mouth tucking. Illustratively, the mouth pose may also be referred to as a mouth shape.
Illustratively, the closed mouth refers to a posture in which the lips are closed and the mouth is relaxed, as shown in (1) in fig. 3; the mouth opening means that the upper lip is kept basically unchanged, the lower lip is opened downwards by a medium amplitude, and the amplitude of the mouth opening is larger than the width of the lower lip and is smaller than the sum of the widths of the upper lip and the lower lip, as shown in (2) in fig. 3; the slightly opened mouth is a posture that the lips are slightly opened, the opened gap between the lips can be seen, and the width of the opened mouth is smaller than the width of the lower lip, as shown in (3) in fig. 3; the wide mouth means a posture in which the mouth is opened to a large extent, the two sides of the mouth are slightly opened outwards, and the width of the opened mouth is larger than the sum of the widths of the upper lip and the lower lip, as shown in (4) in fig. 3.
The circular mouth is a posture that the mouth part is opened to be similar to a circular shape, two sides of the mouth part are slightly closed to the middle, and the upper lip is slightly turned upwards, as shown in (5) in figure 3, the mouth is opened to be larger, and the left and the right of the circular mouth tend to be folded towards the middle; the hissing mouth is a posture that the mouth angle is pulled outwards towards two sides, is slightly opened and is slightly longer than the normal mouth length, as shown in (6) in fig. 3, for example, when the Chinese language is 'one', the posture of the mouth is the hissing mouth; the mouth-beeping means that the mouth angle is beeping and close to the middle of the lips, the middle of the mouth is almost seamless or slit, and the upper and lower lips are in an everted state, as shown in (7) in fig. 3; the slight mouth is characterized in that the angle of the mouth is in a shape of a beep, the mouth is closed towards the middle of the lips, the mouth is opened to a certain degree, and the upper lip and the lower lip are in an outward turning posture, as shown in (8) in figure 3, the outward turning degree of the upper lip and the lower lip of the slight mouth is smaller compared with the mouth.
The upper lip wraps the lower lip, namely the upper lip bites the lower lip, the upper lip is exposed, and the posture of the lower lip is hidden, as shown in (9) in fig. 3; the lower lip wraps the upper lip, namely the lower lip bites the upper lip, the lower lip is exposed, and the posture of the upper lip is hidden, as shown in (10) in fig. 3; the tongue regurgitation refers to a posture in which the tongue is regurgitated in any direction and is exposed outside the lip plane, as shown in (11) in fig. 3; sipping means a state in which the upper and lower lips are sipped inward of the mouth and are difficult to see, as shown in fig. 3 (12).
In one possible implementation manner, the implementation manner of classifying the image to be processed to obtain the matching probability between each candidate pose of the reference object and the image to be processed is as follows: the method comprises the steps of detecting key points belonging to a reference object in an image to be processed, calculating a posture characterization feature corresponding to the image to be processed based on the key points, taking the similarity between the posture characterization feature and a candidate posture characterization feature corresponding to a candidate posture as the matching probability of the candidate posture and the image to be processed, and obtaining the matching probability of each candidate posture and the image to be processed based on the mode.
Illustratively, the manner of calculating the pose representation feature corresponding to the image to be processed based on the key points is set according to the type of the reference object, for example, the reference object is a mouth, the key points belonging to the reference object include a left mouth corner key point, a right mouth corner key point, an upper lip key point and a lower lip key point, the distance between the left mouth corner and the right mouth corner, the distance between the upper lip and the lower lip, the distance between the highest point of the upper lip and the lowest point of the lower lip, and the like are calculated according to the key points, and the calculated information is used as the pose representation feature corresponding to the image to be processed.
Illustratively, the posture characteristic feature corresponding to the image to be processed and the candidate posture characteristic feature corresponding to the candidate posture are the same type of feature so as to ensure comparability. For example, the candidate posture characterization features corresponding to the candidate postures correspond to the candidate postures one by one, and when the candidate postures are set, the candidate posture characterization features corresponding to the candidate reference objects can be set at the same time.
In another possible implementation manner, the manner of classifying the image to be processed to obtain the matching probability between each candidate pose of the reference object and the image to be processed is as follows: acquiring an image classification model; and calling an image classification model to classify the images to be processed to obtain the matching probability of each candidate posture with the images to be processed respectively.
The image classification model is used for classifying the image to be processed so as to obtain the matching probability of each candidate posture with the image to be processed. The image classification model is obtained by training based on a sample image of a reference object and a classification label corresponding to the sample image, wherein the classification label corresponding to the sample image is used for indicating a standard posture corresponding to the sample image, and the standard posture is one candidate posture in each candidate posture.
The implementation process of calling the image classification model to classify the image to be processed is an internal processing process of the image classification model, and is related to the structure of the image classification model, and the implementation process of calling the image classification models with different structures to classify the image to be processed may be different. Illustratively, the image classification model refers to a convolutional neural network model, and classification of the image to be processed is realized through deep learning.
In one possible implementation, the image classification model includes a convolution model, an attention model, and a fully connected model connected in sequence. In this case, the process of calling the image classification model to classify the image to be processed to obtain the matching probability between each candidate pose and the image to be processed includes the following steps 1 to 3:
step 1: and calling a convolution model to extract the features of the image to be processed to obtain first image features.
The implementation process of calling the convolution model to extract the features of the image to be processed is an internal processing process of the convolution model and is related to the structure of the convolution model. After the image to be processed is input into the convolution model, the convolution model outputs a first image characteristic. The first image feature can characterize the image to be processed to a certain extent.
In a possible implementation manner, the convolution model includes at least one convolution sub-model connected in sequence, and the process of calling the convolution model to perform feature extraction on the image to be processed is as follows: and calling at least one convolution submodel to extract the characteristics of the image to be processed. In the process of calling at least one convolution submodel to extract the characteristics of the image to be processed, the input of a first convolution submodel is the image to be processed, the input of a next convolution submodel is the output of a previous convolution submodel from a second convolution submodel, and the output of a last convolution submodel is the first image characteristics. The number of convolution submodels included in a convolution model may be set empirically, or may be flexibly adjusted according to an actual application scenario, which is not limited in this application embodiment, and exemplarily, the number of convolution submodels included in a convolution model is 2, or the number of convolution submodels included in a convolution model is 3, or the like.
In an exemplary embodiment, any convolution submodel includes a convolutional layer, a pooling layer, and an activation layer, which are connected in sequence. Illustratively, the size of the convolution kernel of the convolutional layer is set empirically or flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application, for example, the size of the convolution kernel is 3 × 3. Taking the image to be processed as a two-dimensional image as an example, the convolution layer with the convolution kernel size of 3 × 3 may be denoted as Conv2d3 × 3. For example, the type of the pooling Layer may be a Max pooling Layer (Maxpool) or an Average pooling Layer (Avgpool), which is not limited in this embodiment. For example, the activation function used by the activation layer is set empirically or flexibly adjusted according to an actual application scenario, and this is not limited in this embodiment of the application, for example, the activation function used by the activation layer is a ReLU (corrected Linear Unit) function, or a Sigmoid (S-type) function, and the like.
In an exemplary embodiment, the sizes of the convolution kernels in the convolutional layers in different convolutional submodels may be the same or different; the types of pooling layers in different convolution submodels may be the same or different; the activation functions used by the activation layers in the different convolution submodels may be the same or different.
In an exemplary embodiment, for a case that a convolution model includes at least one convolution submodel connected in sequence, and any convolution submodel includes a convolution layer, a pooling layer, and an activation layer connected in sequence, calling the convolution model to perform feature extraction on an image to be processed, and obtaining a first image includes: calling a convolution layer, a pooling layer and an activation layer in a first convolution submodel in the convolution model to extract the characteristics of the image to be processed to obtain activation characteristics output by the first convolution submodel; and calling a convolution layer, a pooling layer and an activation layer in the next convolution submodel from the second convolution submodel in the convolution model to perform feature extraction on the activation features output by the last convolution submodel to obtain the activation features output by the next convolution submodel until the activation features output by the last convolution submodel are obtained, and taking the activation features output by the last convolution submodel as the first image features.
In an exemplary embodiment, the process principle of calling the convolution layer, the pooling layer, and the activation layer in each convolution submodel to perform feature extraction on the input to obtain the output features is the same, and the process of calling the convolution layer, the pooling layer, and the activation layer in the first convolution submodel in the convolution model to perform feature extraction on the image to be processed to obtain the activation features output by the first convolution submodel is taken as an example for description.
In an exemplary embodiment, the process of calling a convolution layer, a pooling layer and an activation layer in a first convolution submodel in a convolution model to perform feature extraction on an image to be processed to obtain an activation feature output by the first convolution submodel is as follows: calling a convolution layer in a first convolution submodel in the convolution model to perform convolution on an image to be processed to obtain convolution characteristics output by the convolution layer in the first convolution submodel; pooling the convolution characteristics output by the convolution layer in the first convolution submodel by calling the pooling layer in the first convolution submodel to obtain pooling characteristics output by the pooling layer in the first convolution submodel; and calling an activation layer in the first convolution submodel to activate the pooling feature output by the pooling layer in the first convolution submodel to obtain the activation feature output by the first convolution submodel.
It should be noted that any convolution submodel described above including a convolution layer, a pooling layer, and an activation layer connected in sequence is merely an exemplary description, and the embodiments of the present application are not limited thereto. Illustratively, any convolution submodel may further include a convolution layer, an activation layer, and a pooling layer connected in sequence.
In an exemplary embodiment, the order of the activation and pooling layers is interchangeable, since the activation function employed by the activation layer is typically a monotonically increasing function that does not change the maximum-minimum ordering of the inputs. Because the pooling layer can carry out down-sampling on the input, the data volume of the input passing through the activation layer can be effectively reduced, and therefore, the calculation amount can be reduced by adopting a mode of firstly passing through the pooling layer and then passing through the activation layer. For example, if the pooling size of the pooling layer is 2 × 2, the amount of calculation of 3/4 can be reduced when passing through the active layer by passing through the pooling layer and then passing through the active layer.
Step 2: and calling the attention model to perform feature extraction on the first image features to obtain second image features.
After the first image feature is obtained, the first image feature is input into an attention model, feature extraction is carried out on the first image feature by the attention model, the feature output after the attention model carries out feature extraction on the first image feature is taken as a second image feature, and the second image feature can be further characterized than the first image feature on the image to be processed.
The attention model is designed based on an attention mechanism, and can improve the attention of local features, and the embodiment of the present application does not limit the type of the attention model, and the attention model is, for example, an SPP (Spatial Pyramid Pooling) model, and the SPP model can extract feature information from multiple scales. For example, the attention model may also refer to a Transformer (Transformer), etc., which is not limited in the embodiments of the present application. The implementation process of calling the attention model to perform feature extraction on the first image feature is an internal processing process of the attention model, and is related to the type of the attention model, and this is not limited in this embodiment of the application.
And step 3: and calling a full-connection model to classify the second image characteristics to obtain the matching probability of each candidate posture with the image to be processed respectively.
And after the second image characteristics are obtained, inputting the second image characteristics into the full-connection model, classifying the second image characteristics by the full-connection model, and outputting the matching probability of each candidate posture and the image to be processed.
The implementation process of calling the full-connection model to classify the second image features is an internal processing process of the full-connection model, and is related to the structure of the full-connection model, which is not limited in the embodiment of the present application. In an exemplary embodiment, the fully connected model includes at least one fully connected layer and a logistic regression layer connected in sequence. Wherein, the full connection layer is used for extracting features, and the logistic regression layer is used for outputting a plurality of matching probabilities. Illustratively, the logistic regression layer is a softmax (an activation function) function layer. The number of fully-connected layers included in the fully-connected model is not limited in the embodiments of the present application, and for example, the number of fully-connected layers included in the fully-connected model is 2 or 3. In an exemplary embodiment, in at least one fully-connected layer, a drop-out (dropout) module may be further included, the dropout module is used in the model training process and is not used in the model reasoning process, and the dropout module is used for temporarily dropping the neural network units in the fully-connected layer from the fully-connected layer according to a certain probability in the training process so as to avoid overfitting.
Exemplarily, for the case that the fully connected model includes at least one fully connected layer and a logistic regression layer which are sequentially connected, the fully connected model is called to classify the second image features, and the matching probability between each candidate pose and the image to be processed is obtained in the following manner: calling the first full connection layer to process the second image characteristics to obtain full connection characteristics output by the first full connection layer; from the second full connection layer, calling the next full connection layer to process the full connection characteristics output by the previous full connection layer, so as to obtain the full connection characteristics output by the next full connection layer until the full connection characteristics output by the last full connection layer are obtained; and calling a logistic regression layer to process the fully connected features output by the last fully connected layer to obtain the matching probability of each candidate posture with the image to be processed.
In an exemplary embodiment, in the process of calling the first full-connection layer to process the second image feature, the second image feature Flatten is first flattened into a one-dimensional feature, and then the first full-connection layer is called to process the second image feature. For example, Flatten is used to make multidimensional input one-dimensional, and the implementation of Flatten is to stretch high-dimensional arrays according to the x-axis or y-axis into one-dimensional arrays (i.e., one-dimensional features).
Illustratively, the structure of the image classification model is as shown in fig. 4, and the image classification model includes three convolution submodels, an attention model (SPP), two full-connected layers (FC1 and FC2), and a logistic regression layer (softmax) connected in sequence, where the full-connected layer FC1 is connected with a dropout module used in the training process. Each convolution submodel includes a convolution layer (Conv2d3 × 3) with a convolution kernel size of 3 × 3, a max pooling layer (Maxpool), and an activation layer (RELU).
It should be noted that, the above description is only an exemplary description of obtaining the matching probability between each candidate pose and the image to be processed by invoking the image classification model, which is introduced in the case that the image classification model includes a convolution model, an attention model, and a fully connected model that are connected in sequence, and the embodiment of the present application is not limited thereto. In an exemplary embodiment, the image classification model may further not include an attention model, that is, the image classification model includes a convolution model and a full-connected model which are connected in sequence, in this case, after the convolution model is called to perform feature extraction on the image to be processed to obtain the first image feature, the full-connected model may be called to classify the first image feature to obtain the matching probability between each candidate pose and the image to be processed. In an exemplary embodiment, the structure of the image classification model is a relatively simplified structure to reduce the amount of computation, so that the image classification model can be deployed on a mobile terminal for use.
In an exemplary embodiment, the obtaining of the image classification model in the embodiment of the present application may refer to extracting an image classification model obtained through pre-training and stored, or may refer to obtaining the image classification model through a real-time training mode, which is not limited in the embodiment of the present application. In either way, the image classification model needs to be trained.
In an exemplary embodiment, the way to train the obtained image classification model is as follows: obtaining a sample image of a reference object and a classification label corresponding to the sample image; calling an initial classification model to classify the sample images to obtain sample classification results, wherein the sample classification results comprise the matching probability of each candidate gesture and the sample images respectively; obtaining a loss function between a sample classification result and a classification label corresponding to a sample image; and training the initial classification model by using a loss function to obtain an image classification model. The classification label corresponding to the sample image is used for indicating a standard posture corresponding to the sample image, and the standard posture is one candidate posture in each candidate posture.
In an exemplary embodiment, the computer device may acquire the sample image of the reference object in such a manner that a partial image is extracted as a sample image from an image library for storing original images; an image enhancement operation may also be performed on the extracted partial image, resulting in a sample image. The generalization capability of the classification model can be improved by the sample image obtained after the image enhancement operation is executed. Illustratively, the image enhancement operations include, but are not limited to, image flipping, image rotation, image noising, and the like.
In an exemplary embodiment, the number of the sample images is multiple, and the sample images including the front image of the reference object and the side image of the reference object exist in the multiple sample images, so that the use range of the classification model is increased, and the classification model can perform more accurate classification on multiple types of images to be processed.
The classification label corresponding to the sample image is used for indicating the standard posture corresponding to the sample image, and the standard posture is one candidate posture in the candidate postures. Illustratively, the corresponding standard pose of the sample image is specified empirically by a human.
For example, the loss function between the sample classification result and the classification label corresponding to the sample image is used to represent the difference between the sample classification result and the classification label corresponding to the sample image, and the embodiment of the present application does not limit the manner of obtaining the loss function between the sample classification result and the classification label corresponding to the sample image, for example, obtaining a cross entropy loss function between the sample classification result and the classification label corresponding to the sample image, or obtaining a mean square error loss function between the sample classification result and the classification label corresponding to the sample image, and the like.
And if the training termination condition is not met, the loss function is acquired again and is trained again, and if the training termination condition is met, the classification model obtained when the training termination condition is met is taken as the image classification model. Illustratively, satisfying the training termination condition includes, but is not limited to, the number of training times reaching a time threshold, a loss function converging, a loss function being less than a loss function threshold, and the like.
In an exemplary embodiment, after the image classification model is obtained, the classification performance of the image classification model is tested, and the test verification result shows that the top3 accuracy (i.e., the proportion of the standard pose corresponding to the test image in the candidate pose that is 3% before the matching probability of the test image) of the image classification model reaches 98.1%, and the top5 accuracy (i.e., the proportion of the standard pose corresponding to the test image in the candidate pose that is 5% before the matching probability of the test image) reaches 99.7%. This indicates that the classification accuracy of the image classification model is high.
In step 203, based on the matching probability of the first pose and the image to be processed, the pose characterization parameters corresponding to the first pose are weighted and fused to obtain fusion pose characterization parameters, the pose represented by the fusion pose characterization parameters is used as the pose of the reference object corresponding to the image to be processed, and the first pose is a candidate pose of which the matching probability with the image to be processed meets a first condition.
And after the matching probability of each candidate posture with the image to be processed is obtained, taking the candidate posture of which the matching probability with the image to be processed meets a first condition as a first posture. The first condition is met, and the method is set according to experience or flexibly adjusted according to an actual application scene, and is not limited in the embodiment of the application. For example, if the matching probability of each candidate pose and the image to be processed satisfies the first condition, each candidate pose is a first pose, and the number of the first poses is the same as the total number of the candidate poses.
Illustratively, the condition that the matching probability with the image to be processed satisfies the first condition means that the matching probability with the image to be processed is a matching probability that the top K (K is an integer not less than 1) of the matching probabilities of the respective candidate poses with the image to be processed is large, in which case, the number of the first poses is K. In this case, the candidate pose with the matching probability with the image to be processed satisfying the first condition is a candidate pose with a higher matching probability with the image to be processed, so as to improve the accuracy of determining the pose of the reference object corresponding to the image to be processed.
After the first posture is determined, the posture of the reference object corresponding to the image to be processed is determined based on the first posture and the matching probability of the first posture and the image to be processed. The posture of the reference object corresponding to the image to be processed refers to the posture of the reference object recognized from the image to be processed. In the embodiment of the present application, based on the first posture and the matching probability between the first posture and the image to be processed, the process of determining the posture of the reference object corresponding to the image to be processed is as follows: and performing weighted fusion on the attitude characterization parameters corresponding to the first attitude based on the matching probability of the first attitude and the image to be processed to obtain fusion attitude characterization parameters, and taking the attitude characterized by the fusion attitude characterization parameters as the attitude of the reference object corresponding to the image to be processed.
The pose characterization parameter corresponding to the first pose is used for uniquely characterizing the first pose, and in an exemplary embodiment, one pose characterization parameter corresponding to one pose includes one or more sub-parameters, different sub-parameters are used for defining poses of different angles of the reference object, and the poses of different angles together form the pose of one reference object. Illustratively, the number of sub-parameters included in the pose characterization parameter and the angle of the pose defined by each sub-parameter are determined according to the type of the reference object, which is not limited in the embodiments of the present application. For example, if the reference object is the mouth, the gesture characterization parameters corresponding to one gesture include, but are not limited to, a sub-parameter for defining the distance between the upper lip and the lower lip, a sub-parameter for defining the distance between the left mouth corner and the right mouth corner, a sub-parameter for defining the degree of upward tilting of the upper lip, a sub-parameter for defining the degree of downward tilting of the lower lip, a sub-parameter for defining the tongue expectoration direction and the expectoration degree, and the like.
The gestures and the gesture characterization parameters have a one-to-one correspondence relationship, different gestures correspond to different gesture characterization parameters, and for the case that the number of the first gestures is multiple, the gesture characterization parameters corresponding to the first gestures need to be acquired respectively.
In an exemplary embodiment, the posture characterization parameters corresponding to the candidate postures are preset and stored corresponding to the candidate postures, and after the first posture is determined from the candidate postures, the posture characterization parameters corresponding to the first posture can be extracted from the storage.
After acquiring the attitude characterization parameters corresponding to the first attitude, taking the matching probability of the first attitude and the image to be processed as the weight of the attitude characterization parameters corresponding to the first attitude, then performing weighted fusion on the attitude characterization parameters corresponding to the first reference attitude, and taking the attitude characterization parameters obtained after weighted fusion as fusion attitude characterization parameters. And after the fusion attitude characterization parameters are obtained, the attitude characterized by the fusion attitude characterization parameters is used as the attitude of the reference object corresponding to the image to be processed. The fusion attitude characterization parameters are obtained by performing weighted fusion on the attitude characterization parameters corresponding to the first attitude, so that the attitude characterized by the fusion attitude characterization parameters can be regarded as the fusion attitude of the first attitude, the sensitivity is higher, and the accuracy of the determined attitude of the reference object corresponding to the image to be processed can be ensured.
In a possible implementation manner, before the pose represented by the fusion pose representation parameter is used as the pose of the reference object corresponding to the image to be processed, the pose represented by the fusion pose representation parameter needs to be determined. The mode for determining the gesture of the fused gesture characterization parameter characterization is as follows: and acquiring a target reference object virtual model matched with the fusion attitude characterization parameters, and taking the attitude of the target reference object virtual model as the attitude of the fusion attitude characterization parameter characterization. Illustratively, the reference object virtual model refers to a virtual three-dimensional stereo model of the reference object, and the target reference object virtual model has a posture which is a posture characterized by the fusion posture characterization parameters.
In an exemplary embodiment, the fusion pose characterization parameter refers to a generation parameter of a reference object virtual model, and in this case, the manner of obtaining the target reference object virtual model matched with the fusion pose characterization parameter is as follows: and generating a target reference object virtual model based on the fusion attitude characterization parameters. Illustratively, the fusion pose characterization parameters are input into the virtual model generator, i.e. the virtual model of the target reference object is generated by the virtual model generator. The virtual model generator may be designed by a designer, and the embodiment of the present application is not limited thereto.
In an exemplary embodiment, the fusion pose characterization parameter is used to indicate a variable parameter of the reference pose characterization parameter matching the reference object virtual model, and the reference object virtual model changes with the adjustment of the reference pose characterization parameter, in this case, the manner of obtaining the target reference object virtual model matching the fusion pose characterization parameter is as follows: and adjusting the reference attitude characterization parameters according to the fusion attitude characterization parameters, and taking a virtual model obtained after the reference object virtual model changes along with the adjustment of the reference attitude characterization parameters as a target reference object virtual model.
The fusion attitude characterization parameters are used for indicating variable parameters of the reference attitude characterization parameters, so that the reference attitude characterization parameters can be adjusted according to the fusion attitude characterization parameters, and the reference object virtual model changes along with the adjustment of the reference attitude characterization parameters, so that a changed reference object virtual model can be obtained after the reference attitude characterization parameters are adjusted according to the fusion attitude characterization parameters, and the changed reference object virtual model is the target reference object virtual model.
In an exemplary embodiment, a process of determining a posture of a reference object corresponding to an image to be processed is as shown in fig. 5, an original image is obtained, a key point detection is performed on the original image, and after the key point detection, the image to be processed is obtained through preprocessing methods such as cropping and size adjustment. Calling an image classification model to classify the images to be processed to obtain the matching probability of each candidate posture with the images to be processed respectively; and determining a first posture based on the matching probability of each candidate posture and the image to be processed, then performing weighted fusion on the posture characterization parameters corresponding to the first posture according to the matching probability of the first posture and the image to be processed to obtain fusion posture characterization parameters, and taking the posture represented by the fusion posture characterization parameters as the posture of the reference object corresponding to the image to be processed.
In a possible implementation manner, the reference object is a part of the first object, and after the pose represented by the fusion pose representation parameter is taken as the pose of the reference object corresponding to the image to be processed, the method further includes: acquiring a first object virtual model, wherein the first object virtual model comprises a sub-model of a reference object; and carrying out attitude adjustment on the sub-model of the reference object by utilizing the attitude of the reference object corresponding to the image to be processed. The first object refers to an object including the reference object, for example, the reference object is a mouth, and the first object refers to a human face. The first object virtual model is a virtual three-dimensional stereo model of the first object, and since the first object comprises the reference object, the first object virtual model comprises a sub-model of the reference object.
And carrying out attitude adjustment on the sub-model of the reference object according to the attitude of the reference object corresponding to the image to be processed, so that the attitude of the sub-model of the reference object can be the attitude of the reference object corresponding to the image to be processed. In an exemplary embodiment, the sub-model of the reference object in the first object virtual model has a default posture, in which case, the process of performing posture adjustment on the sub-model of the reference object by using the posture of the reference object corresponding to the image to be processed may refer to a process of directly driving the posture that the sub-model of the reference object has to the posture of the reference object corresponding to the image to be processed. In an exemplary embodiment, the sub-model of the reference object in the first object virtual model has a pose determined according to another method, and in this case, the process of performing pose adjustment on the sub-model of the reference object by using the pose of the reference object corresponding to the image to be processed may refer to a process of correcting the pose determined by another method so that the sub-model of the reference object has the pose of the reference object corresponding to the image to be processed.
In an exemplary embodiment, the reference object is a mouth, and some lip language contents can be recognized according to a posture of the mouth, in this case, after taking a posture represented by the fusion posture representing parameter as a posture of the reference object corresponding to the image to be processed, the method further includes: and identifying lip language content matched with the posture of the mouth corresponding to the image to be processed. In an exemplary embodiment, different lip language contents correspond to different postures of the mouth, and the lip language contents matched with the postures of the mouth corresponding to the image to be processed can be analyzed by recognizing the postures of the mouth corresponding to the image to be processed.
Exemplarily, the reference object is a mouth, and the process of recognizing the pose of the reference object may be regarded as a mouth shape recognition process. With the continuous development of face recognition related technologies, application scenes are continuously expanded, and understanding of people is not limited to living body detection, identity authentication and the like. Local features of the person, such as the human five sense organs' gestures, expressions, sounds, lip language, body movements, etc., are also being understood in further detail. The mouth shape recognition can assist or transduce and analyze voice, lip language text, expression states and the like, has wide application scenes, and can further promote virtual anchor mouth shape expression driving, real-time voice/text understanding generation and the like.
The mouth shape recognition method and the mouth shape recognition device have the advantages that mouth shape classification is achieved based on the convolutional neural network model, output is normalized to be standardized matching probability, the matching probability of each class is output while the class is specified, various mouth shape results (called composite mouth shape classes) are expressed through weighted combination of the mouth shape classes, expression of mouth shape recognition results is richer, and direct driving or indirect auxiliary secondary correction can be effectively conducted on mouth postures of the reconstructed virtual three-dimensional image.
In an exemplary embodiment, the embodiments of the present application may be applied to virtual anchor spoken emoticon drivers, real-time speech/text understanding generation, and the like. In a virtual anchor mouth shape driving scene, a technology of a Face three-dimensional deformation Model (3D Face three-dimensional Model, 3DMM) is usually adopted, and the technology performs real-time three-dimensional grid driving on the three-dimensional Model of the virtual cartoon by acquiring expression expressions of real persons to show the same mouth shape and expression, but the technology strongly depends on the accuracy of a Face key point detection algorithm, which often results in insensitive or deficient expression, and on the other hand, Face key point detection is based on two-dimensional information, and lacks three-dimensional output information, which results in wrong results in a plurality of Face angles. By adopting the application embodiment for assistance, the problems of insensitivity and deficiency of expression can be avoided, and the side face can be effectively corrected.
According to the image processing method provided by the embodiment of the application, the gesture represented by the fusion gesture characterization parameters is used as the gesture of the reference object corresponding to the image to be processed, wherein the fusion gesture characterization parameters are obtained by fusing the gesture characterization parameters corresponding to the first gesture according to the matching probability of the first gesture in each candidate gesture and the image to be processed. The gesture of the fused gesture characterization parameter characterization can be regarded as the gesture obtained after one or more candidate gestures are fused according to the matching probability, the gesture of the reference object corresponding to the image to be processed determined in the mode is less influenced by the rough setting of the candidate gestures, the accuracy is high, and the image processing quality is good.
Next, an implementation procedure of the image processing method provided in the embodiment of the present application in an application scenario in which the reference object is a mouth is described. For example, when the reference object is a mouth, the image to be processed of the mouth may be referred to as a mouth image, and in this case, the image processing method is: acquiring a mouth image; classifying the mouth images to obtain the matching probability of each candidate gesture of the mouth and the mouth images; and performing weighted fusion on the posture characterization parameters corresponding to the first posture based on the matching probability of the first posture and the mouth image to obtain fusion posture characterization parameters, and taking the posture represented by the fusion posture characterization parameters as the posture of the mouth corresponding to the mouth image. The first pose is a candidate pose of which the matching probability with the mouth image meets a first condition.
Referring to fig. 6, an embodiment of the present application provides an image processing apparatus, including:
a first acquisition unit 601 configured to acquire an image to be processed of a reference object;
the classification unit 602 is configured to classify the image to be processed to obtain matching probabilities between each candidate pose of the reference object and the image to be processed;
the second obtaining unit 603 is configured to perform weighted fusion on the pose characterization parameters corresponding to the first pose based on the matching probability between the first pose and the image to be processed to obtain fusion pose characterization parameters, and use the pose characterized by the fusion pose characterization parameters as the pose of the reference object corresponding to the image to be processed, where the first pose is a candidate pose whose matching probability with the image to be processed satisfies a first condition.
In a possible implementation manner, the second obtaining unit 603 is further configured to obtain a target reference object virtual model matching the fusion posture characterizing parameter, and take a posture that the target reference object virtual model has as a posture characterized by the fusion posture characterizing parameter.
In a possible implementation manner, the second obtaining unit 603 is further configured to generate a target reference object virtual model based on the fusion pose characterization parameters.
In one possible implementation, the fusion attitude characterization parameter is used to indicate a variable parameter of a reference attitude characterization parameter matched with the reference object virtual model, and the reference object virtual model changes along with the adjustment of the reference attitude characterization parameter; the second obtaining unit 603 is further configured to adjust the reference posture characterization parameter according to the fusion posture characterization parameter, and use a virtual model obtained after the reference object virtual model changes along with the adjustment of the reference posture characterization parameter as the target reference object virtual model.
In a possible implementation manner, the classifying unit 602 is configured to obtain an image classification model, where the image classification model is obtained by training based on a sample image of a reference object and a classification label corresponding to the sample image, the classification label corresponding to the sample image is used to indicate a standard posture corresponding to the sample image, and the standard posture is one candidate posture in each candidate posture; and calling an image classification model to classify the images to be processed to obtain the matching probability of each candidate posture with the images to be processed respectively.
In a possible implementation manner, the image classification model includes a convolution model, an attention model, and a full-connection model, which are connected in sequence, and the classification unit 602 is configured to call the convolution model to perform feature extraction on the image to be processed, so as to obtain a first image feature; calling an attention model to perform feature extraction on the first image features to obtain second image features; and calling a full-connection model to classify the second image characteristics to obtain the matching probability of each candidate posture with the image to be processed respectively.
In a possible implementation manner, the convolution model includes at least one convolution submodel connected in sequence, and any convolution submodel includes a convolution layer, a pooling layer and an activation layer connected in sequence, and the classification unit 602 is configured to invoke the convolution layer, the pooling layer and the activation layer in a first convolution submodel in the convolution model to perform feature extraction on an image to be processed, so as to obtain an activation feature output by the first convolution submodel; and calling a convolution layer, a pooling layer and an activation layer in the next convolution submodel from the second convolution submodel in the convolution model to perform feature extraction on the activation features output by the last convolution submodel to obtain the activation features output by the next convolution submodel until the activation features output by the last convolution submodel are obtained, and taking the activation features output by the last convolution submodel as the first image features.
In one possible implementation, a first obtaining unit 601 configured to obtain an original image, where the original image includes a sub-image of a reference object; determining the area of the sub-image of the reference object in the original image; and acquiring the image to be processed based on the area of the sub-image of the reference object in the original image.
In one possible implementation, the reference object is a part of the first object, see fig. 7, the apparatus further comprising:
an adjusting unit 604 for obtaining a first object virtual model, the first object virtual model comprising sub-models of reference objects; and carrying out attitude adjustment on the sub-model of the reference object by utilizing the attitude of the reference object corresponding to the image to be processed.
In one possible implementation, where the reference object is a mouth, see fig. 7, the apparatus further comprises:
a recognition unit 605 for recognizing lip language contents matched with the pose of the mouth corresponding to the image to be processed.
The image processing device provided by the embodiment of the application takes the gesture represented by the fusion gesture characterization parameters as the gesture of the reference object corresponding to the image to be processed, wherein the fusion gesture characterization parameters are obtained by fusing the gesture characterization parameters corresponding to the first gesture according to the matching probability of the first gesture in each candidate gesture and the image to be processed. The gesture of the fused gesture characterization parameter characterization can be regarded as the gesture obtained after one or more candidate gestures are fused according to the matching probability, the gesture of the reference object corresponding to the image to be processed determined in the mode is less influenced by the rough setting of the candidate gestures, the accuracy is high, and the image processing quality is good.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional units is illustrated, and in practical applications, the above functions may be distributed by different functional units according to needs, that is, the internal structure of the apparatus may be divided into different functional units to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
In an exemplary embodiment, a computer device is also provided, the computer device comprising a processor and a memory, the memory having at least one computer program stored therein. The at least one computer program is loaded and executed by one or more processors to cause the computer apparatus to implement any of the image processing methods described above. The computer device may be a server or a terminal, which is not limited in this embodiment of the present application. Next, the structures of the server and the terminal will be described, respectively.
Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application, where the server may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 801 and one or more memories 802, where the one or more memories 802 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 801, so as to enable the server to implement the image Processing method provided by each method embodiment. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
Fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application. Illustratively, the terminal may be: the system comprises a PC, a mobile phone, a smart phone, a PDA, a wearable device, a PPC, a tablet computer, a smart car machine, a smart television, a smart sound box, a vehicle-mounted terminal and the like. A terminal may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.
Generally, a terminal includes: a processor 901 and a memory 902.
Processor 901 may include one or more processing cores. The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU, which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used for storing at least one instruction, which is used for being executed by the processor 901 to enable the terminal to implement the image processing method provided by the method embodiment in the present application.
In some embodiments, the terminal may further include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.
The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. The camera assembly 906 is used to capture images or video.
Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The positioning component 908 is used to locate the current geographic Location of the terminal to implement navigation or LBS (Location Based Service). The power supply 909 is used to supply power to each component in the terminal. The power source 909 may be alternating current, direct current, disposable or rechargeable.
In some embodiments, the terminal also includes one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.
The acceleration sensor 911 may detect the magnitude of acceleration on three coordinate axes of a coordinate system established with the terminal. The gyroscope sensor 912 can detect the body direction and the rotation angle of the terminal, and the gyroscope sensor 912 and the acceleration sensor 911 cooperate to acquire the 3D motion of the user on the terminal. The pressure sensor 913 may be disposed on a side frame of the terminal and/or under the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal, the user's holding signal to the terminal may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905.
The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. The optical sensor 915 is used to collect ambient light intensity. A proximity sensor 916, also known as a distance sensor, is typically provided on the front panel of the terminal. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal.
Those skilled in the art will appreciate that the configuration shown in fig. 9 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
In an exemplary embodiment, there is also provided a computer-readable storage medium having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor of a computer apparatus to cause the computer to implement any one of the image processing methods described above.
In one possible implementation, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, there is also provided a computer program product comprising a computer program or computer instructions which is loaded and executed by a processor to cause a computer to implement any of the image processing methods described above.
It should be noted that the terms "first," "second," and the like in this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. The implementations described in the above exemplary embodiments do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (14)

1. An image processing method, characterized in that the method comprises:
acquiring an image to be processed of a reference object;
classifying the images to be processed to obtain the matching probability of each candidate gesture of the reference object and the images to be processed respectively;
based on the matching probability of a first gesture and the image to be processed, performing weighted fusion on the gesture characterization parameters corresponding to the first gesture to obtain fusion gesture characterization parameters, taking the gesture characterized by the fusion gesture characterization parameters as the gesture of a reference object corresponding to the image to be processed, wherein the first gesture is a candidate gesture meeting a first condition with the matching probability of the image to be processed.
2. The method according to claim 1, wherein before the pose characterized by the fused pose characterization parameters is taken as the pose of the reference object corresponding to the image to be processed, the method further comprises:
and acquiring a target reference object virtual model matched with the fusion attitude characterization parameters, and taking the attitude of the target reference object virtual model as the attitude characterized by the fusion attitude characterization parameters.
3. The method of claim 2, wherein the obtaining a virtual model of a target reference object that matches the fused pose characterization parameters comprises:
and generating the target reference object virtual model based on the fusion attitude characterization parameters.
4. The method of claim 2, wherein the fused pose characterization parameter is used to indicate a variable parameter of a reference pose characterization parameter matching a reference object virtual model, the reference object virtual model varying as the reference pose characterization parameter is adjusted; the acquiring of the target reference object virtual model matched with the fusion attitude characterization parameters comprises:
and adjusting the reference attitude characterization parameters according to the fusion attitude characterization parameters, and taking a virtual model obtained after a reference object virtual model changes along with the adjustment of the reference attitude characterization parameters as the target reference object virtual model.
5. The method according to any one of claims 1 to 4, wherein the classifying the image to be processed to obtain the matching probability between each candidate pose of the reference object and the image to be processed, comprises:
acquiring an image classification model, wherein the image classification model is obtained based on a sample image of the reference object and a classification label corresponding to the sample image, the classification label corresponding to the sample image is used for indicating a standard posture corresponding to the sample image, and the standard posture is one candidate posture in each candidate posture;
and calling the image classification model to classify the images to be processed to obtain the matching probability of each candidate gesture and the images to be processed respectively.
6. The method according to claim 5, wherein the image classification model includes a convolution model, an attention model and a full-connection model which are connected in sequence, and the step of calling the image classification model to classify the image to be processed to obtain the matching probability between each candidate pose and the image to be processed comprises:
calling the convolution model to perform feature extraction on the image to be processed to obtain a first image feature;
calling the attention model to perform feature extraction on the first image features to obtain second image features;
and calling the full-connection model to classify the second image features to obtain the matching probability of each candidate gesture and the image to be processed respectively.
7. The method according to claim 6, wherein the convolution model comprises at least one convolution submodel connected in sequence, any convolution submodel comprises a convolution layer, a pooling layer and an activation layer connected in sequence, and the step of calling the convolution model to perform feature extraction on the image to be processed to obtain a first image feature comprises:
calling a convolution layer, a pooling layer and an activation layer in a first convolution submodel in the convolution model to perform feature extraction on the image to be processed to obtain activation features output by the first convolution submodel;
and calling a convolution layer, a pooling layer and an activation layer in the next convolution submodel from the second convolution submodel in the convolution model to perform feature extraction on the activation features output by the last convolution submodel to obtain the activation features output by the next convolution submodel until the activation features output by the last convolution submodel are obtained, and taking the activation features output by the last convolution submodel as the first image features.
8. The method according to any one of claims 1-4 and 6-7, wherein the acquiring the to-be-processed image of the reference object comprises:
acquiring an original image, wherein the original image comprises a sub-image of the reference object;
determining the area of the sub-image of the reference object in the original image;
and acquiring the image to be processed based on the area of the sub-image of the reference object in the original image.
9. The method according to any one of claims 1-4 and 6-7, wherein the reference object is a part of a first object, and after the pose represented by the fused pose characterization parameter is used as the pose of the reference object corresponding to the image to be processed, the method further comprises:
obtaining a first object virtual model, wherein the first object virtual model comprises a sub-model of the reference object;
and carrying out attitude adjustment on the sub-model of the reference object by utilizing the attitude of the reference object corresponding to the image to be processed.
10. The method according to any one of claims 1-4 and 6-7, wherein the reference object is a mouth, and after the pose represented by the fusion pose characterization parameter is taken as the pose of the reference object corresponding to the image to be processed, the method further comprises:
and identifying lip language content matched with the posture of the mouth corresponding to the image to be processed.
11. An image processing apparatus, characterized in that the apparatus comprises:
a first acquisition unit configured to acquire an image to be processed of a reference object;
the classification unit is used for classifying the image to be processed to obtain the matching probability of each candidate gesture of the reference object and the image to be processed;
and the second obtaining unit is used for performing weighted fusion on the posture characterization parameters corresponding to the first posture based on the matching probability of the first posture and the image to be processed to obtain fusion posture characterization parameters, taking the posture characterized by the fusion posture characterization parameters as the posture of a reference object corresponding to the image to be processed, and taking the first posture as a candidate posture of which the matching probability with the image to be processed meets a first condition.
12. A computer device, characterized in that the computer device comprises a processor and a memory, in which at least one computer program is stored, which is loaded and executed by the processor, to cause the computer device to carry out the image processing method according to any one of claims 1 to 10.
13. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor to cause a computer to implement the image processing method according to any one of claims 1 to 10.
14. A computer program product, characterized in that it comprises a computer program or computer instructions, which are loaded and executed by a processor, to cause a computer to implement an image processing method according to any one of claims 1 to 10.
CN202111141278.7A 2021-09-28 2021-09-28 Image processing method, apparatus, device, storage medium, and computer program product Pending CN114283461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111141278.7A CN114283461A (en) 2021-09-28 2021-09-28 Image processing method, apparatus, device, storage medium, and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111141278.7A CN114283461A (en) 2021-09-28 2021-09-28 Image processing method, apparatus, device, storage medium, and computer program product

Publications (1)

Publication Number Publication Date
CN114283461A true CN114283461A (en) 2022-04-05

Family

ID=80868604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111141278.7A Pending CN114283461A (en) 2021-09-28 2021-09-28 Image processing method, apparatus, device, storage medium, and computer program product

Country Status (1)

Country Link
CN (1) CN114283461A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310012A (en) * 2023-05-25 2023-06-23 成都索贝数码科技股份有限公司 Video-based three-dimensional digital human gesture driving method, device and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310012A (en) * 2023-05-25 2023-06-23 成都索贝数码科技股份有限公司 Video-based three-dimensional digital human gesture driving method, device and system
CN116310012B (en) * 2023-05-25 2023-07-25 成都索贝数码科技股份有限公司 Video-based three-dimensional digital human gesture driving method, device and system

Similar Documents

Publication Publication Date Title
Mahmood et al. Facial expression recognition in image sequences using 1D transform and gabor wavelet transform
CN109359538B (en) Training method of convolutional neural network, gesture recognition method, device and equipment
WO2020103700A1 (en) Image recognition method based on micro facial expressions, apparatus and related device
CN110135249B (en) Human behavior identification method based on time attention mechanism and LSTM (least Square TM)
Kamarol et al. Spatiotemporal feature extraction for facial expression recognition
Deng et al. MVF-Net: A multi-view fusion network for event-based object classification
WO2023098128A1 (en) Living body detection method and apparatus, and training method and apparatus for living body detection system
CN110555481A (en) Portrait style identification method and device and computer readable storage medium
Gou et al. Cascade learning from adversarial synthetic images for accurate pupil detection
Ekbote et al. Indian sign language recognition using ANN and SVM classifiers
CN113449700B (en) Training of video classification model, video classification method, device, equipment and medium
CN113705316A (en) Method, device and equipment for acquiring virtual image and storage medium
CN112633425B (en) Image classification method and device
Ravi et al. Sign language recognition with multi feature fusion and ANN classifier
CN113569607A (en) Motion recognition method, motion recognition device, motion recognition equipment and storage medium
CN115131849A (en) Image generation method and related device
Neverova Deep learning for human motion analysis
CN114283461A (en) Image processing method, apparatus, device, storage medium, and computer program product
Das et al. A fusion of appearance based CNNs and temporal evolution of skeleton with LSTM for daily living action recognition
Guo et al. Facial expression recognition: a review
Rahaman et al. A real-time hand-signs segmentation and classification system using fuzzy rule based RGB model and grid-pattern analysis.
CN113743186B (en) Medical image processing method, device, equipment and storage medium
CN114283460A (en) Feature extraction method and device, computer equipment and storage medium
CN116029912A (en) Training of image processing model, image processing method, device, equipment and medium
Shane et al. Sign Language Detection Using Faster RCNN Resnet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40071033

Country of ref document: HK