CN112915539B

CN112915539B - Virtual object detection method and device and readable storage medium

Info

Publication number: CN112915539B
Application number: CN202110357207.4A
Authority: CN
Inventors: 高威; 王君乐
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2023-01-06
Anticipated expiration: 2041-04-01
Also published as: CN112915539A

Abstract

The application discloses a virtual object detection method, a device and a readable storage medium, wherein the virtual object detection method comprises the following steps: acquiring an image to be processed, and performing virtual object detection on the image to be processed; if the virtual object exists in the image to be processed, object position information of the virtual object in the image to be processed is obtained, and the virtual object is subjected to area division according to the object position information to obtain at least two key part areas; respectively extracting the characteristics of at least two key part areas to obtain picture area characteristics corresponding to each key part area, and respectively performing wearing detection on the characteristics of the at least two picture areas to obtain virtual object wearing information corresponding to an image to be processed; the virtual object wearing information is used for conducting abnormal rendering analysis on the virtual wearing article associated with the virtual object. By the method and the device, the efficiency and the accuracy of detecting the virtual object can be improved.

Description

Virtual object detection method and device and readable storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for detecting a virtual object, and a readable storage medium.

Background

With the continuous development of mobile communication technology, virtual objects are used as main commercial output contents of many products (such as games, videos and the like), the images of the virtual objects become richer and more diversified, and correspondingly, decorative articles of virtual objects such as hats, backpacks, coats, trousers and the like also have the characteristics of large magnitude, multiple styles and frequent change, so that whether resource loading and rendering are normal or not becomes an essential test point in the resource testing process of the products.

In the existing resource testing scheme, the related abnormity of the virtual object is checked by manually observing whether the decorative article of the virtual object in the picture is completely rendered, so that the detection process is repeated and time-consuming, a large amount of labor cost is consumed, and the detection efficiency and accuracy are low.

Disclosure of Invention

The embodiment of the application provides a virtual object detection method, a virtual object detection device and a readable storage medium, which can improve the efficiency and accuracy of virtual object detection.

An embodiment of the present application provides a virtual object detection method, including:

acquiring an image to be processed, and performing virtual object detection on the image to be processed;

if the virtual object exists in the image to be processed, object position information of the virtual object in the image to be processed is obtained, and the virtual object is subjected to region division according to the object position information to obtain at least two key part regions;

respectively extracting the characteristics of at least two key part areas to obtain picture area characteristics corresponding to each key part area, and respectively carrying out wearing detection on the characteristics of at least two picture areas to obtain virtual object wearing information corresponding to an image to be processed; the virtual object wearing information is used for conducting abnormal rendering analysis on the virtual wearing article associated with the virtual object.

acquiring a sample image, and carrying out virtual object detection on the sample image;

if the sample virtual object exists in the sample image, acquiring prediction object position information and a virtual object wearing label of the sample virtual object in the sample image, and performing region division on the sample virtual object according to the prediction object position information to obtain at least two key part regions;

inputting at least two key part areas into an initial object classification network, respectively extracting features of the at least two key part areas in the initial object classification network to obtain picture area features respectively corresponding to each key part area, and respectively performing wearing detection on the at least two picture area features to obtain predicted virtual object wearing information corresponding to a sample image;

generating a first target loss function according to the virtual object wearing label and the predicted virtual object wearing information, and adjusting network parameters in the initial object classification network according to the first target loss function to obtain an object classification network; the object classification network is used for identifying virtual object wearing information corresponding to a virtual object in the image to be processed, and the virtual object wearing information is used for performing abnormal rendering analysis on a virtual wearing article associated with the virtual object.

An embodiment of the present application provides a virtual object detection apparatus in one aspect, including:

the object detection module is used for acquiring an image to be processed and carrying out virtual object detection on the image to be processed;

the area dividing module is used for acquiring object position information of the virtual object in the image to be processed if the virtual object exists in the image to be processed, and performing area division on the virtual object according to the object position information to obtain at least two key part areas;

the classification detection module is used for respectively extracting the characteristics of at least two key part areas to obtain the picture area characteristics corresponding to each key part area, and respectively performing wearing detection on the characteristics of at least two picture areas to obtain the virtual object wearing information corresponding to the image to be processed; the virtual object wearing information is used for conducting abnormal rendering analysis on the virtual wearing article associated with the virtual object.

The object detection module is specifically used for acquiring an image to be processed, inputting the image to be processed into an object detection network, extracting features of the image to be processed in the object detection network to obtain a picture feature matrix, and generating at least two detection frames according to the picture feature matrix; carrying out non-maximum suppression processing on at least two detection frames to obtain detection frames to be processed; and if the virtual object is contained in the detection frame to be processed, determining that the virtual object exists in the image to be processed.

Wherein, the area division module includes:

the region expansion unit is used for performing region expansion on the detection frame to be processed associated with the virtual object to obtain a target detection frame if the virtual object exists in the image to be processed; the side length of the target detection frame is larger than that of the detection frame to be processed;

the information acquisition unit is used for acquiring the detection frame position information of the target detection frame in the image to be processed and determining the detection frame position information as the object position information of the virtual object in the image to be processed; acquiring the area division ratio between at least two key part reference areas and a virtual object reference area; each key part reference area is positioned in the virtual object reference area;

and the area dividing unit is used for generating at least two groups of area coordinates according to the area dividing proportion and the object position information, and performing area division on the target detection frame according to the at least two groups of area coordinates to obtain at least two key part areas.

Wherein, above-mentioned categorised detection module includes:

the characteristic extraction unit is used for inputting the at least two key part areas into an object classification network, and respectively extracting the characteristics of the at least two key part areas in the object classification network to obtain the image area characteristics corresponding to each key part area;

the pooling unit is used for respectively carrying out global average pooling processing on the at least two picture region characteristics through a global average pooling layer in the object classification network to obtain at least two initial category characteristics;

the full-connection unit is used for respectively performing feature integration on at least two initial class features through a full-connection layer in the object classification network to obtain at least two target class features;

and the output unit is used for outputting the classification probability corresponding to each target class characteristic through an output layer in the object classification network, and determining the virtual object wearing information corresponding to the image to be processed according to at least two classification probabilities.

Wherein the at least two target category features comprise a key site region S _i Corresponding object class characteristics T _i I is a positive integer, and i is less than or equal to the number of key part areas;

the output unit is particularly used for classifying output layers in the network through objectsOutputting the target class characteristics T _i Corresponding classification probability, and target class characteristics T _i The worn state or the unworn state indicated by the corresponding classification probability is determined as a key part region S _i A corresponding predicted wearing state; the worn state refers to the target category characteristic T _i A state indicated when the corresponding classification probability is greater than the worn probability threshold; the non-wearing state refers to the target category characteristic T _i A state indicated when the corresponding classification probability is greater than the unworn probability threshold; determining the predicted wearing states respectively corresponding to each key part area as the virtual object wearing information corresponding to the image to be processed;

the above apparatus further comprises:

a rendering analysis module for obtaining the region S of the virtual object at the key part _i The actual wearing state of the virtual wearing article is determined if the actual wearing state is in the key part region S _i Determining the key part region S if the corresponding predicted wearing states are different _i And the rendering result aiming at the virtual wearing article is an abnormal rendering result.

Wherein, above-mentioned device still includes:

the scene recognition module is used for recognizing the service scene type in the image to be processed and acquiring a template image corresponding to the service scene type; the template image includes a reference virtual object;

the offset analysis module is used for acquiring reference position information of a reference virtual object in the template image and generating a position offset according to the reference position information and the object position information; if the position offset is larger than the offset threshold, determining that the virtual object is in an abnormal position in the image to be processed, and determining that the rendering result of the virtual object in the image to be processed is an abnormal rendering result; and if the position offset is smaller than or equal to the offset threshold, determining that the virtual object is at a normal position in the image to be processed, and determining that the rendering result of the virtual object in the image to be processed is a normal rendering result.

The scene identification module is specifically used for acquiring a scene identification configuration file; the scene identification configuration file comprises an incidence relation between a scene element and a service scene type; performing service scene recognition on the image to be processed to obtain pixel coordinates and element classes of scene elements to be detected in the image to be processed; and performing matching search in the scene identification configuration file according to the pixel coordinates and the element categories, determining scene elements matched with the pixel coordinates and the element categories in the scene identification configuration file as target scene elements, and determining the service scene type having an incidence relation with the target scene elements as the service scene type of the image to be processed.

An aspect of an embodiment of the present application provides a virtual object detection apparatus, including:

the object detection module is used for acquiring a sample image and carrying out virtual object detection on the sample image;

the area dividing module is used for acquiring prediction object position information and a virtual object wearing label of the sample virtual object in the sample image if the sample virtual object exists in the sample image, and performing area division on the sample virtual object according to the prediction object position information to obtain at least two key part areas;

the classification detection module is used for inputting the at least two key part areas into an initial object classification network, respectively extracting the characteristics of the at least two key part areas in the initial object classification network to obtain the image area characteristics corresponding to each key part area, respectively performing wearing detection on the at least two image area characteristics to obtain the predicted virtual object wearing information corresponding to the sample image;

the adjusting module is used for generating a first target loss function according to the virtual object wearing label and the predicted virtual object wearing information, and adjusting network parameters in the initial object classification network according to the first target loss function to obtain an object classification network; the object classification network is used for identifying virtual object wearing information corresponding to a virtual object in the image to be processed, and the virtual object wearing information is used for performing abnormal rendering analysis on a virtual wearing article associated with the virtual object.

The object detection module is specifically configured to decode video sample data including a sample virtual object to obtain a plurality of continuous video frames, and perform frame extraction processing on the plurality of continuous video frames to obtain a sample image; inputting a sample image into an object detection network, extracting the characteristics of the sample image in the object detection network to obtain a picture characteristic matrix, and generating at least two detection frames according to the picture characteristic matrix; carrying out non-maximum suppression processing on at least two detection frames to obtain detection frames to be processed; and if the to-be-processed detection frame contains the sample virtual object, determining that the sample virtual object exists in the sample image.

Wherein, above-mentioned device still includes:

the network training module is used for carrying out virtual object labeling on the sample image to obtain an actual labeling frame, inputting the labeled sample image into an initial object detection network, and outputting a prediction detection frame for labeling the sample virtual object and prediction detection frame position information corresponding to the prediction detection frame through the initial object detection network; acquiring the number of the actual labeling frames and the actual object position information of the actual labeling frames in the sample image; generating a quantity loss function according to the quantity of the actual labeling frames and the quantity of the prediction detection frames, generating a position loss function according to the actual object position information and the prediction detection frame position information, and generating a second target loss function according to the quantity loss function and the position loss function; and adjusting the network parameters in the initial object detection network according to the second target loss function to obtain the object detection network.

The classification detection module is specifically configured to input the area sample images corresponding to the at least two key part areas into an initial object classification network, and perform feature extraction on the at least two area sample images in the initial object classification network to obtain picture area features corresponding to each area sample image; respectively carrying out global average pooling on at least two picture region characteristics through a global average pooling layer in the initial object classification network to obtain at least two initial category characteristics; respectively performing feature integration on at least two initial category features through a full connection layer in an initial object classification network to obtain at least two target category features; and outputting classification probabilities respectively corresponding to each target class characteristic through an output layer in the initial object classification network, determining a predictive sub-label respectively corresponding to each regional sample image according to at least two classification probabilities, and determining at least two predictive sub-labels as predictive virtual object wearing information corresponding to the sample images.

The virtual object wearing label comprises an actual sub-label corresponding to each area sample image; the at least two area sample images include an area sample image X _j J is a positive integer, and j is less than or equal to the number of area sample images;

the adjusting module is specifically configured to adjust the image according to the area sample image X _j Corresponding actual sub-label and area sample image X _j Corresponding predictor label generates area sample image X _j A corresponding sub-loss function; and generating a first target loss function according to the sub-loss function corresponding to each area sample image.

An aspect of an embodiment of the present application provides a computer device, including: a processor, a memory, a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the method in the embodiment of the present application.

An aspect of the present embodiment provides a computer-readable storage medium, in which a computer program is stored, where the computer program is adapted to be loaded by a processor and to execute the method in the present embodiment.

In one aspect, embodiments of the present application provide a computer program product or a computer program, where the computer program product or the computer program includes computer instructions stored in a computer-readable storage medium, and a processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the method in the embodiments of the present application.

According to the method and the device, virtual object detection can be performed on the obtained image to be processed, when the virtual object is detected to exist in the image to be processed, object position information of the virtual object in the image to be processed can be further obtained, the virtual object is subjected to region division according to the object position information, at least two key part regions are obtained, feature extraction can be performed on each key part region respectively, picture region features corresponding to each key part region are obtained, wearing detection is performed on the at least two picture region features respectively, virtual object wearing information corresponding to the image to be processed can be finally obtained, and abnormal rendering analysis can be performed on virtual wearing articles related to the virtual object according to the virtual object wearing information. Therefore, the resource (including virtual wearing articles) abnormal rendering detection scheme based on deep learning can automatically detect whether virtual wearing articles in each region of a virtual object are abnormal or not, so that the abnormal rendering detection of the virtual object of various products can be met, the labor cost is saved, the test flow is accelerated, and the efficiency and the accuracy of detecting the virtual object are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram illustrating a system architecture according to an embodiment of the present disclosure;

fig. 2a to fig. 2c are schematic views of a scene of virtual object detection provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a virtual object detection method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a virtual object detection process provided in an embodiment of the present application;

fig. 5 is a scene schematic diagram of regional expansion provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an object classification network according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a virtual object detection method according to an embodiment of the present application;

fig. 8 is a schematic flowchart of a virtual object detection method according to an embodiment of the present application;

9 a-9 b are schematic diagrams of a scenario of position offset detection provided by an embodiment of the present application;

fig. 10 is a schematic flowchart of a virtual object detection method according to an embodiment of the present application;

FIG. 11 is a schematic flowchart of a training object classification network according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a sample image of different regions according to an embodiment of the present disclosure;

FIG. 13 is a schematic diagram of a scenario of model testing provided in an embodiment of the present application;

fig. 14 is a schematic flowchart of a training object detection network according to an embodiment of the present application;

FIG. 15 is a schematic interface diagram of a virtual object annotation provided in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a virtual object detection apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a virtual object detection apparatus according to an embodiment of the present application;

FIG. 18 is a schematic structural diagram of a computer device according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "look", and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include data processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, deep learning technology and other technologies, and the specific process is explained by the following embodiment.

Please refer to fig. 1, which is a schematic diagram of a system architecture according to an embodiment of the present application. The system architecture may include a service server 100 and a terminal cluster, and the terminal cluster may include: terminal device 200a, terminal device 200b, terminal device 200c, \ 8230, and terminal device 200n, wherein a communication connection may exist between terminal clusters, for example, a communication connection exists between terminal device 200a and terminal device 200b, and a communication connection exists between terminal device 200a and terminal device 200 c. Meanwhile, any terminal device in the terminal cluster may have a communication connection with the service server 100, for example, a communication connection exists between the terminal device 200a and the service server 100, where the communication connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, may also be directly or indirectly connected through a wireless communication manner, and may also be through other manners, which is not limited in this application.

It should be understood that each terminal device in the terminal cluster shown in fig. 1 may be installed with an application client, and when the application client runs in each terminal device, the application client may perform data interaction with the service server 100 shown in fig. 1, so that the service server 100 may receive service data from each terminal device. The application client can be an application client with a function of displaying data information such as characters, images, audios and videos, such as a game application, a video editing application, a social contact application, an instant messaging application, a live broadcast application, a short video application, a music application, a shopping application, a novel application, a payment application and a browser. The application client may be an independent client, or may be an embedded sub-client integrated in a certain client (e.g., an instant messaging client, a social client, a video client, etc.), which is not limited herein.

In an embodiment, taking a Game application as an example, the service server 100 in fig. 1 may be a set including multiple servers, such as a gateway server, a scene server, a world server, a database proxy server, an AI server, and a chat manager, which correspond to the Game application, so that each terminal device may perform data transmission with the service server 100 through an application client corresponding to the Game application, for example, each terminal device may participate in the same Game with other terminal devices through the service server 100, for example, an MMORPG-type Game (full-called Massive Multiplayer Online Role Playing Game), an FPS-type Game (full-called First-person shooting-type Game), and the like, and during a Game, a player may manipulate a corresponding virtual object and may perform real-time interaction with a virtual object controlled by another player in a virtual Game space. In addition, the player can update the virtual wearing articles on the virtual object controlled by the player at any time and any place as required, and display the virtual wearing articles on the game picture, wherein the virtual wearing articles used for decorating the virtual object are various, such as helmets and hats worn on the head of the virtual object, coats, trousers and skirts worn on the body, shoes worn on the feet, virtual firearms required to be used and the like, and the style of each virtual wearing article is different.

In one embodiment, taking a video editing application as an example, the system shown in fig. 1 may represent a distributed multi-machine networked system in a video editing scenario. Research personnel can construct a virtual scene and a virtual object in advance, in order to express the integrity of space and time in a video, a plurality of lens virtual cameras can be arranged in the virtual scene, and it should be noted that, different from a real video camera, the situation that mutual obstruction among the lens virtual cameras does not exist in the virtual scene, and in the process of virtual shooting, the lens virtual cameras are invisible in the virtual scene. As shown in fig. 1, each terminal device in the terminal cluster is installed with a video editing application, and each terminal device may be connected to the same virtual scene through a network, wherein the service server 100 may be configured to generate the virtual scene and manage the virtual scene, the terminal device 200a, the terminal device 200b, and the terminal device 200c, \\8230, a part of the terminal device 200n may control a virtual object including an action, an expression, and the like of the virtual object through the video editing application, and change equipment of the virtual object including a hat, clothing, a virtual weapon, and other virtual articles, and another part may control a lens virtual camera to perform animation shooting through the video editing application, for example, the lens virtual camera may be controlled to move to shoot between different virtual animation characters, or different lens virtual cameras may be switched to realize shooting at different viewing angles. When the virtual object needs to change equipment in a certain scene, namely, a virtual wearing article corresponding to the virtual object needs to be changed, corresponding terminal equipment can perform corresponding rendering and drawing in a video picture.

Accordingly, in order to correctly display the virtual wearing article in the game screen or the video screen, the related terminal device or the service server 100 needs to perform abnormal rendering analysis on the virtual wearing article in the screen. Taking the terminal device 200a and the service server 100 as an example, for example, the terminal device 200a may respond to related operations to display a picture (which may include a game picture, a video picture, and the like) after rendering and drawing a virtual wearing article, and then may use the picture as an image to be processed, and further send the image to be processed to the service server 100, the service server 100 may perform virtual object detection on the image to be processed, when a virtual object exists in the image to be processed, may further obtain object position information of the virtual object in the image to be processed, and perform region division on the virtual object according to the object position information to obtain at least two key region areas, and further may perform feature extraction on each key region area to obtain picture region features corresponding to each key region area, and perform wear detection on at least two picture region features respectively, and finally may obtain virtual object wearing information corresponding to the image to be processed, and may determine whether a rendering result of the virtual wearing article in the image to be processed is a normal rendering result or an abnormal rendering result according to the virtual object wearing information, and subsequent research personnel may perform a targeted test analysis on a region where an abnormal rendering result occurs.

Optionally, it may be understood that the system architecture may include a plurality of service servers, one terminal device may be connected to one service server, and each service server may obtain an image to be processed displayed by the terminal device connected to the service server, so that the virtual object wearing information may be obtained by performing virtual object detection, area division, and wearing detection on the image to be processed.

Optionally, it may be understood that each terminal device may also obtain an image to be processed, so that the virtual object wearing information may be obtained by performing virtual object detection, area division, and wearing detection on the image to be processed.

It should be noted that the above abnormal rendering analysis scheme may be applied to various scenes with virtual wearing articles, such as games, videos, instant messaging, and the like, and the embodiment of the present application only takes a game application and a video editing application as examples for relevant description.

It is understood that the method provided by the embodiment of the present application may be executed by a computer device, which includes, but is not limited to, a terminal device or a service server. The service server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud database, a cloud service, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, domain name service, security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a palm computer, a Mobile Internet Device (MID), a wearable device (e.g., a smart watch, a smart bracelet, etc.), a smart computer, etc. that may operate the application client. The terminal device and the service server may be directly or indirectly connected in a wired or wireless manner, which is not limited in this embodiment of the present application.

It should be noted that the service server may also be a node on the blockchain network. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm, and is mainly used for sorting data according to a time sequence and encrypting the data into an account book, so that the data cannot be falsified and forged, and meanwhile, the data can be verified, stored and updated. It is understood that one or more intelligent contracts may be included in the blockchain system, and these intelligent contracts may refer to code that nodes (including common nodes) of the blockchain can understand and execute, and may execute any logic and obtain a result. A plurality of nodes may be included in a blockchain nodal system, which may correspond to blockchain networks (including but not limited to blockchain networks corresponding to federation chains), and the plurality of nodes may specifically include the aforementioned service servers.

For ease of understanding, the terminal device 200a and the service server 100 are specifically described below as an example.

Please refer to fig. 2 a-2 c together, which are schematic views of a scene for detecting a virtual object according to an embodiment of the present application. The implementation process of the virtual object detection scenario may be performed in the service server 100 shown in fig. 1, or may be performed in a terminal device (e.g., any one of the terminal device 200a, the terminal device 200b, the terminal device 200c, or the terminal device 200n shown in fig. 1), or may be performed by both the terminal device and the service server, which is not limited herein, and the embodiment of the present application is described as an example that the terminal device 200a and the service server 100 are performed together. As shown in fig. 2a, a developer and a terminal device 200a have a binding relationship, and a plurality of applications (for example, a game application, a video application, an instant messaging application, and the like) may be installed on the terminal device 200a, and if the developer needs to test an abnormal situation when one of the applications renders a virtual object, if the target application A1 is used, the terminal device 200a may respond to a trigger operation (for example, a click operation) for the target application A1, display a default display interface corresponding to the target application A1 on a screen thereof, and if the target application A1 is used as a game application, the terminal device 200a may connect to the service server 100 through an application client of the target application A1 to initiate a login request, and then the service server 100 initiates an identity data verification query, and returns an authentication result after completing the data query, and if the identity verification is passed, the service server 100 may continue to query and return account status data (for example, information such as a role, equipment, a level, an attribute, a scene where the last login is located, a scene server where the service server 100 and the last login server logs in, and coordinates of the terminal device may further send the account status data to a corresponding scene server (for convenience of issuing, and understanding, and the service server 100 and the online detection server and the online detection of online control of online friend). Further, after receiving the authentication result, the service server 100 establishes a connection with the corresponding scene server, and the research and development staff successfully logs in to the scene server, so that the game screen 300a shown in fig. 2a may be displayed. As shown in fig. 2a, a virtual object (which may include a virtual wearing article on the virtual object) manipulated by the developer, a virtual scene in which the virtual object is located, other virtual objects appearing near the virtual object, and the like may be displayed in the game screen 300a. Subsequently, the scenario server starts to write all the object behavior logs into the log, and simultaneously sends the relevant data or query request of the research and development personnel to the service server 100. It is understood that the game screen 300a is updated at a time, and the virtual object is not necessarily displayed at a time, and for convenience of the following description, only the game screen 300a having the virtual object is taken as an example.

Further, the terminal device 200a may transmit the game screen 300a as a to-be-processed image to the service server 100, and for convenience of description, the game screen 300a will be referred to as the to-be-processed image 300a in the following. As shown in fig. 2b, after receiving the image 300a to be processed, the service server 100 may perform virtual object detection on the image 300a to be processed, that is, detect whether a virtual object exists in the image 300a to be processed, where the virtual object detection may be implemented by a detection model based on a deep neural network, and the specific process may refer to step S101 in the embodiment corresponding to fig. 3. Assuming that the service server 100 may determine that the virtual object 300B exists in the image 300a to be processed through the detection model, and may obtain object position information of the virtual object 300B in the image 300a to be processed, further, in order to obtain a more accurate test result and increase the test speed, the service server 100 may perform region division on the virtual object 300B according to the object position information to obtain at least two key part regions, optionally, as shown in fig. 2B, the service server 100 may divide the virtual object 300B into 5 key part regions including a key part region B1, a key part region B2, a key part region B3, a key part region B4, and a key part region B5, and as can be seen, the key part region B1 is a whole body region, the key part region B2 is a head region, the key part region B3 is an upper body region, the key part region B4 is a lower body region, and the key part region B5 is a foot region.

It should be noted that, it can be understood that the virtual object may be a dynamic or static image such as a virtual plant, a virtual animal, a virtual building, a virtual vehicle, a virtual article, etc. besides the virtual character shown as the virtual object 300b, and therefore, research and development personnel may set different division rules and division areas according to actual needs, which is not limited in the embodiment of the present application.

Further, the service server 100 may input the 5 key region areas into the integrated detector 300C, as shown in fig. 2C, the integrated detector 300C is a scene schematic diagram in which the integrated detector 300C processes all key region areas, wherein the integrated detector 300C is also a detection model based on a deep neural network, in the integrated detector 300C, feature extraction may be performed on the key region area B1, the key region area B2, the key region area B3, the key region area B4, and the key region area B5, respectively, so as to obtain a picture region feature C1 corresponding to the key region area B1, a picture region feature C2 corresponding to the key region area B2, a picture region feature C3 corresponding to the key region area B3, a picture region feature C4 corresponding to the key region area B4, and a picture region feature C5 corresponding to the key region area B5, then the picture region feature C1, the picture region feature C2, the picture region feature C3, the picture region feature C4, and the picture region feature C5 may be used to detect wearing the image 300a to be processed, and render the information 300d of the wearing object 300a on the virtual wearable terminal 300a, and display the virtual wearing information on the wearable device 300a, and render the virtual wearable terminal 300d, and display the virtual information to be processed, and display the virtual information on the virtual wearable device 300a, and display the virtual wearable terminal 300a, and display the virtual information to be processed. Referring to fig. 2b again, as shown in fig. 2b, the virtual object wearing information 300d herein may include wearing states corresponding to 6 different types of virtual wearing items, specifically, head virtual wearing items such as "hat/mask/helmet," upper body virtual wearing items such as "backpack" and "coat" (here, "backpack" and "coat" are used as different types of virtual wearing items), lower body virtual wearing items such as "trousers/skirt," foot virtual wearing items such as "shoes," and whole body virtual wearing items such as "virtual gun", wherein the non-wearing state may be represented by a number "0", the wearing state may be represented by a number "1", and the virtual object wearing information 300d obtained by the above process may include backpacks and shoes, and by comparing with the actual wearing state of the virtual object 300b in the image 300a to be processed, the virtual object wearing information 300d may be found to be accurate, that the rendering result corresponding to the virtual wearing item 300b is a rendering result, that is associated with the virtual wearing information, that is rendered as a rendering result of abnormal wearing of the virtual wearing object.

The above abnormal rendering detection process for the virtual wearing article may also be executed by the terminal device 200a, which is only described by taking the service server 100 as an example, and the embodiment of the present application does not limit this. It should be noted that the above abnormal rendering detection scheme may be applied to various scenes with virtual wearing articles, such as games, videos, instant messaging, and the like, and the embodiment of the present application is only described by taking a game application as an example, and the detection process in other scenes is consistent with the above described process, and is not described here again.

The detection model involved in the above process may be obtained by the service server 100 (or the terminal device 200 a) training an initial detection model by using a video database with a large amount of game videos, and the specific process may refer to the following embodiments corresponding to fig. 10 and fig. 15.

As can be seen from the above, the present application embodiment does not rely on a manual screening scheme that consumes time and labor, but directly determines whether a virtual wearing article is rendered abnormally based on a deep neural network, that is, the present application embodiment provides a resource (including a virtual wearing article) abnormal rendering detection scheme, and performs virtual object detection on an acquired image to be processed, so that when a virtual object is detected in the image to be processed, object position information of the virtual object in the image to be processed can be further acquired, and the virtual object is subjected to region division according to the object position information, so as to obtain at least two key part regions, and further, each key part region can be subjected to feature extraction, so as to obtain picture region features corresponding to each key part region, and perform wear detection on at least two picture region features, so as to finally obtain virtual object wearing information corresponding to the image to be processed. Because this application embodiment can detect whether the virtual wearing article in each region of virtual object in the pending image renders unusually automatically, consequently can satisfy the virtual object of all kinds of products and render unusually and detect, use manpower sparingly the cost for the test flow, and improve efficiency and the rate of accuracy that detects the virtual object.

Referring to fig. 3, fig. 3 is a schematic flowchart of a virtual object detection method according to an embodiment of the present disclosure. The virtual object detection method may be executed by a computer device, and the computer device may include a terminal device or a service server as described in fig. 1. As shown in fig. 3, the virtual object detection method may include at least the following steps S101 to S103:

step S101, acquiring an image to be processed, and performing virtual object detection on the image to be processed;

specifically, the computer device may first obtain an image to be processed, then input the image to be processed into a pre-trained object detection network, and further perform feature extraction on the image to be processed in the object detection network to obtain a picture feature matrix, and in a general case, at least two detection frames may be generated according to the picture feature matrix, where the detection frames may also be referred to as bounding boxes (bounding boxes) for locating positions of target detection objects (virtual objects in the embodiment of the present application) in the image to be processed. Performing Non-Maximum Suppression (NMS) on at least two detection frames to obtain detection frames to be processed, and determining that a virtual object exists in an image to be processed if the detection frames to be processed contain the virtual object; otherwise, it may be determined that the virtual object does not exist in the image to be processed. The to-be-processed image refers to one or more images which need to be subjected to abnormal rendering detection, and the to-be-processed image includes but is not limited to a game picture image and a video picture image. The virtual object refers to a target object to be detected in an image to be processed, and includes, but is not limited to, dynamic or static images such as a virtual character, a virtual plant, a virtual animal, a virtual building, a virtual carrier, a virtual article, and the like.

In the virtual object detection process, the object detection network based on deep learning includes two subtasks of object classification and object positioning, and in the embodiment of the present application, detection is mainly performed on a virtual object, so that the subtask of object classification mainly detects an object of which the type is a virtual object in an image to be processed, and the subtask of object positioning requires prediction of the position of the virtual object in the image to be processed, optionally, the object positioning process can predict not only a detection frame of the virtual object in the image to be processed, but also a confidence (confidence score) for each detection frame, where the confidence includes two pieces of information, one is the probability that the detection frame includes the virtual object, and the other is the accuracy of the detection frame, therefore, the confidence may be regarded as the probability of whether the detection frame includes the virtual object, and since a plurality of detection frames may be generated in the virtual object detection process to frame the object that may be the virtual object in the image to be processed, the detection frames with the confidence lower than a preset confidence threshold (for example, 0.3) may be ignored first, and then the non-maximum suppression processing may be performed on the remaining detection frames, where the purpose of performing the non-maximum suppression processing is to remove the redundant detection frames that are repeated, extract an independent detection frame with the maximum confidence from the predicted detection frames as a detection result, that is, a detection frame to be processed, and finally map the detection result to the detection frame position information of the detection frame to be processed in the image to be processed (for example, the pixel coordinates of the detection frame to be processed in the image to be processed).

It should be noted that the object detection network may adopt a one-stage object detection algorithm and a two-stage object detection algorithm. The two-stage target detection algorithm is, for example, an R-CNN series (Region-CNN, a target detection technique implemented based on algorithms such as a convolutional neural network, a linear regression, and a support vector machine), and may generate a preselected frame (pro-spatial Region) that may include an object to be detected first, and then perform fine-grained object detection on each preselected frame. One-stage target detection algorithms such as YOLO (a Single neural network-based target detection system), SSD (Single Shot multi box Detector, which is a method for realizing target detection and identification by using a Single deep neural network model), squeezeDet (a full convolution neural network in real-time target detection), and the like can directly extract features in the network to predict object classification and positions, that is, all detection frames can be predicted at one time. Specifically, which algorithm is used may be selected according to actual needs, which is not limited in the embodiment of the present application.

Please refer to fig. 4, which is a schematic flowchart of a virtual object detection process according to an embodiment of the present disclosure. In an optional implementation manner, considering that there is a certain requirement on the detection time, the lightweight network MobileNetv2-SSDLite with a smaller model and a faster speed may be selected as the object detection network to perform the virtual object detection, as shown in fig. 4, in the object detection network, the lightweight network MobileNetv2 may be used to replace a VGG part in the SSD network, and serve as a basic convolutional layer to extract a low-scale picture feature corresponding to an image to be processed, so that the low-scale picture feature finally output by the basic convolutional layer may be input into an auxiliary convolutional layer, a high-scale picture feature may be extracted by the auxiliary convolutional layer, and then the position information and classification information of each point in the high-scale picture feature may be predicted by the prediction convolutional layer, so that the target positioning and classification may be finally achieved, that is, that the virtual object detection is achieved. The standard convolution of the object detection network can be replaced by the deep separable convolution so as to reduce the number of parameters and the operation cost. Optionally, other versions of MobileNet or networks such as ShuffleNet (a lightweight convolutional neural network) may also be used as the base convolutional layer, and it can be understood that the auxiliary convolutional layer may be designed according to actual needs.

Step S102, if a virtual object exists in the image to be processed, object position information of the virtual object in the image to be processed is obtained, and the virtual object is subjected to region division according to the object position information to obtain at least two key part regions;

specifically, through the virtual object detection in step S101, the computer device may determine whether a virtual object exists in the image to be processed, and if a virtual object exists in the image to be processed, to avoid that the detection frame to be processed is not framed on important information of the virtual object (for example, a part of the mask worn by the virtual object exceeds hair), the detection frame to be processed associated with the virtual object may be subjected to region expansion to obtain the target detection frame. In the embodiment of the present application, the detection frames to be processed are all rectangular detection frames, so that the region expansion is to expand the detection frames to be processed in a certain proportion (for example, 0.02) to the specified direction, and it can be understood that the side length of the target detection frame is greater than the side length of the detection frame to be processed. Please refer to fig. 5, which is a scene diagram of a region expansion according to an embodiment of the present application. As shown in fig. 5, it is assumed that a virtual object 400b exists in the image 400a to be processed, and a detection frame 400c to be processed can be obtained through virtual object detection, and it can be seen from the figure that the detection frame 400c to be processed does not completely frame the mask 400d worn by the virtual object 400b, so that the detection frame 400c to be processed needs to be expanded in area, so as to obtain a target detection frame 400e that can completely frame the mask 400d and the virtual object 400 b.

Further, the computer device may acquire position information of the target detection frame in the image to be processed, determine the position information of the detection frame as object position information of the virtual object in the image to be processed, further may acquire a region division ratio between the at least two key portion reference regions and the virtual object reference region, may generate at least two sets of region coordinates according to the region division ratio and the object position information, and perform region division on the target detection frame according to the at least two sets of region coordinates, thereby acquiring the at least two key portion regions. Each key part reference area is located in the virtual object reference area, and the area division ratio between the key part reference area and the virtual object reference area can be obtained by counting a large amount of picture sample data. In an optional implementation manner, when the virtual object is a virtual character, the area division ratio may be set as: head region: [0.21,0,0.77,0.25], upper body region: [0.15,0.1797,0.85,0.4537], lower body region: [0,0.4483,1,0.8954], foot region: [0,0.8758, 1], systemic region: 0,1, that is, the region division ratio may be in the form of [ x1, y1, x2, y2], where (x 1, y 1) and (x 2, y 2) may respectively represent position information of two points located on the target detection frame and at diagonal positions, and relatively, each set of region coordinates may also represent the specific position of the corresponding key region in the image to be processed in a similar form. For example, see points a and B shown in fig. 5, which are two points located at diagonal positions on the target detection frame 400e and can be used to represent the whole body region of the virtual object 400B. Referring again to fig. 2B, in the scene shown in fig. 2B, the computer device may divide the virtual object 300B into a whole body region B1, a head region B2, an upper body region B3, a lower body region B4 and a foot region B5 according to the above region division ratio and the object position information of the virtual object 300B in the image 300a to be processed.

It can be understood that the purpose of the region division is mainly to conveniently detect whether the virtual wearing article is abnormally rendered or not in the following process. In addition, the area division ratio indicates the relative proportion of each key part area to the target detection frame, so that the area division ratio is not fixed, can be adjusted according to actual conditions, and can also be divided into different key part areas according to different forms of virtual objects.

Step S103, respectively extracting the characteristics of at least two key part areas to obtain the picture area characteristics corresponding to each key part area, and respectively carrying out wearing detection on the characteristics of at least two picture areas to obtain the virtual object wearing information corresponding to the image to be processed; the virtual object wearing information is used for conducting abnormal rendering analysis on the virtual wearing article associated with the virtual object.

Specifically, please refer to fig. 6, which is a schematic structural diagram of an object classification network according to an embodiment of the present application. As shown in fig. 6, the computer device may input the obtained at least two key location areas into a pre-trained object classification network, perform feature extraction on the at least two key location areas in the object classification network, obtain picture area features corresponding to each key location area, further perform Global Average Pooling processing on the at least two picture area features through a Global Average Pooling layer (Global Average Pooling) in the object classification network, obtain at least two initial category features, further perform feature integration on the at least two initial category features through a full connection layer in the object classification network, obtain at least two target category features, finally output classification probabilities corresponding to each target category feature through an output layer in the object classification network, determine predicted wearing states corresponding to each key location area according to the at least two classification probabilities, and determine whether a virtual wearing article in the corresponding key location area is normally rendered according to the predicted wearing states, where the virtual wearing article may refer to be worn, or held by a virtual object.

The object classification network may specifically be a classification network based on mobileneetv 2 (a lightweight basic network for object classification) or on ResNet (a residual neural network), and in an optional implementation, the mobileneetv 2 network may be used as a basic network to extract picture region features corresponding to key part regions, so that the mobileneetv 2 network has a high prediction speed, a small generated model, and is convenient to deploy. The global average pooling layer can be used for down-sampling, namely the whole object classification network is regularized in structure, and overfitting is prevented. As shown in fig. 6, the fully-connected layer may include a fully-connected layer 1 (Hidden 1), a fully-connected layer 2 (Hidden 2), and a fully-connected layer 3 (Hidden 3), and a random deactivation layer (Dropout layer) may be interposed between the fully-connected layer 1 and the fully-connected layer 2 to prevent over-fitting. The output layer can be normalized by using a softmax function (normalized exponential function), so as to obtain the classification probability. Assuming that feature extraction is performed on the input key region to obtain a picture region feature with a dimension of 7 × 128, an intermediate category feature with a dimension of 100 may be output through the full-connection layer 1, and a target category feature with a dimension of 2 may be output through the full-connection layer 2 and the full-connection layer 3.

The specific process of determining the virtual object wearing information corresponding to the image to be processed according to the classification probability comprises the following steps: assuming that at least two object class features include a critical site region S _i Corresponding object class characteristics T _i I is a positive integer and i is less than or equal to the number of key part areas, the computer device may output the target category feature T through an output layer in the object classification network _i Corresponding classification probability and target class characteristic T _i The worn state or the unworn state indicated by the corresponding classification probability is determined as a key part region S _i A corresponding predicted wearing state. Wherein the worn state refers to a target category characteristic T _i The corresponding classification probability is greater than the state indicated when the worn probability threshold value is reached, and the unworn state refers to the target class characteristic T _i The state indicated when the corresponding classification probability is greater than the unworn probability threshold, and the worn probability threshold and the unworn probability threshold may be preset as needed. Judgment process for predicting wearing state corresponding to other key part areas and key part area S _i And the consistency is not described in detail here. Finally, the predicted wearing states respectively corresponding to each key part area can be jointly determined as the virtual object wearing information corresponding to the image to be processed. Further, the computer device may acquire the virtual object in the key part areaS _i The actual wearing state of the virtual wearing article is determined if the actual wearing state is in the key part region S _i If the corresponding predicted wearing states are different, the key region S can be determined _i The rendering result aiming at the virtual wearing article is an abnormal rendering result; if the actual wearing state is associated with the key region S _i If the corresponding predicted wearing states are the same, the key part region S can be determined _i And the rendering result aiming at the virtual wearing article is a normal rendering result. For example, referring to the scenario shown in fig. 2B again, if the comprehensive detector 300c finally outputs that the predicted wearing state corresponding to the foot area B5 is the worn state and is the same as the actual wearing state of the foot area B5, the rendering result for the virtual wearing article "shoe" in the foot area B5 is the normal rendering result; if the predicted wearing state corresponding to the foot region B5 is an unworn state and is not the same as the actual wearing state of the foot region B5, the rendering result for the virtual wearing article "shoe" in the foot region B5 is an abnormal rendering result (that is, the virtual wearing article "shoe" is not correctly displayed in the foot region B5).

Please refer to fig. 7, which is a flowchart illustrating a virtual object detection method according to an embodiment of the present disclosure. As shown in fig. 7, for the scene with the virtual object being the virtual character, the virtual object detection is performed on the image to be processed through the step S101, and when the virtual object does not exist in the image to be processed, the subsequent processing is not required, and the process ends. When a virtual object exists in the image to be processed, the virtual object is divided into 5 parts in the step S102, optionally, in combination with the step S102, the virtual object can be divided into a head region for detecting virtual wearing articles such as helmets, hats, masks, and the like, a lower body region (also referred to as a leg region) for detecting virtual wearing articles such as backpacks, jackets (e.g., coats) and the like (here, the backpacks and the jackets can be different types of virtual wearing articles), a foot region for detecting virtual wearing articles such as shoes and the like, and a whole body region for detecting virtual wearing articles such as virtual guns and the like. Further, different types of key part areas are input into a comprehensive detector consisting of an object classification network for classification detection, the corresponding predicted wearing state of each key part area can be obtained, and then comprehensive detection results are output together. It should be noted that one object classification network may only perform classification detection on one type of virtual wearing article, for example, at this time, the virtual wearing article may be classified into 6 different types, and object classification networks with the same structure may be used for different types of virtual wearing articles, that is, the 6 object classification networks shown in fig. 6 may be combined together to form a comprehensive detector, and then the key region is input into the corresponding object classification network in the comprehensive detector according to the region type, so that it may be detected whether six types of virtual wearing articles, such as a helmet/hat/mask, a backpack, a jacket, a pants/skirt/pants, a shoe, a virtual gun, etc., exist, respectively, and finally, the detection result is output together through the comprehensive detector.

The embodiment of the application provides a virtual resource (including a virtual wearing article) abnormal rendering detection scheme, wherein the obtained image to be processed is subjected to virtual object detection, when a virtual object is detected to exist in the image to be processed, object position information of the virtual object in the image to be processed can be further obtained, the virtual object is subjected to region division according to the object position information, at least two key part regions are obtained, then feature extraction can be respectively carried out on each key part region, picture region features respectively corresponding to each key part region are obtained, wearing detection is respectively carried out on at least two picture region features, and finally virtual object wearing information corresponding to the image to be processed can be obtained. The embodiment of the application does not depend on a time-consuming and labor-consuming manual screening scheme, but directly judges whether the rendering of the virtual wearing object is abnormal or not based on the deep neural network, that is, the embodiment of the application can quickly and automatically detect whether the rendering of the virtual wearing object in each region of the virtual object in the image to be processed is abnormal or not, so that the abnormal rendering detection of the virtual object of various products can be met, the labor cost is saved, the test process is accelerated, the network model is small, the deployment is easy, and the efficiency and the accuracy of detecting the virtual object can be improved.

Referring to fig. 8, fig. 8 is a schematic flowchart of a virtual object detection method according to an embodiment of the present disclosure. The virtual object detection method may be executed by a computer device, and the computer device may include a terminal device or a service server as described in fig. 1. As shown in fig. 8, the virtual object detection method may include at least the following steps S201 to S205:

step S201, acquiring an image to be processed, and performing virtual object detection on the image to be processed;

the specific execution process of this step may refer to step S101 in the embodiment corresponding to fig. 3, which is not described herein again.

Step S202, if a virtual object exists in the image to be processed, object position information of the virtual object in the image to be processed is obtained, and the virtual object is subjected to region division according to the object position information to obtain at least two key part regions;

the specific execution process of this step may refer to step S102 in the embodiment corresponding to fig. 3, which is not described herein again.

Step S203, respectively extracting characteristics of at least two key part areas to obtain picture area characteristics corresponding to each key part area, respectively performing wearing detection on the at least two picture area characteristics to obtain virtual object wearing information corresponding to the image to be processed;

the specific implementation process of this step may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Step S204, identifying the service scene type in the image to be processed, and acquiring a template image corresponding to the service scene type; the template image includes a reference virtual object;

specifically, the computer device may first obtain a scene identification profile, where the scene identification profile may be preconfigured according to service requirements, and includes an association relationship between different scene elements and service scene types, for example, for a game scene, the scene elements may include, but are not limited to, game characters (including teammates and enemies), monsters, buildings, vehicles, props, blood bars, skill states, numbers, and icons, and the service scene types may include, but are not limited to, a room scene type, a desert scene type, a forest scene type, an island scene type, and an upper air scene type. Further, the image to be processed may be subjected to service scene recognition, so as to obtain pixel coordinates and element categories of the scene elements to be detected in the image to be processed, wherein the scene elements to be detected in the image to be processed may be recognized by using various algorithms such as template matching, gradient template matching, feature point matching, target detection, deep neural network, and the like, and a recognition result may be output externally. And then, matching search can be carried out in the scene identification configuration file according to the obtained pixel coordinates and element types, scene elements matched with the pixel coordinates and the element types in the scene identification configuration file are determined as target scene elements, further, the service scene type having an incidence relation with the target scene elements can be determined as the service scene type of the image to be processed, then, a template image corresponding to the service scene type can be obtained, and the template image comprises a reference virtual object.

Step S205, obtaining reference position information of the reference virtual object in the template image, and determining a rendering result of the virtual object in the image to be processed according to the reference position information and the object position information.

Specifically, the computer device may obtain reference position information of the reference virtual object in the template image, and optionally, the reference position information may be represented by using a position scale of the reference virtual object in the template image. Further, a position offset may be generated according to the reference position information and the object position information of the virtual object obtained in step S202, and if the position offset is greater than a preset offset threshold, it may be determined that the virtual object is at an abnormal position in the image to be processed, so as to determine that a rendering result of the virtual object in the image to be processed is an abnormal rendering result; if the position offset is smaller than or equal to the offset threshold, it may be determined that the virtual object is at a normal position in the image to be processed, so as to determine that a rendering result of the virtual object in the image to be processed is a normal rendering result.

Please refer to fig. 9 a-9 b together, which are schematic diagrams of a position deviation detection scenario provided in an embodiment of the present application. As shown in fig. 9a, a virtual object 501a exists in the image to be processed 500a, a detection frame to be processed 502a containing the virtual object 501a can be obtained through steps S201-S202, and the detection frame position information of the detection frame to be processed 502a in the image to be processed 500a can be used as the object position information X of the virtual object 501 a. It is assumed that the service scene type of the image 500a to be processed is identified as a field scene type by performing service scene identification, and a template image 500b corresponding to the field scene type can be acquired. Generally, in the same service scene, the position of the virtual object under normal conditions is relatively fixed, as shown in fig. 9b, a reference virtual object 501b exists in the template image 500b, reference position information Y corresponding to an area 502b where the reference virtual object 501b is located can be further obtained, the object position information X is compared with the reference position information Y, it can be found that the two have position offset, and the calculated position offset is greater than an offset threshold, so that it can be determined that the virtual object 501a is in an abnormal position in the image to be processed 500 a. Optionally, only the reference position information or the position offset may be output, and a developer may determine whether the virtual object has a position offset.

It should be noted that step S204 and step S205 may be executed in parallel with step S203, that is, no matter the rendering result of the virtual wearing article is an abnormal rendering result or a normal rendering result, the position deviation detection in step S204 and step S205 may be performed on the virtual object, or step S204 and step S205 may also be executed when the rendering result of the virtual wearing article is a normal rendering result, that is, there is a sequence between step S203 and step S204.

It is understood that the rendering result for the virtual object position obtained in step S205 and the virtual object wearing information obtained in step S203 may be used together to perform abnormal rendering analysis on the virtual resource (including the virtual object position and the virtual wearing article).

The embodiment of the application provides a virtual resource abnormal rendering detection scheme, which includes the steps of carrying out virtual object detection on an obtained image to be processed through an object detection network, carrying out area division on a virtual object according to a detection result, carrying out wearing detection on each key part area through an object classification network, obtaining virtual object wearing information corresponding to the image to be processed, and determining whether a virtual wearing article in each key part area is rendered or not. In addition, the service scene type of the image to be processed can be obtained by identifying the service scene of the image to be processed, the template image containing the reference virtual object is obtained according to the service scene type, and then the position of the reference virtual object in the template image can be compared with the position of the virtual object in the image to be processed, so that whether the virtual object has position deviation in the image to be processed can be detected quickly and accurately. Therefore, the virtual object abnormal rendering detection method and device do not depend on a time-consuming and labor-consuming manual screening scheme, but directly judge whether the virtual wearing object and the virtual object are abnormal or not based on the deep neural network, namely, the problems in the testing processes such as virtual object position deviation, virtual wearing object rendering and the like can be quickly and automatically detected, so that the virtual object abnormal rendering detection of various products can be met, the labor cost is saved, the testing process is accelerated, the network model is small, the deployment is easy, and the efficiency and the accuracy of virtual object detection can be improved.

Please refer to fig. 10, which is a flowchart illustrating a virtual object detection method according to an embodiment of the present disclosure. The virtual object detection method may be executed by a computer device, and the computer device may include a terminal device or a service server as described in fig. 1. As shown in fig. 10, the virtual object detection method may include at least the following steps S301 to S304:

step S301, acquiring a sample image, and performing virtual object detection on the sample image;

specifically, please refer to fig. 11, which is a flowchart illustrating a training object classification network according to an embodiment of the present application. As shown in fig. 11, the computer device may obtain video sample data including a sample virtual object, decode the video sample data to obtain a plurality of continuous video frames, and further perform frame extraction processing on the plurality of video frames to obtain a sample image (that is, the image set in fig. 11 may include a plurality of images), where the computer device may perform frame extraction according to a set extraction interval, or may perform extraction randomly, which is not limited in this embodiment of the present application. The video sample data may be different video data for different application scenes, and may be game video data or animation video data, for example. It is to be appreciated that embodiments of the present application also support the direct collection of image data as sample images for model training and testing.

Further, the computer device may input the obtained sample image into a pre-trained object detection network for virtual object detection, in the object detection network, feature extraction may be performed on the sample image to obtain a picture feature matrix, at least two detection frames may be generated according to the picture feature matrix, and then non-maximum suppression processing may be performed on the at least two detection frames to obtain a detection frame to be processed, if the detection frame to be processed contains the sample virtual object, it may be determined that the sample virtual object exists in the sample image, and at the same time, detection frame position information of the detection frame to be processed in the sample image may be obtained; otherwise, it may be determined that the sample virtual object does not exist in the sample image. The procedure of the non-maximum suppression processing can be referred to in step S101 in the embodiment corresponding to fig. 3.

The training process of the object detection network may refer to the following embodiment corresponding to fig. 14.

Step S302, if a sample virtual object exists in a sample image, obtaining prediction object position information and a virtual object wearing label of the sample virtual object in the sample image, and performing region division on the sample virtual object according to the prediction object position information to obtain at least two key part regions;

specifically, if a sample virtual object exists in the sample image, the computer device may perform region expansion on the detection frame to be processed to obtain a target detection frame, and may further obtain detection frame position information of the target detection frame in the sample image, and determine the detection frame position information as prediction object position information of the sample virtual object in the sample image, and may further obtain a region division ratio between at least two key region reference regions and a virtual object reference region, and may generate at least two groups of region coordinates according to the region division ratio and the prediction object position information, and perform region division on the target detection frame according to the at least two groups of region coordinates, and may obtain at least two key region regions. The specific process of region expansion and region division may refer to step S102 in the embodiment corresponding to fig. 3. In addition, the computer device may obtain a virtual object wearing label corresponding to the sample image, where the virtual object wearing label may be obtained by performing a wearing label on the sample image in advance (for example, label "with virtual wearing article Q" or "without virtual wearing article Q").

Referring to fig. 11 again, through the above processing, data sets corresponding to different types of key region regions (for example, a data set of a head region, a data set of an upper body region, and the like) may be obtained, and then, the data sets may be divided into a training set and a test set, where the training set is used to estimate parameters in a model, and the test set is used to verify generalization performance of the model.

Step S303, inputting at least two key part areas into an initial object classification network, respectively extracting features of the at least two key part areas in the initial object classification network to obtain picture area features respectively corresponding to each key part area, and respectively performing wearing detection on the at least two picture area features to obtain predicted virtual object wearing information corresponding to the sample image;

specifically, the data sets obtained in step S302 may be further divided into more detailed data sets according to the virtual wearing article and the non-virtual wearing article, please refer to fig. 12, which is a schematic diagram of sample images in different areas provided in the embodiment of the present application. As shown in fig. 12, taking an upper body area sample image corresponding to an upper body area and a head area sample image corresponding to a head area as an example, the upper body area sample image is divided into an upper body area sample image a and an upper body area sample image B, wherein the upper body area sample image a may include a sample image A1, a sample image A2, a sample image A3, a \8230;, a sample image Am, and a virtual wearing article "backpack" exists for each sample image in the upper body area sample image a, and correspondingly, the upper body area sample image B may include a sample image B1, a sample image B2, a sample image B3, a \8230; \8230, a sample image Bn, and a virtual wearing article "backpack" does not exist for each sample image in the upper body area sample image B. Similarly, the head region sample image may be divided into a head region sample image C and a head region sample image D, wherein the head region sample image C may include a sample image C1, a sample image C2, a sample image C3, \8230 \ 8230;, a sample image Ci, and each sample image in the head region sample image C has a virtual wearing article "helmet/hat/mask", and the head region sample image D may include a sample image D1, a sample image D2, a sample image D3, \8230;, a sample image Dj and each sample image in the head region sample image D has no virtual wearing article "helmet/hat/mask". The division of the sample images for other regions is consistent with the above process, and will not be described herein again.

Further, the computer device may input at least two divided region sample images into an initial object classification network, in the initial object classification network, feature extraction may be performed on at least two region sample images, respectively, to obtain picture region features corresponding to each region sample image, further, a global average pooling layer in the initial object classification network may perform global average pooling processing on at least two picture region features, respectively, to obtain at least two initial category features, then, feature integration may be performed on at least two initial category features, respectively, by a full connection layer in the initial object classification network, to obtain at least two target category features, further, classification probabilities corresponding to each target category feature may be output by an output layer in the initial object classification network, and then, according to the at least two classification probabilities, a prediction sub-tag corresponding to each region sample image is determined (i.e., it is determined whether each region sample image is in a worn state or in an unworn state), and finally, the at least two prediction sub-tags may be determined together as prediction virtual object wearing information corresponding to the sample images, and a specific determination process may refer to step S103 in an embodiment corresponding to fig. 3, it may understand that a threshold required for a threshold value and a threshold value required for a wearing process of a wearing process may be set in advance. The specific structure of the object classification network may refer to the structural schematic diagram shown in fig. 6, and in the embodiment of the present application, the MobilenetV2 network may be used as a basic network to perform feature extraction.

In an optional implementation manner, when training an initial object classification network, a developer may first set parameters of the network and related configuration items, then train the initial object classification network under a tensoflow frame (which is a computation frame of deep learning open source issued by google and can well implement various deep learning algorithms, and relates to a series of technologies such as natural language processing, machine translation, image description, and image classification), if a pre-training model based on MobilenetV2 (i.e., an initial object classification network) is used, set a learning rate to 0.0001, set a batch size (which refers to the number of a part of sample data trained in each network) to 32, and stop training after iterating 32 epochs (which refers to a process in which all sample data are sent into the network to complete forward computation and backward propagation once). It is understood that the related settings mentioned herein can be adjusted according to actual situations.

Step S304, generating a first target loss function according to the virtual object wearing label and the predicted virtual object wearing information, and adjusting network parameters in the initial object classification network according to the first target loss function to obtain an object classification network; the object classification network is used for identifying virtual object wearing information corresponding to a virtual object in the image to be processed, and the virtual object wearing information is used for performing abnormal rendering analysis on a virtual wearing article associated with the virtual object.

Specifically, it is assumed that the virtual object wearing label obtained in step S302 includes an actual sub-label corresponding to each area sample image, and the at least two area sample images include an area sample image X _j J is a positive integer and j is less than or equal to the number of area sample images, the computer device may then determine from the area sample images X _j Corresponding actual sub-label and area sample image X _j Corresponding predictor label generates area sample image X _j And generating a first target loss function according to the corresponding sub-loss function of each area sample image. Further, the computer device may adjust a network parameter in the initial object classification network according to the calculated first target loss function, thereby obtaining the object classification network.

The object classification network is configured to identify virtual object wearing information corresponding to a virtual object in an image to be processed, where the virtual object wearing information is used to perform exception rendering analysis on a virtual wearing article associated with the virtual object, and a specific application process of the object classification network may refer to step S103 in the embodiment corresponding to fig. 3. Referring to fig. 11 again, different object classification networks can be trained from the area sample images of different area types, and then a plurality of object classification networks can be combined into a detection model (such as the integrated detector shown in fig. 7) for classification detection.

In addition, the detection effect and universality of the trained detection model need to be tested. The embodiment of the application can be used for testing by using images or directly using videos so as to observe the conditions of false detection and missed detection. Specifically, research personnel can record videos that different virtual objects wear different clothes and wear different virtual firearms and the like, then input each frame of the video into the comprehensive detector as a test set, and observe the conditions of false detection and missed detection according to the detection results of the virtual wearing articles. Please refer to fig. 13, which is a scene diagram of a model test according to an embodiment of the present disclosure. As shown in fig. 13, one test image 600a in the test set is input to a comprehensive detector 600b (including 6 object classification networks), and a detection result 600c can be output by the comprehensive detector 600b, where the format of the detection result (result) may be "object" [ exist, score, description ], where object represents the name of the virtual wearing article, exist represents whether the virtual wearing article exists, score represents the classification probability, and description represents the relevant description for the virtual wearing article. As shown in fig. 13, the detection results corresponding to 6 kinds of virtual wearing articles (including "backpack", "jacket", "hat/helmet", "trousers/skirt/shorts", "shoes", and "virtual gun") can be seen from the detection result 600c, for example, the classification probability corresponding to the virtual wearing article "backpack" is 0.996233761, which is greater than the threshold value of the worn probability, and the corresponding description is "backpack exists", and the comparison of the test image 600a can find that the detection result for "backpack" is correct, i.e., the "backpack" in the test image 600a renders normal.

In one embodiment, the object classification network provided by the embodiment of the present application has high accuracy, and is free of interference objects, the actual accuracy of the object classification network can reach 99.9%, the overall accuracy of the comprehensive detector for detecting whether the virtual wearing article is rendered is 96.8%, and the accuracy of each area is respectively:

a) The head region detection result identifies 2992/3087 (96.9%), wherein '2992' refers to the detection of the correct number of test images, '3087' refers to the total number of the test images, and the meanings of other subsequent numbers can be analogized;

b) The upper body area (here, only for "upper body") detection results, and identification 3011/3087 (97.5%);

c) Detecting the lower body area, and identifying 3013/3087 (97.6%);

d) The detection result of the foot area identifies 2753/3087 (89.2%);

e) "backpack" test results, identification 3086/3087 (99.9%);

f) As a result of the whole body region detection, 3022/3087 (97.9%) was identified.

According to the method, the accuracy of the detection model of the foot area is relatively low, and the accuracy of the detection model of other areas can reach 97%.

Please refer to fig. 14, which is a schematic flowchart of a training object detection network according to an embodiment of the present application. As shown in fig. 14, the computer device may also obtain video sample data including a sample virtual object, decode the video sample data to obtain a plurality of consecutive video frames, and further perform frame extraction on the plurality of video frames to obtain a sample image (i.e., the image set in fig. 14, which may include a plurality of images). The video sample data may be different video data for different application scenes, and may be game video data or animation video data, for example. It is to be appreciated that embodiments of the present application also support the direct collection of image data as sample images for model training and testing.

Further, after determining the type of the object to be detected (e.g., a virtual character role), a developer may perform virtual object labeling on the sample image divided into the training set through a labeling training platform (e.g., label img) on the computer device to obtain an actual labeling frame, which may be specifically referred to as fig. 15, which is an interface schematic diagram for virtual object labeling provided in the embodiment of the present application. As shown in fig. 15, an annotation interface 700a is an interface of an optional annotation training platform, and thumbnails of a part of sample images, for example, thumbnails of sample image 1, sample image 2, \8230andsample image 8, may be displayed in a display area 700b in the annotation interface 700a, and it can be seen that, at this time, 1784 sample images to be annotated are total, and the process of annotating each sample image is the same, and here, only sample image 1 is taken as an example for description. As shown in fig. 15, after clicking the thumbnail of the sample image 1 in the display area 700b, a corresponding complete large image may be displayed, and it can be seen that a virtual object 700c exists in the sample image 1, so a developer may draw a rectangular frame in the sample image 1 through an input device such as a mouse, and the drawn rectangular frame 700d may be used as an actual annotation frame corresponding to the virtual object 700c, and in the operation panel 700e, a target detection tag corresponding to the virtual object 700c may be edited, for example, the target detection tag may be edited as a "human body", which indicates that the virtual object 700c is a virtual character.

After all the sample images are labeled with the virtual object, data in a VOC format (a data set labeling format) can be generated, the computer device can input the labeled sample images into the initial object detection network, and can output the predicted detection frames for labeling the sample virtual object and the predicted detection frame position information corresponding to the predicted detection frames through the initial object detection network, further, the number of the actual labeling frames included in the labeled sample images and the actual object position information of the actual labeling frames in the sample images can be obtained, further, a number loss function can be generated according to the number of the actual labeling frames and the number of the predicted detection frames, meanwhile, a position loss function can be generated according to the actual object position information and the predicted detection frame position information, a second target loss function can be generated according to the number loss function and the position loss function, and finally, network parameters in the initial object detection network are adjusted according to the second target loss function, so that a trained object detection network (i.e., the identification model in fig. 14) can be obtained.

In an alternative embodiment, when training the initial object detection network, a developer may first set parameters of the network and related configuration items, then train the initial object classification network under a tensoflow framework, and if a pre-training model based on MobileNetv2-SSDLite is used (i.e., the initial object detection network, the network structure may be shown in fig. 4 above), set the learning rate to 0.0001, iterate 100000 times, and batchsize to 8, and stop training when the loss (loss) is small and the maximum number of iterations is reached. It is to be understood that the relevant settings mentioned herein may be adjusted according to the actual situation.

In addition, the trained recognition model needs to be tested for its detection effect and universality. The embodiment of the application can be used for testing images or videos directly, specifically, research personnel can record videos that different virtual objects wear different clothes, different virtual firearms and the like, then each frame of the videos is input into the identification model as a test set, and the conditions of false detection and missed detection are observed according to whether the virtual objects are detected, and in one implementation mode, the accuracy rate of detection by the identification model can reach 99.9%. Subsequently, after the recognition model test is completed, the recognition model and the relevant configuration file can be uploaded to the service server.

It can be understood that the identification model based on the object detection network, the comprehensive detector based on the object classification network, and the module with the area division function are packaged together, and a web service is built, and a subsequent user can directly call a related API (Application Programming Interface) to perform abnormal rendering analysis on an input image or video, so that a lot of manpower and detection time can be saved, and the detection process is accelerated. In one embodiment, a single examination is shortened from a human elapsed time of 2 days to 5-7 hours (where the script runs for 4-6 hours and the human looks at the picture for 1 hour).

According to the method and the device, the obtained sample image is subjected to virtual object detection through the object detection network, further, the sample virtual object can be subjected to area division according to the detection result, a data set (comprising a training set and a testing set) with different types of virtual wearing articles is obtained, the initial object classification network is trained on the training set, network parameters are adjusted through a loss function, the trained object classification network is tested through the testing set, and finally, the comprehensive detector for detecting whether different virtual wearing articles are rendered or not can be obtained.

Fig. 16 is a schematic structural diagram of a virtual object detection apparatus according to an embodiment of the present application. The virtual object detection means may be a computer program (comprising program code) running on a computer device, for example the virtual object detection means being an application software; the device can be used for executing corresponding steps in the virtual object detection method provided by the embodiment of the application. As shown in fig. 16, the virtual object detection apparatus 1 may include: an object detection module 11, a region division module 12 and a classification detection module 13;

the object detection module 11 is configured to acquire an image to be processed and perform virtual object detection on the image to be processed;

the object detection module 11 is specifically configured to acquire an image to be processed, input the image to be processed into an object detection network, perform feature extraction on the image to be processed in the object detection network to obtain a picture feature matrix, and generate at least two detection frames according to the picture feature matrix; carrying out non-maximum suppression processing on at least two detection frames to obtain detection frames to be processed; if the detection frame to be processed contains the virtual object, determining that the virtual object exists in the image to be processed;

the region dividing module 12 is configured to, if a virtual object exists in the image to be processed, obtain object position information of the virtual object in the image to be processed, and perform region division on the virtual object according to the object position information to obtain at least two key part regions;

the classification detection module 13 is configured to perform feature extraction on at least two key part regions respectively to obtain picture region features corresponding to each key part region, and perform wearing detection on the at least two picture region features respectively to obtain virtual object wearing information corresponding to an image to be processed; the virtual object wearing information is used for conducting abnormal rendering analysis on the virtual wearing article associated with the virtual object.

The specific functional implementation manner of the object detection module 11 may refer to step S101 in the embodiment corresponding to fig. 3, the specific functional implementation manner of the area division module 12 may refer to step S102 in the embodiment corresponding to fig. 3, and the specific functional implementation manner of the classification detection module 13 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 16, the virtual object detection apparatus 1 may further include: a rendering analysis module 14;

the rendering analysis module 14 is configured to obtain an actual wearing state of the virtual object in the key portion area Si about the virtual wearing article, and determine that a rendering result for the virtual wearing article in the key portion area Si is an abnormal rendering result if the actual wearing state is different from a predicted wearing state corresponding to the key portion area Si.

The specific functional implementation manner of the rendering analysis module 14 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 16, the virtual object detection apparatus 1 may further include: a scene recognition module 15 and an offset analysis module 16;

the scene recognition module 15 is configured to recognize a service scene type in the image to be processed, and acquire a template image corresponding to the service scene type; the template image includes a reference virtual object;

the scene recognition module 15 is specifically configured to obtain a scene recognition configuration file; the scene identification configuration file comprises an incidence relation between a scene element and a service scene type; performing service scene recognition on the image to be processed to obtain pixel coordinates and element classes of scene elements to be detected in the image to be processed; matching search is carried out in the scene identification configuration file according to the pixel coordinates and the element categories, scene elements matched with the pixel coordinates and the element categories in the scene identification configuration file are determined as target scene elements, and the service scene type having an incidence relation with the target scene elements is determined as the service scene type of the image to be processed;

the offset analysis module 16 is configured to obtain reference position information of the reference virtual object in the template image, and generate a position offset according to the reference position information and the object position information; if the position offset is larger than the offset threshold, determining that the virtual object is in an abnormal position in the image to be processed, and determining that the rendering result of the virtual object in the image to be processed is an abnormal rendering result; and if the position offset is smaller than or equal to the offset threshold, determining that the virtual object is at a normal position in the image to be processed, and determining that the rendering result of the virtual object in the image to be processed is a normal rendering result.

The specific functional implementation manner of the scene recognition module 15 may refer to step S204 in the embodiment corresponding to fig. 8, and the specific functional implementation manner of the offset analysis module 16 may refer to step S205 in the embodiment corresponding to fig. 8, which is not described herein again.

Referring to fig. 16, the area dividing module 12 may include: a region expanding unit 121, an information acquiring unit 122, and a region dividing unit 123;

the region expansion unit 121 is configured to, if a virtual object exists in the image to be processed, perform region expansion on the detection frame to be processed associated with the virtual object to obtain a target detection frame; the side length of the target detection frame is larger than that of the detection frame to be processed;

an information obtaining unit 122, configured to obtain detection frame position information of the target detection frame in the image to be processed, and determine the detection frame position information as object position information of the virtual object in the image to be processed; acquiring the area division proportion between at least two key part reference areas and a virtual object reference area; each key part reference area is positioned in the virtual object reference area;

and the area dividing unit 123 is configured to generate at least two sets of area coordinates according to the area dividing proportion and the object position information, and perform area division on the target detection frame according to the at least two sets of area coordinates to obtain at least two key part areas.

For specific functional implementation manners of the region expanding unit 121, the information obtaining unit 122, and the region dividing unit 123, reference may be made to step S102 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 16, the classification detecting module 13 may include: a feature extraction unit 131, a pooling unit 132, a full connection unit 133, and an output unit 134;

a feature extraction unit 131, configured to input at least two key part regions into an object classification network, and perform feature extraction on the at least two key part regions in the object classification network, respectively, to obtain picture region features corresponding to each key part region;

the pooling unit 132 is configured to perform global average pooling processing on the at least two image region features through a global average pooling layer in the object classification network, so as to obtain at least two initial category features;

a full connection unit 133, configured to perform feature integration on at least two initial category features through a full connection layer in the object classification network, respectively, to obtain at least two target category features;

the output unit 134 is configured to output classification probabilities respectively corresponding to each target class feature through an output layer in the object classification network, and determine virtual object wearing information corresponding to the to-be-processed image according to at least two classification probabilities;

in one embodiment, the at least two object class characteristics include a critical part region S _i Corresponding object class characteristics T _i I is a positive integer, and i is less than or equal to the number of the key part areas;

the output unit 134 is specifically configured to output the target class characteristics T through an output layer in the object classification network _i Corresponding classification probability, and target class characteristics T _i The worn state or the unworn state indicated by the corresponding classification probability is determined as the key part region S _i A corresponding predicted wearing state; the worn state refers to the target category characteristic T _i A state indicated when the corresponding classification probability is greater than the worn probability threshold; the non-wearing state refers to the target category characteristic T _i A state indicated when the corresponding classification probability is greater than the unworn probability threshold; and determining the predicted wearing states respectively corresponding to each key part area as the virtual object wearing information corresponding to the image to be processed.

The specific functional implementation manners of the feature extraction unit 131, the pooling unit 132, the full connection unit 133, and the output unit 134 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Fig. 17 is a schematic structural diagram of a virtual object detection apparatus according to an embodiment of the present application. The virtual object detection means may be a computer program (comprising program code) running on a computer device, for example the virtual object detection means is an application software; the device can be used for executing corresponding steps in the virtual object detection method provided by the embodiment of the application. As shown in fig. 17, the virtual object detection apparatus 2 may include: an object detection module 21, an area division module 22, a classification detection module 23, and an adjustment module 24;

an object detection module 21, configured to obtain a sample image and perform virtual object detection on the sample image;

the object detection module 21 is specifically configured to decode video sample data including a sample virtual object to obtain a plurality of continuous video frames, and perform frame extraction processing on the plurality of continuous video frames to obtain a sample image; inputting a sample image into an object detection network, extracting the characteristics of the sample image in the object detection network to obtain an image characteristic matrix, and generating at least two detection frames according to the image characteristic matrix; carrying out non-maximum suppression processing on at least two detection frames to obtain detection frames to be processed; if the detection frame to be processed contains the sample virtual object, determining that the sample virtual object exists in the sample image;

the region dividing module 22 is configured to, if a sample virtual object exists in the sample image, obtain prediction object position information and a virtual object wearing tag of the sample virtual object in the sample image, and perform region division on the sample virtual object according to the prediction object position information to obtain at least two key part regions;

the classification detection module 23 is configured to input the at least two key part regions into an initial object classification network, perform feature extraction on the at least two key part regions in the initial object classification network, to obtain picture region features corresponding to each key part region, and perform wear detection on the at least two picture region features, to obtain predicted virtual object wear information corresponding to the sample image;

the classification detection module 23 is specifically configured to input the area sample images corresponding to the at least two key part areas into an initial object classification network, and perform feature extraction on the at least two area sample images in the initial object classification network, so as to obtain picture area features corresponding to each area sample image; respectively carrying out global average pooling on at least two picture region characteristics through a global average pooling layer in the initial object classification network to obtain at least two initial category characteristics; respectively performing feature integration on at least two initial category features through a full connection layer in an initial object classification network to obtain at least two target category features; outputting classification probabilities respectively corresponding to each target class characteristic through an output layer in the initial object classification network, determining prediction sub-labels respectively corresponding to each regional sample image according to at least two classification probabilities, and determining at least two prediction sub-labels as predicted virtual object wearing information corresponding to the sample images;

the adjusting module 24 is configured to generate a first target loss function according to the virtual object wearing label and the predicted virtual object wearing information, and adjust network parameters in the initial object classification network according to the first target loss function to obtain an object classification network; the object classification network is used for identifying virtual object wearing information corresponding to a virtual object in the image to be processed, and the virtual object wearing information is used for performing abnormal rendering analysis on a virtual wearing article associated with the virtual object;

in one embodiment, the virtual object wearing label includes an actual sub-label corresponding to each area sample image; the at least two area sample images include an area sample image X _j J is a positive integer, and j is less than or equal to the number of the area sample images;

the adjusting module 24 is specifically configured to adjust the area sample image X according to the area sample image X _j Corresponding actual sub-label and area sample image X _j Corresponding predictor label generates area sample image X _j A corresponding sub-loss function; and generating a first target loss function according to the sub-loss function corresponding to each area sample image.

The specific functional implementation manner of the object detection module 21 may refer to step S301 in the embodiment corresponding to fig. 10, the specific functional implementation manner of the region division module 22 may refer to step S302 in the embodiment corresponding to fig. 10, the specific functional implementation manner of the classification detection module 23 may refer to step S303 in the embodiment corresponding to fig. 10, and the specific functional implementation manner of the adjustment module 24 may refer to step S304 in the embodiment corresponding to fig. 10, which is not described herein again.

Referring to fig. 17, the virtual object detection apparatus 2 may further include: a network training module 25;

the network training module 25 is configured to perform virtual object labeling on the sample image to obtain an actual labeling frame, input the labeled sample image into an initial object detection network, and output a prediction detection frame for labeling the sample virtual object and prediction detection frame position information corresponding to the prediction detection frame through the initial object detection network; acquiring the number of the actual labeling frames and the actual object position information of the actual labeling frames in the sample image; generating a quantity loss function according to the quantity of the actual labeling frames and the quantity of the prediction detection frames, generating a position loss function according to the actual object position information and the prediction detection frame position information, and generating a second target loss function according to the quantity loss function and the position loss function; and adjusting the network parameters in the initial object detection network according to the second target loss function to obtain the object detection network.

The specific functional implementation manner of the network training module 25 may refer to the steps in the embodiment corresponding to fig. 14, which is not described herein again.

The method and the device have the advantages that the virtual object detection is carried out on the obtained sample image through the object detection network, the sample virtual object can be subjected to region division according to the detection result, the data set (comprising the training set and the testing set) with different types of virtual wearing articles can be obtained, the initial object classification network is trained on the training set, the network parameters are adjusted through the loss function, the trained object classification network is tested through the testing set, and finally the comprehensive detector for detecting whether different virtual wearing articles are rendered or not can be obtained.

Fig. 18 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 18, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 18, the memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 18, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the virtual object detection method in any embodiment corresponding to fig. 3 and fig. 8, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Fig. 19 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 19, the computer device 2000 may include: a processor 2001, a network interface 2004, and a memory 2005, the computer device 2000 may further include: a user interface 2003, and at least one communication bus 2002. The communication bus 2002 is used to implement connection communication between these components. The user interface 2003 may include a Display (Display) and a Keyboard (Keyboard), and the optional user interface 2003 may further include a standard wired interface and a standard wireless interface. The network interface 2004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 2004 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. The memory 2005 may optionally also be at least one memory device located remotely from the aforementioned processor 2001. As shown in fig. 19, the memory 2005 which is a kind of computer-readable storage medium may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 2000 shown in fig. 19, the network interface 2004 may provide a network communication function; and the user interface 2003 is primarily used to provide an interface for user input; and processor 2001 may be used to invoke the device control application stored in memory 2005 to implement:

It should be understood that the computer device 2000 described in this embodiment of the present application may perform the description of the virtual object detection method in the embodiment corresponding to fig. 10, and is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the aforementioned virtual object detection apparatus 1 and the virtual object detection apparatus 2 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the processor can execute the description of the virtual object detection method in the embodiment corresponding to any one of fig. 3, fig. 8, and fig. 10, and therefore, the description will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

The computer-readable storage medium may be the virtual object detection apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash memory card (flash card), and the like provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Further, here, it is to be noted that: embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by any one of the corresponding embodiments in fig. 3, fig. 8, and fig. 10.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and should not be taken as limiting the scope of the present application, so that the present application will be covered by the appended claims.

Claims

1. A virtual object detection method, comprising:

acquiring an image to be processed, and carrying out virtual object detection on the image to be processed;

respectively extracting the characteristics of the at least two key part areas to obtain picture area characteristics corresponding to each key part area, and respectively carrying out wearing detection on the at least two picture area characteristics to obtain virtual object wearing information corresponding to the image to be processed; the virtual object wearing information comprises predicted wearing states corresponding to the at least two key part areas respectively; the predicted wearing stateRefers to a worn or unworn state of the at least two critical site areas; the at least two critical site regions include a critical site region S _i (ii) a i is a positive integer, and i is less than or equal to the number of key part areas;

acquiring the virtual object in the key part area S _i The actual wearing state of the virtual wearing article is determined if the actual wearing state and the key part area S _i Determining the key part region S if the corresponding predicted wearing states are different _i The rendering result aiming at the virtual wearing article in the step (2) is an abnormal rendering result.

2. The method according to claim 1, wherein the acquiring the image to be processed and the performing the virtual object detection on the image to be processed comprise:

acquiring an image to be processed, inputting the image to be processed into an object detection network, extracting the characteristics of the image to be processed in the object detection network to obtain a picture characteristic matrix, and generating at least two detection frames according to the picture characteristic matrix;

carrying out non-maximum suppression processing on the at least two detection frames to obtain detection frames to be processed;

and if the to-be-processed detection frame contains the virtual object, determining that the virtual object exists in the to-be-processed image.

3. The method according to claim 2, wherein if a virtual object exists in the image to be processed, acquiring object position information of the virtual object in the image to be processed, and performing region division on the virtual object according to the object position information to obtain at least two key region regions, includes:

if a virtual object exists in the image to be processed, carrying out region expansion on the detection frame to be processed associated with the virtual object to obtain a target detection frame; the side length of the target detection frame is larger than that of the detection frame to be processed;

acquiring detection frame position information of the target detection frame in the image to be processed, and determining the detection frame position information as object position information of the virtual object in the image to be processed;

acquiring the area division proportion between at least two key part reference areas and a virtual object reference area; each key part reference region is positioned in the virtual object reference region;

and generating at least two groups of area coordinates according to the area division proportion and the object position information, and performing area division on the target detection frame according to the at least two groups of area coordinates to obtain at least two key part areas.

4. The method according to claim 1, wherein the performing feature extraction on the at least two key part regions respectively to obtain picture region features corresponding to each key part region, and performing wear detection on the at least two picture region features respectively to obtain virtual object wear information corresponding to the image to be processed comprises:

inputting the at least two key part areas into an object classification network, and respectively extracting the characteristics of the at least two key part areas in the object classification network to obtain the image area characteristics corresponding to each key part area;

performing global average pooling processing on at least two image region characteristics through a global average pooling layer in the object classification network to obtain at least two initial category characteristics;

respectively performing feature integration on the at least two initial class features through a full connection layer in the object classification network to obtain at least two target class features;

and outputting classification probabilities respectively corresponding to each target class characteristic through an output layer in the object classification network, and determining virtual object wearing information corresponding to the image to be processed according to at least two classification probabilities.

5. According toThe method of claim 4, wherein the at least two object class features comprise a critical site region S _i Corresponding object class characteristics T _i (ii) a The method for determining the virtual object wearing information corresponding to the image to be processed according to at least two classification probabilities includes the steps of outputting the classification probability corresponding to each target class characteristic through an output layer in the object classification network, and determining the virtual object wearing information corresponding to the image to be processed according to at least two classification probabilities:

outputting the target class feature T through an output layer in the object classification network _i Corresponding classification probability, and the target class characteristics T _i Determining a worn state or an unworn state indicated by the corresponding classification probability as the key part region S _i A corresponding predicted wearing state; the worn state refers to the target category characteristic T _i A state indicated when the corresponding classification probability is greater than the worn probability threshold; the non-wearing state refers to the target category characteristic T _i A state indicated when the corresponding classification probability is greater than the unworn probability threshold;

and determining the predicted wearing states respectively corresponding to the key part areas as the virtual object wearing information corresponding to the image to be processed.

6. The method of claim 1, further comprising:

identifying the type of a service scene in the image to be processed, and acquiring a template image corresponding to the type of the service scene; the template image comprises a reference virtual object;

acquiring reference position information of the reference virtual object in the template image, and generating a position offset according to the reference position information and the object position information;

if the position offset is larger than an offset threshold, determining that the virtual object is in an abnormal position in the image to be processed, and determining that a rendering result of the virtual object in the image to be processed is an abnormal rendering result;

and if the position offset is smaller than or equal to the offset threshold, determining that the virtual object is at a normal position in the image to be processed, and determining that the rendering result of the virtual object in the image to be processed is a normal rendering result.

7. The method of claim 6, wherein the identifying the type of the traffic scene in the image to be processed comprises:

acquiring a scene identification configuration file; the scene identification configuration file comprises an incidence relation between a scene element and a service scene type;

performing service scene recognition on the image to be processed to obtain pixel coordinates and element categories of scene elements to be detected in the image to be processed;

and performing matching search in the scene identification configuration file according to the pixel coordinates and the element categories, determining scene elements matched with the pixel coordinates and the element categories in the scene identification configuration file as target scene elements, and determining the service scene type having an association relationship with the target scene elements as the service scene type of the image to be processed.

8. A virtual object detection method, comprising:

if a sample virtual object exists in the sample image, acquiring predicted object position information and a virtual object wearing label of the sample virtual object in the sample image, and performing area division on the sample virtual object according to the predicted object position information to obtain at least two key part areas;

inputting at least two key part areas in the sample image into an initial object classification network, respectively extracting features of the at least two key part areas in the sample image in the initial object classification network to obtain picture area features respectively corresponding to each key part area, and respectively performing wearing detection on the at least two picture area features to obtain predicted virtual object wearing information corresponding to the sample image; the predicted virtual object wearing information comprises predicted wearing states corresponding to at least two key part areas in the sample image respectively; the predicted wearing state refers to a worn state or an unworn state of at least two key part areas in the sample image;

generating a first target loss function according to the virtual object wearing label and the predicted virtual object wearing information, and adjusting network parameters in the initial object classification network according to the first target loss function to obtain an object classification network; the object classification network is used for identifying virtual object wearing information corresponding to a virtual object in an image to be processed; the virtual object wearing information comprises predicted wearing states corresponding to at least two key part areas in the image to be processed respectively; the predicted wearing state refers to a worn state or an unworn state of at least two key part areas in the image to be processed.

9. The method of claim 8, wherein the obtaining a sample image for virtual object detection comprises:

decoding video sample data containing a sample virtual object to obtain a plurality of continuous video frames, and performing frame extraction processing on the plurality of continuous video frames to obtain a sample image;

inputting the sample image into an object detection network, extracting the characteristics of the sample image in the object detection network to obtain a picture characteristic matrix, and generating at least two detection frames according to the picture characteristic matrix;

and if the to-be-processed detection frame comprises the sample virtual object, determining that the sample virtual object exists in the sample image.

10. The method of claim 9, further comprising:

carrying out virtual object labeling on the sample image to obtain an actual labeling frame, inputting the labeled sample image into an initial object detection network, and outputting a prediction detection frame for labeling the sample virtual object and prediction detection frame position information corresponding to the prediction detection frame through the initial object detection network;

acquiring the number of the actual labeling frames and the actual object position information of the actual labeling frames in the sample image;

generating a quantity loss function according to the quantity of the actual labeling frames and the quantity of the prediction detection frames, generating a position loss function according to the actual object position information and the prediction detection frame position information, and generating a second target loss function according to the quantity loss function and the position loss function;

and adjusting the network parameters in the initial object detection network according to the second target loss function to obtain the object detection network.

11. The method according to claim 8, wherein the inputting at least two key region areas in the sample image into an initial object classification network, respectively performing feature extraction on the at least two key region areas in the sample image in the initial object classification network to obtain picture region features corresponding to each key region area, respectively performing wearing detection on the at least two picture region features to obtain predicted virtual object wearing information corresponding to the sample image, comprises:

inputting area sample images corresponding to at least two key part areas in the sample images into an initial object classification network, and respectively extracting the characteristics of the at least two area sample images in the initial object classification network to obtain picture area characteristics corresponding to each area sample image;

respectively carrying out global average pooling on at least two image region characteristics through a global average pooling layer in the initial object classification network to obtain at least two initial category characteristics;

respectively performing feature integration on the at least two initial category features through a full connection layer in the initial object classification network to obtain at least two target category features;

and outputting classification probabilities respectively corresponding to each target class feature through an output layer in the initial object classification network, determining prediction sub-labels respectively corresponding to each regional sample image according to at least two classification probabilities, and determining at least two prediction sub-labels as predicted virtual object wearing information corresponding to the sample images.

12. The method of claim 11, wherein the virtual object wearing label comprises an actual sub-label corresponding to each of the area sample images; the at least two area sample images comprise an area sample image X _j J is a positive integer, and j is less than or equal to the number of area sample images; the generating a first target loss function according to the virtual object wearing label and the predicted virtual object wearing information comprises:

from the area sample image X _j Corresponding actual sub-label and the region sample image X _j Corresponding predictor label, generating the area sample image X _j A corresponding sub-loss function;

and generating a first target loss function according to the sub-loss function corresponding to each area sample image.

13. A virtual object detection apparatus, comprising:

a classification detection module for detectingRespectively extracting features of at least two key part regions to obtain picture region features corresponding to each key part region, and respectively performing wearing detection on the at least two picture region features to obtain virtual object wearing information corresponding to the image to be processed; the virtual object wearing information comprises predicted wearing states corresponding to the at least two key part areas respectively; the predicted wearing state refers to a worn state or an unworn state of the at least two key site areas; the at least two critical site regions include a critical site region S _i (ii) a i is a positive integer, and i is less than or equal to the number of key part areas;

a rendering analysis module for obtaining the region S of the virtual object in the key part _i The actual wearing state of the virtual wearing article is determined if the actual wearing state and the key part area S _i If the corresponding predicted wearing states are different, determining the key part area S _i The rendering result aiming at the virtual wearing article in the step (2) is an abnormal rendering result.

14. A computer device, comprising: a processor, a memory, and a network interface;

the processor is coupled to the memory and the network interface, wherein the network interface is configured to provide data communication functionality, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the method of any of claims 1-12.

15. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded by a processor and to carry out the method of any one of claims 1 to 12.