CN114302157A

CN114302157A - Attribute tag identification and multicast event detection method, device, equipment and medium

Info

Publication number: CN114302157A
Application number: CN202111591536.1A
Authority: CN
Inventors: 苏正航; 陈增海; 贺亮亮
Original assignee: Guangzhou Jinhong Network Media Co ltd; Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Jinhong Network Media Co ltd; Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-08
Anticipated expiration: 2041-12-23
Also published as: CN114302157B

Abstract

The application discloses an attribute label identification method, a broadcast-substituting event detection method, a device, equipment and a medium thereof, wherein the identification method comprises the following steps: acquiring a video frame in a live video stream, wherein the live video stream is sourced from a media server; recognizing a face image in the video frame by adopting a face recognition model trained to be convergent; adopting an attribute label to identify a student model, and performing attribute label prediction on the face image to obtain an attribute label corresponding to the face image; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in any live video stream provided by the media server as a training sample; and labeling and outputting the corresponding face image in the video frame by using the attribute label. According to the method and the device, the identification capability of the attribute label identification model can be improved by utilizing the mass live video stream of the network live broadcast platform, and downstream tasks such as the proxy broadcast behavior detection can be served by utilizing the identification capability of the model.

Description

Attribute tag identification and multicast event detection method, device, equipment and medium

Technical Field

The present application relates to the field of webcast technologies, and in particular, to an attribute tag identification method, a proxy broadcast event detection method, and corresponding apparatuses, computer devices, computer-readable storage media, and computer program products.

Background

With the rapid development of big data and deep learning, the face attribute label recognition has been widely applied to the security industry. The face attributes comprise age, gender, color value, hair color and the like, most of the existing face attribute label recognition methods are used for recognizing face regions by designing various neural networks, in order to measure the precision and the deployment cost of each category, a cascade model or multi-task training is often used, the precision is often difficult to achieve the expectation, and the requirements of the industry cannot be met.

The existing scheme is to simultaneously carry out age estimation, gender identification and race classification on a face image by adopting a multitask learning method. Respectively training the three single task networks, then selecting the weight of the network with the slowest convergence to initialize the shared part of the multi-task network, and randomly initializing the independent part of the multi-task network; next, training the multitask network to obtain a multitask CNN network model; and finally, analyzing three attributes of age, gender and race of the input face image by using the trained multitask CNN network model.

The training of the convolutional neural network of various existing schemes depends on training samples marked manually, the training cost is high, and because the training samples marked manually are possibly sparse and cannot fit practical application scenes, the trained model is difficult to quickly obtain generalization capability, so that the training process is slow and low-efficiency, and in the stage of putting the model into production, particularly for streaming media needing quick response, good identification performance is difficult to obtain.

In summary, the related art of face attribute tag identification in the prior art has a larger promotion space, and the applicant therefore made a related search.

Disclosure of Invention

A primary object of the present application is to solve at least one of the above problems and provide an attribute tag identification method, a proxy event detection method and corresponding apparatuses, a computer device, a computer readable storage medium, and a computer program product.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

the attribute tag identification method provided by adapting to the purpose of the application comprises the following steps:

acquiring a video frame in a live video stream, wherein the live video stream is sourced from a media server;

recognizing a face image in the video frame by adopting a face recognition model trained to be convergent;

adopting attribute labels to identify a student model, and performing attribute label prediction on the face image to obtain one or more attribute labels corresponding to the face image; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in any live video stream provided by the media server as a training sample, in the training process, an attribute label recognition instructor model trained to the convergence state in advance predicts an attribute label for the same training sample, and performs semi-supervised training on the attribute label recognition student model by using the attribute label, wherein the attribute label recognition instructor model is subjected to supervised training in advance to reach the convergence state;

and labeling and outputting the corresponding face image in the video frame by using the attribute label.

In an extended embodiment, the training process of the attribute label recognition student model comprises the following steps:

collecting a single video frame from any live video stream output by a media server as a training sample;

adopting an attribute label recognition instructor model which is trained to be in a convergence state in advance, performing attribute label prediction on the face image, and obtaining one or more attribute labels corresponding to the face image to form a soft label;

adopting the attribute labels to identify a student model, performing attribute label prediction on the face image, obtaining one or more attribute labels corresponding to the face image, and forming a result label;

calculating the loss value of the result label by referring to the soft label, judging whether the loss value reaches a preset threshold value, and terminating the training process when the loss value reaches the preset threshold value; otherwise, gradient updating is carried out, and the training samples are collected again to carry out iterative training on the attribute label recognition student model.

In a further embodiment, the face recognition model performs the following steps:

extracting image characteristic information of a plurality of scales of the video frame from a plurality of convolutional layers respectively;

performing feature fusion on the image feature information in a feature pyramid network to obtain corresponding fusion feature information;

extracting one or more bounding boxes according to the fusion characteristic information and mapping the bounding boxes to a classification space to obtain a classification result of the bounding boxes;

and cutting out a corresponding face image from the video frame according to the boundary box which is characterized as effective by the classification result.

In an embodiment, the labeling output of the corresponding face image in the video frame with the attribute tag includes the following steps:

for each predicted face image, acquiring a boundary frame identified by the face identification model, and determining the position information of the face image in the video frame according to the boundary frame;

generating an information image for representing the attribute label of each predicted face image, wherein the information image has a transparent background;

and according to the position information, the information image is superposed to the corresponding image position of the video frame and the adjacent video frame to realize position association labeling with the corresponding face image, so that the information image corresponding to each face image is displayed when the live video stream is played by client equipment.

In another embodiment, the labeling output of the corresponding face image in the video frame with the attribute tag includes the following steps:

for each predicted face image, packaging the attribute label, the position information and the timestamp of the corresponding video frame in the live video stream into mapping relation data;

and outputting the mapping relation data to the same client equipment in a manner of synchronizing the live video stream, so that after the mapping relation data is analyzed by the client equipment, the escape label information of the attribute label of the live video stream is correspondingly displayed when the live video stream is played and displayed by the client equipment.

comparing attribute labels corresponding to all the facial images with user registration information of a main broadcasting user of the live broadcast video stream, and marking whether the attribute label of one facial image is consistent with the user registration information or not according to a comparison result, wherein the attribute label comprises gender and age;

and when the marks are inconsistent, judging that the anchor user of the live video stream has a broadcasting-substituting behavior, and triggering a background notification message to be sent to a preset network address.

The method for detecting the agent broadcast event, which is provided by adapting to the purpose of the application, comprises the following steps:

adopting attribute labels to identify a student model, and performing attribute label prediction on the face image to obtain attribute labels corresponding to the face image, wherein the attribute labels comprise a gender label and/or an age label; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in any live video stream provided by the media server as a training sample, in the training process, an attribute label recognition instructor model trained to the convergence state in advance predicts an attribute label for the same training sample, and performs semi-supervised training on the attribute label recognition student model by using the attribute label, wherein the attribute label recognition instructor model is subjected to supervised training in advance to reach the convergence state;

matching attribute labels corresponding to all the face images with user registration information of a main broadcasting user of the live video stream, and judging that the main broadcasting user has a broadcasting agency event if at least one attribute label in all the face images is not matched with the user registration information;

and monitoring the duration of the multicast event existing in each anchor user, and triggering a notification event when the duration reaches a preset threshold.

An attribute tag identification apparatus adapted to one of the objects of the present application includes: the system comprises an image acquisition module, a face recognition module, a label prediction module and a label output module, wherein the image acquisition module is used for acquiring video frames in a live video stream, and the live video stream is from a media server; the face recognition module is used for recognizing a face image in the video frame by adopting a face recognition model trained to be convergent; the label prediction module is used for identifying a student model by adopting attribute labels, performing attribute label prediction on the face image and obtaining one or more attribute labels corresponding to the face image; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in a live video stream provided by the media server as a training sample, in the training process, an attribute label recognition instructor model trained to the convergence state in advance predicts an attribute label for the same training sample, and implements semi-supervised training on the attribute label recognition student model by using the attribute label, wherein the attribute label recognition instructor model is implemented with supervised training in advance to reach the convergence state; and the annotation output module is used for performing annotation output on the corresponding face image in the video frame by using the attribute label.

In an extended embodiment, a first training module is run to train the attribute label recognition student model, and the first training module comprises: the sample acquisition submodule is used for acquiring a single video frame from any live video stream output by the media server to serve as a training sample; the face extraction submodule is used for identifying a face image in the video frame by adopting a face recognition model trained to be convergent; the instructor processing submodule is used for adopting an attribute label recognition instructor model which is trained to be in a convergence state in advance to carry out attribute label prediction on the face image, and obtaining one or more attribute labels corresponding to the face image to form a soft label; the student prediction submodule is used for adopting the attribute tags to identify a student model, carrying out attribute tag prediction on the face image, obtaining one or more attribute tags corresponding to the face image and forming result tags; the updating iteration submodule is used for calculating the loss value of the result label by referring to the soft label, judging whether the loss value reaches a preset threshold value or not, and terminating the training process when the loss value reaches the preset threshold value; otherwise, gradient updating is carried out, and the training samples are collected again to carry out iterative training on the attribute label recognition student model.

In a further embodiment, the face recognition module constructs a face extraction sub-module during operation, and the face extraction sub-module includes: a convolution extraction unit, configured to extract image feature information of multiple scales of the video frame from multiple convolution layers, respectively; the characteristic fusion unit is used for carrying out characteristic fusion on the plurality of image characteristic information in the characteristic pyramid network to obtain corresponding fusion characteristic information; the classification mapping unit is used for extracting one or more bounding boxes according to the fusion characteristic information and mapping the bounding boxes to a classification space to obtain a classification result; and the image cutting unit is used for cutting out a corresponding face image from the video frame according to the boundary box represented as the effective classification result.

In an embodied embodiment, the annotation output module includes: the image positioning sub-module is used for acquiring a boundary frame of each predicted face image, which is identified by the face identification model, and determining the position information of the face image in the video frame according to the boundary frame; the information escape submodule is used for generating an information image for representing the attribute label of each predicted face image, and the information image has a transparent background; and the associated labeling submodule is used for superposing the information image to the corresponding image position of the video frame and the adjacent video frame according to the position information to realize position associated labeling with the corresponding face image so as to display the information image corresponding to each face image when the live video stream is played by the client equipment.

In another embodiment, the annotation output module comprises: the image positioning sub-module is used for acquiring a boundary frame of each predicted face image, which is identified by the face identification model, and determining the position information of the face image in the video frame according to the boundary frame; the data packaging submodule is used for packaging the attribute label, the position information and the timestamp of the corresponding video frame in the live video stream into mapping relation data aiming at each predicted face image; and the data pushing submodule is used for outputting the mapping relation data to the same client equipment in a manner of synchronizing with the live video stream, so that after the mapping relation data is analyzed by the client equipment, the escape tag information of the attribute tag of the live video stream is correspondingly displayed when the live video stream is played and displayed by the client equipment.

In yet another embodiment, the annotation output module comprises: the information comparison submodule is used for comparing the attribute labels corresponding to all the face images with the user registration information of the anchor user of the live video stream, and marking whether the attribute label of one face image is consistent with the user registration information or not according to the comparison result, wherein the attribute label comprises gender and age; and the substitute broadcasting judging submodule is used for judging that a main broadcasting user of the live video stream has a substitute broadcasting behavior when the marks are inconsistent, and triggering a background notification message to be sent to a preset network address.

A device for detecting a seeding-related event, adapted to an object of the present application, includes: the system comprises an image acquisition module, a face recognition module, a label prediction module, a broadcast generation detection module and a monitoring notification module, wherein the image acquisition module is used for acquiring video frames in a live video stream, and the live video stream is sourced from a media server; the face recognition module is used for recognizing a face image in the video frame by adopting a face recognition model trained to be convergent; the label prediction module is used for identifying a student model by adopting attribute labels, performing attribute label prediction on the face image and obtaining attribute labels corresponding to the face image, wherein the attribute labels comprise a gender label and/or an age label; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in any live video stream provided by the media server as a training sample, in the training process, an attribute label recognition instructor model trained to the convergence state in advance predicts an attribute label for the same training sample, and performs semi-supervised training on the attribute label recognition student model by using the attribute label, wherein the attribute label recognition instructor model is subjected to supervised training in advance to reach the convergence state; the agent broadcast detection module is used for matching attribute labels corresponding to all the face images with user registration information of an anchor user of the live broadcast video stream, and judging that the anchor user has an agent broadcast event if at least one attribute label in all the face images is not matched with the user registration information; the monitoring notification module is used for monitoring the duration of the multicast event existing in each anchor user, and triggering the notification event when the duration reaches a preset threshold.

A computer device adapted for one of the purposes of the present application comprises a central processing unit and a memory, the central processing unit being configured to invoke execution of a computer program stored in the memory to perform the steps of the attribute tag identification method described herein.

A computer-readable storage medium, which stores in the form of computer-readable instructions a computer program implemented according to the method for attribute tag identification, which when invoked by a computer performs the steps included in the method.

A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.

Compared with the prior art, the application has the following advantages:

in the application, an attribute label recognition student model which is trained to a convergence state in advance is used for recognizing attribute labels of video streams transmitted in real time in a network live broadcast process, namely character images in live broadcast video streams output by a media server, so as to obtain corresponding attribute labels for marking corresponding face images, wherein the student model collects training samples from the live broadcast video streams output by the media server to train the live broadcast video streams to obtain corresponding attribute recognition capability, and the two stages of training and production are implemented according to homologous same type data in a network live broadcast application scene, so that the student model can provide massive video frames containing stronger live broadcast characteristic expressions as training samples by means of a large number of video streams in a network live broadcast platform in the training process to implement training, these training samples have a strong generalization ability and are generated automatically, so the training cost is low, the convergence rate is fast, and the activation ability learned by the student model is better.

Secondly, for live video streaming in live network broadcasting, streaming media data are transmitted in real time, and for content information in the live video streaming, a platform side always needs to master in time, under the condition, face images in the live video streaming are labeled in time to provide basic data for downstream tasks, so that the student model is very light in weight.

In addition, as the models are applied based on live video streams of the media servers of the same source in the training and production stages, for the network platform side, the data mining of mass video data of the models is realized to serve different downstream tasks, a large amount of model training cost is saved, different requirements can be served, and the scale economic advantage is embodied.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram of an exemplary embodiment of an attribute tag identification method of the present application;

FIG. 2 is a schematic diagram of a network architecture of an attribute tag identification model of the present application in implementing knowledge distillation;

FIG. 3 is a schematic flow chart of a process for identifying trained student models according to the present application;

FIG. 4 is a schematic flow chart of the working process of the face recognition model of the present application;

FIG. 5 is a schematic diagram of a network architecture of a face recognition model according to the present application

FIG. 6 is a flow chart illustrating a process of outputting attribute tags in an extended embodiment of the present application;

FIG. 7 is a schematic diagram illustrating the effect of an attribute tab of the present application being displayed on a graphical user interface of a live broadcast room;

FIG. 8 is a flow chart illustrating a process of outputting attribute tags in an extended embodiment of the present application;

FIG. 9 is a schematic flow chart diagram illustrating an exemplary embodiment of a method for detecting a seeding-substituting event according to the present application;

FIG. 10 is a functional block diagram of the attribute tag identification apparatus of the present application;

FIG. 11 is a functional block diagram of a device for detecting a seeding-substituting event according to the present application;

fig. 12 is a schematic structural diagram of a computer device used in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.

Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.

The attribute tag identification method can be programmed into a computer program product and is deployed in a client or a server to run, so that the method can be executed by accessing an open interface after the computer program product runs and performing man-machine interaction with a process of the computer program product through a graphical user interface.

Referring to fig. 1, in an exemplary embodiment of the attribute tag identification method of the present application, the method includes the following steps:

step S1100, acquiring a video frame in a live video stream, wherein the live video stream is from a media server:

the live video stream refers to a video stream which is provided by a network live platform and opens a live room service, is output in real time from a media server of the live room according to the live room implementation logic and is transmitted to a live room user for analysis and display. The live video stream is generally pushed by the anchor user side, and is sent to other online audience users in the live broadcast room after corresponding audio and video processing is carried out by the media server.

When the live video stream is obtained, the live video stream received from the anchor user side may be decoded and extracted by the media server, or the live video stream output after being encoded by the media server may be specially decoded to obtain the video frame therein. In the method, when attribute tag identification is carried out on the live video stream, the video frames required by the identification can be obtained from the live video stream.

The method and the device meet the specific requirement of the live video stream identification, can identify each video frame in the live video stream in the whole live broadcast process, and can also acquire the video frames from the live video stream at intervals of a certain time or frame number for identification, so that the technical personnel in the field can flexibly implement the method and the device.

Step S1200, recognizing the face image in the video frame by adopting the face recognition model trained to be convergent:

in the method, a face recognition model trained to a convergence state is adopted for extracting the face image of the video frame. The face recognition model may be any of various existing object recognition models known to those skilled in the art, or an object recognition model that has been found to perform better. These target recognition models are implemented based on convolutional neural networks, for example, EfficientDet, Yolo, etc., and may be assumed by the pre-trained target recognition model, or may be used in the present application by a person skilled in the art by self-initializing the state of the target recognition model and self-training it to converge.

The method comprises the steps of inputting video frames obtained from live video streams into the face recognition model, enabling the face recognition model to firstly perform representation learning on the video frames to obtain deep semantic information of the video frames, then outputting one or more bounding boxes according to the deep semantic information, and outputting a classification result of whether each bounding box belongs to a face, so that the bounding box corresponding to a face image can be determined according to the classification result, and then intercepting the corresponding face image from an original image of the video frames according to position information of the bounding boxes. Because one or more persons may exist in the video frame, one or more face images may be correspondingly output, and for the case where multiple face images exist, in the subsequent steps, attribute tag recognition may be performed on each face image one by one, or only the face image with better image conditions may be selected as a main image to perform attribute tag recognition alone, where the image conditions include that the corresponding bounding box has the largest area, or the confidence of the bounding box is higher, and the like. The person skilled in the art is flexible in this respect.

Step 1300, adopting attribute labels to identify a student model, and implementing attribute label prediction on the face image to obtain one or more attribute labels corresponding to the face image; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in any live video stream provided by the media server as a training sample, in the training process, an attribute label recognition instructor model trained to the convergence state in advance predicts an attribute label for the same training sample, and implements semi-supervised training on the attribute label recognition student model by using the attribute label, wherein the attribute label recognition instructor model is implemented with supervised training in advance to reach the convergence state:

and performing attribute label recognition on the face image by using an attribute label recognition model, wherein the attribute label recognition model is an attribute label recognition student model realized by using a knowledge distillation technology, and guiding the training of the attribute label recognition student model by using an attribute label recognition instructor model with the same network architecture, so that the attribute label recognition student model performs semi-supervised training on the attribute label recognition instructor model until the model converges, and finally the student model learns the capability of obtaining one or more adaptive attribute labels from the face image.

Referring to an exemplary network architecture built by implementing the knowledge distillation training idea of the present application as shown in fig. 2, in the process of implementing knowledge distillation, the instructor model and the student model are predicted based on the same video frame, the attribute label obtained by the instructor model prediction is used as a supervision label of the attribute label obtained by the student model prediction, a correlation loss value is calculated for gradient update, and the instructor model and the student model are fitted. The instructor model is trained to be in a convergence state by adopting a small amount of manually labeled samples in advance, then the student model is subjected to semi-supervised training in the knowledge distillation process, accordingly, mass labeled samples do not need to be adopted for training the attribute label identification model at one time, the instructor model only needs to be subjected to semi-supervised training on the student model, the student model can be trained to be in the convergence state without the help of the labeled samples, and a large amount of labeling cost is saved. For a network live broadcast platform, mass video frames are generated daily and can be used as a label-free sample in a knowledge distillation process through live broadcast, therefore, in a network live broadcast application scene, a knowledge distillation technology is adopted to train the attribute label recognition model to obtain an attribute label recognition student model required by the application production, and the network live broadcast platform has high economic value.

This application is right when attribute label discernment student model implements the training, and the video frame in the live broadcast video stream that the media server that directly adopts the network live broadcast platform provided implements, not only can provide magnanimity no mark sample for the training process, moreover, because contain abundant personage image among the live broadcast video stream, the thousand states of posture is various, has the characteristics of generalized sample characteristic, helps promoting the characteristic generalization ability of the student model trained to promote student model's the discernment ability to attribute label.

After the face images obtained in the previous step are input into the attribute label recognition student models which are trained to be convergent one by one, the student models are mapped to a multi-classification space after representing and learning each face image, and corresponding one or more attribute labels are obtained. It is understood that, for a video frame containing a plurality of face images, each face image can obtain one or more attribute labels corresponding to the face image.

Step S1400, labeling and outputting the corresponding face image in the video frame by the attribute label:

after obtaining the attribute labels corresponding to each face image, the attribute labels can be established with each face image, or superimposed into the corresponding video frames of the live video stream, or sent to the client device of the audience user in the live broadcast room for analysis and processing, or sent to the corresponding network address, and so on, and the attribute labels are used for labeling the corresponding face images through the corresponding relations, so that the network platform obtains labeling information related to the face images in the dynamic content of the live video stream provided by the media server.

The live video stream is a streaming media, wherein video frames are organized along a time axis, so that when the face image is labeled with a video picture, the labeling information corresponding to the face image can be kept for a certain preset time length, so that a user can conveniently view the label information. In practice, when the attribute tag is identified for the live video stream, the video frames can be identified frame by frame, or the video frames can be identified at certain time intervals, and when the face image marked first disappears, the corresponding marked information can not be displayed continuously. And so on, those skilled in the art can flexibly implement to match the needs of the user experience.

With the present exemplary embodiment, it can be appreciated that the implementation of the present application has many positive advantages, including but not limited to the following:

Referring to fig. 3, in an extended embodiment, the training process of the attribute tag recognition student model is implemented by using a knowledge distillation idea, which can be combined with the network architecture shown in fig. 2 to enhance understanding, and the process includes the following steps:

step S2100, collecting a single video frame from any live video stream output by the media server as a training sample:

as mentioned above, when training, any live video stream output by the media server is directly adopted to obtain training samples required by training. Preferably, the video frame can be extracted from an image space after the media server decodes the live video stream provided by the anchor user side, and a single video frame can be extracted during each training and used as a training sample.

Step S2200, recognizing the face image in the video frame by adopting the face recognition model trained to be convergent:

as for the implementation of this step, please refer to step S1200, the two techniques are completely the same, and only one or more face images in the video frame serving as the training sample need to be recognized by using the face recognition model.

Step S2300, adopting an attribute label recognition instructor model trained to a convergence state in advance, performing attribute label prediction on the face image to obtain one or more attribute labels corresponding to the face image, and forming a soft label:

as shown in fig. 2, the attribute label recognition instructor model and the attribute label recognition student model are actually two different instances of the same attribute label recognition model, wherein the instructor model is pre-trained to a convergence state by using a relatively small number of manually labeled training samples as described above, and thus the output attribute labels have the capability of guiding the student models to perform representation learning.

The attribute label identification model is similarly implemented by adopting a convolutional neural network model, such as Resnet, RegNet and the like, and is used for representing and learning the face image, extracting image feature information in the face image, and then performing multi-classification mapping according to the image feature information to obtain one or more corresponding attribute labels so as to predict the attribute labels of the face image.

In an exemplary example of preparing the instructor model, the RegNet-4.0G is adopted in the model, a high-precision model is obtained by utilizing softmax loss training with label smooth based on a face region corresponding to a face image and a label (with a small general magnitude) labeled manually, and the RegNet-4.0G is the instructor model and can generate a soft label in real time when a small model, namely a student model, is trained. Since live video streaming is continuous, small models can be trained with huge amounts (on the order of millions or even tens of millions) of data with soft labels.

The cross entropy function employed by the classifier Softmax is as follows:

where n is the total number of training samples, p_iIs the prediction of the ith sample for the network, and y (i) is the artificial label for the ith sample.

In the knowledge distillation process, a teacher model and a student model synchronously carry out representation learning and prediction on the same face image, wherein the attribute labels obtained by prediction of the teacher model are used as soft labels for supervising the attribute labels predicted by the student model.

Step S2400, adopting the attribute labels to identify a student model, performing attribute label prediction on the face image, obtaining one or more attribute labels corresponding to the face image, and forming a result label:

in the other path parallel to the instructor model, after the attribute label recognition student model performs representation learning on the same face image in the same way, one or more attribute labels are mapped through the classifier and serve as result labels.

Step S2500, calculating a loss value of the result label by referring to the soft label, judging whether the loss value reaches a preset threshold value, and terminating the training process when the loss value reaches the preset threshold value; otherwise, updating the gradient, and acquiring the training sample again to carry out iterative training on the attribute label recognition student model:

as described above, the soft label of the instructor model is used to supervise the result label of the student model, so that a super-parameter is used to calculate the loss value between the result label and the soft label, when the loss value does not reach the preset threshold, it indicates that the student model does not converge, the gradient update is performed on the student model according to the loss value, and a new video frame is continuously collected from any live video stream as the next training sample to perform iterative training, that is, steps S2100 to S2500 are cycled; otherwise, the student model converges and training may be terminated. In the training process, an artificial label, namely a hard label, can be provided for the training process of one video frame, and finally the student model is subjected to gradient updating by using the weighted sum of the loss value calculated by the hard label and the loss value calculated by the soft label, so that the convergence speed of the student model is improved.

One of the bright points here is that a form of a teacher model is adopted to guide a student model in real time, and multi-task training is performed in a semi-supervised form, so that the performance of the student model is finally better than that of the teacher model, and the network complexity is far less than that of the teacher model. In training the student model, 2 Loss functions were used, one of which was the Distillation Loss (Loss of Distillation, Loss _1) to fit the losses of the teacher model and the student model, and the other was the Softmax Loss (Loss of student, Loss _2) to fit the losses between the student model and the artificial label. The rejection loss and Softmax loss are essentially identical, except that the soft tag replaces the manual tag.

Actual measurement results show that C + + based on TensrT deploys the trained small model, the processing capacity of one frame in 5 seconds can be realized, and anchor attributes of all channels of the same network live broadcast platform can be updated in a second level in real time.

The advantage of this embodiment is that the instructor model only needs to adopt a small number of manually labeled samples to be trained in advance to a convergence state, and the student model can be trained without manually labeled samples, and only needs the output of the instructor model to be used as a soft label, so that the training of the student model can be supervised, and the training process can be performed without using manually labeled samples, so that the video frames in the live video stream in the media server of the network platform can be directly used as the unlabeled samples to be trained. The knowledge distillation idea is applied, so that the training cost of the model can be saved, the resource advantage that massive live broadcast video streams exist in a network live broadcast platform and can provide a non-labeled sample can be utilized, the cyclic utilization in a certain sense is achieved, the data value is mined, and the scale economic utility is obtained.

Referring to fig. 4, in a further embodiment, the face recognition model performs the following steps:

step S3100, extracting image feature information of a plurality of scales of the video frame from each of a plurality of convolutional layers:

in this embodiment, the EfficientDet model is recommended to be used as the face recognition model of the present application, and is trained to a convergence state in advance, and a pre-trained example can be directly adopted.

As shown in fig. 5, according to the inherent architecture of the EfficientDet model, after a video frame is input to a face recognition model, deep semantic information of the video frame is extracted at different scales through convolution layers of multiple levels constituting a backbone network, so as to obtain corresponding image feature information. Among the image characteristic information with different scales, the larger scale is favorable for capturing the global characteristics of the image in the video frame, and the smaller scale is favorable for capturing the local characteristics of the image in the video frame, so that the method is suitable for different sizes of the figures in the video frame, and can capture the face characteristics of the figure image in the video frame with high fidelity.

Step S3200, performing feature fusion on the plurality of image feature information in the feature pyramid network to obtain corresponding fusion feature information:

the image feature information of different scales obtained in the previous step is input into a feature pyramid network (BiFPN) for feature interaction, so that bidirectional feature fusion is realized, and corresponding fusion feature information is obtained, and then the fusion feature information is output into a classification network and a bounding box prediction network for further classification mapping and bounding box prediction.

Step S3300, extracting one or more bounding boxes according to the fusion feature information and mapping the bounding boxes to a classification space to obtain a classification result:

on one hand, the classification network maps the fusion characteristic information after full connection to a classification space of the fusion characteristic information, and the classification space is a binary space, namely, the classification network can adopt two classifiers which are suitable for carrying out classification judgment on a corresponding boundary box predicted by the boundary box prediction network so as to represent whether the boundary box is a valid boundary box or not.

On the other hand, the bounding box prediction network determines one or more bounding boxes by capturing contour information in the fused feature information, and obtains corresponding bounding box coordinates.

Step S3400, cutting out a corresponding face image from the video frame according to the boundary box represented as effective by the classification result:

after the coordinates of a plurality of boundary frames existing in the video frame and corresponding classification results thereof are determined in the previous step, an effective boundary frame can be determined according to the classification results in the step, then the corresponding image area mapped to the video frame is determined according to the coordinates of the boundary frames corresponding to the effective boundary frame, and then the corresponding image area is cut out, namely the face image corresponding to each boundary frame is obtained, and the face head images can be further provided for the attribute label recognition student model of the application to perform further attribute labeling. Of course, the face images may have different sizes, and for this, a person skilled in the art can flexibly scale the input specifications of the student model according to the attribute tag identification.

In fact, those skilled in the art may also train the face recognition model by themselves, and construct a data set required for training in advance, where the data set includes video frames serving as training samples and their corresponding artificial boundary box labels, and accordingly, it is sufficient to perform iterative training on the face recognition model by using the training samples until the face recognition model converges.

In the training process of the face recognition model, FocalLoss is used for calculating the cross entropy loss of prediction results of all types which are not ignored, and SmoothLoss is used for calculating the loss of a target regression frame, wherein:

the calculation formula for the focallloss is:

FL(p,y)＝-αy(1-p)^γlog(p)-(1-α)(1-y)p^γlog(1-p)

where p is the predicted value of the neural network, y is an artificial label, and α and γ are adjustable hyper-parameters. The SmoothLoss loss is calculated as:

wherein v ═ v (v)_x,v_y,v_w,v_h]) The coordinates of the box representing the sample,

and representing the predicted bounding box coordinates, namely solving the loss of four points respectively, and then adding the loss as the regression error of the bounding box.

In the embodiment, the face recognition model realized based on EfficientDet is optimized, and the depth feature interaction is performed on the basis of capturing the multi-scale image features of the video frame by using the convolution layers and the feature pyramid network of a plurality of levels, so that the face image recognition of the video frame with complex dynamic change features, such as a live broadcast picture, is realized, the face image can be recognized from the live broadcast video stream to the greatest extent, the false recognition is avoided, the face image recognition accuracy is ensured, and the accuracy of recognizing the attribute label of the face image is improved.

Referring to fig. 6, in an embodiment, the step S1400 of labeling and outputting the corresponding face image in the video frame with the attribute tag includes the following steps:

step S1411, for each predicted face image, acquiring a bounding box identified by the face recognition model, and determining position information of the face image in the video frame according to the bounding box:

after the attribute labels of the face images are identified in the previous steps of the present application, i.e., step S1100 to step S1300, in this embodiment, the bounding boxes output in step S1200 are obtained first, that is, the mapping relationship between the face images and the video frames is utilized, the position information of the face images in the video frames is mapped according to the effective bounding box coordinates, and the position information corresponding to each identified face image in the video frames is determined, so as to further label the face images.

Step S1412, generating an information image for representing the attribute label of each predicted face image, where the information image has a transparent background:

for each predicted face image, information carried by each corresponding attribute label needs to be labeled, so that each attribute label can be converted into an information image, for example, the attribute label is converted into character information' { sex: beauty; age 20; peak value, then converting these character information into information image, keeping the information image as transparent background including semi-transparent background, thereby completing the construction of labeled information corresponding to the face image.

Step S1413, according to the position information, superimposing the information image to the corresponding image position of the video frame and the adjacent video frame to realize position-related labeling with the corresponding face image, so that the information image corresponding to each face image is displayed when the live video stream is played by the client device:

the position information of each predicted face image in the video frame is determined, so that an anchor point of an information image corresponding to each face image can be determined in the video frame according to the position information, the anchor point can be synchronously changed in association with the change of the boundary frame, and then the information image is positioned and superposed into the anchor point to be superposed into the image of the video frame, so that the association labeling of the face image in the video frame and the corresponding information image is realized.

After the associated labeling process is completed, when the live video stream reaches the client device of the live room and is played and displayed, the character image can be displayed, and simultaneously, the corresponding information image can be displayed along with the face of the character image, namely, the attribute label corresponding to the face image in the character image is displayed. Fig. 7 shows the corresponding effect diagram.

In this embodiment, the attribute tag obtained by the present application is used for performing attribute tagging of a face image on a live video stream in a background, and when the attribute tag is transmitted to a client device for display, the attribute tag facilitates display of attribute information corresponding to a person in the live video stream by a terminal device, and user experience can be improved.

Referring to fig. 8, in another embodiment, the step S1400 of labeling and outputting the corresponding face image in the video frame with the attribute tag includes the following steps:

step S1421, for each predicted face image, acquiring a bounding box identified by the face recognition model, and determining position information of the face image in the video frame according to the bounding box:

Step S1422, for each predicted face image, encapsulate the attribute tag, the location information, and the timestamp of the corresponding video frame in the live video stream into mapping relationship data:

the important point of the embodiment is that the background is responsible for superimposing the information corresponding to the attribute tag to the live video stream in the previous embodiment, and the embodiment sends the information indicated by the corresponding attribute tag to the client device in the live broadcast room, and the client device is responsible for rendering the interface to display the corresponding annotation information of the character image.

Therefore, in this step, for each predicted face image, the attribute tag, the position information, and the timestamp of the identified video frame in the live video stream may be encapsulated as mapping relationship data to form one notification message in the information stream of the live room.

Step S1423, synchronizing the mapping relationship data with the live video stream and outputting the synchronized live video stream to the same client device, so that after the mapping relationship data is analyzed by the client device, when the live video stream is played and displayed by the client device, the escape tag information of the attribute tag of the live video stream is correspondingly displayed:

then, through the information flow of the live broadcast room, the notification message encapsulating the mapping relationship data is broadcasted to the client devices of all audience users in the live broadcast room along with the live broadcast video flow, so that each client device can obtain the mapping relationship data, and can automatically analyze the mapping relationship data to obtain the attribute label, the position information and the time stamp corresponding to each face image, when the time stamp of the live broadcast video flow arrives, a display component corresponding to the position of the face image in a playing window is rendered in the interface, and the escape label information corresponding to the attribute label is displayed in the display component, namely the information which is determined according to the mapping relationship between the value in the attribute label and the attribute type and is represented by characters or images. The interface effect can be similarly shown with reference to fig. 7.

According to the embodiment, different from the previous embodiment, the display of the attribute tags is transferred to the client device to be realized, and after the mapping relation data containing the display tags are acquired and automatically analyzed through the information flow of the live network system by each client device, the corresponding escape tag information is automatically rendered and displayed on the interface, so that the operation load is dispersed, and the burden of a server of a live network platform is reduced.

In another embodiment, the step S1400, labeling and outputting the corresponding face image in the video frame with the attribute tag, includes the following steps:

step S1431, comparing the attribute tags corresponding to all the face images with the user registration information of the anchor user of the live video stream, and labeling, according to a comparison result, whether an attribute tag of a face image is consistent with the user registration information, where the attribute tag includes gender and age:

in this embodiment, the attribute tags identified in steps S1100 to S1300 in the present application may be used to detect a second generation behavior in a live broadcast room, that is, to detect whether a face image in a live broadcast video stream is a registered user of the live broadcast room, so as to guide an anchor user who signs a contract with a network platform to avoid a default behavior.

Accordingly, it is necessary to invoke user registration information of the anchor user to which the identified live video stream belongs, where the user registration information is personal account information corresponding to the anchor user, and the user registration information can be invoked directly by the network platform side, and attribute data of a type corresponding to the attribute tag, such as gender, age, and the like, are stored therein.

Furthermore, the attribute labels of the face images obtained from the video frames can be correspondingly compared with the attribute data in the user registration information of the anchor user in the live broadcast room, and when the attribute labels are consistent with each other or most of the attribute labels are consistent with each other, the live broadcast room can be confirmed to have no alternate broadcast behavior.

Step S1432, when the labels are inconsistent, determining that there is a multicast action for the anchor user of the live video stream, and triggering a background notification message to send to a preset network address:

otherwise, when the two are inconsistent or when the two are inconsistent after multiple identifications carried out for a preset duration, the fact that the anchor user has the broadcasting-substituting behavior can be judged, and accordingly, a background notification message can be triggered and sent to a preset network address. The network address may be an instant messaging port preset by a webmaster on the network platform side, a database storage address on the network platform side, an instant messaging message receiving port of the anchor user, and the like, and can be realized by a person skilled in the art according to specific service logic.

In the embodiment, the attribute tag corresponding to the face image in the live video stream is judged through the face image by utilizing an image recognition technology, and the proxy broadcasting behavior is investigated according to the consistency between the attribute tag and the user registration information of the anchor user of the live video stream, so that the safety recognition capability in the network live broadcasting process can be improved, and the ecological health and stable development of the network live broadcasting can be ensured.

Referring to fig. 9, in an exemplary embodiment of the present invention, a method for detecting a multicast event includes the following steps:

step S5100, obtain a video frame in a live video stream, the live video stream originating from a media server:

Step S5200, recognizing the face image in the video frame by using the face recognition model trained to converge:

Step S5300, identifying a student model by adopting attribute tags, performing attribute tag prediction on the face image, and obtaining one or more attribute tags corresponding to the face image, wherein the attribute tags comprise a gender tag and/or an age tag; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in any live video stream provided by the media server as a training sample, in the training process, an attribute label recognition instructor model trained to the convergence state in advance predicts an attribute label for the same training sample, and implements semi-supervised training on the attribute label recognition student model by using the attribute label, wherein the attribute label recognition instructor model is implemented with supervised training in advance to reach the convergence state:

In accordance with the requirements of the present embodiment for detecting a multicast event, the attribute tag suitably includes an age tag and/or a gender tag so as to indicate specific information corresponding to the age and/or gender of the person in the face image.

Step S5400, matching attribute labels corresponding to all face images with user registration information of a main broadcasting user of the live video stream, and if at least one attribute label exists in all face images and is not matched with the user registration information, judging that the main broadcasting user has a broadcasting agency event:

the attribute labels identified before are used for detecting the alternate broadcasting behavior in the live broadcasting room, namely detecting whether the face image in the live video stream is the registered user of the live broadcasting room so as to guide the anchor user signed with the network platform to avoid the default behavior.

Accordingly, it is necessary to invoke user registration information of the anchor user to which the identified live video stream belongs, where the user registration information is personal account information corresponding to the anchor user, and the user registration information can be invoked directly by the network platform side, and attribute data of a type corresponding to the attribute tag, for example, information corresponding to types such as gender, age, and the like, is stored therein.

Furthermore, the attribute labels, namely age labels and/or gender labels, of the face images obtained by the student model from the video frames can be correspondingly compared with information data such as age, gender and the like in the user registration information of the anchor user in the live broadcast room, and when the attribute labels and the information data are completely consistent, the attribute labels corresponding to the face images are matched with the user registration information, so that the live broadcast room can be confirmed to have no alternate broadcast behavior; on the contrary, if the attribute labels of all the face images cannot be completely matched with the corresponding data items in the user registration information of the anchor user, the existence of the alternate broadcasting behavior in the live broadcasting room can be judged, so that the corresponding alternate broadcasting event is triggered.

In the alternative embodiment, for the case that a plurality of face images exist, all attribute tags of each face image are matched with the user registration information, and the matching is only formed when the attribute tags are completely the same as the user registration information, otherwise, the face images are not matched as long as one attribute tag is different from the corresponding item in the user registration information. If all the face images are not matched, the broadcasting-agency event can be triggered; otherwise, if a face image is matched, the multicast event is not triggered.

In another alternative embodiment, for the case where there are a plurality of face images, all attribute tags of each face image are matched with the user registration information, and as long as one of the attribute tags is identical to the corresponding item in the user registration information, a match is formed, and when all of the attribute tags are required to be different from the corresponding item in the user registration information, the face image is not matched. If all the face images are not matched, the broadcasting-agency event can be triggered; otherwise, if a face image is matched, the multicast event is not triggered.

In one embodiment, considering that the same team member may appear around the same anchor user, it is allowed to provide a white list, and data such as age, gender, and the like corresponding to more members are added in the white list, when it is necessary to determine whether there is a multicast behavior, the attribute tag obtained by the model is compared and matched with the corresponding data of all members in the whole white list, and when the two cannot be matched, the multicast event is triggered. Therefore, the expandability of the method in the detection process is enhanced, and the expansion of the behavior of the multi-person participating in live broadcast can be realized.

Step S5500, monitor the duration of the multicast event existing in each anchor user, and when the duration reaches a preset threshold, trigger a notification event:

for the same live video stream, after the first multicast event is detected, a timer is started for counting the duration of the subsequent multicast event, and a preset threshold is set for comparing the duration. Within the time range of the preset threshold, if the condition that the attribute labels of the student models are matched with the user registration information is found through one-time detection, the fact that the anchor user re-hosts the live broadcast is indicated, and then the timer can be closed. Within the time range with the preset threshold, if the generation event is continuously detected to occur until the duration exceeds the preset threshold, the generation behavior of the anchor user can be determined to be established to form a default, and accordingly, a corresponding notification event is triggered. The notification event comprises a notification event sent to a message notification interface of the anchor user and/or a notification event sent to a background management interface of the live webcast platform, and the notification event is sent to a corresponding address through a corresponding interface.

As can be seen from the above disclosure of the proxy event method of the present application, the attribute tag identification method of the present application is established in the proxy event detection method of the present application, and therefore, the technical solutions implemented in the embodiments of the attribute tag identification method of the present application can be adopted by the proxy event method of the present application as long as there is no conflict, for example, the training principle and the process of the student model, and therefore, for the same technical solution, details are not repeated here.

In a further modified embodiment, in the course of the proxy broadcast event detection method, matching of the face image and the user avatar in the user registration information can be combined to obtain an avatar matching result, the avatar matching result and the presence of the proxy broadcast event are jointly detected, and when any one of the two matches, the presence of the proxy broadcast event is determined, so that missing detection can be avoided, and the accuracy of proxy broadcast event detection is improved.

In the embodiment, the face image is identified to obtain the corresponding attribute tag, the attribute tag is compared with the user registration information of the anchor user, the anchor event is detected in time through comparison, and the corresponding notification message is sent when the anchor event meets the violation condition.

Referring to fig. 10, an attribute tag identification apparatus adapted for one of the purposes of the present application includes: the system comprises an image acquisition module 1100, a face recognition module 1200, a label prediction module 1300, and an annotation output module 1400, wherein the image acquisition module 1100 is configured to acquire a video frame in a live video stream, and the live video stream is from a media server; the face recognition module 1200 is configured to recognize a face image in the video frame by using a face recognition model trained to converge; the label prediction module 1300 is configured to perform attribute label prediction on the face image by using an attribute label recognition student model, to obtain one or more attribute labels corresponding to the face image, and in a training process, predict an attribute label for the same training sample by using an attribute label recognition instructor model trained to a convergence state in advance, perform semi-supervised training on the attribute label recognition student model by using the attribute label, where the attribute label recognition instructor model is implemented with supervised training in advance to reach the convergence state; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in a live video stream provided by the media server as a training sample; the annotation output module 1400 is configured to perform annotation output on the corresponding face image in the video frame according to the attribute tag.

In an embodied embodiment, the annotation output module 1400 includes: the image positioning sub-module is used for acquiring a boundary frame of each predicted face image, which is identified by the face identification model, and determining the position information of the face image in the video frame according to the boundary frame; the information escape submodule is used for generating an information image for representing the attribute label of each predicted face image, and the information image has a transparent background; and the associated labeling submodule is used for superposing the information image to the corresponding image position of the video frame and the adjacent video frame according to the position information to realize position associated labeling with the corresponding face image so as to display the information image corresponding to each face image when the live video stream is played by the client equipment.

In another embodiment, the annotation output module 1400 includes: the image positioning sub-module is used for acquiring a boundary frame of each predicted face image, which is identified by the face identification model, and determining the position information of the face image in the video frame according to the boundary frame; the data packaging submodule is used for packaging the attribute label, the position information and the timestamp of the corresponding video frame in the live video stream into mapping relation data aiming at each predicted face image; and the data pushing submodule is used for outputting the mapping relation data to the same client equipment in a manner of synchronizing with the live video stream, so that after the mapping relation data is analyzed by the client equipment, the escape tag information of the attribute tag of the live video stream is correspondingly displayed when the live video stream is played and displayed by the client equipment.

In another embodiment, the annotation output module 1400 includes: the information comparison submodule is used for comparing the attribute labels corresponding to all the face images with the user registration information of the anchor user of the live video stream, and marking whether the attribute label of one face image is consistent with the user registration information or not according to the comparison result, wherein the attribute label comprises gender and age; and the substitute broadcasting judging submodule is used for judging that a main broadcasting user of the live video stream has a substitute broadcasting behavior when the marks are inconsistent, and triggering a background notification message to be sent to a preset network address.

Referring to fig. 11, a device for detecting a broadcast-substituting event according to the purpose of the present application includes: the system comprises an image acquisition module 5100, a face recognition module 5200, a label prediction module 5300, a proxy broadcast detection module 5400 and a monitoring notification module 5500, wherein the image acquisition module 5100 is used for acquiring video frames in a live video stream, and the live video stream is sourced from a media server; the face recognition module 5200 is configured to recognize a face image in the video frame by using a face recognition model trained to converge; the label prediction module 5300 is configured to identify a student model by using an attribute label, perform attribute label prediction on the face image, and obtain an attribute label corresponding to the face image, where the attribute label includes a gender label and/or an age label; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in any live video stream provided by the media server as a training sample, in the training process, an attribute label recognition instructor model trained to the convergence state in advance predicts an attribute label for the same training sample, and performs semi-supervised training on the attribute label recognition student model by using the attribute label, wherein the attribute label recognition instructor model is subjected to supervised training in advance to reach the convergence state; the agent broadcast detection module 5400 is configured to match attribute tags corresponding to all facial images with user registration information of an anchor user of the live broadcast video stream, and determine that an agent broadcast event exists for the anchor user if at least one attribute tag in all facial images does not match with the user registration information; the monitoring notification module 5500 is configured to monitor a duration of a multicast event existing for each anchor user, and trigger a notification event when the duration reaches a preset threshold.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 12, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize the attribute label identification method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the attribute tag identification method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 10 and 11, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data necessary for executing all modules/sub-modules in the attribute tag identification device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.

The present application also provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the attribute tag identification method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

To sum up, the method and the device can make full use of the advantages of mass live video streams existing in the live network platform, improve the recognition capability of the model as a training sample of an attribute label recognition model, serve various downstream tasks by utilizing the recognition capability of the model, realize the detection of the alternate broadcasting behavior in the live network process, and provide technical support for the healthy development of live network.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. An attribute tag identification method is characterized by comprising the following steps:

2. The attribute tag identification method according to claim 1, wherein the training process of the attribute tag identification student model comprises the following steps:

3. The attribute tag identification method according to claim 1, wherein the face recognition model performs the steps of:

4. The method for identifying the attribute tag according to any one of claims 1 to 3, wherein the labeling output of the corresponding face image in the video frame by the attribute tag comprises the following steps:

5. The method for identifying the attribute tag according to any one of claims 1 to 3, wherein the labeling output of the corresponding face image in the video frame by the attribute tag comprises the following steps:

6. The method for identifying the attribute tag according to any one of claims 1 to 3, wherein the labeling output of the corresponding face image in the video frame by the attribute tag comprises the following steps:

7. A method for detecting a proxy event, comprising the steps of:

8. An attribute tag identification apparatus, comprising:

the system comprises an image acquisition module, a media server and a video processing module, wherein the image acquisition module is used for acquiring video frames in a live video stream, and the live video stream is sourced from the media server;

the face recognition module is used for recognizing a face image in the video frame by adopting a face recognition model trained to be convergent;

the label prediction module is used for adopting attribute labels to identify a student model, carrying out attribute label prediction on the face image and obtaining one or more attribute labels corresponding to the face image; the attribute label recognition student model is trained to a convergence state in advance by taking a video frame in a live video stream provided by the media server as a training sample, in the training process, an attribute label recognition instructor model trained to the convergence state in advance predicts an attribute label for the same training sample, and performs semi-supervised training on the attribute label recognition student model by using the attribute label, wherein the attribute label recognition instructor model is subjected to supervised training in advance to reach the convergence state;

and the annotation output module is used for performing annotation output on the corresponding face image in the video frame by using the attribute label.

9. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.