CN114220175B - Motion pattern recognition method and device, equipment, medium and product thereof - Google Patents

Motion pattern recognition method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN114220175B
CN114220175B CN202111555402.4A CN202111555402A CN114220175B CN 114220175 B CN114220175 B CN 114220175B CN 202111555402 A CN202111555402 A CN 202111555402A CN 114220175 B CN114220175 B CN 114220175B
Authority
CN
China
Prior art keywords
video frame
image
information
frame
current video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111555402.4A
Other languages
Chinese (zh)
Other versions
CN114220175A (en
Inventor
苏正航
陈增海
贺亮亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jinhong Network Media Co ltd
Guangzhou Cubesili Information Technology Co Ltd
Original Assignee
Guangzhou Jinhong Network Media Co ltd
Guangzhou Cubesili Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jinhong Network Media Co ltd, Guangzhou Cubesili Information Technology Co Ltd filed Critical Guangzhou Jinhong Network Media Co ltd
Priority to CN202111555402.4A priority Critical patent/CN114220175B/en
Publication of CN114220175A publication Critical patent/CN114220175A/en
Application granted granted Critical
Publication of CN114220175B publication Critical patent/CN114220175B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a motion pattern recognition method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: acquiring a frame difference information image corresponding to a current video frame in a live video stream, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to a discontinuous previous video frame; performing representation learning on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information; performing context combing on the image characteristic information by adopting a semantic memory model trained in advance to a convergence state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information; and mapping the comprehensive characteristic information to a classification space by adopting a preset classifier, and judging the motion mode of the human image in the current video frame according to the classification result. The method and the device can accurately identify the motion mode corresponding to the motion behavior of the human body image in the live video stream.

Description

Motion pattern recognition method and device, equipment, medium and product thereof
Technical Field
The present disclosure relates to the field of network live broadcasting technology, and in particular, to a motion pattern recognition method, a corresponding apparatus, a computer device, a computer readable storage medium, and a computer program product.
Background
Behavior recognition is an extremely important and very active research direction in computer vision, which has been studied for decades. Since people can use actions to handle things and express emotion, behavior recognition has very wide application fields which are not fully solved, such as intelligent monitoring systems, man-machine interaction, virtual reality, robots and the like. In previous approaches, RGB image sequences, depth image sequences, video or specific fusion of these modalities (e.g., rgb+optical flow) have all been used with unexpected results.
In the field of network live broadcast, recognition of various user action behaviors is attempted by using related technologies, but for action behaviors with high requirements on partial real-time performance, the current existing schemes have little benefit. For example, for identification of user dancing, martial arts and other actions during live broadcasting, the real-time requirement is extremely high (in the order of seconds), and the conventional technical solutions are too complex and often require several seconds to obtain the identification result, so that the user is difficult to land.
For this reason, the existing behavior recognition model of the live webcast scene adopts a method which uses more than 8 frames of RGB image sequences, depth image sequences or specific fusion of the multiple modes (such as rgb+optical flow). The existing methods often cannot meet the requirement of real-time performance due to high complexity when being deployed, for example, multiple frames of RGB images need to be accumulated at multiple moments, depth images or optical flow images cannot be acquired in real time, and therefore the schemes cannot land in the live broadcast field with extremely high real-time performance.
In view of this, the applicant has attempted to make a related search as a precursor in the art.
Disclosure of Invention
It is a primary object of the present application to solve at least one of the above problems and to provide a motion pattern recognition method and corresponding apparatus, computer device, computer readable storage medium, computer program product.
In order to meet the purposes of the application, the application adopts the following technical scheme:
a motion pattern recognition method according to one of the objects of the present application, comprising the steps of:
acquiring a frame difference information image corresponding to a current video frame in a live video stream, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to a discontinuous previous video frame;
Performing representation learning on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information;
performing context combing on the image characteristic information by adopting a semantic memory model trained in advance to a convergence state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information;
and mapping the comprehensive characteristic information to a classification space by adopting a preset classifier, and judging the motion mode of the human image in the current video frame according to the classification result.
In a specific embodiment, acquiring a frame difference information image corresponding to a current video frame in a live video stream includes the following steps:
acquiring two discontinuous video frames from a live video stream processed by a media server, wherein the two discontinuous video frames comprise a prior video frame and a current video frame;
generating a frame difference information image corresponding to the current video frame, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame.
In an extended embodiment, before the step of obtaining the frame difference information image corresponding to the current video frame in the live video stream, the method includes the following training process:
Acquiring two sample video frames obtained by video sampling of the same movement pattern as training samples, wherein the two sample video frames comprise a current video frame and a preceding video frame with a preceding time sequence, and the movement pattern is a dance performance;
generating a frame difference information image corresponding to a current video frame, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame;
performing representation learning on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information;
performing context combing on the image characteristic information by adopting a semantic memory model in a training state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information;
mapping the comprehensive characteristic information to a classification space by using a classifier in a training state to obtain a corresponding classification label;
and calculating the loss value of the classification label based on the supervision label corresponding to the training sample, terminating the training task when the loss value reaches a preset threshold, and otherwise, calling the next training sample to implement iterative training.
In a deepened embodiment, generating a frame difference information image corresponding to a current video frame includes the following steps:
calculating pixel level difference values of a previous video frame and a current video frame to obtain first frame difference information corresponding to the current video frame;
smoothing the first frame difference information to obtain second frame difference information so as to highlight edge information therein;
performing point multiplication operation on the current video frame and the second frame difference information to obtain a motion mode saliency map integrating motion information of the current video frame relative to the previous video frame;
and merging the motion mode saliency map and the gray scale map of the previous video frame to form a frame difference information image.
In a specific embodiment, a preset classifier is adopted to map the comprehensive feature information to a classification space, and a motion mode of a human image in a current video frame is determined according to a classification result, and the method comprises the following steps:
mapping the comprehensive characteristic information to a classification space by adopting a preset classifier to obtain a binarization classification result;
according to the classification result, when the classification result represents a truth value result, judging that the person image in the current video frame is in a specific motion mode;
and when the live video stream is in the specific motion mode, adding a highlight label to a live broadcasting room for providing the live video stream, and improving the sorting priority of the live broadcasting room in a display list corresponding to the specific motion mode.
In an extended embodiment, before the step of obtaining the frame difference information image corresponding to the current video frame in the live video stream, the method includes the following training process:
randomly initializing two image feature extraction models to be trained, wherein one image feature extraction model is used as a training target, and the other image feature extraction model is used as a supervision target;
the method comprises the steps of obtaining a sample picture, dividing the sample picture into two paths, and respectively carrying out random data enhancement processing to obtain two data enhancement views, wherein the sample picture is a frame difference information image;
respectively inputting the two data enhancement views into the representation layers of the two image feature extraction models to perform representation learning, and obtaining two corresponding intermediate feature information;
extracting semantic information from the two corresponding intermediate feature information through a multi-layer perceptron of the two image feature extraction models respectively to obtain two corresponding image feature information;
and calculating a loss value of the image characteristic information of the training target according to the image characteristic information of the supervision target, carrying out gradient update on the training target according to the loss value, and carrying out iterative training until the training target reaches a convergence state.
A motion pattern recognition apparatus provided in accordance with one of the objects of the present application, comprising: the system comprises a frame difference acquisition module, a representation learning module, a memory carding module and a classification judging module, wherein the frame difference acquisition module is used for acquiring a frame difference information image corresponding to a current video frame in a live video stream, and the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to a previous video frame which is discontinuous with the current video frame; the representation learning module is used for carrying out representation learning on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information; the memory combing module is used for carrying out context combing on the image characteristic information by adopting a semantic memory model which is trained in advance to a convergence state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information; the classification judging module is used for mapping the comprehensive characteristic information to a classification space by adopting a preset classifier, and judging the motion mode of the human image in the current video frame according to the classification result.
In a specific embodiment, the frame difference acquisition module includes: the image sampling sub-module is used for acquiring two discontinuous video frames from the live video stream processed by the media server, wherein the two discontinuous video frames comprise a prior video frame and a current video frame; the frame difference generation sub-module is used for generating a frame difference information image corresponding to the current video frame, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame.
In an extended embodiment, the motion pattern recognition device of the present application further includes: the sample calling training time module is used for acquiring two sample video frames obtained by video sampling of the same movement pattern as a training sample, wherein the two sample video frames comprise a current video frame and a preceding video frame with a preceding time sequence, and the movement pattern is a dance performance; the frame difference generation training time module is used for generating a frame difference information image corresponding to a current video frame, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame; the learning training representation module is used for performing the learning representation on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information; the memory combing training time module is used for carrying out context combing on the image characteristic information by adopting a semantic memory model in a training state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information; the classification judgment training time module is used for mapping the comprehensive characteristic information to a classification space by adopting a classifier in a training state to obtain a corresponding classification label; and the gradient updating training time module is used for calculating the loss value of the classification label based on the supervision label corresponding to the training sample, terminating the training task when the loss value reaches a preset threshold, and otherwise, calling the next training sample to implement iterative training.
In a further embodiment, the frame difference generating module and the frame difference generating training time module include the following steps: the difference value calculation sub-module is used for calculating pixel level difference values of the previous video frame and the current video frame and obtaining first frame difference information corresponding to the current video frame; the smoothing filter sub-module is used for carrying out smoothing filter processing on the first frame difference information to obtain second frame difference information so as to highlight edge information in the second frame difference information; the information synthesis sub-module is used for carrying out dot multiplication operation on the current video frame and the second frame difference information to obtain a motion mode saliency map integrating motion information of the current video frame relative to the previous video frame; and the channel merging sub-module is used for merging the motion mode saliency map and the gray level map of the previous video frame to form a frame difference information image.
In a specific embodiment, the classification determination module includes: the classification mapping sub-module is used for mapping the comprehensive characteristic information to a classification space by adopting a preset classifier to obtain a binarization classification result; the mode judging sub-module is used for judging that the person image in the current video frame is in a specific motion mode according to the classification result and when the classification result represents a true value result; and the high-light marking sub-module is used for adding a high-light label to a live broadcasting room for providing the live video stream when the live broadcasting room is in a specific motion mode, and improving the sorting priority of the live broadcasting room in a display list corresponding to the specific motion mode.
In an extended embodiment, the motion pattern recognition device of the present application further includes: the model initialization module is used for randomly initializing two image feature extraction models to be trained, wherein one image feature extraction model is used as a training target, and the other image feature extraction model is used as a supervision target; the data enhancement module is used for obtaining a sample picture, dividing the sample picture into two paths and respectively carrying out random data enhancement processing to obtain two data enhancement views, wherein the sample picture is a frame difference information image; the feature extraction module is used for respectively inputting the two data enhancement views into the representation layers of the two image feature extraction models to perform representation learning so as to obtain two corresponding intermediate feature information; the perception extraction module is used for extracting semantic information from the two corresponding intermediate feature information through the multi-layer perception machine of the two image feature extraction models respectively to obtain two corresponding image feature information; and the gradient updating module is used for calculating the loss value of the image characteristic information of the training target according to the image characteristic information of the supervision target, carrying out gradient updating on the training target according to the loss value, and carrying out iterative training until the training target reaches a convergence state.
A computer device provided in accordance with one of the objects of the present application comprises a central processor and a memory, the central processor being adapted to invoke the steps of executing a computer program stored in the memory to perform the movement pattern recognition method described herein.
A computer readable storage medium adapted to another object of the present application is provided, which stores in the form of computer readable instructions a computer program implemented according to the motion pattern recognition method, which computer program, when being called by a computer for execution, performs the steps comprised by the method.
A computer program product is provided adapted for another object of the present application, comprising a computer program/instruction which, when executed by a processor, carries out the steps of the method described in any of the embodiments of the present application.
Compared with the prior art, the method has the following advantages:
firstly, a frame difference information image is obtained aiming at a current video frame and a previous video frame with the time in a live video stream, the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame which is discontinuous with the current video frame, on the basis, the frame difference information image is subjected to representation learning to obtain deep semantic information, then the deep semantic information is processed by using a semantic memory model with context combing capability, a sequence arrangement function is realized, and therefore, a representation learning effect on the motion mode can be realized by using two video frames which are time-sequential, so that a decision can be made on the motion mode of a character image in the live video according to the corresponding deep semantic information.
Secondly, for user action behaviors of relatively rapid actions such as dance, martial arts and the like in a live broadcasting room of network live broadcasting, the method is often related to popularization of the live broadcasting room, aiming at the requirement, the motion mode of a person in a live broadcasting video stream of the live broadcasting room is rapidly distinguished, live broadcasting activities of the live broadcasting room can be intelligently identified, accordingly, downstream tasks are rapidly guided, for example, advertisement information of the live broadcasting room for implementing the dance behavior is pushed to related users according to the identified dance behavior, and accordingly, the implementation of the method is easy to understand, and has remarkable boosting effect on improving user experience of network live broadcasting service and improving user traffic.
In addition, massive live video streams can be generated in the network live scene, the live video streams contain various motion modes, video frames sampled from the live video streams can provide data samples required by training for various neural network models applied by the network live platform, the feature generalization capability of the models is provided, and the trained models can be further served in the same network live platform, so that the motion modes of character images in live stream videos of massive live rooms can be identified, a closed loop is formed, continuous bidirectional promotion is realized, and the network live platform can obtain the scale economic effect.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of an exemplary embodiment of a motion pattern recognition method of the present application;
FIG. 2 is a schematic diagram of a neural network model network architecture for implementing the technical solution of the present application;
FIG. 3 is a schematic flow chart of a process of generating a frame difference information image according to an embodiment of the present application;
FIG. 4 is an example of an effect diagram of various intermediate stages in the generation of a frame difference information image of the present application;
FIG. 5 is a flow chart illustrating a process of training the semantic memory model of the present application based on the network architecture shown in FIG. 2;
FIG. 6 is a flowchart illustrating a process of performing a specific motion mode determination to perform a live room ordering task according to an embodiment of the present application;
FIGS. 7 and 8 are graphical user interfaces of an example of the present application, each displaying a list of living rooms of the same network living platform, wherein FIG. 8 produces a change in ordering relative to FIG. 7;
FIG. 9 is a flow chart of a process of training an image feature extraction model according to the present application;
FIG. 10 is a schematic diagram of a network architecture of the present application for training an image feature extraction model of the present application;
FIG. 11 is a schematic block diagram of a motion pattern recognition device of the present application;
fig. 12 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As used herein, "client," "terminal device," and "terminal device" are understood by those skilled in the art to include both devices that include only wireless signal receivers without transmitting capabilities and devices that include receiving and transmitting hardware capable of two-way communication over a two-way communication link. Such a device may include: a cellular or other communication device such as a personal computer, tablet, or the like, having a single-line display or a multi-line display or a cellular or other communication device without a multi-line display; a PCS (Personal Communications Service, personal communication system) that may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant ) that can include a radio frequency receiver, pager, internet/intranet access, web browser, notepad, calendar and/or GPS (Global Positioning System ) receiver; a conventional laptop and/or palmtop computer or other appliance that has and/or includes a radio frequency receiver. As used herein, "client," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or adapted and/or configured to operate locally and/or in a distributed fashion, at any other location(s) on earth and/or in space. As used herein, a "client," "terminal device," or "terminal device" may also be a communication terminal, an internet terminal, or a music/video playing terminal, for example, a PDA, a MID (Mobile Internet Device ), and/or a mobile phone with music/video playing function, or may also be a device such as a smart tv, a set top box, or the like.
The hardware referred to by the names "server", "client", "service node" and the like in the present application is essentially an electronic device having the performance of a personal computer, and is a hardware device having necessary components disclosed by von neumann's principle, such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, and an output device, and a computer program is stored in the memory, and the central processing unit calls the program stored in the external memory to run in the memory, executes instructions in the program, and interacts with the input/output device, thereby completing a specific function.
It should be noted that the concept of "server" as referred to in this application is equally applicable to the case of a server farm. The servers should be logically partitioned, physically separate from each other but interface-callable, or integrated into a physical computer or group of computers, according to network deployment principles understood by those skilled in the art. Those skilled in the art will appreciate this variation and should not be construed as limiting the implementation of the network deployment approach of the present application.
One or several technical features of the present application, unless specified in the plain text, may be deployed either on a server to implement access by remotely invoking an online service interface provided by the acquisition server by a client, or directly deployed and run on the client to implement access.
The neural network model cited or possibly cited in the application can be deployed on a remote server and used for implementing remote call on a client, or can be deployed on a client with sufficient equipment capability for direct call unless specified in a clear text, and in some embodiments, when the neural network model runs on the client, the corresponding intelligence can be obtained through migration learning so as to reduce the requirement on the running resources of the hardware of the client and avoid excessively occupying the running resources of the hardware of the client.
The various data referred to in the present application, unless specified in the plain text, may be stored either remotely in a server or in a local terminal device, as long as it is suitable for being invoked by the technical solution of the present application.
Those skilled in the art will appreciate that: although the various methods of the present application are described based on the same concepts so as to be common to each other, the methods may be performed independently, unless otherwise indicated. Similarly, for each of the embodiments disclosed herein, the concepts presented are based on the same inventive concept, and thus, the concepts presented for the same description, and concepts that are merely convenient and appropriately altered although they are different, should be equally understood.
The various embodiments to be disclosed herein, unless the plain text indicates a mutually exclusive relationship with each other, the technical features related to the various embodiments may be cross-combined to flexibly construct a new embodiment, so long as such combination does not depart from the inventive spirit of the present application and can satisfy the needs in the art or solve the deficiencies in the prior art. This variant will be known to the person skilled in the art.
The motion pattern recognition method can be programmed into a computer program product, deployed in a client or a server and operated, and can be executed by accessing an interface opened after the computer program product is operated and performing man-machine interaction with the process of the computer program product through a graphical user interface.
Referring to fig. 1, in an exemplary embodiment, the motion pattern recognition method of the present application includes the following steps:
step S1100, obtaining a frame difference information image corresponding to a current video frame in a live video stream, where the frame difference information image includes state information of the current video frame and motion information of the current video frame relative to a previous video frame discontinuous with the current video frame:
The live video stream refers to an open live broadcasting room service of a network live broadcasting platform, and is output from a media server thereof in real time according to the logic of the live broadcasting room implementation so as to be transmitted to user side client equipment of the live broadcasting room for analyzing and displaying the video stream. The live video stream is generally pushed by a main broadcasting user side, and is subjected to corresponding audio and video processing by decoding through a media server and then sent to other online audience users in a live broadcasting room. Therefore, in the alternative implementation example, the technical solution of the present application may also be implemented in the computer device at the anchor user side, and the result data notification may also be formed to the live broadcast room service.
When the live video stream is obtained, the media server can decode the live video stream received from the main broadcasting user side and then extract the live video stream, and the live video stream output after the media server is encoded can be specially decoded to obtain video frames in the live video stream to be recoded and output. In the application, when the motion mode of the live video stream is started to be distinguished, the video frames required for identification in the application can be acquired from the live video stream.
According to the method and the device, the specific situation that the live video stream is identified is met, in the whole live broadcast process, any video frame in the live broadcast process can be used as a current video frame to be identified, when the identification is carried out on the current video frame, a previous video frame in time is referred to, so that in the process of acquiring a frame difference information image corresponding to the current video frame, the motion information of the image is acquired by referring to the previous video frame.
The current video frame and the prior video frame can be controlled within a certain preset duration or a preset frame number range in consideration of the requirement that the motion mode is only represented by images at a plurality of moments, so that the prior video frame and the current video frame are discontinuous in frame sequence. For example, in connection with the actual measurement of the present application, a value may be arbitrarily selected from 0.1 seconds to 1.2 seconds, especially from 0.2 seconds to 0.8 seconds, as a preset duration between the previous video frame and the current video frame, for example, 0.4 seconds, where the previous video frame of the first timestamp may be acquired from the live video stream according to the preset duration, and the current video frame corresponding to the delay of 0.4 seconds from the first timestamp may be acquired. Of course, the preset duration can be prolonged appropriately according to different duration requirements of different motion modes, for example, the preset duration can be arbitrarily valued even between 1 second and 2 seconds for the slow motion mode of the Taiji boxing performance.
Of course, referring to the preset duration and the frame rate of the live video stream, the preset duration may also be converted into a preset frame number to be set, for example, for a live video stream with a frame rate of 24 frames/s, taking 1 second as an example of the equivalent preset duration, taking the 1 st frame as the previous video frame, and then taking the current video frame as the 25 th frame. It can be seen that it is equivalent whether the time distance between the current video frame and its preceding video frame is controlled by a preset duration or a preset number of frames.
From the examples herein, it will be appreciated by those skilled in the art that different motion patterns may flexibly set the preset duration between the prior video frame and the current video frame according to the speed in which the character acts, but the acts themselves are substantially seconds in speed, so that the setting of the duration between the prior video frame and the current video frame too slow lacks positive significance for the action recognition, which demonstrates on the other hand the necessity that the various action recognition models existing in the prior art cannot satisfy the actions of dance, martial arts, etc., highlighting the inventive origin of the present application.
For each current video frame, the frame difference information image of the current video frame not only comprises the state information of the current video frame, but also comprises the motion information of the current video frame relative to the previous video frame, and the state information and the motion information can be represented by carrying out image processing on the current video frame and the previous video frame, so that the frame difference information image has the capability of representing the motion process of the character image in the live video stream, thereby being convenient for judging the motion mode of the character image according to the state information and the motion information.
Because the live broadcast activity is a continuous process, and the event of implementing the action behavior by the character image has uncertainty in time for a computer program, the application program realized according to the application program can continuously acquire the current video frame and the prior video frame along with the live broadcast in the background in the whole course, and timely judge the action behavior of a host user in the live broadcast room so as to timely grasp the development of the live broadcast activity and provide the activity state data in the live broadcast room for the downstream task. Accordingly, it can be understood that the current video frame at a certain moment is the previous video frame at the next moment after the preset duration; the previous video frame at a certain moment is the current video frame at the previous moment before the preset duration. And generating a frame difference information image corresponding to each current video frame, and continuously forming a frame difference information image sequence in the live broadcast process of the user.
Step S1200, performing representation learning on the frame difference information image by using an image feature extraction model trained in advance to a convergence state, so as to obtain image feature information:
referring to the network architecture shown in fig. 2, in order to obtain deep semantic information of a frame difference information image corresponding to each current video frame, an image feature extraction model based on a convolutional neural network structure is adopted to perform representation learning on the frame difference information image, and image feature information is correspondingly obtained.
The image feature extraction model may be a pre-training model, or may be self-trained to a converged state by a person skilled in the art using samples corresponding to the frame information image. The image feature extraction model may adopt various convolutional neural network models such as CNN, resnet, efficientNet which are suitable for extracting feature information from images, so that the model works as a best performing model, for example, the application recommends to use a Resnet series, and the model has been actually measured to obtain good performance.
The image characteristic information is a representation of the frame difference information image in deep semantic, and because the frame difference information image contains the relative motion information between the prior video frame and the current video frame and the state information of the current video frame, the image characteristic extraction model can pay attention to the action information of the character image in the frame difference information image and can correlate the state change before and after the action, and the image characteristic information corresponding to the frame difference information image correspondingly shows the action information and the state information after the characteristic extraction.
Step 1300, performing context combing on the image feature information by adopting a semantic memory model trained in advance to a convergence state and referring to the image feature information corresponding to the previous video frame, so as to obtain comprehensive feature information:
in this step, please continue to refer to the network architecture shown in fig. 2, a semantic memory model with a long-short-term memory architecture is used to implement context-based filtering of the image feature information. The semantic memory model can be LSTM, biLSTM, transformer and other neural network models suitable for processing sequence data, and particularly can be flexibly selected by a person skilled in the art, for example, LSTM (least squares) recommended by the application, so that image characteristic information of a series of frame difference information images which are sequentially input is used as sequence data to be processed, and comprehensive characteristic information is obtained after context combing is realized according to motion information and state information among the image characteristic information, so that the subsequent motion mode discrimination is more accurate. Similarly, the semantic memory model is previously trained to a converged state by one skilled in the art in accordance with the principles disclosed herein.
Step S1400, mapping the comprehensive characteristic information to a classification space by adopting a preset classifier, and judging the motion mode of the human image in the current video frame according to the classification result:
With continued reference to the network architecture shown in fig. 2, the integrated feature information further enters a classifier through a full connection layer, and is mapped to a classification space, where the classifier may employ a binary space, so as to obtain whether the character image in the current video frame is in a specific motion mode specifically aimed at training in a certain pre-training process, such as dance, performance, etc. Accordingly, it is easy to understand that when training is performed on the semantic memory model, for a specific motion mode, a front video frame and a rear video frame corresponding to the specific motion mode are adopted to respectively serve as a previous video frame and a current video frame to determine a frame difference information image corresponding to the current video frame, then image characteristic information of the frame difference information image is obtained through an image characteristic extraction model, then the image characteristic information is input into the semantic memory model to perform context combing, training is performed on the semantic memory model, and a binarization result output by a classifier is monitored by using corresponding manual labels to prompt convergence of the semantic memory model. That is, in the present application, the motion mode determination made by the classifier is a result of outputting yes or no for the same specific motion mode. Taking dance recognition as an example, after the semantic memory model is accessed into the classifier, whether training output belongs to dance is implemented by utilizing corresponding image characteristic information, and then an output result is supervised by utilizing a label of whether the image belongs to dance, which is manually marked, so that training is continuously implemented by utilizing positive and negative samples, and finally, a specific movement mode which is suitable for judging whether character images in a live video stream match with dance can be obtained.
The result information output and judged by the classifier can be provided for downstream tasks for further processing, such as corresponding labeling of live broadcast rooms and anchor users of the live video stream, or promotion of notification messages of the live video stream in a specific motion mode to other users, etc.
From the present exemplary embodiment, it can be appreciated that the practice of the present application has many positive advantages, including but not limited to the following:
firstly, a frame difference information image is obtained aiming at a current video frame and a previous video frame with the time in a live video stream, the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame which is discontinuous with the current video frame, on the basis, the frame difference information image is subjected to representation learning to obtain deep semantic information, then the deep semantic information is processed by using a semantic memory model with context combing capability, a sequence arrangement function is realized, and therefore, a representation learning effect on the motion mode can be realized by using two video frames which are time-sequential, so that a decision can be made on the motion mode of a character image in the live video according to the corresponding deep semantic information.
Secondly, for user action behaviors of relatively rapid actions such as dance, martial arts and the like in a live broadcasting room of network live broadcasting, the method is often related to popularization of the live broadcasting room, aiming at the requirement, the motion mode of a person in a live broadcasting video stream of the live broadcasting room is rapidly distinguished, live broadcasting activities of the live broadcasting room can be intelligently identified, accordingly, downstream tasks are rapidly guided, for example, advertisement information of the live broadcasting room for implementing the dance behavior is pushed to related users according to the identified dance behavior, and accordingly, the implementation of the method is easy to understand, and has remarkable boosting effect on improving user experience of network live broadcasting service and improving user traffic.
In addition, massive live video streams can be generated in the network live scene, the live video streams contain various motion modes, video frames sampled from the live video streams can provide data samples required by training for various neural network models applied by the network live platform, the feature generalization capability of the models is provided, and the trained models can be further served in the same network live platform, so that the motion modes of character images in live stream videos of massive live rooms can be identified, a closed loop is formed, continuous bidirectional promotion is realized, and the network live platform can obtain the scale economic effect.
In a specific embodiment, the step S1100 of acquiring a frame difference information image corresponding to a current video frame in a live video stream includes the following steps:
step S1110, acquiring two discontinuous video frames from the live video stream processed by the media server, including a previous video frame and a current video frame:
in this embodiment, a server cluster in which a computer program implemented by the technical scheme of the present application is deployed on a network live broadcast platform is taken as an example to describe the present embodiment, so that a live video stream can be more conveniently acquired from a media server therein to identify a motion mode.
When extracting video frames from the live video stream, the live video stream can be decoded and mapped to an image space, and each video frame can be acquired from the image space. As described above, for each live video stream, in the present application, a plurality of video frames are continuously acquired at intervals of a preset duration, two video frames are acquired corresponding to each identification procedure, the time sequence is a preceding video frame, and the time sequence is the latter as the current video frame. The next time the recognition program acquires two video frames, the previous video frame may be the current video frame of the previous recognition program. The two video frames are acquired according to the preset time length or the preset frame number, so that the video frames are discontinuous relative to the frame sequence of the live video stream.
The media server simultaneously and concurrently serves a large number of live broadcasting rooms to push, so that a plurality of live broadcasting video streams exist at the same time.
Step S1120, generating a frame difference information image corresponding to the current video frame, where the frame difference information image includes state information of the current video frame and motion information of the current video frame relative to the previous video frame:
for each identification procedure, the current video frame and the previous video frame are obtained and then can be used for generating a frame difference information image corresponding to the current video frame.
Referring to fig. 3, in one embodiment deepened on the basis of the present embodiment, the generating a frame difference information image corresponding to a current video frame in the present application may be implemented as the following specific steps:
step S2100, calculating a pixel level difference value between a previous video frame and a current video frame, to obtain first frame difference information corresponding to the current video frame:
and calling a preset image filter to perform pixel-by-pixel difference making on the prior video frame and the current video frame, and then taking a module to obtain the difference value corresponding to each pixel between the prior video frame and the current video frame, thereby obtaining first frame difference information. In this embodiment, the calculation of the first frame difference information is recommended to be implemented by applying a frame difference method using a tool provided by OpenCV. OpenCV is written in C++ language and has the characteristic of high execution efficiency.
The basic principle of the frame difference method is to extract a motion region in an image by pixel-based time difference through closed-valued processing between two or three adjacent frames of an image sequence. Specifically, the corresponding pixel values of the adjacent frame images are subtracted to obtain a differential image, the differential image is binarized, and if the corresponding pixel value changes less than a predetermined threshold value under the condition that the ambient brightness changes little, the corresponding pixel value can be regarded as a background pixel; if the pixel values of the image areas vary greatly, which may be considered to be due to a moving object in the image, these areas are marked as foreground pixels, and the position of the moving object in the image can be determined using the marked pixel areas. Because the time interval between two adjacent frames is very short, the background model which uses the previous frame image as the current frame has better instantaneity, the background is not accumulated, the updating speed is high, the algorithm is simple, the calculated amount is small, therefore, the frame difference between the previous video frame and the current video frame is calculated by applying the principle, the first frame difference information is obtained, and the method is efficient and quick.
Step S2200, performing smoothing filter processing on the first frame difference information to obtain second frame difference information so as to highlight edge information therein:
On the basis of the first frame difference information, a scale filter is adopted to carry out convolution filtering on the first frame difference information, the scale recommendation of the filter is set to 3*3, second frame difference information is obtained after convolution operation, and the second frame difference information is filtered, so that the edge information of a human body image in a video frame can be highlighted.
Step S2300, performing a dot product operation on the current video frame and the second frame difference information, to obtain a motion pattern saliency map that integrates motion information of the current video frame relative to the previous video frame:
and carrying out dot multiplication operation on the current video frame and the second frame difference information, and realizing softening and amplifying of the image edge in the current video frame by utilizing the edge information in the second frame difference information to obtain a motion mode saliency map, wherein the saliency map represents the motion information of the current video frame relative to the previous video frame.
Step S2400, merging the motion pattern saliency map and the gray scale map of the previous video frame to form a frame difference information image:
and finally, processing the motion mode saliency map into three channels of RGB, taking the gray level map of the previous video frame as one channel, and splicing the two channels to obtain a four-channel frame difference information image. The gray level image channel corresponding to the previous video frame represents the state information of the human body image at the previous moment, and the three color channels corresponding to the saliency images represent the motion information of the human body image between two video frames, so that the frame difference information image integrates the motion information between two video frames and the state information of the previous video frame, the former is dynamic change information, and the latter is static information, thereby being capable of guiding a semantic memory model to pay attention to the motion information in the human body image at the subsequent moment and correlating the front state and the back state change in the live video stream, and being capable of identifying the motion mode according to the frame difference information image.
It should be noted that, the process of generating the frame difference information image corresponding to the current video frame in the present application may be applied not only when the various models of the present application are used for production, but also when the various models of the present application are trained, according to the need of the models to obtain the frame difference image information, for example, the subsequent step S3200 of the present application will also refer to this process.
For example, in fig. 4, l1 is the previous video frame, l2 is the current video frame, both sampled from the same live video stream, are color maps (processed as gray maps due to the provision of the patent review guideline). And calculating the frame difference to obtain a corresponding diagram of first frame difference information shown as M1, wherein the diagram comprises action information between a previous video frame and a current video frame, smoothing to obtain a diagram M2 corresponding to second frame difference information, and performing dot product operation on the second frame difference information M2 and the current video frame l2 to obtain a motion mode saliency diagram shown as S2, wherein a dot product result of the first frame difference information M1 and the current video frame l2 corresponds to the diagram S1 for reference only. Fig. 4 shows the effect of various intermediate states in the process of calculating the frame difference information image in a formal manner, which is convenient for the reader to understand. The image S2 is used to combine the gray scale image of I1, and then the frame difference information image of the present application is obtained.
According to the embodiment, on the basis of extracting every two video frames of a live video stream, an image algorithm is adopted to rapidly calculate the frame difference between a previous video frame and a current video frame, a saliency map is obtained after smooth filtering and synthesis, and then the saliency map and a gray level map of the previous video frame are subjected to channel splicing to obtain a frame difference information image, so that the frame difference information image synthesizes the motion information between the current video frame and the previous video frame, and the state information of the previous video frame is loaded, rich and comprehensive semantic representation required by motion behavior recognition is realized, a subsequent model can be effectively guided to perform motion pattern recognition, and the process is efficient and rapid.
Referring to fig. 5, in an extended embodiment, before the step of acquiring the frame difference information image corresponding to the current video frame in the live video stream, the step of step S1100 includes the following training process:
step S3100, obtaining two sample video frames obtained by video sampling in the same motion mode as training samples, where the two sample video frames include a current video frame and a preceding video frame whose time sequence is preceding, and the motion mode is dance performance:
preparing one or more videos corresponding to the same movement mode in advance, collecting video frames from the videos as training samples according to a preset time interval, and manually marking the training samples in advance, wherein the training samples containing dance movement behaviors in the two video frames are marked as positive samples; for training samples that do not contain choreography behavior, the negative samples are labeled. After the labeling is completed, the data set required for training is constructed. For a training, at least two to three sample video frames are included, and two are generally used.
When training is performed on a model required for an identification task, the training samples used are training samples obtained corresponding to the same motion pattern, for example, for a specific motion pattern of dance performance, the obtained training samples are sampled, that is, images corresponding to the motion pattern of dance performance. It should be noted that the positive samples are samples manually determined to be dance performances, and the negative samples are samples manually determined to be non-dance performances but may be suspected dance performances or non-dance performances.
In a more efficient manner, in combination with a live webcast application scene, according to the live broadcast activity state of a live webcast user, for example, when the live webcast application scene starts a live performance activity function, sampling is carried out from corresponding live video streams of a media server of the live webcast platform or historical videos of the live webcast user participating in corresponding activities, and according to marking information provided by the live broadcast activity state, a supervision tag of a training sample obtained by sampling can be automatically generated, so that the value mining of massive video data in the live webcast platform can be realized, and the manual marking cost is saved.
When the training samples are collected from the video, two sample video frames in the training samples are sampled at intervals according to preset time length or preset frame number, wherein the preset time length or preset frame parameter is determined according to the principle of the typical embodiment of the application, and therefore, in the two sample video frames, the sample video frame with the prior time stamp is the prior video frame, and the video frame with the subsequent time stamp is the current video frame.
In addition, the person skilled in the art can flexibly collect continuous shooting pictures in the dance performance process as the training sample. Similarly, if the network architecture being trained is for achieving recognition of other specific athletic patterns, such as martial arts performances, training samples containing their respective athletic activities need to be collected accordingly to construct the data set required for training. In this regard, one skilled in the art would be flexibly adaptable.
Step S3200, generating a frame difference information image corresponding to the current video frame, where the frame difference information image includes state information of the current video frame and motion information of the current video frame relative to the previous video frame:
the principle and process of the implementation of this step are identical to those of the step S1120, so please refer to this step, and the specific process thereof can refer to the steps S2100 to S2400 described above, which are not repeated herein.
Step S3300, performing representation learning on the frame difference information image by using an image feature extraction model trained in advance to a convergence state, so as to obtain image feature information:
in this step, the image feature extraction model is used to perform representation learning on the frame difference information image, and the principle and process are the same as those of step S1200, and the corresponding image feature information is obtained according to the frame difference information image, so that descriptions are omitted.
Step S3400, performing context combing on the image characteristic information by adopting a semantic memory model in a training state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information:
in this step, the semantic memory model is used to perform representation learning on the frame difference information image, and the principle and process are the same as those of step S1300, and the image feature information is subjected to context combing to obtain corresponding comprehensive feature information, so that the description is omitted for the same parts. It can be understood that, because the semantic memory model has memory data of the state of the previous frame difference information image in the frame difference information image sequence based on the long-short-time memory model, the image characteristic information is subjected to semantic extraction by referring to the image characteristic information of the previous frame difference information image in the context carding process, so that the comprehensive characteristic information is obtained, and the comprehensive characteristic information integrates the whole motion information in the frame difference information image sequence.
Step S3500, mapping the comprehensive feature information to a classification space by using a classifier in a training state to obtain a corresponding classification label:
The same as the step S1400, the integrated feature information obtained after the context is combed by the semantic memory model is fully connected to enter the classifier to be classified, so as to obtain a corresponding classification label, where the classification label is a binarization result, for example, a normalized value of 1 or 0 indicates that motion information contained between two sample video frames belongs to or does not belong to a specific motion mode based on the current video frame.
Step S3600, calculating a loss value of the classification label based on the supervision label corresponding to the training sample, when the loss value reaches a preset threshold value, terminating the training task, otherwise, calling the next training sample to implement iterative training:
as described above, when the data set is constructed, corresponding supervision labels corresponding to each training sample are manually marked, including the supervision labels corresponding to the positive samples and the supervision labels corresponding to the negative samples, so that for the classification labels output by the classifier, the cross entropy loss can be calculated by adopting the supervision labels corresponding to the training samples, and a corresponding loss value is obtained. When the loss value reaches a preset threshold value, the semantic memory model is trained to a convergence state, so that a training task can be terminated; when the loss value does not reach the preset threshold value, the loss function of the semantic memory model is not converged, gradient updating is carried out on the semantic memory model by using the loss value, the related weight parameters in the middle process are corrected, and then the next training sample is continuously called from the data set to carry out iterative training on the semantic memory model until the semantic memory model is trained to a converged state.
According to the embodiment, the training samples are obtained through proper sampling, the training samples are used for preparing frame difference information images to train the semantic memory model, so that the semantic memory model can be identified, and because the training samples are convenient to sample from massive live video streams of the network live platform, a large amount of data cost can be saved, the training samples obtained by sampling the active state data in the live video streams can be automatically marked, a large amount of manpower and material resources can be saved, the construction efficiency of a data set is improved, and the training speed of the semantic memory model is further improved. In addition, under the guidance of massive training samples provided by the network live broadcast platform, the semantic memory model can enhance the characteristic generalization capability, is trained to a convergence state more quickly, and is easier to accurately identify a specific motion model for a live video stream.
Referring to fig. 6, in a specific embodiment, the step S1400 of mapping the integrated feature information to a classification space by using a preset classifier, and determining a motion mode of a person image in a current video frame according to a classification result includes the following steps:
step S1410, mapping the integrated feature information to a classification space by using a preset classifier, to obtain a binarized classification result:
As described above, the classifier of the semantic memory model is accessed, the comprehensive feature information obtained by the semantic memory model is mapped to a binary space, and a corresponding binary classification result is obtained, where the classification probabilities corresponding to the specific motion model and the non-specific motion model are logically represented, typically, for example, the vector representation form obtained after normalization: [1,0] or [0,1].
Step S1420, according to the classification result, when the classification result represents the truth value result, it is determined that the human image in the current video frame is in a specific motion mode:
for the case that the classification probability is normalized, according to the value of each classification label in the corresponding classification result, whether the human image in the current video frame is in the specific motion mode can be determined. For example, for [1,0], wherein the first element is a true value, which indicates that the human image in the current video frame matches the specific motion pattern, and the visual classification result is a true value result; when expressed as [0,1], the visual classification result is a false value result. When the specific motion mode is a dance performance mode, the dance performance is judged to be implemented according to the true or false of the classification result, when the specific motion mode is true, the dance performance is judged to be implemented by the character image in the live video stream, and when the specific motion mode is false, the dance performance is judged to be not implemented.
Step S1430, adding a highlight label to a live broadcasting room providing the live video stream when the live broadcasting room is in a specific motion mode, and improving the sorting priority of the live broadcasting room in a display list corresponding to the specific motion mode:
and adapting to the business logic of a network live broadcast platform, wherein, for the situation that the network live broadcast platform is judged to be in a specific motion mode, a highlight label can be added to a live broadcast room of the live broadcast video stream, and the highlight label can be used by downstream tasks. In the business logic of the network living broadcast platform, as a downstream task, the high-light label can be resolved to improve the exposure degree of the living broadcast room, for this purpose, in a display list matched with the specific motion mode, for example, a 'hot dance living broadcast room list' in the network living broadcast platform, the access entrance of the living broadcast room is displayed therein, and meanwhile, the sorting priority of the access entrance in the display list is improved according to the corresponding high-light label of the living broadcast room.
In the graphical user interface of the "live dancing room list" at the first moment shown in fig. 7, since the host user does not enter into the dancing performance state, the live webcast platform applying the technical scheme of the application detects that the live video stream does not have dancing performance activity, so that the ordering of the live video stream in the list is not promoted, but is in a state with a later ordering.
In the graphical user interface of the "live dancing room list" at the second moment shown in fig. 8, since the hosting user has entered the dance performance state, the network live broadcast platform applying the technical scheme of the present application detects that it is judged that it is holding dance performance activity in its live video stream, and accordingly adds a highlight label to its live broadcasting room to make it obtain a promotion of the ranking priority, so that it is switched to the state with a higher ranking in the list.
In this embodiment, in combination with the front end performance of the network live broadcast platform, the technical scheme of the application is limited to the network live broadcast platform to obtain the relevant technical effect, wherein by detecting the specific motion mode of the live video stream of the live broadcast room, whether the live broadcast room is holding the activity matched with the specific motion model or not can be rapidly determined, and then the exposure of the live broadcast room is adjusted accordingly, so that a good recommendation effect is achieved, and the method has remarkable positive effects on improving the exposure of the live broadcast room, the flow of the active live broadcast room and improving the information obtaining efficiency of audience users.
Referring to fig. 9, in an extended embodiment, before the step of acquiring the frame difference information image corresponding to the current video frame in the live video stream, the step of step S1100 includes the following training process:
Step S4100, randomly initializing two image feature extraction models to be trained, wherein one image feature extraction model is used as a training target, and the other image feature extraction model is used as a supervision target:
in the prior art, the pre-trained image feature extraction model often does not have the capability of generalizing the features of an image corresponding to a certain specific motion mode, and the features are different in practice, so that the scheme for self-training the image feature extraction model is continuously provided in the embodiment.
The image feature extraction model can recommend Resnet with good performance to be used as a backbone network, and in addition, a multi-layer perceptron is externally connected to perform further semantic extraction. Accordingly, the architecture shown in fig. 10, that is, two examples of the image feature extraction model are adopted, and after random initialization, training is performed. One image feature extraction model is used as a training target, a prediction result label is output, and the other image feature extraction model is used as a supervision target, so that a soft label for supervision training is provided for the former.
Step S4200, dividing the acquired sample picture into two paths, and performing random data enhancement processing to obtain two data enhancement views, where the sample picture is a frame difference information image:
In order to make the image feature extraction model more suitable for the motion pattern recognition required in the present application, a sample picture may be obtained from the dataset, which may be a normal picture including a human body image. The present embodiment recommends the step of generating the frame difference information image corresponding to the current video frame according to the present application, that is, sampling from the video stream by means of step S3100, and then using the frame difference information image generated by means of step S3200 as the sample picture. The frame difference information image is used as a sample picture, so that the generalization capability of the image feature extraction model to the image features in the specific motion mode can be enhanced, the representation learning effect of the image feature extraction model is better, and the obtained image feature information can be used for accurately identifying the specific motion mode.
And dividing the sample picture into two paths corresponding to the two examples of the image feature extraction model, and respectively performing random data enhancement processing, wherein corresponding data enhancement means can be cutting, scaling, overturning, illumination brightness change, gaussian blur and the like, so that two data enhancement views are correspondingly obtained, and differentiation is realized on the basis of the same sample picture.
Step S4300, respectively inputting the two data enhancement views into the representation layers of the two image feature extraction models to perform representation learning, and obtaining two corresponding intermediate feature information:
And then, respectively providing the two data enhancement videos for two examples of the image feature extraction model, entering a representation layer in the two examples, for example, performing representation learning in a backbone network constructed by the Resnet, extracting deep semantic information in the data enhancement view, and obtaining two corresponding intermediate feature information.
Step S4400, extracting semantic information from two corresponding intermediate feature information through a multi-layer perceptron of two image feature extraction models to obtain two corresponding image feature information:
each image feature extraction model instance carries out full connection on corresponding intermediate feature information to realize the integration of semantic information, further extraction of semantic information is realized, and corresponding image feature information is output. Thus, two instances obtain two image characteristic information accordingly.
Step S4500, calculating a loss value of the image feature information of the training target according to the image feature information of the supervision target, and performing gradient update on the training target according to the loss value, and performing iterative training until the training target reaches a convergence state:
as described above, one of the image feature extraction model instances is used as a supervision target, and the image feature information output by it is used as a soft tag for supervising the image feature information obtained by the other instance, and therefore, the loss value of the latter is calculated using the L2 loss function, and when the loss value reaches a preset threshold value, it is indicated that the image feature extraction model as a training target has been trained to a convergence state, so that the training task can be terminated; when the loss value does not reach the preset threshold value, the loss function of the image feature extraction model serving as the training target is not converged, gradient updating is carried out on the image feature extraction model serving as the training target by utilizing the loss value, the related weight parameters in the middle process are corrected, and then the next training sample is continuously called from the data set to carry out iterative training on the image feature extraction model serving as the training target until the image feature extraction model serving as the training target is trained to a converged state.
In this embodiment, training is performed on the same sample picture by means of different examples of the same image feature extraction model, and image feature information output by another image feature extraction model is supervised by using a soft tag output by one image feature extraction model, so that semi-supervised learning is realized, labeling cost can be saved, training process is simplified, and training efficiency is improved. Particularly, when the frame difference information image is used for training, the image characteristic extraction model serving as a training target can acquire the capability of capturing the semantic information corresponding to the action behaviors accurately, and the acquired image characteristic information can be used for improving the accuracy of the semantic memory model in identifying the motion mode.
Referring to fig. 11, a motion pattern recognition apparatus according to one of the objects of the present application includes: the device comprises a frame difference acquisition module 1100, a representation learning module 1200, a memory carding module 1300 and a classification judging module 1400, wherein the frame difference acquisition module 1100 is used for acquiring a frame difference information image corresponding to a current video frame in a live video stream, and the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to a previous video frame which is discontinuous with the current video frame; the representation learning module 1200 is configured to perform representation learning on the frame difference information image by using an image feature extraction model trained in advance to a convergence state, so as to obtain image feature information; the memory combing module 1300 is configured to combine the image feature information with the image feature information corresponding to the previous video frame by using a semantic memory model trained in advance to a convergence state to obtain comprehensive feature information; the classification determination module 1400 is configured to map the integrated feature information to a classification space by using a preset classifier, and determine a motion mode of a person image in a current video frame according to a classification result.
In an embodiment, the frame difference obtaining module 1100 includes: the image sampling sub-module is used for acquiring two discontinuous video frames from the live video stream processed by the media server, wherein the two discontinuous video frames comprise a prior video frame and a current video frame; the frame difference generation sub-module is used for generating a frame difference information image corresponding to the current video frame, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame.
In an extended embodiment, the motion pattern recognition device of the present application further includes: the sample calling training time module is used for acquiring two sample video frames obtained by video sampling of the same movement pattern as a training sample, wherein the two sample video frames comprise a current video frame and a preceding video frame with a preceding time sequence, and the movement pattern is a dance performance; the frame difference generation training time module is used for generating a frame difference information image corresponding to a current video frame, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame; the learning training representation module is used for performing the learning representation on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information; the memory combing training time module is used for carrying out context combing on the image characteristic information by adopting a semantic memory model in a training state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information; the classification judgment training time module is used for mapping the comprehensive characteristic information to a classification space by adopting a classifier in a training state to obtain a corresponding classification label; and the gradient updating training time module is used for calculating the loss value of the classification label based on the supervision label corresponding to the training sample, terminating the training task when the loss value reaches a preset threshold, and otherwise, calling the next training sample to implement iterative training.
In a further embodiment, the frame difference generating module and the frame difference generating training time module include the following steps: the difference value calculation sub-module is used for calculating pixel level difference values of the previous video frame and the current video frame and obtaining first frame difference information corresponding to the current video frame; the smoothing filter sub-module is used for carrying out smoothing filter processing on the first frame difference information to obtain second frame difference information so as to highlight edge information in the second frame difference information; the information synthesis sub-module is used for carrying out dot multiplication operation on the current video frame and the second frame difference information to obtain a motion mode saliency map integrating motion information of the current video frame relative to the previous video frame; and the channel merging sub-module is used for merging the motion mode saliency map and the gray level map of the previous video frame to form a frame difference information image.
In an embodiment, the classification determination module 1400 includes: the classification mapping sub-module is used for mapping the comprehensive characteristic information to a classification space by adopting a preset classifier to obtain a binarization classification result; the mode judging sub-module is used for judging that the person image in the current video frame is in a specific motion mode according to the classification result and when the classification result represents a true value result; and the high-light marking sub-module is used for adding a high-light label to a live broadcasting room for providing the live video stream when the live broadcasting room is in a specific motion mode, and improving the sorting priority of the live broadcasting room in a display list corresponding to the specific motion mode.
In an extended embodiment, the motion pattern recognition device of the present application further includes: the model initialization module is used for randomly initializing two image feature extraction models to be trained, wherein one image feature extraction model is used as a training target, and the other image feature extraction model is used as a supervision target; the data enhancement module is used for obtaining a sample picture, dividing the sample picture into two paths and respectively carrying out random data enhancement processing to obtain two data enhancement views, wherein the sample picture is a frame difference information image; the feature extraction module is used for respectively inputting the two data enhancement views into the representation layers of the two image feature extraction models to perform representation learning so as to obtain two corresponding intermediate feature information; the perception extraction module is used for extracting semantic information from the two corresponding intermediate feature information through the multi-layer perception machine of the two image feature extraction models respectively to obtain two corresponding image feature information; and the gradient updating module is used for calculating the loss value of the image characteristic information of the training target according to the image characteristic information of the supervision target, carrying out gradient updating on the training target according to the loss value, and carrying out iterative training until the training target reaches a convergence state.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. As shown in fig. 12, the internal structure of the computer device is schematically shown. The computer device includes a processor, a computer readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and the computer readable instructions can enable the processor to realize a motion pattern recognition method when the computer readable instructions are executed by the processor. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the motion pattern recognition method of the present application. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by those skilled in the art that the structure shown in fig. 12 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
The processor in this embodiment is configured to execute specific functions of each module and its sub-module in fig. 11, and the memory stores program codes and various data required for executing the above-mentioned modules or sub-modules. The network interface is used for data transmission between the user terminal or the server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the motion pattern recognition apparatus of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the motion pattern recognition method of any of the embodiments of the present application.
The present application also provides a computer program product comprising computer programs/instructions which when executed by one or more processors implement the steps of the method described in any of the embodiments of the present application.
Those skilled in the art will appreciate that implementing all or part of the above-described methods of embodiments of the present application may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed, may comprise the steps of embodiments of the methods described above. The storage medium may be a computer readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
In summary, the motion mode corresponding to the motion behavior of the human body image in the live video stream can be accurately identified, basic data is provided for the downstream task of the network live platform, and user experience can be improved.
Those of skill in the art will appreciate that the various operations, methods, steps in the flow, actions, schemes, and alternatives discussed in the present application may be alternated, altered, combined, or eliminated. Further, other steps, means, or steps in a process having various operations, methods, or procedures discussed in this application may be alternated, altered, rearranged, split, combined, or eliminated. Further, steps, measures, schemes in the prior art with various operations, methods, flows disclosed in the present application may also be alternated, altered, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (9)

1. A method for motion pattern recognition, comprising the steps of:
Acquiring a frame difference information image corresponding to a current video frame in a live video stream, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to a discontinuous previous video frame;
performing representation learning on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information;
performing context combing on the image characteristic information by adopting a semantic memory model trained in advance to a convergence state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information;
and mapping the comprehensive characteristic information to a classification space by adopting a preset classifier to obtain a binary classification result, judging that a human image in a current video frame is in a specific motion mode when the classification result represents a true value result according to the classification result, and adding a highlight label to a live broadcasting room for providing the live broadcasting video stream when the classification result is in the specific motion mode to promote the sequencing priority of the live broadcasting room in a display list corresponding to the specific motion mode.
2. The motion pattern recognition method according to claim 1, wherein acquiring a frame difference information image corresponding to a current video frame in a live video stream comprises the steps of:
Acquiring two discontinuous video frames from a live video stream processed by a media server, wherein the two discontinuous video frames comprise a prior video frame and a current video frame;
generating a frame difference information image corresponding to the current video frame, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame.
3. The motion pattern recognition method according to claim 1, wherein before the step of acquiring the frame difference information image corresponding to the current video frame in the live video stream, the method comprises the following training process:
acquiring two sample video frames obtained by video sampling of the same movement pattern as training samples, wherein the two sample video frames comprise a current video frame and a preceding video frame with a preceding time sequence, and the movement pattern is a dance performance;
generating a frame difference information image corresponding to a current video frame, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to the previous video frame;
performing representation learning on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information;
Performing context combing on the image characteristic information by adopting a semantic memory model in a training state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information;
mapping the comprehensive characteristic information to a classification space by using a classifier in a training state to obtain a corresponding classification label;
and calculating the loss value of the classification label based on the supervision label corresponding to the training sample, terminating the training task when the loss value reaches a preset threshold, and otherwise, calling the next training sample to implement iterative training.
4. The motion pattern recognition method according to claim 2, wherein generating the frame difference information image corresponding to the current video frame, comprises the steps of:
calculating pixel level difference values of a previous video frame and a current video frame to obtain first frame difference information corresponding to the current video frame;
smoothing the first frame difference information to obtain second frame difference information so as to highlight edge information therein;
performing point multiplication operation on the current video frame and the second frame difference information to obtain a motion mode saliency map integrating motion information of the current video frame relative to the previous video frame;
And merging the motion mode saliency map and the gray scale map of the previous video frame to form a frame difference information image.
5. The motion pattern recognition method according to claim 1, wherein before the step of acquiring the frame difference information image corresponding to the current video frame in the live video stream, the method comprises the following training process:
randomly initializing two image feature extraction models to be trained, wherein one image feature extraction model is used as a training target, and the other image feature extraction model is used as a supervision target;
the method comprises the steps of obtaining a sample picture, dividing the sample picture into two paths, and respectively carrying out random data enhancement processing to obtain two data enhancement views, wherein the sample picture is a frame difference information image;
respectively inputting the two data enhancement views into the representation layers of the two image feature extraction models to perform representation learning, and obtaining two corresponding intermediate feature information;
extracting semantic information from the two corresponding intermediate feature information through a multi-layer perceptron of the two image feature extraction models respectively to obtain two corresponding image feature information;
and calculating a loss value of the image characteristic information of the training target according to the image characteristic information of the supervision target, carrying out gradient update on the training target according to the loss value, and carrying out iterative training until the training target reaches a convergence state.
6. A motion pattern recognition apparatus, comprising:
the frame difference acquisition module is used for acquiring a frame difference information image corresponding to a current video frame in the live video stream, wherein the frame difference information image comprises state information of the current video frame and motion information of the current video frame relative to a discontinuous previous video frame;
the representation learning module is used for carrying out representation learning on the frame difference information image by adopting an image feature extraction model which is trained in advance to a convergence state to obtain image feature information;
the memory combing module is used for carrying out context combing on the image characteristic information by adopting a semantic memory model which is trained in advance to a convergence state and referring to the image characteristic information corresponding to the previous video frame to obtain comprehensive characteristic information;
the classification judging module is used for mapping the comprehensive characteristic information to a classification space by adopting a preset classifier to obtain a binary classification result, judging that a person image in a current video frame is in a specific motion mode when the classification result represents a true value result according to the classification result, adding a highlight label to a live broadcasting room for providing the live broadcasting video stream when the person image is in the specific motion mode, and improving the sorting priority of the live broadcasting room in a display list corresponding to the specific motion mode.
7. A computer device comprising a central processor and a memory, characterized in that the central processor is arranged to invoke a computer program stored in the memory for performing the steps of the method according to any of claims 1 to 5.
8. A computer-readable storage medium, characterized in that it stores in the form of computer-readable instructions a computer program implemented according to the method of any one of claims 1 to 5, which, when invoked by a computer, performs the steps comprised by the corresponding method.
9. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 5.
CN202111555402.4A 2021-12-17 2021-12-17 Motion pattern recognition method and device, equipment, medium and product thereof Active CN114220175B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111555402.4A CN114220175B (en) 2021-12-17 2021-12-17 Motion pattern recognition method and device, equipment, medium and product thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111555402.4A CN114220175B (en) 2021-12-17 2021-12-17 Motion pattern recognition method and device, equipment, medium and product thereof

Publications (2)

Publication Number Publication Date
CN114220175A CN114220175A (en) 2022-03-22
CN114220175B true CN114220175B (en) 2023-04-25

Family

ID=80703879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111555402.4A Active CN114220175B (en) 2021-12-17 2021-12-17 Motion pattern recognition method and device, equipment, medium and product thereof

Country Status (1)

Country Link
CN (1) CN114220175B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116233304B (en) * 2022-11-30 2024-04-05 荣耀终端有限公司 Schedule-based equipment state synchronization system, method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110624B (en) * 2019-04-24 2023-04-07 江南大学 Human body behavior recognition method based on DenseNet and frame difference method characteristic input
CN110135386B (en) * 2019-05-24 2021-09-03 长沙学院 Human body action recognition method and system based on deep learning
CN110765860B (en) * 2019-09-16 2023-06-23 平安科技(深圳)有限公司 Tumble judging method, tumble judging device, computer equipment and storage medium
CN111626202B (en) * 2020-05-27 2023-08-29 北京百度网讯科技有限公司 Method and device for identifying video
CN112804561A (en) * 2020-12-29 2021-05-14 广州华多网络科技有限公司 Video frame insertion method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN114220175A (en) 2022-03-22

Similar Documents

Publication Publication Date Title
KR102297393B1 (en) Gating Models for Video Analysis
WO2021088510A1 (en) Video classification method and apparatus, computer, and readable storage medium
CN107481327B (en) About the processing method of augmented reality scene, device, terminal device and system
US11520824B2 (en) Method for displaying information, electronic device and system
CN113365147B (en) Video editing method, device, equipment and storage medium based on music card point
CN114550053A (en) Traffic accident responsibility determination method, device, computer equipment and storage medium
CN111783712A (en) Video processing method, device, equipment and medium
CN110956059B (en) Dynamic gesture recognition method and device and electronic equipment
CN114302157B (en) Attribute tag identification and substitution event detection methods, device, equipment and medium thereof
CN114220175B (en) Motion pattern recognition method and device, equipment, medium and product thereof
CN115909127A (en) Training method of abnormal video recognition model, abnormal video recognition method and device
CN115114439A (en) Method and device for multi-task model reasoning and multi-task information processing
CN113313098B (en) Video processing method, device, system and storage medium
CN104881647B (en) Information processing method, information processing system and information processing unit
CN107656760A (en) Data processing method and device, electronic equipment
CN114038067B (en) Coal mine personnel behavior detection method, equipment and storage medium
CN114581994A (en) Class attendance management method and system
CN112165626B (en) Image processing method, resource acquisition method, related equipment and medium
CN112261321B (en) Subtitle processing method and device and electronic equipment
CN112101387A (en) Salient element identification method and device
CN112580750A (en) Image recognition method and device, electronic equipment and storage medium
CN112784631A (en) Method for recognizing face emotion based on deep neural network
CN112115740A (en) Method and apparatus for processing image
CN112906679B (en) Pedestrian re-identification method, system and related equipment based on human shape semantic segmentation
Pisnyi et al. AR Intelligent Real-time Method for Cultural Heritage Object Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant