CN112188140A - Face tracking video chat method, system and storage medium - Google Patents

Face tracking video chat method, system and storage medium Download PDF

Info

Publication number
CN112188140A
CN112188140A CN202011048611.5A CN202011048611A CN112188140A CN 112188140 A CN112188140 A CN 112188140A CN 202011048611 A CN202011048611 A CN 202011048611A CN 112188140 A CN112188140 A CN 112188140A
Authority
CN
China
Prior art keywords
video
face
frame
session
chat
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011048611.5A
Other languages
Chinese (zh)
Inventor
康登立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Konka Electronic Technology Co Ltd
Original Assignee
Shenzhen Konka Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Konka Electronic Technology Co Ltd filed Critical Shenzhen Konka Electronic Technology Co Ltd
Priority to CN202011048611.5A priority Critical patent/CN112188140A/en
Publication of CN112188140A publication Critical patent/CN112188140A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/142Constructional details of the terminal equipment, e.g. arrangements of the camera and the display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a face tracking video chatting method, a system and a storage medium, wherein the method comprises the following steps: the method comprises the steps that a capturing device collects a video image, carries out face recognition on the video image to output a recognition result, processes a face area according to the recognition result, generates a coded video frame and sends the coded video frame to a chatting device; the chat device obtains the device list in the current session, pulls the video stream of the client of the remote device, simultaneously obtains the local video stream of the capture device, and sends the video stream to the player for decoding and playing. The invention realizes the effect that the video picture moves along with the human face in the visual range, and the audio and video data directly establishes the session between the devices of the session request through the punching service, thereby overcoming the problems that the cost of forwarding the video data through the server is higher, and the number of the devices capable of accommodating the simultaneous online chatting is limited.

Description

Face tracking video chat method, system and storage medium
Technical Field
The invention relates to the technical field of computer application, in particular to a face tracking video chatting method, a face tracking video chatting system and a storage medium.
Background
In a video chat application, automatic tracking of user movement can be achieved, a capture device connected to a computing device captures a user in its field of view and identifies a sub-frame of pixels identifying positions of the user's head, neck, and shoulders in a capture frame of a capture area, the sub-frame of pixels being displayed to a remote user at a remote computing device that is participating in a video chat with the user; when the user moves to a next location within the capture area, the capture device automatically tracks the location of the user's head, neck, and shoulders; a next sub-frame of pixels identifying the head, neck, and shoulders of the user at the next location is identified and displayed to the remote user at the remote computing device.
The face recognition engine is positioned in a computing device (which is also a display device) instead of being integrated into a capturing device, application scenes are limited, video chat terminal devices such as a video conference or a smart television are expensive compared with the capturing device, and the updating cost of the device is higher; the computing equipment needs to identify a plurality of parts such as the head, the neck, the shoulders and the like, the requirements on NPU computing power, I/O performance and the like of the computing equipment are high, and the production cost is high; the detection and identification of a plurality of parts need to load and use a plurality of different algorithm models, complete identification information of a current image frame is obtained after all the models are operated, and then the identification information is applied to a next image frame, so that the frame rate of an NPU input image is limited, and the problems of delay, dislocation, errors and the like can be marked; the video technology evolves from high definition to ultra-high definition, the real-time audio and video data volume is large, and the cost of computing resource overhead, network bandwidth, network flow and the like, such as the cost of computing resource overhead, network bandwidth and network flow and the like, which are transmitted to other equipment, a server CPU and a memory, is high; data is transferred by the server, so that the image delay is easy and the blockage is serious; the number of terminals accommodating simultaneous online video chat is also limited.
Accordingly, the prior art is yet to be improved and developed.
Disclosure of Invention
The invention mainly aims to provide a face tracking video chatting method, a face tracking video chatting system and a storage medium, and aims to solve the problems that the scenes of face tracking video chatting are limited and the cost is high in the prior art.
In order to achieve the above object, the present invention provides a face tracking video chat method, which is applied to a face tracking video chat system, wherein the face tracking video chat system comprises a capture device and a chat device, and the face tracking video chat method comprises the following steps:
the method comprises the steps that a capturing device collects a video image, carries out face recognition on the video image to output a recognition result, processes a face area according to the recognition result, generates a coded video frame and sends the coded video frame to a chatting device;
the chat device obtains the device list in the current session, pulls the video stream of the client of the remote device, simultaneously obtains the local video stream of the capture device, and sends the video stream to the player for decoding and playing.
The face tracking video chatting method includes the steps that the capturing device collects video images, carries out face recognition on the video images to output recognition results, processes face regions according to the recognition results to generate coded video frames, and sends the coded video frames to the chatting device, and specifically includes the steps that:
acquiring a video image according to a default configuration frame rate;
acquiring a YUV frame, and inputting the YUV frame into a face recognition algorithm model to output a recognition result;
calculating the area of the face in the current visual field according to the recognition result, performing external expansion processing on the face area, applying the external expansion processing to the next frame of image, and performing cutting and scaling processing;
and acquiring a YUV frame, inputting the YUV frame into an encoder for video encoding, generating an encoded video frame, and sending the encoded video frame to the chatting equipment.
The face tracking video chatting method includes the steps that the chatting equipment obtains a device list in a current session, pulls a video stream of a client of the remote equipment, simultaneously obtains a local video stream of the capturing equipment, and sends the video stream to the player for decoding and playing, and specifically includes:
sending a point-to-point connection request to a punching server, and acquiring a public network IP (Internet protocol) and a port number of a client in a session request and an intranet IP and a port number;
acquiring a device list in the current session, pulling a video stream of a client of a remote device, and sending the video stream to a player;
acquiring a session state and a device list, acquiring a video stream of a device in the session, and decoding to obtain a YUV frame;
and drawing a play display window according to the session equipment list and the state, converting the acquired YUV frame into RGB data, and rendering the RGB data to a corresponding play window.
The face tracking video chat method comprises the steps of calculating the area of a face in the current visual field range according to the recognition result, performing external expansion processing on the face area, applying the external expansion processing to the next frame of image, and performing cutting and scaling processing, and then further comprises the following steps:
and if no people appear in the current visual field range and the time of the last face tracking action expires, restoring to the original image preview.
The face tracking video chat method comprises the following steps of sending a point-to-point connection request to a punching server:
and triggering the punching service in the calling or called process of the user, and initializing the connection with the punching server.
The face tracking video chat method, wherein the sending of the point-to-point connection request to the hole server obtains the public network IP and port number of the client and the intranet IP and port number in the session request, and then further comprises:
and the local client on the chat device establishes a direct session with the client in the current session request and updates a device list in the session.
The face tracking video chat method comprises the following steps of obtaining a device list in a current session, pulling a video stream of a remote device client, and sending the video stream to a player, and then:
triggering the capture device to collect the video through the UVC protocol, storing the collected video frame, sending the video frame into a memory pool corresponding to the player, and simultaneously pushing the video frame to a remote chat device client in the session.
The face tracking video chatting method is characterized in that the recognition result is description information of facial features of the face.
In addition, to achieve the above object, the present invention further provides a face tracking video chat system, wherein the face tracking video chat system includes: the device comprises a capturing device and a chatting device, wherein the capturing device is connected to the chatting device through a USB;
the capture device includes:
the video acquisition module is used for acquiring video images according to a default configuration frame rate and sending the acquired YUV frames to the face recognition algorithm module and the first data processing module through the memory pool;
the face recognition algorithm module is used for acquiring YUV frames, inputting the YUV frames into the face recognition algorithm model to output a recognition result, and sending the output result to the first data processing module;
the first data processing module is used for calculating the area of the face in the current visual field range according to the recognition result, performing outward expansion processing on the face area, applying the outward expansion processing to the next frame of image, and performing cutting and scaling processing;
the video coding module is used for acquiring a YUV frame, inputting the YUV frame into an encoder for video coding, generating a coded video frame and sending the coded video frame to the chatting equipment;
the chat device includes:
the P2P holing service module is used for sending a point-to-point connection request to a holing server, and acquiring a public network IP (Internet protocol) and a port number as well as an intranet IP and a port number of a client in a session request so as to establish direct session connection with remote chat equipment;
the second data processing module is used for acquiring a device list in the current session, pulling a video stream of a client of the remote device, acquiring a local video stream of the capture device and sending the local video stream to the player;
the player module is used for acquiring a session state and a device list, acquiring and decoding a video stream of a device in the session, and sending an acquired YUV frame to the display module;
and the display module is used for drawing a play display window according to the session equipment list and the state, converting the acquired YUV frames into RGB data and rendering the RGB data to a corresponding play window.
In addition, in order to achieve the above object, the present invention further provides a storage medium, wherein the storage medium stores a face tracking video chat program, and the face tracking video chat program, when executed by a processor, implements the steps of the face tracking video chat method as described above.
The method comprises the steps of collecting a video image through a capturing device, carrying out face recognition on the video image to output a recognition result, processing a face area according to the recognition result to generate a coded video frame, and sending the coded video frame to a chatting device; the chat device obtains the device list in the current session, pulls the video stream of the client of the remote device, simultaneously obtains the local video stream of the capture device, and sends the video stream to the player for decoding and playing. The invention realizes the effect that the video picture moves along with the human face in the visual range, and the audio and video data directly establishes the session between the devices of the session request through the punching service, thereby overcoming the problems that the cost of forwarding the video data through the server is higher, and the number of the devices capable of accommodating the simultaneous online chatting is limited.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of a face tracking video chat method of the present invention;
FIG. 2 is a software framework diagram of the preferred embodiment of the face tracking video chat method of the present invention;
FIG. 3 is a schematic diagram of a capture device for video capture in accordance with a preferred embodiment of the face tracking video chat method of the present invention;
FIG. 4 is a schematic diagram illustrating a chat device performing video chat according to a preferred embodiment of the face tracking video chat method of the present invention;
FIG. 5 is a schematic flow chart of the capturing device performing video capture in the preferred embodiment of the face tracking video chat method of the present invention;
FIG. 6 is a flow chart of image recognition performed by a capture device in accordance with a preferred embodiment of the face tracking video chat method of the present invention;
FIG. 7 is a flow chart of data processing performed by the capture device in the preferred embodiment of the face tracking video chat method of the invention;
FIG. 8 is a flow chart of video encoding by a capture device in the preferred embodiment of the face tracking video chat method of the invention;
FIG. 9 is a flow chart of the hole-punching service performed by the chat device in the preferred embodiment of the face tracking video chat method of the invention;
FIG. 10 is a flow chart of data processing performed by the chat device in the preferred embodiment of the face tracking video chat method of the invention;
FIG. 11 is a flow chart of player processing data in the chat device according to the preferred embodiment of the face tracking video chat method;
FIG. 12 is a flow chart of a chat device displaying video chat in the preferred embodiment of the face tracking video chat method of the present invention;
fig. 13 is a schematic diagram of a preferred embodiment of the face tracking video chat system of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, the face tracking video chatting method according to the preferred embodiment of the present invention is applied to a face tracking video chatting system, and as shown in fig. 2, the face tracking video chatting system includes a capturing device (i.e., the video capturing device in fig. 2) and a chatting device (i.e., the video chatting device in fig. 2), and the capturing device is connected to the chatting device through a USB.
As shown in fig. 1, the face tracking video chat method includes the following steps:
and step S10, the capturing device collects the video image, carries out face recognition on the video image to output a recognition result, processes the face area according to the recognition result, generates a coded video frame and sends the coded video frame to the chatting device.
Specifically, as shown in FIG. 3, the capture device (i.e., the video capture device in FIG. 3) captures video images at a default configured frame rate; acquiring YUV frames, inputting the YUV frames into a face recognition algorithm model to output a recognition result, wherein the recognition result is description information of face features, mainly comprises coordinates of face key points of each person in an image, face quality scores, distances, living body scores and other information, calculating face frame coordinates and sizes according to the face key point coordinates, and performing proper external expansion processing to obtain a region where a person (whole) is located, applying the region to the next frame image, and performing cutting and zooming processing (namely data processing); the method comprises the steps of obtaining YUV frames (YUV is a color coding method, Y represents brightness (Luminance), U and V represent chroma (Chrominance), and the common formats are YUV420P, YUV420SP, YUV422 and the like), inputting the YUV frames into an encoder to perform video coding, generating coded video frames, and sending the coded video frames to the chatting equipment (namely the video chatting equipment in the picture 3).
Further, as shown in fig. 5, after the Video capture process starts, drivers such as a sensor and a UVC (USB Video Class, which is a protocol standard defined for a USB Video capture device) are loaded to initialize the media system; judging whether to start video chat or not, and directly ending the process when the video chat is not started; when a video chat is started, capturing an optical signal by a sensor (e.g., a light sensor), converting the optical signal into an electrical signal, and acquiring RAW Data; image Signal Processing, which is a unit mainly used for Processing output signals of front-end Image sensors to match Image sensors of different manufacturers, and converts RAW Data into RGB Data; when video acquisition is carried out, converting the RGB image into a YUV image frame and storing the YUV image frame; inputting the YUV image frames into a Memory Pool (Memory Pool, which is a Memory allocation method and is also called fixed-size block planning), and inputting the YUV image frames into a face recognition algorithm module and a first data processing module; and judging whether the video data acquisition needs to be stopped or not, if so, ending the current flow, and if not, continuing to capture the optical signal by the sensor to acquire the image.
Further, as shown in fig. 6, after the face recognition process starts, an NPU (Neural-network Processing Unit) is initialized, and a driver and a face recognition algorithm model are loaded; acquiring YUV frames (namely YUV image frames) from a memory pool; judging whether the resolution of the YUV frame is consistent with the requirements of the face recognition algorithm model; when the YUV frames are inconsistent, the resolution of the YUV frames is zoomed to the resolution required by the face recognition algorithm model; if so, judging whether the input frame rate exceeds the processing capacity of the face recognition algorithm model; if so, performing secondary frame rate control; if not, inputting the YUV frame into a face recognition algorithm model for face recognition; NPU operation is carried out, and a face recognition result (description information of face features) is returned; putting the face recognition result into a queue; and judging whether the identification needs to be stopped, if so, ending the current flow, and if not, returning to continue to acquire the YUV frame from the memory pool.
Further, as shown in fig. 7, after the data processing flow starts, a face recognition result is obtained from the queue, and whether no person appears in the current view range is judged; when no person appears, judging that the time of the last face tracking action expires, if no person appears in the current visual field range and the time of the last face tracking action expires, restoring to original image preview without shifting, clipping and zooming; if people appear in the current visual field range, calculating a human face area; judging whether the face area is too small (for example, smaller than a preset size); when the face area is too small, correcting the face area, and performing appropriate external expansion (namely appropriately expanding the face area) to obtain the position and size information (namely cutting area) of the area where the person is located; when the size of the cutting area is proper, applying the information of the cutting area to the next image frame for cutting; after the YUV image frame is cut down, zooming to a preview resolution ratio; after the cutting and scaling processing is finished, obtaining YUV frames, sending the YUV frames into a memory pool, and transmitting the YUV frames to a video coding module; and judging whether the data processing needs to be stopped, if so, ending the current flow, and if not, returning to continuously obtain the face recognition result from the queue.
Further, as shown in fig. 8, after the video encoding flow starts, a codec (encoder) driver is loaded to initialize the codec; acquiring YUV frames from a memory pool; inputting the YUV frame into a codec for coding; acquiring a coded video frame output by the codec (for example, obtaining coded video frames such as H.264/H.265/MJPEG); sending the encoded video frame to chat equipment through UVC to be played and displayed and pushed to other clients; and judging whether the encoding needs to be stopped, if so, ending the current flow, and if not, returning to continue to acquire the YUV frame from the memory pool.
Step S20, the chat device obtains the device list in the current session, pulls the video stream of the client of the remote device, obtains the local video stream of the capture device, and sends the video stream to the player for decoding and playing. .
Specifically, as shown in fig. 4, the chat device triggers a hole punching service in a user calling or called process, initializes a connection with a hole punching server, sends a peer-to-peer connection (P2P, peer-to-peer, a peer-to-peer technology, which is an internet system without a central server and exchanging information by means of a user group (peers), and is used for reducing nodes in past network transmission to reduce the risk of data loss) request to the hole punching server, and obtains a public network IP and port number of a client and an intranet IP and port number in a session request; the local client on the chat device tries to make a hole to establish a direct session with the client in the current session request and update a device list in the session; according to the equipment list in the current session, pulling the video stream of the client of the remote equipment and sending the video stream to the player; meanwhile, triggering the capture device to collect the video through the UVC protocol, storing the collected video frame, sending the video frame into a memory pool corresponding to the player, and pushing the video frame to a remote chat device client in a session (for example, a session established by the video chat device A and the video chat device B); the player obtains a session state and a device list, obtains a video stream of a device in the session and decodes the video stream to obtain a YUV frame; and drawing a play display window according to the session equipment list and the state, converting the acquired YUV frame into RGB data, and rendering the RGB data to a corresponding play window.
Further, as shown in fig. 9, after the hole-making service flow starts, the chat device initializes a connection with the hole-making server; sending a P2P connection request to the punching server, and requesting the punching server to assist in establishing communication with the remote client; receiving a response result of the P2P connection request, and acquiring the public network IP and port number and the internal network IP and port number of other clients (namely remote clients) in the session request; local client (application program running on chat device) sends hole data packet to other clients in session request; judging whether the hole punching is successful or not; and if not, ending the current flow, and if so, updating the equipment list of the current session and ending the current flow.
Among them, hole punching (or punching) is a technique in computer networks for establishing a direct connection between two parties, one or both of which are located behind a firewall or router using NAT. NAT: network Address Translation, a technique for rewriting a source IP Address or a destination IP Address when an IP packet passes through a router or a firewall in a computer Network, is commonly used in a private Network having a plurality of hosts but accessing the internet only through one public IP Address. The hole punching technology is that related table entries are established on respective NAT gateways through the assistance of an intermediate server, so that messages sent by two parties connected by P2P can directly penetrate through the NAT gateway of the other party, and interconnection and intercommunication of P2P clients are realized.
Further, as shown in fig. 10, after the data processing flow starts, an equipment list of the current session request is obtained; pulling a video frame of a remote equipment client in the equipment list; the video frames of the client of the remote device are respectively sent into the memory pool and are transmitted to the player (the player runs in the chat device and is mainly used for decoding the received coded video frames); meanwhile, acquiring a video frame of the local capturing device, pushing the video frame of the local capturing device to a remote device (namely a remote chat device client), sending the video frame into a memory pool, and transmitting the video frame to a player; judging whether the session is interrupted; if yes, the current flow is ended, and if not, the device list of the current session request is continuously acquired.
Further, as shown in fig. 11, after the player processing data flow starts, the player is initialized; acquiring a device list in a session; taking out the video stream of the equipment in the session from the memory pool; performing video decoding to obtain YUV frames; and putting the YUV frame into a memory pool, and transmitting the YUV frame to a display module.
Further, as shown in fig. 12, after the video chat display flow starts, a device list in the session is obtained; drawing a video playing display window; taking out a YUV frame from the memory pool, and converting the YUV frame into RGB data; rendering the RGB data to a corresponding playing window.
The invention provides a method for automatic face tracking video chat based on face recognition and P2P technology, wherein a user triggers a capture device on a chat device to collect images, collected YUV frames are input into a face recognition algorithm model to obtain face key point information, a picture area where a person is located is determined according to face features such as key points and the like, the picture area is cut and zoomed by applying to the next image frame, and the preprocessed YUV frames are coded and then transmitted to the chat device through UVC; the chat equipment of the session request respectively initializes the connection with the P2P punching server, and the local client sends a request to the punching server in the calling process to obtain the public network IP and port number and the internal network IP and port number of the called remote equipment; the local client tries to send a punching data packet to the remote equipment, after punching is completed, direct conversation is established among the equipment by penetrating an intranet, the local coded video stream is pushed to the remote equipment and pulled, and meanwhile, the video streams of all the equipment (including the local) in the conversation are transmitted to a player through a memory pool to be decoded and played.
Further, as shown in fig. 13, based on the above face tracking video chat method, the present invention further provides a face tracking video chat system, wherein the face tracking video chat system includes: a capture device 10 (which may be understood as a video capture subsystem, deployed in a capture device) and a chat device 20 (which may be understood as a video chat system, deployed in a chat device), said capture device 10 being connected to said chat device 20 via USB.
Specifically, the capture device 10 includes:
the video acquisition module 11 is used for acquiring video images according to a default configuration frame rate and sending the acquired YUV frames to the face recognition algorithm module 12 and the first data processing module 13 through the memory pool;
the face recognition algorithm module 12 is configured to obtain a YUV frame, input the YUV frame into the face recognition algorithm model to output a recognition result, and send the output result to the first data processing module 13;
the first data processing module 13 is configured to calculate a region where a human face is located in a current view range according to the recognition result, perform an outward expansion process on the human face region, apply the outward expansion process to a next frame of image, and perform a clipping and scaling process;
the video encoding module 14 is configured to obtain a YUV frame, input the YUV frame into an encoder to perform video encoding, generate an encoded video frame, and send the encoded video frame to the chat device 20.
Specifically, the chat device 20 includes:
the P2P hole punching service module 21 is configured to send a point-to-point connection request to a hole punching server, and acquire a public network IP and a port number of a client and an intranet IP and a port number in a session request, so as to establish a direct session connection with a remote chat device;
a second data processing module 22, configured to obtain a list of devices in the current session, pull a video stream of the client of the remote device, obtain a local video stream of the capture device, and send the local video stream to the player (i.e., the player module 23);
the player module 23 is configured to obtain a session state and a device list, obtain a video stream of a device in a session, decode the video stream, and send an obtained YUV frame to the display module 24;
and the display module 24 is configured to draw a play display window according to the session device list and the state, convert the acquired YUV frames into RGB data, and render the RGB data to a corresponding play window.
The video acquisition module 11 is responsible for acquiring video images according to a default configuration frame rate after being triggered and started by a user, and the acquired YUV frames are sent to the face recognition algorithm module 12 and the first data processing module 13 through a memory pool; the face recognition algorithm module 12 acquires YUV data from the memory pool, inputs the YUV data into the face recognition algorithm model, sends the returned recognition result into a queue, and transmits the recognition result to the first data processing module 13; the first data processing module 13 calculates the region where the face is located in the current visual field range according to the recognition result, moderately expands the face region, applies to the cutting and scaling of the next image frame, sends the preprocessed YUV frame into a memory pool, and transmits the preprocessed YUV frame to the video coding module 14; if no people appear in the current visual field range and the time of the last face tracking action expires, restoring to original image preview; the video coding module 14 obtains the preprocessed YUV frames from the memory pool, sends the preprocessed YUV frames to the codec for video coding, obtains coded video frames such as h.264/h.265/MJPEG, and sends the coded video frames to the chatting device 20 through the UVC for playing and displaying and pushing the coded video frames to other clients.
The P2P holing service module 21 triggers a holing service in the process of calling or being called by a user, initializes the connection with a holing server, and sends a P2P connection request to the holing server, acquires the public network IP and port number and the internal network IP and port number of other clients in the session request, and then the local client tries to send a holing data packet, after holing is successful, the local client establishes a direct session with the client in the current session request, and updates the device list in the session; the second data processing module 22 obtains the device list in the current session, pulls the video frames of the client of the remote device, and the video frames are respectively sent into the memory pool and transmitted to the player module 23; meanwhile, the second data processing module 22 triggers the capture device to collect video through UVC, stores the collected video frame, sends the video frame into a memory pool corresponding to the player, and simultaneously pushes the video frame to the remote chat device client in the session; the player module 23 obtains the session state and the device list, takes out the video stream of the device in the session from the memory pool, decodes the video stream, sends the obtained YUV frame to the corresponding memory pool, and transmits the YUV frame to the display module 24; the display module 24 draws a play display window according to the session device list and the state, and then takes out the YUV frame from the memory pool to convert the YUV frame into RGB data and render the RGB data to a corresponding play window.
Has the advantages that:
(1) the NPU is positioned in the capturing device, and the face recognition engine is positioned in the display device in the prior art, so that the problems of high cost, limited application scene and the like caused by the need of purchasing the capturing device and the display device at the same time in the prior art are solved;
(2) the invention realizes face following only by depending on facial features, has lower requirements on NPU computing power, I/O performance and the like, and greatly reduces the cost; meanwhile, the problems of delay, blockage, dislocation and the like which are possibly caused by the fact that the prior art depends on a plurality of position characteristics and needs a plurality of algorithm models to be applied to the next frame after calculation is completed are solved;
(3) the audio and video data in the invention are punched through P2P, and the session is directly established between the devices of the session request, thus overcoming the cost of bandwidth, flow and the like transferred through the server, and simultaneously overcoming the problems of limited number of devices capable of accommodating simultaneous online chatting and the like.
The invention also provides a storage medium, wherein the storage medium stores a face tracking video chatting program, and the face tracking video chatting program realizes the steps of the face tracking video chatting method when being executed by the processor.
In summary, the present invention provides a face tracking video chat method, system and storage medium, wherein the method includes: the method comprises the steps that a capturing device collects a video image, carries out face recognition on the video image to output a recognition result, calculates a face area according to the recognition result, properly expands the face area to obtain a cutting area, zooms to a preview resolution ratio after cutting is finished, sends a YUV image frame after zooming to an encoder to generate a coded video frame, and sends the coded video frame to a chatting device; the chat device obtains the device list in the current session, pulls the video stream of the client of the remote device, simultaneously obtains the local video stream of the capture device, and sends the video stream to the player for decoding and playing. The invention realizes the effect that the video picture moves along with the human face in the visual range, and the audio and video data directly establishes the session between the devices of the session request through the punching service, thereby overcoming the problems that the cost of forwarding the video data through the server is higher, and the number of the devices capable of accommodating the simultaneous online chatting is limited.
Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program instructing relevant hardware (such as a processor, a controller, etc.), and the program may be stored in a computer readable storage medium, and when executed, the program may include the processes of the above method embodiments. The storage medium may be a memory, a magnetic disk, an optical disk, etc.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims (10)

1. A face tracking video chatting method is applied to a face tracking video chatting system, the face tracking video chatting system comprises a capturing device and a chatting device, and the face tracking video chatting method is characterized by comprising the following steps:
the method comprises the steps that a capturing device collects a video image, carries out face recognition on the video image to output a recognition result, processes a face area according to the recognition result, generates a coded video frame and sends the coded video frame to a chatting device;
the chat device obtains the device list in the current session, pulls the video stream of the client of the remote device, simultaneously obtains the local video stream of the capture device, and sends the video stream to the player for decoding and playing.
2. The face tracking video chat method according to claim 1, wherein the capturing device collects a video image, performs face recognition on the video image to output a recognition result, processes a face region according to the recognition result to generate a coded video frame, and sends the coded video frame to the chat device, and specifically comprises:
acquiring a video image according to a default configuration frame rate;
acquiring a YUV frame, and inputting the YUV frame into a face recognition algorithm model to output a recognition result;
calculating the area of the face in the current visual field according to the recognition result, performing external expansion processing on the face area, applying the external expansion processing to the next frame of image, and performing cutting and scaling processing;
and acquiring a YUV frame, inputting the YUV frame into an encoder for video encoding, generating an encoded video frame, and sending the encoded video frame to the chatting equipment.
3. The face tracking video chat method according to claim 2, wherein the chat device obtains a list of devices in the current session, pulls a video stream of a client of the remote device, obtains a local video stream of the capture device, and sends the video stream to the player for decoding and playing, specifically comprising:
sending a point-to-point connection request to a punching server, and acquiring a public network IP (Internet protocol) and a port number of a client in a session request and an intranet IP and a port number;
acquiring a device list in the current session, pulling a video stream of a client of a remote device, and sending the video stream to a player;
acquiring a session state and a device list, acquiring a video stream of a device in the session, and decoding to obtain a YUV frame;
and drawing a play display window according to the session equipment list and the state, converting the acquired YUV frame into RGB data, and rendering the RGB data to a corresponding play window.
4. The chat method of face tracking video according to claim 2, wherein the computing of the area of the face in the current visual field according to the recognition result, the performing of the external expansion processing on the face area, the applying of the external expansion processing to the next frame image, the performing of the cropping and the scaling processing, further comprising:
and if no people appear in the current visual field range and the time of the last face tracking action expires, restoring to the original image preview.
5. The face tracking video chat method according to claim 3, wherein sending a point-to-point connection request to a hole server further comprises:
and triggering the punching service in the calling or called process of the user, and initializing the connection with the punching server.
6. The face tracking video chat method according to claim 3, wherein the sending a point-to-point connection request to the hole server obtains a public network IP and port number and an internal network IP and port number of the client in the session request, and then further comprising:
and the local client on the chat device establishes a direct session with the client in the current session request and updates a device list in the session.
7. The face tracking video chat method according to claim 3, wherein the obtaining a list of devices in the current session, pulling a video stream of the remote device client, and sending the video stream to the player, further comprises:
triggering the capture device to collect the video through the UVC protocol, storing the collected video frame, sending the video frame into a memory pool corresponding to the player, and simultaneously pushing the video frame to a remote chat device client in the session.
8. The face tracking video chat method of claim 1, wherein the recognition result is descriptive information of facial features of the human face.
9. A face tracking video chat system, the face tracking video chat system comprising: the device comprises a capturing device and a chatting device, wherein the capturing device is connected to the chatting device through a USB;
the capture device includes:
the video acquisition module is used for acquiring video images according to a default configuration frame rate and sending the acquired YUV frames to the face recognition algorithm module and the first data processing module through the memory pool;
the face recognition algorithm module is used for acquiring YUV frames, inputting the YUV frames into the face recognition algorithm model to output a recognition result, and sending the output result to the first data processing module;
the first data processing module is used for calculating the area of the face in the current visual field range according to the recognition result, performing outward expansion processing on the face area, applying the outward expansion processing to the next frame of image, and performing cutting and scaling processing;
the video coding module is used for acquiring a YUV frame, inputting the YUV frame into an encoder for video coding, generating a coded video frame and sending the coded video frame to the chatting equipment;
the chat device includes:
the P2P holing service module is used for sending a point-to-point connection request to a holing server, and acquiring a public network IP (Internet protocol) and a port number as well as an intranet IP and a port number of a client in a session request so as to establish direct session connection with remote chat equipment;
the second data processing module is used for acquiring a device list in the current session, pulling a video stream of a client of the remote device, acquiring a local video stream of the capture device and sending the local video stream to the player;
the player module is used for acquiring a session state and a device list, acquiring and decoding a video stream of a device in the session, and sending an acquired YUV frame to the display module;
and the display module is used for drawing a play display window according to the session equipment list and the state, converting the acquired YUV frames into RGB data and rendering the RGB data to a corresponding play window.
10. A storage medium storing a face tracking video chat program, the face tracking video chat program when executed by a processor implementing the steps of the face tracking video chat method of any of claims 1-8.
CN202011048611.5A 2020-09-29 2020-09-29 Face tracking video chat method, system and storage medium Pending CN112188140A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011048611.5A CN112188140A (en) 2020-09-29 2020-09-29 Face tracking video chat method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011048611.5A CN112188140A (en) 2020-09-29 2020-09-29 Face tracking video chat method, system and storage medium

Publications (1)

Publication Number Publication Date
CN112188140A true CN112188140A (en) 2021-01-05

Family

ID=73945574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011048611.5A Pending CN112188140A (en) 2020-09-29 2020-09-29 Face tracking video chat method, system and storage medium

Country Status (1)

Country Link
CN (1) CN112188140A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949427A (en) * 2021-02-09 2021-06-11 北京奇艺世纪科技有限公司 Person identification method, electronic device, storage medium, and apparatus

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1924894A (en) * 2006-09-27 2007-03-07 北京中星微电子有限公司 Multiple attitude human face detection and track system and method
CN102055912A (en) * 2009-10-29 2011-05-11 北京中星微电子有限公司 Video application system, video special effect processing system and method
CN102348093A (en) * 2011-08-23 2012-02-08 太原理工大学 Intelligent base of Android mobile phone for video chat
CN102541085A (en) * 2010-12-30 2012-07-04 深圳亚希诺科技有限公司 Spherical-bottom device and method for tracking video target and controlling posture of video target
US20160361653A1 (en) * 2014-12-11 2016-12-15 Intel Corporation Avatar selection mechanism
CN106845385A (en) * 2017-01-17 2017-06-13 腾讯科技(上海)有限公司 The method and apparatus of video frequency object tracking
CN107992832A (en) * 2017-12-08 2018-05-04 阳光暖果(北京)科技发展有限公司 A kind of face tracking System and method for
CN110892411A (en) * 2017-07-28 2020-03-17 高通股份有限公司 Detecting popular faces in real-time media

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1924894A (en) * 2006-09-27 2007-03-07 北京中星微电子有限公司 Multiple attitude human face detection and track system and method
CN102055912A (en) * 2009-10-29 2011-05-11 北京中星微电子有限公司 Video application system, video special effect processing system and method
CN102541085A (en) * 2010-12-30 2012-07-04 深圳亚希诺科技有限公司 Spherical-bottom device and method for tracking video target and controlling posture of video target
CN102348093A (en) * 2011-08-23 2012-02-08 太原理工大学 Intelligent base of Android mobile phone for video chat
US20160361653A1 (en) * 2014-12-11 2016-12-15 Intel Corporation Avatar selection mechanism
CN106845385A (en) * 2017-01-17 2017-06-13 腾讯科技(上海)有限公司 The method and apparatus of video frequency object tracking
CN110892411A (en) * 2017-07-28 2020-03-17 高通股份有限公司 Detecting popular faces in real-time media
CN107992832A (en) * 2017-12-08 2018-05-04 阳光暖果(北京)科技发展有限公司 A kind of face tracking System and method for

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949427A (en) * 2021-02-09 2021-06-11 北京奇艺世纪科技有限公司 Person identification method, electronic device, storage medium, and apparatus

Similar Documents

Publication Publication Date Title
CA2820872C (en) Transmission management apparatus
US7227567B1 (en) Customizable background for video communications
JP2009510877A (en) Face annotation in streaming video using face detection
WO2015024362A1 (en) Image processing method and device
KR20140020733A (en) Image capturing method and local endpoint host device
JP6590011B2 (en) Transmission system, transmission method, transmission terminal, transmission terminal control method, and transmission terminal control program
CN103517072A (en) Video communication method and video communication equipment
JP2016082455A (en) Transmission system, transmission terminal, transmission method and transmission program
CN111970473A (en) Method, device, equipment and storage medium for realizing synchronous display of double video streams
CN108353127B (en) Image stabilization based on depth camera
EP2924985A1 (en) Low-bit-rate video conference system and method, sending end device, and receiving end device
WO2021057477A1 (en) Video encoding and decoding method and related device
JP4100146B2 (en) Bi-directional communication system, video communication device
CN112188140A (en) Face tracking video chat method, system and storage medium
KR100989660B1 (en) Control method and system for a remote video chain
CN110634564B (en) Pathological information processing method, device and system, electronic equipment and storage medium
CN108322693B (en) Method and system for controlling auxiliary stream of third-party terminal by MCU (microprogrammed control Unit) terminal
CN110753243A (en) Image processing method, image processing server and image processing system
CN106170003A (en) Multipart video-meeting system and multipart video-meeting data transmission method
CN113068059B (en) Video live broadcasting method, device, equipment and storage medium
CN115103204A (en) Method and device for realizing edge intelligent application supporting AI engine
JP2002077844A (en) Apparatus and method for transmitting image as well as image transmission program recording computer readable recording medium
CN112866729A (en) Method for reducing live network broadcast time delay and live network broadcast system
WO2023051705A1 (en) Video communication method and apparatus, electronic device, and computer readable medium
US20180192085A1 (en) Method and apparatus for distributed video transmission

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105

RJ01 Rejection of invention patent application after publication